diff options
Diffstat (limited to '')
-rw-r--r-- | docs/why-netdata/1s-granularity.md | 21 | ||||
-rw-r--r-- | docs/why-netdata/README.md | 18 | ||||
-rw-r--r-- | docs/why-netdata/immediate-results.md | 24 | ||||
-rw-r--r-- | docs/why-netdata/meaningful-presentation.md | 26 | ||||
-rw-r--r-- | docs/why-netdata/unlimited-metrics.md | 12 |
5 files changed, 51 insertions, 50 deletions
diff --git a/docs/why-netdata/1s-granularity.md b/docs/why-netdata/1s-granularity.md index 0d12a2d4..195a0d8f 100644 --- a/docs/why-netdata/1s-granularity.md +++ b/docs/why-netdata/1s-granularity.md @@ -4,9 +4,9 @@ High resolution metrics are required to effectively monitor and troubleshoot sys ## Why? -- The world is going real-time. Today, customer experience is significantly affected by response time, so SLAs are tighter than ever before. It is just not practical to monitor a 2-second SLA with 10-second metrics. +- The world is going real-time. Today, customer experience is significantly affected by response time, so SLAs are tighter than ever before. It is just not practical to monitor a 2-second SLA with 10-second metrics. -- IT goes virtual. Unlike real hardware, virtual environments are not linear, nor predictable. You cannot expect resources to be available when your applications need them. They will eventually be, but not exactly at the time they are needed. The latency of virtual environments is affected by many factors, most of which are outside our control, like: the maintenance policy of the hosting provider, the work load of third party virtual machines running on the same physical servers combined with the resource allocation and throttling policy among virtual machines, the provisioning system of the hosting provider, etc. +- IT goes virtual. Unlike real hardware, virtual environments are not linear, nor predictable. You cannot expect resources to be available when your applications need them. They will eventually be, but not exactly at the time they are needed. The latency of virtual environments is affected by many factors, most of which are outside our control, like: the maintenance policy of the hosting provider, the work load of third party virtual machines running on the same physical servers combined with the resource allocation and throttling policy among virtual machines, the provisioning system of the hosting provider, etc. ## What do others do? @@ -16,9 +16,9 @@ They want to, but they can't, at least not massively. The reasons lie in their design decisions: -1. Time-series databases (prometheus, graphite, opentsdb, influxdb, etc) centralize all the metrics. At scale, these databases can easily become the bottleneck of the whole infrastructure. +1. Time-series databases (prometheus, graphite, opentsdb, influxdb, etc) centralize all the metrics. At scale, these databases can easily become the bottleneck of the whole infrastructure. -2. SaaS providers base their business models on centralizing all the metrics. On top of the time-series database bottleneck they also have increased bandwidth costs. So, massively supporting high resolution metrics, destroys their business model. +2. SaaS providers base their business models on centralizing all the metrics. On top of the time-series database bottleneck they also have increased bandwidth costs. So, massively supporting high resolution metrics, destroys their business model. Of course, since a couple of decades the world has fixed this kind of scaling problems: instead of scaling up, scale out, horizontally. That is, instead of investing on bigger and bigger central components, decentralize the application so that it can scale by adding more smaller nodes to it. @@ -30,9 +30,9 @@ Finally, per second data collection is a lot harder. Busy virtual environments h So, the monitoring industry fails to massively provide high resolution metrics, mainly for 3 reasons: -1. Centralization of metrics makes monitoring cost inefficient at that rate. -2. Data collection needs optimization, otherwise it will significantly affect the monitored systems. -3. Data collection is a lot harder, especially on busy virtual environments. +1. Centralization of metrics makes monitoring cost inefficient at that rate. +2. Data collection needs optimization, otherwise it will significantly affect the monitored systems. +3. Data collection is a lot harder, especially on busy virtual environments. ## What does Netdata do differently? @@ -45,9 +45,10 @@ To eliminate the error introduced by data collection latencies on busy virtual e Finally, Netdata is really fast. Optimization is a core product feature. On modern hardware, Netdata can collect metrics with a rate of above 1M metrics per second per core (this includes everything, parsing data sources, interpolating data, storing data in the time series database, etc). So, for a few thousands metrics per second per node, Netdata needs negligible CPU resources (just 1-2% of a single core). Netdata has been designed to: -- Solve the centralization problem of monitoring -- Replace the console for performance troubleshooting. + +- Solve the centralization problem of monitoring +- Replace the console for performance troubleshooting. So, for Netdata 1s granularity is easy, the natural outcome... -[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2F1s-granularity&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2F1s-granularity&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/why-netdata/README.md b/docs/why-netdata/README.md index df8c0d02..1003b07a 100644 --- a/docs/why-netdata/README.md +++ b/docs/why-netdata/README.md @@ -6,25 +6,25 @@ Netdata is built around 4 principles: -1. **[Per second data collection for all metrics.](1s-granularity.md)** +1. **[Per second data collection for all metrics.](1s-granularity.md)** - *It is impossible to monitor a 2 second SLA, with 10 second metrics.* + _It is impossible to monitor a 2 second SLA, with 10 second metrics._ -2. **[Collect and visualize all the metrics from all possible sources.](unlimited-metrics.md)** +2. **[Collect and visualize all the metrics from all possible sources.](unlimited-metrics.md)** - *To troubleshoot slowdowns, we need all the available metrics. The console should not provide more metrics.* + _To troubleshoot slowdowns, we need all the available metrics. The console should not provide more metrics._ -3. **[Meaningful presentation, optimized for visual anomaly detection.](meaningful-presentation.md)** +3. **[Meaningful presentation, optimized for visual anomaly detection.](meaningful-presentation.md)** - *Metrics are a lot more than name-value pairs over time. The monitoring tool should know all the metrics. Users should not!* + _Metrics are a lot more than name-value pairs over time. The monitoring tool should know all the metrics. Users should not!_ -4. **[Immediate results, just install and use.](immediate-results.md)** +4. **[Immediate results, just install and use.](immediate-results.md)** - *Most of our infrastructure is standardized. There is no point to configure everything metric by metric.* + _Most of our infrastructure is standardized. There is no point to configure everything metric by metric._ Unlike other monitoring solutions that focus on metrics visualization, Netdata's helps us troubleshoot slowdowns without touching the console. So, everything is a bit different. -[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2FWhy-Netdata&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2FWhy-Netdata&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/why-netdata/immediate-results.md b/docs/why-netdata/immediate-results.md index 12333671..f1f452ca 100644 --- a/docs/why-netdata/immediate-results.md +++ b/docs/why-netdata/immediate-results.md @@ -1,7 +1,7 @@ # Immediate results Most of our infrastructure is based on standardized systems and applications. - + It is a tremendous waste of time and effort, in a global scale, to require from all users to configure their infrastructure dashboards and alarms metric by metric. ## Why? @@ -22,20 +22,20 @@ Monitoring SaaS providers offer a very basic set of pre-configured metrics, dash ## What does Netdata do? -1. Metrics are auto-detected, so for 99% of the cases data collection works out of the box. -2. Metrics are converted to human readable units, right after data collection, before storing them into the database. -3. Metrics are structured, organized in charts, families and applications, so that they can be browsed. -4. Dashboards are automatically generated, so all metrics are available for exploration immediately after installation. -5. Dashboards are not just visualizing metrics; they are a tool, optimized for visual anomaly detection. -6. Hundreds of pre-configured alarm templates are automatically attached to collected metrics. +1. Metrics are auto-detected, so for 99% of the cases data collection works out of the box. +2. Metrics are converted to human readable units, right after data collection, before storing them into the database. +3. Metrics are structured, organized in charts, families and applications, so that they can be browsed. +4. Dashboards are automatically generated, so all metrics are available for exploration immediately after installation. +5. Dashboards are not just visualizing metrics; they are a tool, optimized for visual anomaly detection. +6. Hundreds of pre-configured alarm templates are automatically attached to collected metrics. The result is that Netdata can be used immediately after installation! Netdata: -- Helps engineers understand and learn what the metrics are. -- Does not require any configuration. Of course there are thousands of options to tweak, but the defaults are pretty good for most systems. -- Does not introduce any query languages or any other technology to be learned. Of course some familiarity with the tool is required, but nothing too complicated. -- Includes all the community expertise and experience for monitoring systems and applications. +- Helps engineers understand and learn what the metrics are. +- Does not require any configuration. Of course there are thousands of options to tweak, but the defaults are pretty good for most systems. +- Does not introduce any query languages or any other technology to be learned. Of course some familiarity with the tool is required, but nothing too complicated. +- Includes all the community expertise and experience for monitoring systems and applications. -[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2Fimmediate-results&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2Fimmediate-results&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/why-netdata/meaningful-presentation.md b/docs/why-netdata/meaningful-presentation.md index f6fd0756..2623d152 100644 --- a/docs/why-netdata/meaningful-presentation.md +++ b/docs/why-netdata/meaningful-presentation.md @@ -15,20 +15,20 @@ The result is that for most of the world, monitoring sucks. It is incomplete, in But even if all the metrics are collected, an even bigger challenge is revealed: What to do with them? How to use them? The existing monitoring solutions, assume the engineers will: - -- Design dashboards -- Configure alarms -- Use a query language to investigate issues + +- Design dashboards +- Configure alarms +- Use a query language to investigate issues However, all these have to be configured metric by metric. The monitoring industry believes there is this "IT Operations Hero", a person combining these abilities: -1. Has a deep understanding of IT architectures and is a skillful SysAdmin. -2. Is a superb Network Administrator (can even read and understand the Linux kernel networking stack). -3. Is a exceptional database administrator. -4. Is fluent in software engineering, capable of understanding the internal workings of applications. -5. Masters Data Science, statistical algorithms and is fluent in writing advanced mathematical queries to reveal the meaning of metrics. +1. Has a deep understanding of IT architectures and is a skillful SysAdmin. +2. Is a superb Network Administrator (can even read and understand the Linux kernel networking stack). +3. Is a exceptional database administrator. +4. Is fluent in software engineering, capable of understanding the internal workings of applications. +5. Masters Data Science, statistical algorithms and is fluent in writing advanced mathematical queries to reveal the meaning of metrics. Of course this person does not exist! @@ -46,11 +46,11 @@ So, they collect very limited metrics. Basic dashboards can be created with thes In Netdata, the meaning of metrics is incorporated into the database: -1. all metrics are converted and stored to human-friendly units. This is a data-collection process, not a visualization process. For example, cpu utilization in Netdata is stored as percentage, not as kernel ticks. +1. all metrics are converted and stored to human-friendly units. This is a data-collection process, not a visualization process. For example, cpu utilization in Netdata is stored as percentage, not as kernel ticks. -2. all metrics are organized into human-friendly charts, sharing the same context and units (similar to what other monitoring solutions call `cardinality`). So, when Netdata developer collect metrics, they configure the correlation of the metrics right in data collection, which is stored in the database too. +2. all metrics are organized into human-friendly charts, sharing the same context and units (similar to what other monitoring solutions call `cardinality`). So, when Netdata developer collect metrics, they configure the correlation of the metrics right in data collection, which is stored in the database too. -3. all charts are then organized in families, and chart families are organized in applications. These structures are responsible for providing the menu at the right side of Netdata dashboards for exploring the whole database. +3. all charts are then organized in families, and chart families are organized in applications. These structures are responsible for providing the menu at the right side of Netdata dashboards for exploring the whole database. The result is a system that can be browsed by humans, even if the database has 100,000 unique metrics. It is pretty natural for everyone to browse them, understand their meaning and their scope. @@ -60,4 +60,4 @@ But it simplifies everything else. Data collection, metrics database and visuali Netdata goes a step further, by enriching the dashboard with information that is useful for most people. So, to improve clarity and help users be more effective, Netdata includes right in the dashboard the community knowledge and expertise about the metrics. So, that Netdata users can focus on solving their infrastructure problem, not on the technicalities of data collection and visualization. -[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2Fmeaningful-presentation&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2Fmeaningful-presentation&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/why-netdata/unlimited-metrics.md b/docs/why-netdata/unlimited-metrics.md index a4ecaf3f..827138ff 100644 --- a/docs/why-netdata/unlimited-metrics.md +++ b/docs/why-netdata/unlimited-metrics.md @@ -10,8 +10,8 @@ Unfortunately, this does not work! Filtering out most metrics is like reading a For many people, monitoring is about: -- Detecting outages -- Capacity planning +- Detecting outages +- Capacity planning However, **slowdowns are 10 times more common** compared to outages (check slide 14 of [Online Performance is Business Performance ](https://www.slideshare.net/KenGodskind/alertsitetrac) reported by Trac Research/AlertSite). Designing a monitoring system targeting only outages and capacity planning solves just a tiny part of the operational problems we face. Check also [Downtime vs. Slowtime: Which Hurts More?](https://dzone.com/articles/downtime-vs-slowtime-which-hurts-more). @@ -29,9 +29,9 @@ So, why do monitoring solutions and SaaS providers filter out metrics? They can't do otherwise! -1. Centralization of metrics depends on metrics filtering, to control monitoring costs. Time-series databases limit the number of metrics collected, because the number of metrics influences their performance significantly. They get congested at scale. -2. It is a lot easier to provide an illusion of monitoring by using a few basic metrics. -3. Troubleshooting slowdowns is the hardest IT problem to solve, so most solutions just avoid it. +1. Centralization of metrics depends on metrics filtering, to control monitoring costs. Time-series databases limit the number of metrics collected, because the number of metrics influences their performance significantly. They get congested at scale. +2. It is a lot easier to provide an illusion of monitoring by using a few basic metrics. +3. Troubleshooting slowdowns is the hardest IT problem to solve, so most solutions just avoid it. ## What does Netdata do? @@ -41,4 +41,4 @@ Due to Netdata's distributed nature, the number of metrics collected does not ha Of course, since Netdata is also about [meaningful presentation](meaningful-presentation.md), the number of metrics makes Netdata development slower. We, the Netdata developers, need to have a good understanding of the metrics before adding them into Netdata. We need to organize the metrics, add information related to them, configure alarms for them, so that you, the Netdata users, will have the best out-of-the-box experience and all the information required to kill the console for troubleshooting slowdowns. -[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2Funlimited-metrics&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2Funlimited-metrics&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) |