From 50485bedfd9818165aa1d039d0abe95a559134b7 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Fri, 8 Feb 2019 08:31:03 +0100 Subject: Merging upstream version 1.12.0. Signed-off-by: Daniel Baumann --- docs/why-netdata/1s-granularity.md | 53 ++++++++++++++++++++++++ docs/why-netdata/README.md | 30 ++++++++++++++ docs/why-netdata/immediate-results.md | 41 +++++++++++++++++++ docs/why-netdata/meaningful-presentation.md | 63 +++++++++++++++++++++++++++++ docs/why-netdata/unlimited-metrics.md | 44 ++++++++++++++++++++ 5 files changed, 231 insertions(+) create mode 100644 docs/why-netdata/1s-granularity.md create mode 100644 docs/why-netdata/README.md create mode 100644 docs/why-netdata/immediate-results.md create mode 100644 docs/why-netdata/meaningful-presentation.md create mode 100644 docs/why-netdata/unlimited-metrics.md (limited to 'docs/why-netdata') diff --git a/docs/why-netdata/1s-granularity.md b/docs/why-netdata/1s-granularity.md new file mode 100644 index 000000000..089854543 --- /dev/null +++ b/docs/why-netdata/1s-granularity.md @@ -0,0 +1,53 @@ +# 1s granularity + +High resolution metrics are required to effectively monitor and troubleshoot systems and applications. + +## Why? + +- The world is going real-time. Today, customer experience is significantly affected by response time, so SLAs are tighter than ever before. It is just not practical to monitor a 2-second SLA with 10-second metrics. + +- IT goes virtual. Unlike real hardware, virtual environments are not linear, nor predictable. You cannot expect resources to be available when your applications need them. They will eventually be, but not exactly at the time they are needed. The latency of virtual environments is affected by many factors, most of which are outside our control, like: the maintenance policy of the hosting provider, the work load of third party virtual machines running on the same physical servers combined with the resource allocation and throttling policy among virtual machines, the provisioning system of the hosting provider, etc. + +## What do others do? + +So, why don't most monitoring platforms and monitoring SaaS providers offer high resolution metrics? + +They want to, but they can't, at least not massively. + +The reasons lie in their design decisions: + +1. Time-series databases (prometheus, graphite, opentsdb, influxdb, etc) centralize all the metrics. At scale, these databases can easily become the bottleneck of the whole infrastructure. + +2. SaaS providers base their business models on centralizing all the metrics. On top of the time-series database bottleneck they also have increased bandwidth costs. So, massively supporting high resolution metrics, destroys their business model. + +Of course, since a couple of decades the world has fixed this kind of scaling problems: instead of scaling up, scale out, horizontally. That is, instead of investing on bigger and bigger central components, decentralize the application so that it can scale by adding more smaller nodes to it. + +There have been many attempts to fix this problem for monitoring. But so far, all solutions required centralization of metrics, which can only scale up. So, although the problem is somehow managed, it is still the key problem of all monitoring platforms and one of the key reasons for increased monitoring costs. + +Another important factor is how resource efficient data collection can be when running per second. Most solutions fail to do it properly. The data collection agent is consuming significant system resources when running "per second", influencing the monitored systems and applications to a great degree. + +Finally, per second data collection is a lot harder. Busy virtual environments have [a constant latency of about 100ms, spread randomly to all data sources](https://docs.google.com/presentation/d/18C8bCTbtgKDWqPa57GXIjB2PbjjpjsUNkLtZEz6YK8s/edit#slide=id.g422e696d87_0_57). If data collection is not implemented properly, this latency introduces a random error of +/- 10%, which is quite significant for a monitoring system. + +So, the monitoring industry fails to massively provide high resolution metrics, mainly for 3 reasons: + +1. Centralization of metrics makes monitoring cost inefficient at that rate. +2. Data collection needs optimization, otherwise it will significantly affect the monitored systems. +3. Data collection is a lot harder, especially on busy virtual environments. + +## What does netdata do differently? + +Netdata decentralizes monitoring completely. Each Netdata node is autonomous. It collects metrics locally, it stores them locally, it runs checks against them to trigger alarms locally, and provides an API for the dashboards to visualize them. This allows Netdata to scale to infinity. + +Of course, Netdata can centralize metrics when needed. For example, it is not practical to keep metrics locally on ephemeral nodes. For these cases, Netdata streams the metrics in real-time, from the ephemeral nodes to one or more non-ephemeral nodes nearby. This centralization is again distributed. On a large infrastructure, there may be many centralization points. + +To eliminate the error introduced by data collection latencies on busy virtual environments, Netdata interpolates collected metrics. It does this using microsecond timings, per data source, offering measurements with an error rate of 0.0001%. When running [in debug mode, netdata calculates this error rate](https://github.com/netdata/netdata/blob/36199f449852f8077ea915a3a14a33fa2aff6d85/database/rrdset.c#L1070-L1099) for every point collected, ensuring that the database works with acceptable accuracy. + +Finally, Netdata is really fast. Optimization is a core product feature. On modern hardware, Netdata can collect metrics with a rate of above 1M metrics per second per core (this includes everything, parsing data sources, interpolating data, storing data in the time series database, etc). So, for a few thousands metrics per second per node, Netdata needs negligible CPU resources (just 1-2% of a single core). + +Netdata has been designed to: +- Solve the centralization problem of monitoring +- Replace the console for performance troubleshooting. + +So, for Netdata 1s granularity is easy, the natural outcome... + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2F1s-granularity&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/docs/why-netdata/README.md b/docs/why-netdata/README.md new file mode 100644 index 000000000..df8c0d02b --- /dev/null +++ b/docs/why-netdata/README.md @@ -0,0 +1,30 @@ +# Why Netdata + +> Any performance monitoring solution that does not go down to per second +> collection and visualization of the data, is useless. +> It will make you happy to have it, but it will not help you more than that. + +Netdata is built around 4 principles: + +1. **[Per second data collection for all metrics.](1s-granularity.md)** + + *It is impossible to monitor a 2 second SLA, with 10 second metrics.* + +2. **[Collect and visualize all the metrics from all possible sources.](unlimited-metrics.md)** + + *To troubleshoot slowdowns, we need all the available metrics. The console should not provide more metrics.* + +3. **[Meaningful presentation, optimized for visual anomaly detection.](meaningful-presentation.md)** + + *Metrics are a lot more than name-value pairs over time. The monitoring tool should know all the metrics. Users should not!* + +4. **[Immediate results, just install and use.](immediate-results.md)** + + *Most of our infrastructure is standardized. There is no point to configure everything metric by metric.* + +Unlike other monitoring solutions that focus on metrics visualization, +Netdata's helps us troubleshoot slowdowns without touching the console. + +So, everything is a bit different. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2FWhy-Netdata&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/docs/why-netdata/immediate-results.md b/docs/why-netdata/immediate-results.md new file mode 100644 index 000000000..9afe4afdc --- /dev/null +++ b/docs/why-netdata/immediate-results.md @@ -0,0 +1,41 @@ +# Immediate results + +Most of our infrastructure is based on standardized systems and applications. + +It is a tremendous waste of time and effort, in a global scale, to require from all users to configure their infrastructure dashboards and alarms metric by metric. + +## Why? + +Most of the existing monitoring solutions, focus on providing a platform "for building your monitoring". So, they provide the tools to collect metrics, store them, visualize them, check them and query them. And we are all expected to go through this process. + +However, most of our infrastructure is standardized. We run well known Linux distributions, the same kernel, the same database, the same web server, etc. + +So, why can't we have a monitoring system that can be installed and instantly provide feature rich dashboards and alarms about everything we use? Is there any reason you would like to monitor your web server differently than me? + +What a waste of time and money! Hundreds of thousands of people doing the same thing over and over again, trying to understand what the metrics are, how to visualize them, how to configure alarms for them and how to query them when issues arise. + +## What do others do? + +Open-source solutions rely almost entirely on configuration. So, you have to go through endless metric-by-metric configuration yourself. The result will reflect your skills, your experience, your understanding. + +Monitoring SaaS providers offer a very basic set of pre-configured metrics, dashboards and alarms. They assume that you will configure the rest you may need. So, once more, the result will reflect your skills, your experience, your understanding. + +## What does netdata do? + +1. Metrics are auto-detected, so for 99% of the cases data collection works out of the box. +2. Metrics are converted to human readable units, right after data collection, before storing them into the database. +3. Metrics are structured, organized in charts, families and applications, so that they can be browsed. +4. Dashboards are automatically generated, so all metrics are available for exploration immediately after installation. +5. Dashboards are not just visualizing metrics; they are a tool, optimized for visual anomaly detection. +6. Hundreds of pre-configured alarm templates are automatically attached to collected metrics. + +The result is that Netdata can be used immediately after installation! + +Netdata: + +- Helps engineers understand and learn what the metrics are. +- Does not require any configuration. Of course there are thousands of options to tweak, but the defaults are pretty good for most systems. +- Does not introduce any query languages or any other technology to be learned. Of course some familiarity with the tool is required, but nothing too complicated. +- Includes all the community expertise and experience for monitoring systems and applications. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2Fimmediate-results&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/docs/why-netdata/meaningful-presentation.md b/docs/why-netdata/meaningful-presentation.md new file mode 100644 index 000000000..6414d023f --- /dev/null +++ b/docs/why-netdata/meaningful-presentation.md @@ -0,0 +1,63 @@ +# Meaningful presentation + +Metrics are a lot more than name-value pairs over time. It is just not practical to require from all users to have a deep understanding of all metrics for monitoring their systems and applications. + +## Why? + +There is a plethora of metrics. And each of them has a context, a meaning, a way to be interpreted. + +Traditionally, monitoring solutions instruct engineers to collect only the metrics they understand. This is a good strategy as long as you have a clear understanding of what you need and you have the skills, the expertise and the experience to select them. + +For most people, this is an impossible task. It is just not practical to assume that any engineer will have a deep understanding of how the kernel works, how the networking stack works, how the system manages its memory, how it schedules processes, how web servers work, how databases work, etc. + +The result is that for most of the world, monitoring sucks. It is incomplete, inefficient, and in most of the cases only useful for providing an illusion that the infrastructure is being monitored. It is not! According to the [State of Monitoring 2017](http://start.bigpanda.io/state-of-monitoring-report-2017), only 11% of the companies are satisfied with their existing monitoring infrastructure, and on the average they use 6-7 monitoring tools. + +But even if all the metrics are collected, an even bigger challenge is revealed: What to do with them? How to use them? + +The existing monitoring solutions, assume the engineers will: + +- Design dashboards +- Configure alarms +- Use a query language to investigate issues + +However, all these have to be configured metric by metric. + +The monitoring industry believes there is this "IT Operations Hero", a person combining these abilities: + +1. Has a deep understanding of IT architectures and is a skillful SysAdmin. +2. Is a superb Network Administrator (can even read and understand the Linux kernel networking stack). +3. Is a exceptional database administrator. +4. Is fluent in software engineering, capable of understanding the internal workings of applications. +5. Masters Data Science, statistical algorithms and is fluent in writing advanced mathematical queries to reveal the meaning of metrics. + +Of course this person does not exist! + +## What do others do? + +Most solutions are based on a time-series database. A database that tracks name-value pairs, over time. + +Data collection blindly collects metrics and stores them into the database, dashboard editors query the database to visualize the metrics. They may also provide a query editor, that users can use to query the database by hand. + +Of course, it is just not practical to work that way when the database has 10,000 unique metrics. Most of them will be just noise, not because they are not useful, but because no one understands them! + +So, they collect very limited metrics. Basic dashboards can be created with these metrics, but for any issue that needs to be troubleshooted, the monitoring system is just not adequate. It cannot help. So, engineers are using the console to access the rest of the metrics and find the root cause. + +## What does netdata do? + +In netdata, the meaning of metrics is incorporated into the database: + +1. all metrics are converted and stored to human-friendly units. This is a data-collection process, not a visualization process. For example, cpu utilization in Netdata is stored as percentage, not as kernel ticks. + +2. all metrics are organized into human-friendly charts, sharing the same context and units (similar to what other monitoring solutions call `cardinality`). So, when Netdata developer collect metrics, they configure the correlation of the metrics right in data collection, which is stored in the database too. + +3. all charts are then organized in families, and chart families are organized in applications. These structures are responsible for providing the menu at the right side of Netdata dashboards for exploring the whole database. + +The result is a system that can be browsed by humans, even if the database has 100,000 unique metrics. It is pretty natural for everyone to browse them, understand their meaning and their scope. + +Of course, this process makes data collection significantly more time consuming. Netdata developers need to normalize and correlate and categorize every single metric Netdata collects. + +But it simplifies everything else. Data collection, metrics database and visualization are de-coupled, thus the query engine is simpler, and the visualization is straight forward. + +Netdata goes a step further, by enriching the dashboard with information that is useful for most people. So, to improve clarity and help users be more effective, Netdata includes right in the dashboard the community knowledge and expertise about the metrics. So, that Netdata users can focus on solving their infrastructure problem, not on the technicalities of data collection and visualization. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2Fmeaningful-presentation&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/docs/why-netdata/unlimited-metrics.md b/docs/why-netdata/unlimited-metrics.md new file mode 100644 index 000000000..e35034a2b --- /dev/null +++ b/docs/why-netdata/unlimited-metrics.md @@ -0,0 +1,44 @@ +# Unlimited metrics + +All metrics are important and all metrics should be available when you need them. + +## Why? + +Collecting all the metrics breaks the first rule of every monitoring text book: "collect only the metrics you need", "collect only the metrics you understand". + +Unfortunately, this does not work! Filtering out most metrics is like reading a book by skipping most of its pages... + +For many people, monitoring is about: + +- Detecting outages +- Capacity planning + +However, **slowdowns are 10 times more common** compared to outages (check slide 14 of [Online Performance is Business Performance ](https://www.slideshare.net/KenGodskind/alertsitetrac) reported by Trac Research/AlertSite). Designing a monitoring system targeting only outages and capacity planning solves just a tiny part of the operational problems we face. Check also [Downtime vs. Slowtime: Which Hurts More?](https://dzone.com/articles/downtime-vs-slowtime-which-hurts-more). + +To troubleshoot a slowdown, a lot more metrics are needed. Actually all the metrics are needed, since the real cause of a slowdown is most probably quite complex. If we knew the possible reasons, chances are we would have fixed them before they become a problem. + +## What do others do? + +Most monitoring solutions, when they are able to detect something, provide just a hint (e.g. "hey, there is a 20% drop in requests per second over the last minute") and they expect us to use the console for determining the root cause. + +Of course this introduces a lot more problems: how to troubleshoot a slowdown using the console, if the slowdown lifetime is just a few seconds, randomly spread throughout the day? + +You can't! You will spend your entire day on the console, waiting for the problem to happen again while you are logged in. A blame war starts: developers blame the systems, sysadmins blame the hosting provider, someone says it is a DNS problem, another one believes it is network related, etc. We have all experienced this, multiple times... + +So, why do monitoring solutions and SaaS providers filter out metrics? + +They can't do otherwise! + +1. Centralization of metrics depends on metrics filtering, to control monitoring costs. Time-series databases limit the number of metrics collected, because the number of metrics influences their performance significantly. They get congested at scale. +2. It is a lot easier to provide an illusion of monitoring by using a few basic metrics. +3. Troubleshooting slowdowns is the hardest IT problem to solve, so most solutions just avoid it. + +## What does netdata do? + +Netdata collects, stores and visualizes everything, every single metric exposed by systems and applications. + +Due to Netdata's distributed nature, the number of metrics collected does not have any noticeable effect on the performance or the cost of the monitoring infrastructure. + +Of course, since netdata is also about [meaningful presentation](meaningful-presentation.md), the number of metrics makes Netdata development slower. We, the Netdata developers, need to have a good understanding of the metrics before adding them into Netdata. We need to organize the metrics, add information related to them, configure alarms for them, so that you, the Netdata users, will have the best out-of-the-box experience and all the information required to kill the console for troubleshooting slowdowns. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fwhy-netdata%2Funlimited-metrics&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() -- cgit v1.2.3