diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-19 02:57:58 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-19 02:57:58 +0000 |
commit | be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97 (patch) | |
tree | 9754ff1ca740f6346cf8483ec915d4054bc5da2d /docs/category-overview-pages | |
parent | Initial commit. (diff) | |
download | netdata-be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97.tar.xz netdata-be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97.zip |
Adding upstream version 1.44.3.upstream/1.44.3upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'docs/category-overview-pages')
21 files changed, 780 insertions, 0 deletions
diff --git a/docs/category-overview-pages/accessing-netdata-dashboards.md b/docs/category-overview-pages/accessing-netdata-dashboards.md new file mode 100644 index 00000000..97df8b83 --- /dev/null +++ b/docs/category-overview-pages/accessing-netdata-dashboards.md @@ -0,0 +1,38 @@ +# Accessing Netdata Dashboards + +This section contains documentation on how you can access the Netdata dashboard, which are the same both for the Agent and Cloud. + +A user accessing the Netdata dashboard **from the Cloud** will always be presented with the latest Netdata dashboard version. + +A user accessing the Netdata dashboard **from the Agent** will, by default, be presented with the latest Netdata dashboard version (the same as Netdata Cloud) except in the following scenarios: +* Agent doesn't have Internet access, and is unable to get the latest Netdata dashboards, as a result it falls back to the Netdata dashboard version that +was shipped with the agent. +* Users have defined, e.g. through URL bookmark, that they want to see the previous version of the dashboard (accessible `http://NODE:19999/v1`, replacing `NODE` with the IP address or hostname of your Agent). + +## Main sections + +The Netdata dashboard consists of the following main sections: +* [Netdata charts](https://github.com/netdata/netdata/blob/master/docs/cloud/visualize/interact-new-charts.md) +* [Infrastructure Overview](https://github.com/netdata/netdata/blob/master/docs/visualize/overview-infrastructure.md) +* [Nodes view](https://github.com/netdata/netdata/blob/master/docs/cloud/visualize/nodes.md) +* [Custom dashboards](https://learn.netdata.cloud/docs/visualizations/custom-dashboards) +* [Alerts](https://github.com/netdata/netdata/blob/master/docs/monitor/view-active-alerts.md) +* [Anomaly Advisor](https://github.com/netdata/netdata/blob/master/docs/cloud/insights/anomaly-advisor.md) +* [Functions](https://github.com/netdata/netdata/blob/master/docs/cloud/netdata-functions.md) +* [Events feed](https://github.com/netdata/netdata/blob/master/docs/cloud/insights/events-feed.md) + +> ⚠️ Some sections of the dashboard, when accessed through the agent, may require the user to be signed in to Netdata Cloud or having the Agent claimed to Netdata Cloud for their full functionality. Examples include saving visualization settings on charts or custom dashboards, claiming the node to Netdata Cloud, or executing functions on an Agent. + +## How to access the dashboards? + +### Netdata Cloud + +You can access the dashboard at https://app.netdata.cloud/ and [sign-in](https://github.com/netdata/netdata/blob/master/docs/cloud/manage/sign-in.md) with an account or [sign-up](https://github.com/netdata/netdata/blob/master/docs/cloud/manage/sign-in.md#dont-have-a-netdata-cloud-account-yet) if you don't have an account yet. + +### Netdata Agent + +Netdata starts a web server for its dashboard at port `19999`. Open up your web browser of choice and +navigate to `http://NODE:19999`, replacing `NODE` with the IP address or hostname of your Agent. If installed on localhost, you can access it through `http://localhost:19999`. + + +Documentation for previous Agent dashboard can still be found [here](https://github.com/netdata/netdata/blob/master/web/gui/README.md).
\ No newline at end of file diff --git a/docs/category-overview-pages/build-the-netdata-agent-yourself.md b/docs/category-overview-pages/build-the-netdata-agent-yourself.md new file mode 100644 index 00000000..99166ad9 --- /dev/null +++ b/docs/category-overview-pages/build-the-netdata-agent-yourself.md @@ -0,0 +1,3 @@ +# Build the Netdata Agent yourself + +This section contains documentation on all the ways that you can build the Netdata Agent.
\ No newline at end of file diff --git a/docs/category-overview-pages/deployment-strategies.md b/docs/category-overview-pages/deployment-strategies.md new file mode 100644 index 00000000..69daaf9f --- /dev/null +++ b/docs/category-overview-pages/deployment-strategies.md @@ -0,0 +1,268 @@ +# Deployment strategies + +Netdata can be used to monitor all kinds of infrastructure, from stand-alone tiny IoT devices to complex hybrid setups +combining on-premise and cloud infrastructure, mixing bare-metal servers, virtual machines and containers. + +There are 3 components to structure your Netdata ecosystem: + +1. **Netdata Agents** + To monitor the physical or virtual nodes of your infrastructure, including all applications and containers running on them. + + Netdata Agents are Open-Source, licensed under GPL v3+. + +2. **Netdata Parents** + To create data centralization points within your infrastructure, to offload Netdata Agents functions from your production + systems, to provide high-availability of your data, increased data retention and isolation of your nodes. + + Netdata Parents are implemented using the Netdata Agent software. Any Netdata Agent can be an Agent for a node and a Parent + for other Agents, at the same time. + + It is recommended to set up multiple Netdata Parents. They will all seamlessly be integrated by Netdata Cloud into one monitoring solution. + + +3. **Netdata Cloud** + Our SaaS, combining all your infrastructure, all your Netdata Agents and Parents, into one uniform, distributed, + scalable, monitoring database, offering advanced data slicing and dicing capabilities, custom dashboards, advanced troubleshooting + tools, user management, centralized management of alerts, and more. + + +The Netdata Agent is a highly modular software piece, providing data collection via numerous plugins, an in-house crafted time-series +database, a query engine, health monitoring and alerts, machine learning and anomaly detection, metrics exporting to third party systems. + + +## Deployment Options Overview + +This section provides a quick overview of a few common deployment options. The next sections go into configuration examples and further reading. + +### Stand-alone Deployment + +To help our users have a complete experience of Netdata when they install it for the first time, a Netdata Agent with default configuration +is a complete monitoring solution out of the box, having all these features enabled and available. + +The Agent will act as a _stand-alone_ Agent by default, and this is great to start out with for small setups and home labs. By [connecting each Agent to Cloud](https://github.com/netdata/netdata/blob/master/claim/README.md), you can see an overview of all your nodes, with aggregated charts and centralized alerting, without setting up a Parent. + +![image](https://github.com/netdata/netdata/assets/116741/6a638175-aec4-4d46-85a6-520c283ab6a8) + +### Parent – Child Deployment + +An Agent connected to a Parent is called a _Child_. It will _stream_ metrics to its Parent. The Parent can then take care of storing metrics on behalf of that node (with longer retention), handle metrics queries for showing dashboards, and provide alerting. + +When using Cloud, it is recommended that just the Parent is connected to Cloud. Child Agents can then be configured to have short retention, in RAM instead of on Disk, and have alerting and other features disabled. Because they don't need to connect to Cloud themselves, those children can then be further secured by not allowing outbound traffic. + +![image](https://github.com/netdata/netdata/assets/116741/cb65698d-a6b7-43ee-a2d1-c30d0a46f084) + +This setup allows for leaner Child nodes and is good for setups with more than a handful of nodes. Metrics data remains accessible if the Child node is temporarily unavailable or decommissioned, although there is no failover in case the Parent becomes unavailable. + + +### Active–Active Parent Deployment + +For high availability, Parents can be configured to stream data for their children between them, and keep the data sets in sync. Child Agents are configured with the addresses of both Parent Agents, but will only stream to one of them at a time. When that Parent becomes unavailable, it reconnects to another. When the first Parent becomes available again, that Parent will catch up by receiving the backlog from the second. + +With both Parent Agents connected to Cloud, Cloud will route queries to either Parent transparently, depending on their availability. Alerts trigger on either Parent will stream to Cloud, and Cloud will deduplicate and debounce state changes to prevent spurious notifications. + +![image](https://github.com/netdata/netdata/assets/116741/6ae2b10c-7f7d-4503-aac4-0a9381c6f80b) + + +## Configuration Details + +### Stand-alone Deployment + +The stand-alone setup is configured out of the box with reasonable defaults, but please consult our [configuration documentation](https://github.com/netdata/netdata/blob/master/docs/cloud/cheatsheet.md) for details, including the overview of [common configuration changes](https://github.com/netdata/netdata/blob/master/docs/configure/common-changes.md). + +### Parent – Child Deployment + +For setups involving Child and Parent Agents, the Agents need to be configured for [_streaming_](https://github.com/netdata/netdata/blob/master/streaming/README.md), through the configuration file `stream.conf`. This will instruct the Child to stream data to the Parent and the Parent to accept streaming connections for one or more Child Agents. To secure this connection, both need set up a shared API key (to replace the string `API_KEY` in the examples below). Additionally, the Child is configured with one or more addresses of Parent Agents (`PARENT_IP_ADDRESS`). + +An API key is a key created with `uuidgen` and is used for authentication and/or customization in the Parent side. I.e. a Child will stream using the API key, and a Parent is configured to accept connections from Child, but can also apply different options for children by using multiple different API keys. The easiest setup uses just one API key for all Child Agents. + +#### Child config + +As mentioned above, the recommendation is to not claim the Child to Cloud directly during your setup, avoiding establishing an [ACLK](https://github.com/netdata/netdata/blob/master/aclk/README.md) connection. + +To reduce the footprint of the Netdata Agent on your production system, some capabilities can be switched OFF on the Child and kept ON on the Parent. In this example, Machine Learning and Alerting are disabled in the Child, so that the Parent can take the load. We also use RAM instead of disk to store metrics with limited retention, covering temporary network issues. + +##### netdata.conf + +On the child node, edit `netdata.conf` by using the edit-config script: `/etc/netdata/edit-config netdata.conf` set the following parameters: + +```yaml +[db] + # https://learn.netdata.cloud/docs/agent/database + # none = no retention, ram = some retention in ram + mode = ram + # The retention in seconds. + # This provides some tolerance to the time the child has to find a parent in + # order to transfer the data. For IoT this can be lowered to 120. + retention = 1200 + # The granularity of metrics, in seconds. + # You may increase this to lower CPU resources. + update every = 1 +[ml] + # Disable Machine Learning + enabled = no +[health] + # Disable Health Checks (Alerting) + enabled = no +[web] + # Disable remote access to the local dashboard + bind to = lo +[plugins] + # Uncomment the following line to disable all external plugins on extreme + # IoT cases by default. + # enable running new plugins = no +``` + +##### stream.conf + +To edit `stream.conf`, again use the edit-config script: `/etc/netdata/edit-config stream.conf`. + +Set the following parameters: + +```yaml +[stream] + # Stream metrics to another Netdata + enabled = yes + # The IP and PORT of the parent + destination = PARENT_IP_ADDRESS:19999 + # The shared API key, generated by uuidgen + api key = API_KEY +``` + +#### Parent config + +For the Parent, besides setting up streaming, the example will also provide an example configuration of multiple [tiers](https://github.com/netdata/netdata/blob/master/database/engine/README.md#tiering) of metrics [storage](https://github.com/netdata/netdata/blob/master/docs/store/change-metrics-storage.md), for 10 children, with about 2k metrics each. + +- 1s granularity at tier 0 for 1 week +- 1m granularity at tier 1 for 1 month +- 1h granularity at tier 2 for 1 year + +Requiring: + +- 25GB of disk +- 3.5GB of RAM (2.5GB under pressure) + +##### netdata.conf + +On the Parent, edit `netdata.conf` with `/etc/netdata/edit-config netdata.conf` and set the following parameters: + +```yaml +[db] + mode = dbengine + storage tiers = 3 + # To allow memory pressure to offload index from ram + dbengine page descriptors in file mapped memory = yes + # storage tier 0 + update every = 1 + dbengine multihost disk space MB = 12000 + dbengine page cache size MB = 1400 + # storage tier 1 + dbengine tier 1 page cache size MB = 512 + dbengine tier 1 multihost disk space MB = 4096 + dbengine tier 1 update every iterations = 60 + dbengine tier 1 backfill = new + # storage tier 2 + dbengine tier 2 page cache size MB = 128 + dbengine tier 2 multihost disk space MB = 2048 + dbengine tier 2 update every iterations = 60 + dbengine tier 2 backfill = new +[ml] + # Enabled by default + # enabled = yes +[health] + # Enabled by default + # enabled = yes +[web] + # Enabled by default + # bind to = * +``` + +##### stream.conf + +On the Parent node, edit `stream.conf` with `/etc/netdata/edit-config stream.conf`, and then set the following parameters: + +```yaml +[API_KEY] + # Accept metrics streaming from other Agents with the specified API key + enabled = yes +``` + +### Active–Active Parent Deployment + +In order to setup active–active streaming between Parent 1 and Parent 2, Parent 1 needs to be instructed to stream data to Parent 2 and Parent 2 to stream data to Parent 1. The Child Agents need to be configured with the addresses of both Parent Agents. The Agent will only connect to one Parent at a time, falling back to the next if the previous failed. These examples use the same API key between Parent Agents as for connections from Child Agents. + +On both Netdata Parent and all Child Agents, edit `stream.conf` with `/etc/netdata/edit-config stream.conf`: + +##### stream.conf on Parent 1 + +```yaml +[stream] + # Stream metrics to another Netdata + enabled = yes + # The IP and PORT of Parent 2 + destination = PARENT_2_IP_ADDRESS:19999 + # This is the API key for the outgoing connection to Parent 2 + api key = API_KEY +[API_KEY] + # Accept metrics streams from Parent 2 and Child Agents + enabled = yes +``` + +##### stream.conf on Parent 2 + +```yaml +[stream] + # Stream metrics to another Netdata + enabled = yes + # The IP and PORT of Parent 1 + destination = PARENT_1_IP_ADDRESS:19999 + api key = API_KEY +[API_KEY] + # Accept metrics streams from Parent 1 and Child Agents + enabled = yes +``` + +##### stream.conf on Child Agents + +```yaml +[stream] + # Stream metrics to another Netdata + enabled = yes + # The IP and PORT of the parent + destination = PARENT_1_IP_ADDRESS:19999 PARENT_2_IP_ADDRESS:19999 + # The shared API key, generated by uuidgen + api key = API_KEY +``` + +## Further Reading + +We strongly recommend the following configuration changes for production deployments: + +1. Understand Netdata's [security and privacy design](https://github.com/netdata/netdata/blob/master/docs/netdata-security.md) and + [secure your nodes](https://github.com/netdata/netdata/blob/master/docs/category-overview-pages/secure-nodes.md) + + To safeguard your infrastructure and comply with your organization's security policies. + +2. Set up [streaming and replication](https://github.com/netdata/netdata/blob/master/streaming/README.md) to: + + - Offload Netdata Agents running on production systems and free system resources for the production applications running on them. + - Isolate production systems from the rest of the world and improve security. + - Increase data retention. + - Make your data highly available. + +3. [Optimize the Netdata Agents system utilization and performance](https://github.com/netdata/netdata/blob/master/docs/guides/configure/performance.md) + + To save valuable system resources, especially when running on weak IoT devices. + +We also suggest that you: + +1. [Use Netdata Cloud to access the dashboards](https://github.com/netdata/netdata/blob/master/docs/quickstart/infrastructure.md) + + For increased security, user management and access to our latest tools for advanced dashboarding and troubleshooting. + +2. [Change how long Netdata stores metrics](https://github.com/netdata/netdata/blob/master/docs/store/change-metrics-storage.md) + + To control Netdata's memory use, when you have a lot of ephemeral metrics. + +3. [Use host labels](https://github.com/netdata/netdata/blob/master/docs/guides/using-host-labels.md) + + To organize systems, metrics, and alerts. diff --git a/docs/category-overview-pages/install-netdata-on-embedded-systems.md b/docs/category-overview-pages/install-netdata-on-embedded-systems.md new file mode 100644 index 00000000..dfaa4482 --- /dev/null +++ b/docs/category-overview-pages/install-netdata-on-embedded-systems.md @@ -0,0 +1,3 @@ +# Install Netdata on Embedded Systems Overview + +This section contains documentation for installation methods when it comes to Embedded Systems.
\ No newline at end of file diff --git a/docs/category-overview-pages/install-with-a-cicd-provisioning-system.md b/docs/category-overview-pages/install-with-a-cicd-provisioning-system.md new file mode 100644 index 00000000..30a5a706 --- /dev/null +++ b/docs/category-overview-pages/install-with-a-cicd-provisioning-system.md @@ -0,0 +1,3 @@ +# Install with a CI/CD Provisioning System Overview + +This section contains documentation on all the installation methods through a CI/CD system.
\ No newline at end of file diff --git a/docs/category-overview-pages/installation-overview.md b/docs/category-overview-pages/installation-overview.md new file mode 100644 index 00000000..e60dd442 --- /dev/null +++ b/docs/category-overview-pages/installation-overview.md @@ -0,0 +1,10 @@ +# Installation + +In this category you can find instructions on all the possible ways you can install Netdata on the +[supported platforms](https://github.com/netdata/netdata/blob/master/packaging/PLATFORM_SUPPORT.md). + +If this is your first time using Netdata, we recommend that you first start with the +[quick installation guide](https://github.com/netdata/netdata/edit/master/packaging/installer/README.md) and then +go into the more advanced options available to you. + + diff --git a/docs/category-overview-pages/integrations-overview.md b/docs/category-overview-pages/integrations-overview.md new file mode 100644 index 00000000..6fa2f50a --- /dev/null +++ b/docs/category-overview-pages/integrations-overview.md @@ -0,0 +1,31 @@ +<!-- +title: "Integrations" +sidebar_label: "Integrations" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/docs/category-overview-pages/integrations-overview.md" +description: "Available integrations in Netdata" +learn_status: "Published" +learn_rel_path: "Integrations" +sidebar_position: 60 +--> + +# Integrations + +Netdata's ability to monitor out of the box every potentially useful aspect of a node's operation is unparalleled. +But Netdata also provides out of the box, meaningful charts and alerts for hundreds of applications, with the ability +to be easily extended to monitor anything. See the full list of Netdata's capabilities and how you can extend them in the +[supported collectors list](https://github.com/netdata/netdata/blob/master/collectors/COLLECTORS.md). + +Our out of the box alerts were created by expert professionals and have been validated on the field, countless times. +Use them to trigger [alert notifications](https://github.com/netdata/netdata/blob/master/docs/monitor/enable-notifications.md) +either centrally, via the +[Cloud alert notifications](https://github.com/netdata/netdata/blob/master/docs/cloud/alerts-notifications/notifications.md) +, or by configuring individual +[agent notifications](https://github.com/netdata/netdata/blob/master/health/notifications/README.md). + +We designed Netdata with interoperability in mind. The Agent collects thousands of metrics every second, and then what +you do with them is up to you. You can +[store metrics in the database engine](https://github.com/netdata/netdata/blob/master/database/README.md), +or send them to another time series database for long-term storage or further analysis using +Netdata's [exporting engine](https://github.com/netdata/netdata/edit/master/exporting/README.md). + + diff --git a/docs/category-overview-pages/logs.md b/docs/category-overview-pages/logs.md new file mode 100644 index 00000000..fbaf8563 --- /dev/null +++ b/docs/category-overview-pages/logs.md @@ -0,0 +1,3 @@ +# Logs + +This section talks about ways Netdata collects and visualizes logs, while also providing useful guides on log centralization setups that can be used with Netdata. diff --git a/docs/category-overview-pages/machine-learning-and-assisted-troubleshooting.md b/docs/category-overview-pages/machine-learning-and-assisted-troubleshooting.md new file mode 100644 index 00000000..074051e3 --- /dev/null +++ b/docs/category-overview-pages/machine-learning-and-assisted-troubleshooting.md @@ -0,0 +1,3 @@ +# Machine Learning and Assisted Troubleshooting Overview + +This section contains documentation regarding Netdata's troubleshooting and machine learning features.
\ No newline at end of file diff --git a/docs/category-overview-pages/maintenance-operations-on-netdata-agents.md b/docs/category-overview-pages/maintenance-operations-on-netdata-agents.md new file mode 100644 index 00000000..207a0bd3 --- /dev/null +++ b/docs/category-overview-pages/maintenance-operations-on-netdata-agents.md @@ -0,0 +1,3 @@ +# Maintenance operations on Netdata Agents Overview + +This section provides information on various actions you can take when maintaining a Netdata Agent.
\ No newline at end of file diff --git a/docs/category-overview-pages/metrics-streaming-and-replication.md b/docs/category-overview-pages/metrics-streaming-and-replication.md new file mode 100644 index 00000000..f473105f --- /dev/null +++ b/docs/category-overview-pages/metrics-streaming-and-replication.md @@ -0,0 +1,175 @@ +# Netdata Parents (Streaming and Replication) + +## What are they and why do we need them? + +A “Parent” is a Netdata Agent, like the ones we install on all our systems, but is configured as a central node that receives, stores and processes metrics data from other Netdata “Child” nodes in our infrastructure. + +Netdata Parents are flexible. You can have one big active-active cluster of Netdata Parents, or you can spread a lot of independent Parents across the infrastructure. + +This “distributed still centralized” setup provides a lot of benefits. Let’s see them: + +## Infrastructure-Level Dashboards: All Nodes in One Dashboard + +A Parent node receives and aggregates metrics data from all child nodes that push metrics to it, presenting all of them on a single, centralized dashboard. + +Metrics streaming between Netdata nodes is real-time and low-latency, so that the Parent can provide the same resolution and detail its children provide. + +Each chart on the Parent’s dashboard is automatically turned into a multi-node chart, allowing instant aggregation of the data across the entire dashboard. This is transparent and automatic for all kinds of charts, even application-specific ones. For example, when you have 2 PostgreSQL servers in your infrastructure, the parent will present one set of charts for PostgreSQL and these charts will include data from both servers. + +## Increased Data Retention: Store More, Learn More + +Netdata’s database (`dbengine`), supports multiple tiers of variable resolution for storing metrics’ samples. Tier 0 is the high-resolution one and usually stores per second data. Tier 1 is the middle resolution one, downsampling data to per minute. Tier 2 is the low-resolution one, downsampling data to per hour. With this setup, a default Netdata setup is usually able to maintain 2-3 days of high resolution and up to a year of low-resolution data, all in less than 1 GB of disk space. + +In many cases, however, organizations require a lot more retention than this. A Netdata Parent can be configured to have weeks or even months of high-resolution data and several years of low-resolution data for all its Child nodes, by allowing the Netdata database to grow to hundreds of GiBs or even several TiBs. + +## Monitoring Ephemeral Nodes: No Node Left Behind + +Production systems are often ephemeral by nature. In containerized and orchestrated environments, like Kubernetes, nodes may come and go due to scaling policies, maintenance tasks, or as part of regular operations. + +Netdata Parents come to the rescue in such scenarios. They can continuously receive metrics from ephemeral nodes during their lifecycle. As these nodes are removed or replaced, the Parent retains their performance history, essentially archiving the life of each node. + +The Netdata dashboards on the Parents automatically bring into the charts data from archived nodes when users pan the dashboard to the time-window these nodes were alive. This means that no data is lost and visibility is maintained across the entire lifespan of every node, regardless of its ephemeral nature. + +## Unified Alerts Management: Silence the Noise + +Each Netdata Agent is able to run health checks, trigger alerts and send notifications on its own. However, in a large-scale infrastructure with numerous nodes, each capable of generating alerts, managing these notifications can quickly become a challenge. Duplicate alerts and non-centralized management can lead to unnecessary noise, causing alert fatigue and possibly overlooking critical warnings. + +Netdata Parents provide a solution to this problem. By configuring a Parent node to handle all alerts and health checks, and disabling health monitoring on the Child nodes, you centralize your alerts management, meaning that all alerts are now generated from a single place, reducing noise and ensuring that each unique issue only triggers a single notification. + +In addition to making alert management more straightforward, this setup also allows for more refined control over your alert configurations. Instead of managing alert settings across multiple nodes, you can handle all configurations in one place, ensuring consistency and ease of management. + +## Offloading Production Systems: Prioritize Performance + +In a production environment, every bit of system resources is crucial. Minimizing the overhead due to monitoring and observability is vital to ensure optimal system performance. Although the Netdata Agent is designed to be lightweight and efficient, using a Netdata Parent can allow the Netdata Agents on your production systems to focus on the absolutely necessary for collecting metrics and pushing them to their Parent. + +On your production systems, by configuring the Netdata Agents to use the `alloc` database mode with 5-10 minutes of retention time and disabling health monitoring and Machine Learning (ML) processing, you significantly reduce the system resources consumed by the monitoring system. + +Netdata, with the `alloc` database mode, doesn't touch the disk at all (apart from logging - which can also be disabled). This approach eliminates any potential disk I/O impact from Netdata on your production applications, which could be particularly beneficial in I/O-sensitive environments. + +## Fault Tolerance and Redundancy: Ensure Continuous Monitoring + +Netdata Agents stream metrics to one Netdata Parent at a time. But more than one Parent can be configured on each child. The first available at any given time is used. + +Similarly, Netdata Parents can be configured to stream/proxy the data they receive to another Netdata Parent. And they can support multiple Parents too, one of which will be used at any given time. + +Configuration allows setting up a circular streaming setup. Parent A streams to Parent B and Parent B streams to Parent A. Child nodes are configured to stream to any of Parents A and B and they will automatically fall back and switch parents as necessary. + +With the replication feature (enabled by default), all nodes replicate missing data on their Parent, before streaming live metrics, filling up any gap the Parent may have. + +The same setup can work for 2 or even more parents, to form an active-active multi-node cluster. Child nodes can connect to any of the parent nodes available and the parent nodes will automatically replicate and stream metrics to each other. + +The setup is optimized even for wide-area connections between child nodes and parents, or for cases where the bandwidth between child nodes and parents has a cost associated with it. At any given time each child node sends its data only once. The parents then replicate and stream this data to each other. + +## Security and Isolation: Protect Your Production Systems + +Parent nodes can be set up in your organization's Demilitarized Zone (DMZ), acting as a protective barrier or application firewall, shielding your production Netdata agents from the outside world. + +With Netdata Parents configured, the Netdata Agents running on your production systems need only one connection to these parents. They don’t need to run data queries, they will never send alert notifications, or even connect to Netdata Cloud. + +Especially for Netdata Cloud, when the Parent node is connected to Netdata Cloud, it registers its Child nodes to it and can serve all functions required by the Cloud on behalf of the Child nodes. So, although only the parent is connected to Netdata Cloud, there is no difference in the user features you enjoy on Netdata Cloud in regard to your production systems. They will all be there. + +## FAQ about Netdata Parents + +### How much can a Parent node scale? + +For about 1 million real-time metrics, with a default configuration: + +- collected and streamed to it per second, +- stored in 3 database tiers (high, mid, low resolution), +- with ML training and anomaly detection running, +- health for alerts and notifications + +And about 2 TiB of storage for metrics, you will need about 5-8 CPU cores and 32GiB of RAM. On such a setup you can have: + +- 15 days of high resolution metrics +- 3 months of mid resolution metrics +- 1 year of low resolution metrics + +For such a setup, we recommend a 16 CPU cores system so that there is spare capacity for queries. More RAM and faster disks will give faster queries. + +So, depending on the number of metrics per node you have and the size of your Parents, you may be able to aggregate 200 to 500 nodes per Parent. + +### If I set up 2 active-active parents, will I be able to have more Child nodes stream to them? + +No. When you set up an active-active cluster, even if child nodes connect randomly to one or the other, all the parent nodes receive all the metrics of all the child nodes. So, all of them do all the work. + +There is a feature we currently work on, to allow Parent nodes to detect that they receive ML information with the streamed metric data (they receive it already but they ignore it), to prevent them from training their own ML models and running anomaly detection again for the child node. But this is not ready yet. + +### How much retention do the child nodes need? + +Child nodes need to have only the retention required in order to connect to another Parent if one fails or stops for maintenance. + +- If you have an active-active cluster of parents, 5 to 10 minutes in `alloc` mode is enough. +- If you have only 1 parent, it would be better to run the child nodes with `dbengine` so that they will have enough retention to backfill the parent nodes if it stops for a few hours for maintenance. + +### Does streaming between child nodes and parents support encryption? + +Yes. You can configure your parent nodes to enable TLS and configure the child nodes to connect with TLS to it. The streaming connection is also compressed with LZ4 and this works even on top of TLS. + +### Can I have an HTTP proxy between parent and child nodes? + +No. The streaming protocol works on the same port as the internal web server of Netdata Agents, but the protocol is not HTTP-friendly and cannot be understood by HTTP proxy servers. + +### Should I load balance the parents with a TCP load balancer? + +Although this can be done and for streaming between child and parent nodes it could work, we recommend not doing it. It can lead to several kinds of problems. + +It is better to configure all the parent nodes directly in the child nodes `stream.conf`. The child nodes will do everything in their power to find a parent node to connect and they will never give up. + +### When I have an active-active cluster of parents, will I receive alert notifications from both of them? + +If both are configured to run health checks and trigger alerts, yes. + +We recommend using Netdata Cloud to avoid receiving duplicate alert notifications. Netdata Cloud deduplicates alert notifications so that you will receive them only once. On top of that, you can control silencing and routing directly from the Netdata Cloud UI. + +### When I have only Parents connected to Netdata Cloud, will I be able to use the Functions feature on my child nodes? + +Yes. + +Functions is a feature of data collection plugins to expose functions that can be run from the dashboard to view more detailed information about a data collection. For example, apps.plugin exposes the processes function that returns a list of all the processes running, together with information about their CPU utilization, memory consumption, disk I/O operations, bandwidth, and a lot more. + +When a parent receives a Function request, it forwards it to the plugin that exposes it. If the plugin is available over a streaming connection, the parent will forward the request to the socket it receives metrics from. This process will be repeated even if many parents are chained in order to reach the child. + +### If I have a set of 2 active-active parents and get one out for maintenance for a few hours, will it have missing data when it returns back online? + +There are 2 reasons you may have gaps in your data after you bring it back online: + +1. Replication does not replicate metrics that are not actively collected. So, when the parent comes back, if there are samples that this parent does not have, for metrics that are not currently being collected, these samples will not be propagated to that parent. [We are working to fix this issue](https://github.com/netdata/netdata/issues/15198). +2. If the parent has been offline for a long time and the child nodes run in db mode `alloc`, you need to plan how you will bring this parent back online. Child nodes in this mode do not have enough retention to backfill the parent and if they connect to it before the other parent, you will end up with missing information on that parent. + +The simplest way to solve this is to block at the firewall all connections to port 19999 from child nodes, but allow connections from the other parent nodes. Once replication finishes for all nodes, you can unblock the connections from child nodes to it. + +### I got a parent out of maintenance but it replicates (backfills) missing data slowly. Can I speed it up? + +Yes, there is a setting on `netdata.conf` under section `[db]` called `replication threads`. The default value is 1. + +Usually, each thread is able to replicate about 2-5 million samples per second. We suggest setting this to 5 threads for all parents. Generally do not use too many threads because you are risking congesting the disks and/or the CPU cores available. Keep in mind that the sending parent needs this setting. + +There is no need to increase this number on child nodes. Each node has one replication sender, so when hundreds of nodes are replicating to a parent, there are already a lot of senders pushing metrics to it. + +### I have multiple active-active parents. Which one is used by Netdata Cloud for queries? + +When you have multiple parents available, the one that is further away from the child node is used by Netdata Cloud, unless it does not have the data required. + +This works like this: The child has `hops = 0`. Each parent receiving metrics for this child increases the `hops` by 1. So the first parent will have `hops = 1`, the second parent will have `hops = 2` and so on. + +Netdata Cloud knows the retention of each parent. So, when it needs data from this child, it first checks the available retention each parent has for it and then it uses the parent with the higher `hops`. If no parent is available and the child node is directly connected to Netdata Cloud, it uses the child. + +### Is there a way to balance child nodes to the parent nodes of an active-active cluster? + +If you have 2 parent nodes A and B, you can configure them on half the child nodes as A, B, and the other half as B, A. The child nodes will connect to the first available (left to right). If both A and B are online, half of the child nodes will connect to A and the other half to B. + +Keep in mind, however, that if you restart a parent, all the child nodes that were connected to it will automatically reconnect to the other parent. Once this happens, the child nodes will stay connected to it. + +### Is there a way to get notified when a child gets disconnected? + +There are 2 kinds of production nodes: +1. **Permanent nodes** + These are nodes that should be available permanently and if they disconnect an alert should be triggered to notify you. + By default, all nodes are considered permanent (not ephemeral). +2. **Ephemeral nodes** + These are nodes that are ephemeral by nature and they may shutdown at any point in time without any impact on the services you run. + +To set the ephemeral flag on a node, edit its `netdata.conf` and in the `[health]` section set is `ephemeral = yes`. This setting is propagated to parent nodes and Netdata Cloud. + +When using Netdata Cloud (via a parent or directly) and a permanent node gets disconnected, Netdata Cloud sends node disconnection notifications. diff --git a/docs/category-overview-pages/misc-overview.md b/docs/category-overview-pages/misc-overview.md new file mode 100644 index 00000000..dbb11e9b --- /dev/null +++ b/docs/category-overview-pages/misc-overview.md @@ -0,0 +1,3 @@ +# Miscellaneous material + +This section contains material that will be moved to new locations as we see fit. We keep it here to make it accessible while we make these changes.
\ No newline at end of file diff --git a/docs/category-overview-pages/monitor-your-infrastructure.md b/docs/category-overview-pages/monitor-your-infrastructure.md new file mode 100644 index 00000000..3582e88a --- /dev/null +++ b/docs/category-overview-pages/monitor-your-infrastructure.md @@ -0,0 +1,3 @@ +# Monitor your Infrastructure Overview + +This section contains documentation on how you can use Netdata Cloud and it's features to monitor your entire infrastructure.
\ No newline at end of file diff --git a/docs/category-overview-pages/netdata-apis.md b/docs/category-overview-pages/netdata-apis.md new file mode 100644 index 00000000..82d1c175 --- /dev/null +++ b/docs/category-overview-pages/netdata-apis.md @@ -0,0 +1,5 @@ +# Netdata APIs Overview + +This section contains information about Netdata's APIs. + +You can access the Netdata Agent's API through swagger UI [here](/api).
\ No newline at end of file diff --git a/docs/category-overview-pages/netdata-architecture.md b/docs/category-overview-pages/netdata-architecture.md new file mode 100644 index 00000000..70f12659 --- /dev/null +++ b/docs/category-overview-pages/netdata-architecture.md @@ -0,0 +1,3 @@ +# Netdata Architecture Overview + +This section's purpose is to explain the architecture of Netdata, the role of the Agent and the Cloud, and more.
\ No newline at end of file diff --git a/docs/category-overview-pages/netdata-dashboards-and-visualizations.md b/docs/category-overview-pages/netdata-dashboards-and-visualizations.md new file mode 100644 index 00000000..cc930436 --- /dev/null +++ b/docs/category-overview-pages/netdata-dashboards-and-visualizations.md @@ -0,0 +1,3 @@ +# Netdata Dashboards and Visualizations Overview + +This section provides documentation about all the visualization operations, features and insights that Netdata provides.
\ No newline at end of file diff --git a/docs/category-overview-pages/optimizing-metrics-database.md b/docs/category-overview-pages/optimizing-metrics-database.md new file mode 100644 index 00000000..fdbd3b69 --- /dev/null +++ b/docs/category-overview-pages/optimizing-metrics-database.md @@ -0,0 +1,3 @@ +# Optimizing Metrics Database Overview + +This section contains documentation to help you understand how the metrics DB works, understand the key features and configure them to suit your needs.
\ No newline at end of file diff --git a/docs/category-overview-pages/reverse-proxies.md b/docs/category-overview-pages/reverse-proxies.md new file mode 100644 index 00000000..07c8b9bd --- /dev/null +++ b/docs/category-overview-pages/reverse-proxies.md @@ -0,0 +1,34 @@ +# Running Netdata behind a reverse proxy + +If you need to access a Netdata agent's user interface or API in a production environment we recommend you put Netdata behind +another web server and secure access to the dashboard via SSL, user authentication and firewall rules. + +A dedicated web server also provides more robustness and capabilities than the Agent's [internal web server](https://github.com/netdata/netdata/blob/master/web/README.md). + +We have documented running behind +[nginx](https://github.com/netdata/netdata/blob/master/docs/Running-behind-nginx.md), +[Apache](https://github.com/netdata/netdata/blob/master/docs/Running-behind-apache.md), +[HAProxy](https://github.com/netdata/netdata/blob/master/docs/Running-behind-haproxy.md), +[Lighttpd](https://github.com/netdata/netdata/blob/master/docs/Running-behind-lighttpd.md), +[Caddy](https://github.com/netdata/netdata/blob/master/docs/Running-behind-caddy.md), +and [H2O](https://github.com/netdata/netdata/blob/master/docs/Running-behind-h2o.md). +If you prefer a different web server, we suggest you follow the documentation for nginx and tell us how you did it + by adding your own "Running behind webserverX" document. + +When you run Netdata behind a reverse proxy, we recommend you firewall protect all your Netdata servers, so that only the web server IP will be allowed to directly access Netdata. To do this, run this on each of your servers (or use your firewall manager): + +```sh +PROXY_IP="1.2.3.4" +iptables -t filter -I INPUT -p tcp --dport 19999 \! -s ${PROXY_IP} -m conntrack --ctstate NEW -j DROP +``` + +The above will prevent anyone except your web server to access a Netdata dashboard running on the host. + +You can also use `netdata.conf`: + +``` +[web] + allow connections from = localhost 1.2.3.4 +``` + +Of course, you can add more IPs. diff --git a/docs/category-overview-pages/secure-nodes.md b/docs/category-overview-pages/secure-nodes.md new file mode 100644 index 00000000..33e205f0 --- /dev/null +++ b/docs/category-overview-pages/secure-nodes.md @@ -0,0 +1,177 @@ +# Secure your nodes
+
+Netdata is a monitoring system. It should be protected, the same way you protect all your admin apps. We assume Netdata
+will be installed privately, for your eyes only.
+
+Upon installation, the Netdata Agent serves the **local dashboard** at port `19999`. If the node is accessible to the
+internet at large, anyone can access the dashboard and your node's metrics at `http://NODE:19999`. We made this decision
+so that the local dashboard was immediately accessible to users, and so that we don't dictate how professionals set up
+and secure their infrastructures.
+
+Viewers will be able to get some information about the system Netdata is running. This information is everything the dashboard
+provides. The dashboard includes a list of the services each system runs (the legends of the charts under the `Systemd Services`
+section), the applications running (the legends of the charts under the `Applications` section), the disks of the system and
+their names, the user accounts of the system that are running processes (the `Users` and `User Groups` section of the dashboard),
+the network interfaces and their names (not the IPs) and detailed information about the performance of the system and its applications.
+
+This information is not sensitive (meaning that it is not your business data), but **it is important for possible attackers**.
+It will give them clues on what to check, what to try and in the case of DDoS against your applications, they will know if they
+are doing it right or not.
+
+Also, viewers could use Netdata itself to stress your servers. Although the Netdata daemon runs unprivileged, with the minimum
+process priority (scheduling priority `idle` - lower than nice 19) and adjusts its OutOfMemory (OOM) score to 1000 (so that it
+will be first to be killed by the kernel if the system starves for memory), some pressure can be applied on your systems if
+someone attempts a DDoS against Netdata.
+
+Instead of dictating how to secure your infrastructure, we give you many options to establish security best practices
+that align with your goals and your organization's standards.
+
+- [Disable the local dashboard](#disable-the-local-dashboard): **Simplest and recommended method** for those who have
+ added nodes to Netdata Cloud and view dashboards and metrics there.
+
+- [Expose Netdata only in a private LAN](#expose-netdata-only-in-a-private-lan). Simplest and recommended method for those who do not use Netdata Cloud.
+
+- [Fine-grained access control](#fine-grained-access-control): Allow local dashboard access from
+ only certain IP addresses, such as a trusted static IP or connections from behind a management LAN. Full support for Netdata Cloud.
+
+- [Use a reverse proxy (authenticating web server in proxy mode)](#use-an-authenticating-web-server-in-proxy-mode): Password-protect
+ a local dashboard and enable TLS to secure it. Full support for Netdata Cloud.
+
+- [Use Netdata parents as Web Application Firewalls](#use-netdata-parents-as-web-application-firewalls)
+
+- [Other methods](#other-methods) list some less common methods of protecting Netdata.
+
+## Disable the local dashboard
+
+This is the _recommended method for those who have connected their nodes to Netdata Cloud_ and prefer viewing real-time
+metrics using the War Room Overview, Nodes tab, and Cloud dashboards.
+
+You can disable the local dashboard (and API) but retain the encrypted Agent-Cloud link
+([ACLK](https://github.com/netdata/netdata/blob/master/aclk/README.md)) that
+allows you to stream metrics on demand from your nodes via the Netdata Cloud interface. This change mitigates all
+concerns about revealing metrics and system design to the internet at large, while keeping all the functionality you
+need to view metrics and troubleshoot issues with Netdata Cloud.
+
+Open `netdata.conf` with `./edit-config netdata.conf`. Scroll down to the `[web]` section, and find the `mode =
+static-threaded` setting, and change it to `none`.
+
+```conf
+[web]
+ mode = none
+```
+
+Save and close the editor, then [restart your Agent](https://github.com/netdata/netdata/blob/master/docs/configure/start-stop-restart.md)
+using `sudo systemctl
+restart netdata`. If you try to visit the local dashboard to `http://NODE:19999` again, the connection will fail because
+that node no longer serves its local dashboard.
+
+> See the [configuration basics doc](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md) for details on how to find
+`netdata.conf` and use
+> `edit-config`.
+
+## Expose Netdata only in a private LAN
+
+If your organisation has a private administration and management LAN, you can bind Netdata on this network interface on all your servers.
+This is done in `Netdata.conf` with these settings:
+
+```
+[web]
+ bind to = 10.1.1.1:19999 localhost:19999
+```
+
+You can bind Netdata to multiple IPs and ports. If you use hostnames, Netdata will resolve them and use all the IPs
+(in the above example `localhost` usually resolves to both `127.0.0.1` and `::1`).
+
+**This is the best and the suggested way to protect Netdata**. Your systems **should** have a private administration and management
+LAN, so that all management tasks are performed without any possibility of them being exposed on the internet.
+
+For cloud based installations, if your cloud provider does not provide such a private LAN (or if you use multiple providers),
+you can create a virtual management and administration LAN with tools like `tincd` or `gvpe`. These tools create a mesh VPN
+allowing all servers to communicate securely and privately. Your administration stations join this mesh VPN to get access to
+management and administration tasks on all your cloud servers.
+
+For `gvpe` we have developed a [simple provisioning tool](https://github.com/netdata/netdata-demo-site/tree/master/gvpe) you
+may find handy (it includes statically compiled `gvpe` binaries for Linux and FreeBSD, and also a script to compile `gvpe`
+on your macOS system). We use this to create a management and administration LAN for all Netdata demo sites (spread all over
+the internet using multiple hosting providers).
+
+## Fine-grained access control
+
+If you want to keep using the local dashboard, but don't want it exposed to the internet, you can restrict access with
+[access lists](https://github.com/netdata/netdata/blob/master/web/server/README.md#access-lists). This method also fully
+retains the ability to stream metrics
+on-demand through Netdata Cloud.
+
+The `allow connections from` setting helps you allow only certain IP addresses or FQDN/hostnames, such as a trusted
+static IP, only `localhost`, or connections from behind a management LAN.
+
+By default, this setting is `localhost *`. This setting allows connections from `localhost` in addition to _all_
+connections, using the `*` wildcard. You can change this setting using Netdata's [simple
+patterns](https://github.com/netdata/netdata/blob/master/libnetdata/simple_pattern/README.md).
+
+```conf
+[web]
+ # Allow only localhost connections
+ allow connections from = localhost
+
+ # Allow only from management LAN running on `10.X.X.X`
+ allow connections from = 10.*
+
+ # Allow connections only from a specific FQDN/hostname
+ allow connections from = example*
+```
+
+The `allow connections from` setting is global and restricts access to the dashboard, badges, streaming, API, and
+`netdata.conf`, but you can also set each of those access lists more granularly if you choose:
+
+```conf
+[web]
+ allow connections from = localhost *
+ allow dashboard from = localhost *
+ allow badges from = *
+ allow streaming from = *
+ allow netdata.conf from = localhost fd* 10.* 192.168.* 172.16.* 172.17.* 172.18.* 172.19.* 172.20.* 172.21.* 172.22.* 172.23.* 172.24.* 172.25.* 172.26.* 172.27.* 172.28.* 172.29.* 172.30.* 172.31.*
+ allow management from = localhost
+```
+
+See the [web server](https://github.com/netdata/netdata/blob/master/web/server/README.md#access-lists) docs for additional details
+about access lists. You can take
+access lists one step further by [enabling SSL](https://github.com/netdata/netdata/blob/master/web/server/README.md#enabling-tls-support) to encrypt data from local
+dashboard in transit. The connection to Netdata Cloud is always secured with TLS.
+
+## Use an authenticating web server in proxy mode
+
+Use one web server to provide authentication in front of **all your Netdata servers**. So, you will be accessing all your Netdata with
+URLs like `http://{HOST}/netdata/{NETDATA_HOSTNAME}/` and authentication will be shared among all of them (you will sign-in once for all your servers).
+Instructions are provided on how to set the proxy configuration to have Netdata run behind
+[nginx](https://github.com/netdata/netdata/blob/master/docs/Running-behind-nginx.md),
+[HAproxy](https://github.com/netdata/netdata/blob/master/docs/Running-behind-haproxy.md),
+[Apache](https://github.com/netdata/netdata/blob/master/docs/Running-behind-apache.md),
+[lighthttpd](https://github.com/netdata/netdata/blob/master/docs/Running-behind-lighttpd.md),
+[caddy](https://github.com/netdata/netdata/blob/master/docs/Running-behind-caddy.md), and
+[H2O](https://github.com/netdata/netdata/blob/master/docs/Running-behind-h2o.md).
+
+## Use Netdata parents as Web Application Firewalls
+
+The Netdata Agents you install on your production systems do not need direct access to the Internet. Even when you use
+Netdata Cloud, you can appoint one or more Netdata Parents to act as border gateways or application firewalls, isolating
+your production systems from the rest of the world. Netdata
+Parents receive metric data from Netdata Agents or other Netdata Parents on one side, and serve most queries using their own
+copy of the data to satisfy dashboard requests on the other side.
+
+For more information see [Streaming and replication](https://github.com/netdata/netdata/blob/master/docs/metrics-storage-management/enable-streaming.md).
+
+## Other methods
+
+Of course, there are many more methods you could use to protect Netdata:
+
+- Bind Netdata to localhost and use `ssh -L 19998:127.0.0.1:19999 remote.netdata.ip` to forward connections of local port 19998 to remote port 19999.
+This way you can ssh to a Netdata server and then use `http://127.0.0.1:19998/` on your computer to access the remote Netdata dashboard.
+
+- If you are always under a static IP, you can use the script given above to allow direct access to your Netdata servers without authentication,
+from all your static IPs.
+
+- Install all your Netdata in **headless data collector** mode, forwarding all metrics in real-time to a parent
+ Netdata server, which will be protected with authentication using an nginx server running locally at the parent
+ Netdata server. This requires more resources (you will need a bigger parent Netdata server), but does not require
+ any firewall changes, since all the child Netdata servers will not be listening for incoming connections.
diff --git a/docs/category-overview-pages/troubleshooting-overview.md b/docs/category-overview-pages/troubleshooting-overview.md new file mode 100644 index 00000000..60406edd --- /dev/null +++ b/docs/category-overview-pages/troubleshooting-overview.md @@ -0,0 +1,5 @@ +# Troubleshooting and machine learning + +In this section you can learn about Netdata's advanced tools that can assist you in troubleshooting issues with +your infrastructure, to facilitate the identification of a root cause. + diff --git a/docs/category-overview-pages/visualizations-overview.md b/docs/category-overview-pages/visualizations-overview.md new file mode 100644 index 00000000..d07af062 --- /dev/null +++ b/docs/category-overview-pages/visualizations-overview.md @@ -0,0 +1,4 @@ +# Visualizations, charts and dashboards + +In this section you can learn about the various ways Netdata visualizes the collected metrics at an infrastructure level with Netdata Cloud +and at a single node level, with the Netdata Agent Dashboard. |