diff options
Diffstat (limited to 'docs/observability-centralization-points/metrics-centralization-points')
6 files changed, 336 insertions, 0 deletions
diff --git a/docs/observability-centralization-points/metrics-centralization-points/README.md b/docs/observability-centralization-points/metrics-centralization-points/README.md new file mode 100644 index 000000000..812b493d7 --- /dev/null +++ b/docs/observability-centralization-points/metrics-centralization-points/README.md @@ -0,0 +1,48 @@ + +# Metrics Centralization Points (Netdata Parents) + +```mermaid +flowchart BT + C1["Netdata Child 1"] + C2["Netdata Child 2"] + C3["Netdata Child N"] + P1["Netdata Parent 1"] + C1 -->|stream| P1 + C2 -->|stream| P1 + C3 -->|stream| P1 +``` + +Netdata **Streaming and Replication** copies the recent past samples (replication) and in real-time all new samples collected (streaming) from production systems (Netdata Children) to metrics centralization points (Netdata Parents). The Netdata Parents then maintain the database for these metrics, according to their retention settings. + +Each production system (Netdata Child) can stream to **only one** Netdata Parent at a time. The configuration allows configuring multiple Netdata Parents for high availability, but only the first found working will be used. + +Netdata Parents receive metric samples **from multiple** production systems (Netdata Children) and have the option to re-stream them to another Netdata Parent. This allows building an infinite hierarchy of Netdata Parents. It also enables the configuration of Netdata Parents Clusters, for high availability. + +| Feature | Netdata Child (production system) | Netdata Parent (centralization point) | +|:---------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------:| +| Metrics Retention | Can be minimized, or switched to mode `ram` or `alloc` to save resources. Some retention is required in case network errors introduce disconnects. | Common retention settings for all systems aggregated to it. | +| Machine Learning | Can be disabled (enabled by default). | Runs Anomaly Detection for all systems aggregated to it. | +| Alerts & Notifications | Can be disabled (enabled by default). | Runs health checks and sends notifications for all systems aggregated to it. | +| API and Dashboard | Can be disabled (enabled by default). | Serves the dashboard for all systems aggregated to it, using its own retention. | +| Exporting Metrics | Not required (enabled by default). | Exports the samples of all metrics collected by the systems aggregated to it. | +| Netdata Functions | Netdata Child must be online. | Forwards Functions requests to the Children connected to it. | +| Connection to Netdata Cloud | Not required. | Each Netdata Parent registers to Netdata Cloud all systems aggregated to it. | + +## Supported Configurations + +For Netdata Children: + +1. **Full**: Full Netdata functionality is available at the Children. This means running machine learning, alerts, notifications, having the local dashboard available, and generally all Netdata features enabled. This is the default. +2. **Thin**: The Children are only collecting and forwarding metrics to a Parent. Some local retention may exist to avoid missing samples in case of network issues or Parent maintenance, but everything else is disabled. + +For Netdata Parents: + +1. **Standalone**: The Parent is standalone, either the only Parent available in the infrastructure, or the top-most of an hierarchy of Parents. +2. **Cluster**: The Parent is part of a cluster of Parents, all having the same data, from the same Children. A Cluster of Parents offers high-availability. +3. **Proxy**: The Parent receives metrics and stores them locally, but it also forwards them to a Grand Parent. + +A Cluster is configured as a number of circular **Proxies**, ie. each of the nodes in a cluster has all the others configured as its Parents. So, if multiple levels of metrics centralization points (Netdata Parents) are required, only the top-most level can be a cluster. + +## Best Practices + +Refer to [Best Practices for Observability Centralization Points](/docs/observability-centralization-points/best-practices.md). diff --git a/docs/observability-centralization-points/metrics-centralization-points/clustering-and-high-availability-of-netdata-parents.md b/docs/observability-centralization-points/metrics-centralization-points/clustering-and-high-availability-of-netdata-parents.md new file mode 100644 index 000000000..17a10b02e --- /dev/null +++ b/docs/observability-centralization-points/metrics-centralization-points/clustering-and-high-availability-of-netdata-parents.md @@ -0,0 +1,50 @@ +# Clustering and High Availability of Netdata Parents + +```mermaid +flowchart BT + C1["Netdata Child 1"] + C2["Netdata Child 2"] + C3["Netdata Child N"] + P1["Netdata Parent 1"] + P2["Netdata Parent 2"] + C1 & C2 & C3 -->|stream| P1 + P1 -->|stream| P2 + C1 & C2 & C3 .->|failover| P2 + P2 .->|failover| P1 +``` + +Netdata supports building Parent clusters of 2+ nodes. Clustering and high availability works like this: + +1. All Netdata Children are configured to stream to all Netdata Parents. The first one found working will be used by each Netdata Child and the others will be automatically used if and when this connection is interrupted. +2. The Netdata Parents are configured to stream to all other Netdata Parents. For each of them, the first found working will be used and the others will be automatically used if and when this connection is interrupted. + +All the Netdata Parents in such a cluster will receive all the metrics of all Netdata Children connected to any of them. They will also receive the metrics all the other Netdata Parents have. + +In case there is a failure on any of the Netdata Parents, the Netdata Children connected to it will automatically failover to another available Netdata Parent, which now will attempt to re-stream all the metrics it receives to the other available Netdata Parents. + +Netdata Cloud will receive registrations for all Netdata Children from all the Netdata Parents. As long as at least one of the Netdata Parents is connected to Netdata Cloud, all the Netdata Children will be available on Netdata Cloud. + +Netdata Children need to maintain a retention only for the time required to switch Netdata Parents. When Netdata Children connect to a Netdata Parent, they negotiate the available retention and any missing data on the Netdata Parent are replicated from the Netdata Children. + +## Restoring a Netdata Parent after maintenance + +Given the [replication limitations](/docs/observability-centralization-points/metrics-centralization-points/replication-of-past-samples.md#replication-limitations), special care is needed when restoring a Netdata Parent after some long maintenance work on it. + +If the Netdata Children do not have enough retention to replicate the missing data on this Netdata Parent, it is preferable to block access to this Netdata Parent from the Netdata Children, until it replicates the missing data from the other Netdata Parents. + +To block access from Netdata Children, and still allow access from other Netdata Parent siblings: + +1. Use `iptables` to block access to port 19999 from Netdata Children to the restored Netdata Parent, or +2. Use separate streaming API keys (in `stream.conf`) for Netdata Children and Netdata Parents, and disable the API key used by Netdata Children, until the restored Netdata Parent has been synchronized. + +## Duplicating a Parent + +The easiest way is to `rsync` the directory `/var/cache/netdata` from the existing Netdata Parent to the new Netdata Parent. + +> Important: Starting the new Netdata Parent with default settings, may delete the new files in `/var/cache/netdata` to apply the default disk size constraints. Therefore it is important to set the right retention settings in the new Netdata Parent before starting it up with the copied files. + +To configure retention at the new Netdata Parent, set in `netdata.conf` the following to at least the values the old Netdata Parent has: + +- `[db].dbengine multihost disk space MB`, this is the max disk size for `tier0`. The default is 256MiB. +- `[db].dbengine tier 1 multihost disk space MB`, this is the max disk space for `tier1`. The default is 50% of `tier0`. +- `[db].dbengine tier 2 multihost disk space MB`, this is the max disk space for `tier2`. The default is 50% of `tier1`. diff --git a/docs/observability-centralization-points/metrics-centralization-points/configuration.md b/docs/observability-centralization-points/metrics-centralization-points/configuration.md new file mode 100644 index 000000000..bf2aa98db --- /dev/null +++ b/docs/observability-centralization-points/metrics-centralization-points/configuration.md @@ -0,0 +1,105 @@ +# Configuring Metrics Centralization Points + +Metrics streaming configuration for both Netdata Children and Parents is done via `stream.conf`. + +`netdata.conf` and `stream.conf` have the same `ini` format, but `netdata.conf` is considered a non-sensitive file, while `stream.conf` contains API keys, IPs and other sensitive information that enable communication between Netdata agents. + +`stream.conf` has 2 main sections: + +- The `[stream]` section includes options for the **sending Netdata** (ie Netdata Children, or Netdata Parents that stream to Grand Parents, or to other sibling Netdata Parents in a cluster). +- The rest includes multiple sections that define API keys for the **receiving Netdata** (ie. Netdata Parents). + +## Edit `stream.conf` + +To edit `stream.conf`, run this on your terminal: + +```bash +cd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata +sudo ./edit-config stream.conf +``` + +Your editor will open, with defaults and commented `stream.conf` options. + +## Configuring a Netdata Parent + +To enable the reception of metrics from Netdata Children, generate a random API key with this command: + +```bash +uuidgen +``` + +Then, copy the UUID generated, [edit `stream.conf`](#edit-streamconf), find the section that reads like the following and replace `API_KEY` with the UUID you generated: + +```ini +[API_KEY] + # Accept metrics streaming from other Agents with the specified API key + enabled = yes +``` + +Save the file and restart Netdata. + +## Configuring Netdata Children + +To enable streaming metrics to a Netdata Parent, [edit `stream.conf`](#edit-streamconf), and at the `[stream]` section at the top, set: + +```ini +[stream] + # Stream metrics to another Netdata + enabled = yes + # The IP and PORT of the parent + destination = PARENT_IP_ADDRESS:19999 + # The shared API key, generated by uuidgen + api key = API_KEY +``` + +Save the file and restart Netdata. + +## Enable TLS/SSL Communication + +While encrypting the connection between your parent and child nodes is recommended for security, it's not required to get started. + +This example uses self-signed certificates. + +> **Note** +> This section assumes you have read the documentation on [how to edit the Netdata configuration files](/docs/netdata-agent/configuration/README.md). +<!-- here we need link to the section that will contain the restarting instructions --> + +1. **Parent node** + To generate an SSL key and certificate using `openssl`, take a look at the related section around [Securing Netdata Agents](/src/web/server/README.md#enable-httpstls-support) in our Documentation. + +2. **Child node** + Update `stream.conf` to enable SSL/TLS and allow self-signed certificates. Append ':SSL' to the destination and uncomment 'ssl skip certificate verification'. + + ```conf + [stream] + enabled = yes + destination = 203.0.113.0:SSL + ssl skip certificate verification = yes + api key = 11111111-2222-3333-4444-555555555555 + ``` + +3. Restart the Netdata Agent on both the parent and child nodes, to stream encrypted metrics using TLS/SSL. + + + +## Troubleshooting Streaming Connections + +You can find any issues related to streaming at Netdata logs. + +### From the UI + +Netdata logs to systemd-journald by default, and its logs are available at the `Logs` tab of the UI. At the `MESSAGE_ID` field look for `Netdata connection from child` and `Netdata connection to parent`. + +### From the terminal + +On the Parents: + +```bash +journalctl -r --namespace=netdata MESSAGE_ID=ed4cdb8f1beb4ad3b57cb3cae2d162fa +``` + +On the Children: + +```bash +journalctl -r --namespace=netdata MESSAGE_ID=6e2e3839067648968b646045dbf28d66 +``` diff --git a/docs/observability-centralization-points/metrics-centralization-points/faq.md b/docs/observability-centralization-points/metrics-centralization-points/faq.md new file mode 100644 index 000000000..027dfc748 --- /dev/null +++ b/docs/observability-centralization-points/metrics-centralization-points/faq.md @@ -0,0 +1,70 @@ +# FAQ on Metrics Centralization Points + +## How much can a Netdata Parent node scale? + +Netdata Parents generally scale well. According [to our tests](https://blog.netdata.cloud/netdata-vs-prometheus-performance-analysis/) Netdata Parents scale better than Prometheus for the same workload: -35% CPU utilization, -49% Memory Consumption, -12% Network Bandwidth, -98% Disk I/O, -75% Disk footprint. + +For more information, Check [Sizing Netdata Parents](/docs/observability-centralization-points/metrics-centralization-points/sizing-netdata-parents.md). + +## If I set up a parents cluster, will I be able to have more Child nodes stream to them? + +No. When you set up an active-active cluster, even if child nodes connect randomly to one or the other, all the parent nodes receive all the metrics of all the child nodes. So, all of them do all the work. + +## How much retention do the child nodes need? + +Child nodes need to have only the retention required in order to connect to another Parent if one fails or stops for maintenance. + +- If you have a cluster of parents, 5 to 10 minutes in `alloc` mode is usually enough. +- If you have only 1 parent, it would be better to run the child nodes with `dbengine` so that they will have enough retention to back-fill the parent node if it stops for maintenance. + +## Does streaming between child nodes and parents support encryption? + +Yes. You can configure your parent nodes to enable TLS at their web server and configure the child nodes to connect with TLS to it. The streaming connection is also compressed, on top of TLS. + +## Can I have an HTTP proxy between parent and child nodes? + +No. The streaming protocol works on the same port as the internal web server of Netdata Agents, but the protocol is not HTTP-friendly and cannot be understood by HTTP proxy servers. + +## Should I load balance multiple parents with a TCP load balancer? + +Although this can be done and for streaming between child and parent nodes it could work, we recommend not doing it. It can lead to several kinds of problems. + +It is better to configure all the parent nodes directly in the child nodes `stream.conf`. The child nodes will do everything in their power to find a parent node to connect and they will never give up. + +## When I have multiple parents for the same children, will I receive alert notifications from all of them? + +If all parents are configured to run health checks and trigger alerts, yes. + +We recommend using Netdata Cloud to avoid receiving duplicate alert notifications. Netdata Cloud deduplicates alert notifications so that you will receive them only once. + +## When I have only Parents connected to Netdata Cloud, will I be able to use the Functions feature on my child nodes? + +Yes. Function requests will be received by the Parents and forwarded to the Child via their streaming connection. Function requests are propagated between parents, so this will work even if multiple levels of Netdata Parents are involved. + +## If I have a cluster of parents and get one out for maintenance for a few hours, will it have missing data when it returns back online? + +Check [Restoring a Netdata Parent after maintenance](/docs/observability-centralization-points/metrics-centralization-points/clustering-and-high-availability-of-netdata-parents.md). + +## I have a cluster of parents. Which one is used by Netdata Cloud? + +When there are multiple data sources for the same node, Netdata Cloud follows this strategy: + +1. Netdata Cloud prefers Netdata agents having `live` data. +2. For time-series queries, when multiple Netdata agents have the retention required to answer the query, Netdata Cloud prefers the one that is further away from production systems. +3. For Functions, Netdata Cloud prefers Netdata agents that are closer to the production systems. + +## Is there a way to balance child nodes to the parent nodes of a cluster? + +Yes. When configuring the Parents at the Children `stream.conf`, configure them in different order. Children get connected to the first Parent they find available, so if the order given to them is different, they will spread the connections to the Parents available. + +## Is there a way to get notified when a child gets disconnected? + +It depends on the ephemerality setting of each Netdata Child. + +1. **Permanent nodes**: These are nodes that should be available permanently and if they disconnect an alert should be triggered to notify you. By default, all nodes are considered permanent (not ephemeral). + +2. **Ephemeral nodes**: These are nodes that are ephemeral by nature and they may shutdown at any point in time without any impact on the services you run. + +To set the ephemeral flag on a node, edit its netdata.conf and in the `[health]` section set `is ephemeral = yes`. This setting is propagated to parent nodes and Netdata Cloud. + +When using Netdata Cloud (via a parent or directly) and a permanent node gets disconnected, Netdata Cloud sends node disconnection notifications. diff --git a/docs/observability-centralization-points/metrics-centralization-points/replication-of-past-samples.md b/docs/observability-centralization-points/metrics-centralization-points/replication-of-past-samples.md new file mode 100644 index 000000000..5c776b860 --- /dev/null +++ b/docs/observability-centralization-points/metrics-centralization-points/replication-of-past-samples.md @@ -0,0 +1,60 @@ +# Replication of Past Samples + +Replication is triggered when a Netdata Child connects to a Netdata Parent. It replicates the latest samples of collected metrics a Netdata Parent may be missing. The goal of replication is to back-fill samples that were collected between disconnects and reconnects, so that the Netdata Parent does not have gaps on the charts for the time Netdata Children were disconnected. + +The same replication mechanism is used between Netdata Parents (the sending Netdata is treated as a Child and the receiving Netdata as a Parent). + +## Replication Limitations + +The current implementation is optimized to replicate small durations and have minimal impact during reconnects. As a result it has the following limitations: + +1. Replication can only append samples to metrics. Only missing samples at the end of each time-series are replicated. + +2. Only `tier0` samples are replicated. Samples of higher tiers in Netdata are derived from `tier0` samples, and therefore there is no mechanism for ingesting them directly. This means that the maximum retention that can be replicated across Netdata is limited by the samples available in `tier0` of the sending Netdata. + +3. Only samples of metrics that are currently being collected are replicated. Archived metrics (or even archived nodes) will be replicated when and if they are collected again. Netdata archives metrics 1 hour after they stop being collected, so Netdata Parents may miss data only if Netdata Children are disconnected for more than an hour from their Parents. + +When multiple Netdata Parents are available, the replication happens in sequence, like in the following diagram. + +```mermaid +sequenceDiagram + Child-->>Parent1: Connect + Parent1-->>Child: OK + Parent1-->>Parent2: Connect + Parent2-->>Parent1: OK + Child-->>Parent1: Metric M1 with retention up to Now + Parent1-->>Child: M1 stopped at -60sec, replicate up to Now + Child-->>Parent1: replicate M1 samples -60sec to Now + Child-->>Parent1: streaming M1 + Parent1-->>Parent2: Metric M1 with retention up to Now + Parent2-->>Parent1: M1 stopped at -63sec, replicate up to Now + Parent1-->>Parent2: replicate M1 samples -63sec to Now + Parent1-->>Parent2: streaming M1 +``` + +As shown in the diagram: + +1. All connections are established immediately after a Netdata child connects to any of the Netdata Parents. +2. Each pair of connections (Child->Parent1, Parent1->Parent2) complete replication on the receiving side and then initiate replication on the sending side. +3. Replication pushes data up to Now, and the sending side immediately enters streaming mode, without leaving any gaps on the samples of the receiving side. +4. On every pair of connections, replication negotiates the retention of the receiving party to back-fill as much data as necessary. + +## Configuration options for Replication + +The following `netdata.conf` configuration parameters affect replication. + +On the receiving side (Netdata Parent): + +- `[db].seconds to replicate` limits the maximum time to be replicated. The default is 1 day (86400 seconds). Keep in mind that replication is also limited by the `tier0` retention the sending side has. + +On the sending side (Netdata Children, or Netdata Parent when parents are clustered): + +- `[db].replication threads` controls how many concurrent threads will be replicating metrics. The default is 1. Usually the performance is about 2 million samples per second per thread, so increasing this number may allow replication to progress faster between Netdata Parents. + +- `[db].cleanup obsolete charts after secs` controls for how much time after metrics stop being collected will not be available for replication. The default is 1 hour (3600 seconds). If you plan to have scheduled maintenance on Netdata Parents of more than 1 hour, we recommend increasing this setting. Keep in mind however, that increasing this duration in highly ephemeral environments can have an impact on RAM utilization, since metrics will be considered as collected for longer durations. + +## Monitoring Replication Progress + +Inbound and outbound replication progress is reported at the dashboard using the Netdata Function `Streaming`, under the `Top` tab. + +The same information is exposed via the API endpoint `http://agent-ip:19999/api/v2/node_instances` of both Netdata Parents and Children. diff --git a/docs/observability-centralization-points/metrics-centralization-points/sizing-netdata-parents.md b/docs/observability-centralization-points/metrics-centralization-points/sizing-netdata-parents.md new file mode 100644 index 000000000..edfbabe93 --- /dev/null +++ b/docs/observability-centralization-points/metrics-centralization-points/sizing-netdata-parents.md @@ -0,0 +1,3 @@ +# Sizing Netdata Parents + +To estimate CPU, RAM, and disk requirements for your Netdata Parents, check [sizing Netdata agents](/docs/netdata-agent/sizing-netdata-agents/README.md). |