summaryrefslogtreecommitdiffstats
path: root/docs/netdata-agent/sizing-netdata-agents/disk-requirements-and-retention.md
blob: d9e879cb62ba41dda30cedee1dc5cf5b62525f72 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# Disk Requirements & Retention

## Database Modes and Tiers

Netdata comes with 3 database modes:

1. `dbengine`: the default high-performance multi-tier database of Netdata. Metric samples are cached in memory and are saved to disk in multiple tiers, with compression.
2. `ram`: metric samples are stored in ring buffers in memory, with increments of 1024 samples. Metric samples are not committed to disk. Kernel-Same-Page (KSM) can be used to deduplicate Netdata's memory.
3. `alloc`: metric samples are stored in ring buffers in memory, with flexible increments. Metric samples are not committed to disk.

## `ram` and `alloc`

Modes `ram` and `alloc` can help when Netdata should not introduce any disk I/O at all. In both of these modes, metric samples exist only in memory, and only while they are collected.

When Netdata is configured to stream its metrics to a Metrics Observability Centralization Point (a Netdata Parent), metric samples are forwarded in real-time to that Netdata Parent. The ring buffers available in these modes is used to cache the collected samples for some time, in case there are network issues, or the Netdata Parent is restarted for maintenance.

The memory required per sample in these modes, is 4 bytes:

- `ram` mode uses `mmap()` behind the scene, and can be incremented in steps of 1024 samples (4KiB). Mode `ram` allows the use of the Linux kernel memory dedupper (Kernel-Same-Page or KSM) to deduplicate Netdata ring buffers and save memory.
- `alloc` mode can be sized for any number of samples per metric. KSM cannot be used in this mode.

To configure database mode `ram` or `alloc`, in `netdata.conf`, set the following:

- `[db].mode` to either `ram` or `alloc`.
- `[db].retention` to the number of samples the ring buffers should maintain. For `ram` if the value set is not a multiple of 1024, the next multiple of 1024 will be used.

## `dbengine`

`dbengine` supports up to 5 tiers. By default, 3 tiers are used, like this:

|   Tier   |                                          Resolution                                          | Uncompressed Sample Size | Usually On Disk |
|:--------:|:--------------------------------------------------------------------------------------------:|:------------------------:|:---------------:|
| `tier0`  |            native resolution (metrics collected per-second as stored per-second)             |         4 bytes          |    0.6 bytes    |
| `tier1`  | 60 iterations of `tier0`, so when metrics are collected per-second, this tier is per-minute. |         16 bytes         |     6 bytes     |
| `tier2`  |  60 iterations of `tier1`, so when metrics are collected per second, this tier is per-hour.  |         16 bytes         |    18 bytes     |

Data are saved to disk compressed, so the actual size on disk varies depending on compression efficiency.

`dbegnine` tiers are overlapping, so higher tiers include a down-sampled version of the samples in lower tiers:

```mermaid
gantt
    dateFormat  YYYY-MM-DD
    tickInterval 1week
    axisFormat    
    todayMarker off
    tier0, 14d       :a1, 2023-12-24, 7d
    tier1, 60d       :a2, 2023-12-01, 30d
    tier2, 365d      :a3, 2023-11-02, 59d
```

## Disk Space and Metrics Retention

You can find information about the current disk utilization of a Netdata Parent, at <http://agent-ip:19999/api/v2/info>. The output of this endpoint is like this:

```json
{
  // more information about the agent
  // then, near the end:
  "db_size": [
    {
      "tier": 0,
      "metrics": 43070,
      "samples": 88078162001,
      "disk_used": 41156409552,
      "disk_max": 41943040000,
      "disk_percent": 98.1245269,
      "from": 1705033983,
      "to": 1708856640,
      "retention": 3822657,
      "expected_retention": 3895720,
      "currently_collected_metrics": 27424
    },
    {
      "tier": 1,
      "metrics": 72987,
      "samples": 5155155269,
      "disk_used": 20585157180,
      "disk_max": 20971520000,
      "disk_percent": 98.1576785,
      "from": 1698287340,
      "to": 1708856640,
      "retention": 10569300,
      "expected_retention": 10767675,
      "currently_collected_metrics": 27424
    },
    {
      "tier": 2,
      "metrics": 148234,
      "samples": 314919121,
      "disk_used": 5957346684,
      "disk_max": 10485760000,
      "disk_percent": 56.8136853,
      "from": 1667808000,
      "to": 1708856640,
      "retention": 41048640,
      "expected_retention": 72251324,
      "currently_collected_metrics": 27424
    }
  ]
}
```

In this example:

- `tier` is the database tier.
- `metrics` is the number of unique time-series in the database.
- `samples` is the number of samples in the database.
- `disk_used` is the currently used disk space in bytes.
- `disk_max` is the configured max disk space in bytes.
- `disk_percent` is the current disk space utilization for this tier.
- `from` is the first (oldest) timestamp in the database for this tier.
- `to` is the latest (newest) timestamp in the database for this tier.
- `retention` is the current retention of the database for this tier, in seconds (divide by 3600 for hours, divide by 86400 for days).
- `expected_retention` is the expected retention in seconds when `disk_percent` will be 100 (divide by 3600 for hours, divide by 86400 for days).
- `currently_collected_metrics` is the number of unique time-series currently being collected for this tier.

So, for our example above:

| Tier | # Of Metrics |  # Of Samples | Disk Used | Disk Free | Current Retention | Expected Retention | Sample Size |
|-----:|-------------:|--------------:|----------:|----------:|------------------:|-------------------:|------------:|
|    0 |        43.1K |  88.1 billion |    38.4Gi |     1.88% |         44.2 days |          45.0 days |      0.46 B |
|    1 |        73.0K |   5.2 billion |    19.2Gi |     1.84% |        122.3 days |         124.6 days |      3.99 B |
|    2 |       148.3K | 315.0 million |     5.6Gi |    43.19% |        475.1 days |         836.2 days |     18.91 B |

To configure retention, in `netdata.conf`, set the following:

- `[db].mode` to `dbengine`.
- `[db].dbengine multihost disk space MB`, this is the max disk size for `tier0`. The default is 256MiB.
- `[db].dbengine tier 1 multihost disk space MB`, this is the max disk space for `tier1`. The default is 50% of `tier0`.
- `[db].dbengine tier 2 multihost disk space MB`, this is the max disk space for `tier2`. The default is 50% of `tier1`.