summaryrefslogtreecommitdiffstats
path: root/docs/guides/longer-metrics-storage.md
blob: 8ccd9585faeeda9a731b855c9c64be77d3a250de (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
<!--
title: "Netdata Longer Metrics Retention"
description: ""
custom_edit_url: https://github.com/netdata/netdata/edit/master/docs/guides/longer-metrics-storage.md
-->

# Netdata Longer Metrics Retention

Metrics retention affects 3 parameters on the operation of a Netdata Agent:

1. The disk space required to store the metrics.
2. The memory the Netdata Agent will require to have that retention available for queries.
3. The CPU resources that will be required to query longer time-frames.

As retention increases, the resources required to support that retention increase too.

Since Netdata Agents usually run at the edge, inside production systems, Netdata Agent **parents** should be considered. When having a **parent - child** setup, the child (the Netdata Agent running on a production system) delegates all its functions, including longer metrics retention and querying, to the parent node that can dedicate more resources to this task. A single Netdata Agent parent can centralize multiple children Netdata Agents (dozens, hundreds, or even thousands depending on its available resources). 


## Ephemerality of metrics

The ephemerality of metrics plays an important role in retention. In environments where metrics stop being collected and new metrics are constantly being generated, we are interested about 2 parameters:

1. The **expected concurrent number of metrics** as an average for the lifetime of the database.
   This affects mainly the storage requirements.

2. The **expected total number of unique metrics** for the lifetime of the database.
   This affects mainly the memory requirements for having all these metrics indexed and available to be queried.

## Granularity of metrics

The granularity of metrics (the frequency they are collected and stored, i.e. their resolution) is significantly affecting retention.

Lowering the granularity from per second to every two seconds, will double their retention and half the CPU requirements of the Netdata Agent, without affecting disk space or memory requirements.

## Which database mode to use

Netdata Agents support multiple database modes.

The default mode `[db].mode = dbengine` has been designed to scale for longer retentions.

The other available database modes are designed to minimize resource utilization and should usually be considered on **parent - child** setups at the children side.

So,

* On a single node setup, use `[db].mode = dbengine` to increase retention.
* On a **parent - child** setup, use `[db].mode = dbengine` on the parent to increase retention and a more resource efficient mode (like `save`, `ram` or `none`) for the child to minimize resources utilization.

To use `dbengine`, set this in `netdata.conf` (it is the default):

```
[db]
    mode = dbengine
```

## Tiering

`dbengine` supports tiering. Tiering allows having up to 3 versions of the data:

1. Tier 0 is the high resolution data.
2. Tier 1 is the first tier that samples data every 60 data collections of Tier 0.
3. Tier 2 is the second tier that samples data every 3600 data collections of Tier 0 (60 of Tier 1).

To enable tiering set `[db].storage tiers` in `netdata.conf` (the default is 1, to enable only Tier 0):

```
[db]
    mode = dbengine
    storage tiers = 3
```

## Disk space requirements

Netdata Agents require about 1 bytes on disk per database point on Tier 0 and 4 times more on higher tiers (Tier 1 and 2). They require 4 times more storage per point compared to Tier 0, because for every point higher tiers store `min`, `max`, `sum`, `count` and `anomaly rate` (the values are 5, but they require 4 times the storage because `count` and `anomaly rate` are 16-bit integers). The `average` is calculated on the fly at query time using `sum / count`.

### Tier 0 - per second for a week

For 2000 metrics, collected every second and retained for a week, Tier 0 needs: 1 byte x 2000 metrics x 3600 secs per hour x 24 hours per day x 7 days per week = 1100MB.

The setting to control this is in `netdata.conf`:

```
[db]
    mode = dbengine
    
    # per second data collection
    update every = 1
    
    # enable only Tier 0
    storage tiers = 1
    
    # Tier 0, per second data for a week
    dbengine multihost disk space MB = 1100
```

By setting it to `1100` and restarting the Netdata Agent, this node will start maintaining about a week of data. But pay attention to the number of metrics. If you have more than 2000 metrics on a node, or you need more that a week of high resolution metrics, you may need to adjust this setting accordingly.

### Tier 1 - per minute for a month

Tier 1 is by default sampling the data every 60 points of Tier 0. If Tier 0 is per second, then Tier 1 is per minute.

Tier 1 needs 4 times more storage per point compared to Tier 0. So, for 2000 metrics, with per minute resolution, retained for a month, Tier 1 needs: 4 bytes x 2000 metrics x 60 minutes per hour x 24 hours per day x 30 days per month = 330MB.

Do this in `netdata.conf`:

```
[db]
    mode = dbengine
    
    # per second data collection
    update every = 1
    
    # enable only Tier 0 and Tier 1
    storage tiers = 2
    
    # Tier 0, per second data for a week
    dbengine multihost disk space MB = 1100
    
    # Tier 1, per minute data for a month
    dbengine tier 1 multihost disk space MB = 330
```

Once `netdata.conf` is edited, the Netdata Agent needs to be restarted for the changes to take effect.

### Tier 2 - per hour for a year

Tier 2 is by default sampling data every 3600 points of Tier 0 (60 of Tier 1). If Tier 0 is per second, then Tier 2 is per hour.

The storage requirements are the same to Tier 1.

For 2000 metrics, with per hour resolution, retained for a year, Tier 2 needs: 4 bytes x 2000 metrics x 24 hours per day x 365 days per year = 67MB.

Do this in `netdata.conf`:

```
[db]
    mode = dbengine
    
    # per second data collection
    update every = 1
    
    # enable only Tier 0 and Tier 1
    storage tiers = 3
    
    # Tier 0, per second data for a week
    dbengine multihost disk space MB = 1100
    
    # Tier 1, per minute data for a month
    dbengine tier 1 multihost disk space MB = 330

    # Tier 2, per hour data for a year
    dbengine tier 2 multihost disk space MB = 67
```

Once `netdata.conf` is edited, the Netdata Agent needs to be restarted for the changes to take effect.