summaryrefslogtreecommitdiffstats
path: root/database/engine/README.md
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2021-02-07 11:49:00 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2021-02-07 12:42:05 +0000
commit2e85f9325a797977eea9dfea0a925775ddd211d9 (patch)
tree452c7f30d62fca5755f659b99e4e53c7b03afc21 /database/engine/README.md
parentReleasing debian version 1.19.0-4. (diff)
downloadnetdata-2e85f9325a797977eea9dfea0a925775ddd211d9.tar.xz
netdata-2e85f9325a797977eea9dfea0a925775ddd211d9.zip
Merging upstream version 1.29.0.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'database/engine/README.md')
-rw-r--r--database/engine/README.md202
1 files changed, 132 insertions, 70 deletions
diff --git a/database/engine/README.md b/database/engine/README.md
index 78f3b15ec..191366a46 100644
--- a/database/engine/README.md
+++ b/database/engine/README.md
@@ -1,93 +1,102 @@
-# Database engine
-
-The Database Engine works like a traditional database. There is some amount of RAM dedicated to data caching and
-indexing and the rest of the data reside compressed on disk. The number of history entries is not fixed in this case,
-but depends on the configured disk space and the effective compression ratio of the data stored. This is the **only
-mode** that supports changing the data collection update frequency (`update_every`) **without losing** the previously
-stored metrics.
-
-## Files
+<!--
+title: "Database engine"
+description: "Netdata's highly-efficient database engine use both RAM and disk for distributed, long-term storage of per-second metrics."
+custom_edit_url: https://github.com/netdata/netdata/edit/master/database/engine/README.md
+-->
-With the DB engine memory mode the metric data are stored in database files. These files are organized in pairs, the
-datafiles and their corresponding journalfiles, e.g.:
+# Database engine
-```sh
-datafile-1-0000000001.ndf
-journalfile-1-0000000001.njf
-datafile-1-0000000002.ndf
-journalfile-1-0000000002.njf
-datafile-1-0000000003.ndf
-journalfile-1-0000000003.njf
-...
-```
+The Database Engine works like a traditional database. It dedicates a certain amount of RAM to data caching and
+indexing, while the rest of the data resides compressed on disk. Unlike other [memory modes](/database/README.md), the
+amount of historical metrics stored is based on the amount of disk space you allocate and the effective compression
+ratio, not a fixed number of metrics collected.
-They are located under their host's cache directory in the directory `./dbengine` (e.g. for localhost the default
-location is `/var/cache/netdata/dbengine/*`). The higher numbered filenames contain more recent metric data. The user
-can safely delete some pairs of files when Netdata is stopped to manually free up some space.
+By using both RAM and disk space, the database engine allows for long-term storage of per-second metrics inside of the
+Agent itself.
-_Users should_ **back up** _their `./dbengine` folders if they consider this data to be important._
+In addition, the database engine is the only memory mode that supports changing the data collection update frequency
+(`update_every`) without losing the metrics your Agent already gathered and stored.
## Configuration
-There is one DB engine instance per Netdata host/node. That is, there is one `./dbengine` folder per node, and all
-charts of `dbengine` memory mode in such a host share the same storage space and DB engine instance memory state. You
-can select the memory mode for localhost by editing netdata.conf and setting:
+To use the database engine, open `netdata.conf` and set `memory mode` to `dbengine`.
```conf
[global]
memory mode = dbengine
```
-For setting the memory mode for the rest of the nodes you should look at
-[streaming](../../streaming/).
+To configure the database engine, look for the `page cache size` and `dbengine multihost disk space` settings in the
+`[global]` section of your `netdata.conf`. The Agent ignores the `history` setting when using the database engine.
-The `history` configuration option is meaningless for `memory mode = dbengine` and is ignored for any metrics being
-stored in the DB engine.
+```conf
+[global]
+ page cache size = 32
+ dbengine multihost disk space = 256
+```
+
+The above values are the default values for Page Cache size and DB engine disk space quota. Both numbers are
+in **MiB**.
+
+The `page cache size` option determines the amount of RAM in **MiB** dedicated to caching Netdata metric values. The
+actual page cache size will be slightly larger than this figure—see the [memory requirements](#memory-requirements)
+section for details.
-All DB engine instances, for localhost and all other streaming recipient nodes inherit their configuration from
-`netdata.conf`:
+The `dbengine multihost disk space` option determines the amount of disk space in **MiB** that is dedicated to storing
+Netdata metric values and all related metadata describing them. You can use the [**database engine
+calculator**](/docs/store/change-metrics-storage.md#calculate-the-system-resources-RAM-disk-space-needed-to-store-metrics)
+to correctly set `dbengine multihost disk space` based on your metrics retention policy. The calculator gives an
+accurate estimate based on how many child nodes you have, how many metrics your Agent collects, and more.
+
+### Legacy configuration
+
+The deprecated `dbengine disk space` option determines the amount of disk space in **MiB** that is dedicated to storing
+Netdata metric values per legacy database engine instance (see [details on the legacy mode](#legacy-mode) below).
```conf
[global]
- page cache size = 32
dbengine disk space = 256
```
-The above values are the default and minimum values for Page Cache size and DB engine disk space quota. Both numbers are
-in **MiB**. All DB engine instances will allocate the configured resources separately.
+### Streaming metrics to the database engine
-The `page cache size` option determines the amount of RAM in **MiB** that is dedicated to caching Netdata metric values
-themselves as far as queries are concerned. The total page cache size will be greater since data collection itself will
-consume additional memory as is described in the [Memory requirements](#memory-requirements) section.
+When using the multihost database engine, all parent and child nodes share the same `page cache size` and `dbengine
+multihost disk space` in a single dbengine instance. The [**database engine
+calculator**](/docs/store/change-metrics-storage.md#calculate-the-system-resources-RAM-disk-space-needed-to-store-metrics)
+helps you properly set `page cache size` and `dbengine multihost disk space` on your parent node to allocate enough
+resources based on your metrics retention policy and how many child nodes you have.
-The `dbengine disk space` option determines the amount of disk space in **MiB** that is dedicated to storing Netdata
-metric values and all related metadata describing them.
+#### Legacy mode
-## Operation
+_For Netdata Agents earlier than v1.23.2_, the Agent on the parent node uses one dbengine instance for itself, and
+another instance for every child node it receives metrics from. If you had four streaming nodes, you would have five
+instances in total (`1 parent + 4 child nodes = 5 instances`).
-The DB engine stores chart metric values in 4096-byte pages in memory. Each chart dimension gets its own page to store
-consecutive values generated from the data collectors. Those pages comprise the **Page Cache**.
+The Agent allocates resources for each instance separately using the `dbengine disk space` (**deprecated**) setting. If
+`dbengine disk space`(**deprecated**) is set to the default `256`, each instance is given 256 MiB in disk space, which
+means the total disk space required to store all instances is, roughly, `256 MiB * 1 parent * 4 child nodes = 1280 MiB`.
-When those pages fill up they are slowly compressed and flushed to disk. It can take `4096 / 4 = 1024 seconds = 17
-minutes`, for a chart dimension that is being collected every 1 second, to fill a page. Pages can be cut short when we
-stop Netdata or the DB engine instance so as to not lose the data. When we query the DB engine for data we trigger disk
-read I/O requests that fill the Page Cache with the requested pages and potentially evict cold (not recently used)
-pages.
+#### Backward compatibility
-When the disk quota is exceeded the oldest values are removed from the DB engine at real time, by automatically deleting
-the oldest datafile and journalfile pair. Any corresponding pages residing in the Page Cache will also be invalidated
-and removed. The DB engine logic will try to maintain between 10 and 20 file pairs at any point in time.
+All existing metrics belonging to child nodes are automatically converted to legacy dbengine instances and the localhost
+metrics are transferred to the multihost dbengine instance.
-The Database Engine uses direct I/O to avoid polluting the OS filesystem caches and does not generate excessive I/O
-traffic so as to create the minimum possible interference with other applications.
+All new child nodes are automatically transferred to the multihost dbengine instance and share its page cache and disk
+space. If you want to migrate a child node from its legacy dbengine instance to the multihost dbengine instance, you
+must delete the instance's directory, which is located in `/var/cache/netdata/MACHINE_GUID/dbengine`, after stopping the
+Agent.
+
+##### Information
+
+For more information about setting `memory mode` on your nodes, in addition to other streaming configurations, see
+[streaming](/streaming/README.md).
-## Memory requirements
+### Memory requirements
Using memory mode `dbengine` we can overcome most memory restrictions and store a dataset that is much larger than the
available memory.
-There are explicit memory requirements **per** DB engine **instance**, meaning **per** Netdata **node** (e.g. localhost
-and streaming recipient nodes):
+There are explicit memory requirements **per** DB engine **instance**:
- The total page cache memory footprint will be an additional `#dimensions-being-collected x 4096 x 2` bytes over what
the user configured with `page cache size`.
@@ -99,18 +108,30 @@ and streaming recipient nodes):
- for very highly compressible data (compression ratio > 90%) this RAM overhead is comparable to the disk space
footprint.
-An important observation is that RAM usage depends on both the `page cache size` and the `dbengine disk space` options.
+An important observation is that RAM usage depends on both the `page cache size` and the `dbengine multihost disk space`
+options.
-## File descriptor requirements
+You can use our [database engine
+calculator](/docs/store/change-metrics-storage.md#calculate-the-system-resources-RAM-disk-space-needed-to-store-metrics)
+to validate the memory requirements for your particular system(s) and configuration (**out-of-date**).
-The Database Engine may keep a **significant** amount of files open per instance (e.g. per streaming slave or master
-server). When configuring your system you should make sure there are at least 50 file descriptors available per
+### Disk space requirements
+
+There are explicit disk space requirements **per** DB engine **instance**:
+
+- The total disk space footprint will be the maximum between `#dimensions-being-collected x 4096 x 2` bytes or what
+ the user configured with `dbengine multihost disk space` or `dbengine disk space`.
+
+### File descriptor requirements
+
+The Database Engine may keep a **significant** amount of files open per instance (e.g. per streaming child or
+parent server). When configuring your system you should make sure there are at least 50 file descriptors available per
`dbengine` instance.
Netdata allocates 25% of the available file descriptors to its Database Engine instances. This means that only 25% of
the file descriptors that are available to the Netdata service are accessible by dbengine instances. You should take
that into account when configuring your service or system-wide file descriptor limits. You can roughly estimate that the
-Netdata service needs 2048 file descriptors for every 10 streaming slave hosts when streaming is configured to use
+Netdata service needs 2048 file descriptors for every 10 streaming child hosts when streaming is configured to use
`memory mode = dbengine`.
If for example one wants to allocate 65536 file descriptors to the Netdata service on a systemd system one needs to
@@ -143,12 +164,53 @@ kern.maxfiles=65536
You can apply the settings by running `sysctl -p` or by rebooting.
+## Files
+
+With the DB engine memory mode the metric data are stored in database files. These files are organized in pairs, the
+datafiles and their corresponding journalfiles, e.g.:
+
+```sh
+datafile-1-0000000001.ndf
+journalfile-1-0000000001.njf
+datafile-1-0000000002.ndf
+journalfile-1-0000000002.njf
+datafile-1-0000000003.ndf
+journalfile-1-0000000003.njf
+...
+```
+
+They are located under their host's cache directory in the directory `./dbengine` (e.g. for localhost the default
+location is `/var/cache/netdata/dbengine/*`). The higher numbered filenames contain more recent metric data. The user
+can safely delete some pairs of files when Netdata is stopped to manually free up some space.
+
+_Users should_ **back up** _their `./dbengine` folders if they consider this data to be important._ You can also set up
+one or more [exporting connectors](/exporting/README.md) to send your Netdata metrics to other databases for long-term
+storage at lower granularity.
+
+## Operation
+
+The DB engine stores chart metric values in 4096-byte pages in memory. Each chart dimension gets its own page to store
+consecutive values generated from the data collectors. Those pages comprise the **Page Cache**.
+
+When those pages fill up they are slowly compressed and flushed to disk. It can take `4096 / 4 = 1024 seconds = 17
+minutes`, for a chart dimension that is being collected every 1 second, to fill a page. Pages can be cut short when we
+stop Netdata or the DB engine instance so as to not lose the data. When we query the DB engine for data we trigger disk
+read I/O requests that fill the Page Cache with the requested pages and potentially evict cold (not recently used)
+pages.
+
+When the disk quota is exceeded the oldest values are removed from the DB engine at real time, by automatically deleting
+the oldest datafile and journalfile pair. Any corresponding pages residing in the Page Cache will also be invalidated
+and removed. The DB engine logic will try to maintain between 10 and 20 file pairs at any point in time.
+
+The Database Engine uses direct I/O to avoid polluting the OS filesystem caches and does not generate excessive I/O
+traffic so as to create the minimum possible interference with other applications.
+
## Evaluation
-We have evaluated the performance of the `dbengine` API that the netdata daemon uses internally. This is **not** the
-web API of netdata. Our benchmarks ran on a **single** `dbengine` instance, multiple of which can be running in a
-netdata master server. We used a server with an AMD Ryzen Threadripper 2950X 16-Core Processor and 2 disk drives, a
-Seagate Constellation ES.3 2TB magnetic HDD and a SAMSUNG MZQLB960HAJR-00007 960GB NAND Flash SSD.
+We have evaluated the performance of the `dbengine` API that the netdata daemon uses internally. This is **not** the web
+API of netdata. Our benchmarks ran on a **single** `dbengine` instance, multiple of which can be running in a Netdata
+parent node. We used a server with an AMD Ryzen Threadripper 2950X 16-Core Processor and 2 disk drives, a Seagate
+Constellation ES.3 2TB magnetic HDD and a SAMSUNG MZQLB960HAJR-00007 960GB NAND Flash SSD.
For our workload, we defined 32 charts with 128 metrics each, giving us a total of 4096 metrics. We defined 1 worker
thread per chart (32 threads) that generates new data points with a data generation interval of 1 second. The time axis
@@ -170,10 +232,10 @@ so as to avoid all disk bottlenecks.
The reported numbers are the following:
| device | page cache | dataset | reads/sec | writes/sec |
-| :---: | :---: | ---: | ---: | ---: |
-| HDD | 64 MiB | 4.1 GiB | 813K | 18.0M |
-| SSD | 64 MiB | 9.8 GiB | 1.7M | 43.0M |
-| N/A | 16 GiB | 6.8 GiB |118.2M | 30.2M |
+| :----: | :--------: | ------: | --------: | ---------: |
+| HDD | 64 MiB | 4.1 GiB | 813K | 18.0M |
+| SSD | 64 MiB | 9.8 GiB | 1.7M | 43.0M |
+| N/A | 16 GiB | 6.8 GiB | 118.2M | 30.2M |
where "reads/sec" is the number of metric data points being read from the database via its API per second and
"writes/sec" is the number of metric data points being written to the database per second.