1 files changed, 239 insertions, 64 deletions
diff --git a/database/engine/README.md b/database/engine/README.md
index c67e400f4..664d40506 100644
--- a/database/engine/README.md
+++ b/database/engine/README.md
@@ -1,48 +1,126 @@
 <!--
 title: "Database engine"
 description: "Netdata's highly-efficient database engine use both RAM and disk for distributed, long-term storage of per-second metrics."
-custom_edit_url: https://github.com/netdata/netdata/edit/master/database/engine/README.md
+custom_edit_url: "https://github.com/netdata/netdata/edit/master/database/engine/README.md"
+sidebar_label: "Database engine"
+learn_status: "Published"
+learn_topic_type: "Concepts"
+learn_rel_path: "Concepts"
 -->
 
-# Database engine
+# DBENGINE
 
-The Database Engine works like a traditional time series database. Unlike other [database modes](/database/README.md),
-the amount of historical metrics stored is based on the amount of disk space you allocate and the effective compression
-ratio, not a fixed number of metrics collected.
+DBENGINE is the time-series database of Netdata.
 
-## Tiering
+## Design
 
-Tiering is a mechanism of providing multiple tiers of data with
-different [granularity on metrics](/docs/store/distributed-data-architecture.md#granularity-of-metrics).
+### Data Points
 
-For Netdata Agents with version `netdata-1.35.0.138.nightly` and greater, `dbengine` supports Tiering, allowing almost
-unlimited retention of data.
+**Data points** represent the collected values of metrics.
 
+A **data point** has:
 
-### Metric size
+1. A **value**, the data collected for a metric.  There is a special **value** to indicate that the collector failed to collect a valid value, and thus the data point is a **gap**.
+2. A **timestamp**, the time it has been collected.
+3. A **duration**, the time between this and the previous data collection.
+4. A flag which is set when machine-learning categorized the collected value as **anomalous** (an outlier based on the trained models).
 
-Every Tier down samples the exact lower tier (lower tiers have greater resolution). You can have up to 5
-Tiers **[0. . 4]** of data (including the Tier 0, which has the highest resolution)
+Using the **timestamp** and **duration**, Netdata calculates for each point its **start time**, **end time** and **update every**.
 
-Tier 0 is the default that was always available in `dbengine` mode. Tier 1 is the first level of aggregation, Tier 2 is
-the second, and so on.
+For incremental metrics (counters), Netdata interpolates the collected values to align them to the expected **end time** at the microsecond level,  absorbing data collection micro-latencies.
 
-Metrics on all tiers except of the _Tier 0_ also store the following five additional values for every point for accurate
-representation:
+When data points are stored in higher tiers (time aggregations - see [Tiers](#Tiers) below), each data point has:
 
-1. The `sum` of the points aggregated
-2. The `min` of the points aggregated
-3. The `max` of the points aggregated
-4. The `count` of the points aggregated (could be constant, but it may not be due to gaps in data collection)
-5. The `anomaly_count` of the points aggregated (how many of the aggregated points found anomalous)
+1. The **sum** of the original values that have been aggregated,
+2. The **count**  of all the original values aggregated,
+3. The **minimum** value among them,
+4. The **maximum** value among them,
+5. Their **anomaly rate**, i.e. the count of values that were detected as outliers based on the currently trained models for the metric,
+6. A **timestamp**, which is the equal to the **end time** of the last point aggregated,
+7. A **duration**, which is the duration between the **first time** of the first point aggregated to the **end time** of the last point aggregated.
 
-Among `min`, `max` and `sum`, the correct value is chosen based on the user query. `average` is calculated on the fly at
-query time.
+This design allows Netdata to accurately know the **average**, **minimum**, **maximum** and **anomaly rate** values even when using higher tiers to satisfy a query.
 
-### Tiering in a nutshell
+### Pages
+Data points are organized into **pages**, i.e. segments of contiguous data collections of the same metric.
 
-The `dbengine` is capable of retaining metrics for years. To further understand the `dbengine` tiering mechanism let's
-explore the following configuration.
+Each page:
+
+1. Contains contiguous **data points** of a single metric.
+2. Contains **data points** having the same **update every**. If a metric changes **update every** on the fly, the page is flushed and a new one with the new **update every** is created. If a data collection is missed, a **gap point** is inserted into the page, so that the data points in a page remain contiguous.
+3. Has a **start time**, which is equivalent to the **end time** of the first data point stored into it,
+4. Has an **end time**, which is equal to the **end time** of the last data point stored into it,
+5. Has an **update every**, common for all points in the page.
+
+A **page** is a simple array of values. Each slot in the array has a **timestamp** implied by its position in the array, and each value stored represents the **data point** for that time, for the metric the page belongs to.
+
+This simple fixed step page design allows Netdata to collect several millions of points per second and pack all the values in a compact form with minimal metadata overhead.
+
+#### Hot Pages
+
+While a metric is collected, there is one **hot page** in memory for each of the configured tiers. Values collected for a metric are appended to its **hot page** until that page becomes full.
+
+#### Dirty Pages
+
+Once a **hot page** is full, it becomes a **dirty page**, and it is scheduled for immediate **flushing** (saving) to disk.
+
+#### Clean Pages
+
+Flushed (saved) pages are **clean pages**, i.e. read-only pages that reside primarily on disk, and are loaded on demand to satisfy data queries.
+
+#### Pages Configuration
+
+Pages are configured like this:
+
+| Attribute                                                                             |                 Tier0                 |                              Tier1                              |                              Tier2                              |
+|---------------------------------------------------------------------------------------|:-------------------------------------:|:---------------------------------------------------------------:|:---------------------------------------------------------------:|
+| Point Size in Memory, in Bytes                                                        |                   4                   |                               16                                |                               16                                |
+| Point Size on Disk, in Bytes<br/><small>after LZ4 compression, on the average</small> |                   1                   |                                4                                |                                4                                |
+| Page Size in Bytes                                                                    | 4096<br/><small>2048 in 32bit</small> |              2048<br/><small>1024 in 32bit</small>              |               384<br/><small>192 in 32bit</small>               |
+| Collections per Point                                                                 |                   1                   | 60x Tier0<br/><small>configurable in<br/>`netdata.conf`</small> | 60x Tier1<br/><small>configurable in<br/>`netdata.conf`</small> |
+| Points per Page                                                                       | 1024<br/><small>512 in 32bit</small>  |               128<br/><small>64 in 32bit</small>                |                24<br/><small>12 in 32bit</small>                |
+
+### Files
+
+To minimize the amount of data written to disk and the amount of storage required for storing metrics, Netdata aggregates up to 64 **dirty pages** of independent metrics, packs them all together into one bigger buffer, compresses this buffer with LZ4 (about 75% savings on the average) and commits a transaction to the disk files.
+
+#### Extents
+
+This collection of 64 pages that is packed and compressed together is called an **extent**. Netdata tries to store together, in the same **extent**, metrics that are meant to be "close". Dimensions of the same chart are such. They are usually queried together, so it is beneficial to have them in the same **extent** to read all of them at once at query time.
+
+#### Datafiles
+
+Multiple **extents** are appended to **datafiles** (filename suffix `.ndf`), until these **datafiles** become full. The size of each **datafile** is determined automatically by Netdata. The minimum for each **datafile** is 4MB and the maximum 512MB. Depending on the amount of disk space configured for each tier, Netdata will decide a **datafile** size trying to maintain about 50 datafiles for the whole database, within the limits mentioned (4MB min, 512MB max per file). The maximum number of datafiles supported is 65536, and therefore the maximum database size (per tier) that Netdata can support is 32TB.
+
+#### Journal Files
+
+Each **datafile** has two **journal files** with metadata related to the stored data in the **datafile**.
+
+- **journal file v1**, with filename suffix `.njf`, holds information about the transactions in its **datafile** and provides the ability to recover as much data as possible, in case either the datafile or the journal files get corrupted. This journal file has a maximum transaction size of 4KB, so in case data are corrupted on disk transactions of 4KB are lost. Each transaction holds the metadata of one **extent** (this is why DBENGINE supports up to 64 pages per extent).
+
+- **journal file v2**, with filename suffix `.njfv2`, which is a disk-based index for all the **pages** and **extents**. This file is memory mapped at runtime and is consulted to find where the data of a metric are in the datafile. This journal file is automatically re-created from **journal file v1** if it is missing. It is safe to delete these files (when Netdata does not run). Netdata will re-create them on the next run. Journal files v2 are supported in Netdata Agents with version `netdata-1.37.0-115-nightly`. Older versions maintain the journal index in memory.
+
+#### Database Rotation
+
+Database rotation is achieved by deleting the oldest **datafile** (and its journals) and creating a new one (with its journals).
+
+Data on disk are append-only. There is no way to delete, add, or update data in the middle of the database. If data are not useful for whatever reason, Netdata can be instructed to ignore these data. They will eventually be deleted from disk when the database is rotated. New data are always appended.
+
+#### Tiers
+
+Tiers are supported in Netdata Agents with version `netdata-1.35.0.138.nightly` and greater.
+
+**datafiles** and **journal files** are organized in **tiers**. All tiers share the same metrics and same collected values.
+
+- **tier 0** is the high resolution tier that stores the collected data at the frequency they are collected.
+- **tier 1** by default aggregates 60 values of **tier 0**.
+- **tier 2** by default aggregates 60 values of **tier 1**, or 3600 values of **tier 0**.
+
+Updating the higher **tiers** is automated, and it happens in real-time while data are being collected for **tier 0**.
+
+When the Netdata Agent starts, during the first data collection of each metric, higher tiers are automatically **backfilled** with data from lower tiers, so that the aggregation they provide will be accurate.
+
+3 tiers are enabled by default in Netdata, with the following configuration:
 
 ```
 [db]
@@ -51,46 +129,151 @@ explore the following configuration.
     # per second data collection
     update every = 1
     
-    # enables Tier 1 and Tier 2, Tier 0 is always enabled in dbengine mode
+    # number of tiers used (1 to 5, 3 being default)
     storage tiers = 3
     
-    # Tier 0, per second data for a week
-    dbengine multihost disk space MB = 1100
+    # Tier 0, per second data
+    dbengine multihost disk space MB = 256
     
-    # Tier 1, per minute data for a month
-    dbengine tier 1 multihost disk space MB = 330
+    # Tier 1, per minute data
+    dbengine tier 1 multihost disk space MB = 128
+
+    # Tier 2, per hour data
+    dbengine tier 2 multihost disk space MB = 64
+```
+
+The exact retention that can be achieved by each tier depends on the number of metrics collected. The more the metrics, the smaller the retention that will fit in a given size. The general rule is that Netdata needs about **1 byte per data point on disk for tier 0**, and **4 bytes per data point on disk for tier 1 and above**.
+
+So, for 1000 metrics collected per second and 256 MB for tier 0, Netdata will store about:
+
+```
+256MB on disk / 1 byte per point / 1000 metrics => 256k points per metric / 86400 seconds per day = about 3 days
+```
+
+At tier 1 (per minute):
+
+```
+128MB on disk / 4 bytes per point / 1000 metrics => 32k points per metric / (24 hours * 60 minutes) = about 22 days
+```
+
+At tier 2 (per hour):
+
+```
+64MB on disk / 4 bytes per point / 1000 metrics => 16k points per metric / 24 hours per day = about 2 years 
+```
+
+Of course double the metrics, half the retention. There are more factors that affect retention. The number of ephemeral metrics (i.e. metrics that are collected for part of the time). The number of metrics that are usually constant over time (affecting compression efficiency). The number of restarts a Netdata Agents gets through time (because it has to break pages prematurely, increasing the metadata overhead). But the actual numbers should not deviate significantly from the above. 
+
+### Data Loss
+
+Until **hot pages** and **dirty pages** are **flushed** to disk they are at risk (e.g. due to a crash, or
+power failure), as they are stored only in memory.
+
+The supported way of ensuring high data availability is the use of Netdata Parents to stream the data in real-time to
+multiple other Netdata agents.
+
+## Memory Requirements
+
+DBENGINE memory is related to the number of metrics concurrently being collected, the retention of the metrics on disk in relation with the queries running, and the number of metrics for which retention is maintained.
+
+### Memory for concurrently collected metrics
+
+DBENGINE is automatically sized to use memory according to this equation:
+
+```
+memory in KiB = METRICS x (TIERS - 1) x 4KiB x 2 + 32768 KiB
+```
+
+Where:
+- `METRICS`: the maximum number of concurrently collected metrics (dimensions) from the time the agent started.
+- `TIERS`: the number of storage tiers configured, by default 3 ( `-1` when using 3+ tiers)
+- `x 2`, to accommodate room for flushing data to disk
+- `x 4KiB`, the data segment size of each metric
+- `+ 32768 KiB`, 32 MB for operational caches
+
+So, for 2000 metrics (dimensions) in 3 storage tiers:
+
+```
+memory for 2k metrics = 2000 x (3 - 1) x 4 KiB x 2 + 32768 KiB = 64 MiB
+```
+
+For 100k concurrently collected metrics in 3 storage tiers:
+
+```
+memory for 100k metrics = 100000 x (3 - 1) x 4 KiB x 2 + 32768 KiB = 1.6 GiB
+```
+
+#### Exceptions
+
+Netdata has several protection mechanisms to prevent the use of more memory (than the above), by incrementally fetching data from disk and aggressively evicting old data to make room for new data, but still memory may grow beyond the above limit under the following conditions:
+
+1. The number of pages concurrently used in queries do not fit the in the above size. This can happen when multiple queries of unreasonably long time-frames run on lower, higher resolution, tiers. The Netdata query planner attempts to avoid such situations by gradually loading pages, but still under extreme conditions the system may use more memory to satisfy these queries.
+
+2. The disks that host Netdata files are extremely slow for the workload required by the database so that data cannot be flushed to disk quickly to free memory. Netdata will automatically spawn more flushing workers in an attempt to parallelize and speed up flushing, but still if the disks cannot write the data quickly enough, they will remain in memory until they are written to disk.
+
+### Caches
+
+DBENGINE stores metric data to disk. To achieve high performance even under severe stress, it uses several layers of caches.
+
+#### Main Cache
+
+Stores page data. It is the primary storage of hot and dirty pages (before they are saved to disk), and its clean queue is the LRU cache for speeding up queries.
+
+The entire DBENGINE is designed to use the hot queue size (the currently collected metrics) as the key for sizing all its memory consumption. We call this feature **memory ballooning**. More collected metrics, bigger main cache and vice versa.
+
+In the equation:
 
-    # Tier 2, per hour data for a year
-    dbengine tier 2 multihost disk space MB = 67
 ```
+memory in KiB = METRICS x (TIERS - 1) x 4KiB x 2 + 32768 KiB
+```
+
+the part `METRICS x (TIERS - 1) x 4KiB` is an estimate for the max hot size of the main cache. Tier 0 pages are 4KiB, but tier 1 pages are 2 KiB and tier 2 pages are 384 bytes. So a single metric in 3 tiers uses 4096 + 2048 + 384 = 6528 bytes. The equation estimates 8192 per metric, which includes cache internal structures and leaves some spare.
+
+Then `x 2` is the worst case estimate for the dirty queue. If all collected metrics (hot) become available for saving at once, to avoid stopping data collection all their pages will become dirty and new hot pages will be created instantly. To save memory, when Netdata starts, DBENGINE allocates randomly smaller pages for metrics, to spread their completion evenly across time.
+
+The memory we saved with the above is used to improve the LRU cache. So, although we reserved 32MiB for the LRU, in bigger setups (Netdata Parents) the LRU grows a lot more, within the limits of the equation.
+
+In practice, the main cache sizes itself with `hot x 1.5` instead of `host x 2`. The reason is that 5% of main cache is reserved for expanding open cache, 5% for expanding extent cache and we need room for the extensive buffers that are allocated in these setups. When the main cache exceeds `hot x 1.5` it enters a mode of critical evictions, and aggresively frees pages from the LRU to maintain a healthy memory footprint within its design limits.
+
+#### Open Cache
 
-For 2000 metrics, collected every second and retained for a week, Tier 0 needs: 1 byte x 2000 metrics x 3600 secs per
-hour x 24 hours per day x 7 days per week = 1100MB.
+Stores metadata about on disk pages. Not the data itself. Only metadata about the location of the data on disk.
 
-By setting `dbengine multihost disk space MB` to `1100`, this node will start maintaining about a week of data. But pay
-attention to the number of metrics. If you have more than 2000 metrics on a node, or you need more that a week of high
-resolution metrics, you may need to adjust this setting accordingly.
+Its primary use is to index information about the open datafile, the one that still accepts new pages. Once that datafile becomes full, all the hot pages of the open cache are indexed in journal v2 files.
 
-Tier 1 is by default sampling the data every **60 points of Tier 0**. In our case, Tier 0 is per second, if we want to
-transform this information in terms of time then the Tier 1 "resolution" is per minute.
+The clean queue is an LRU for reducing the journal v2 scans during quering.
 
-Tier 1 needs four times more storage per point compared to Tier 0. So, for 2000 metrics, with per minute resolution,
-retained for a month, Tier 1 needs: 4 bytes x 2000 metrics x 60 minutes per hour x 24 hours per day x 30 days per month
-= 330MB.
+Open cache uses memory ballooning too, like the main cache, based on its own hot pages. Open cache hot size is mainly controlled by the size of the open datafile. This is why on netdata versions with journal files v2, we decreased the maximum datafile size from 1GB to 512MB and we increased the target number of datafiles from 20 to 50.
 
-Tier 2 is by default sampling data every 3600 points of Tier 0 (60 of Tier 1, which is the previous exact Tier). Again
-in term of "time" (Tier 0 is per second), then Tier 2 is per hour.
+On bigger setups open cache will get a bigger LRU by automatically sizing it (the whole open cache) to 5% to the size of (the whole) main cache.
 
-The storage requirements are the same to Tier 1.
+#### Extent Cache
+
+Caches compressed **extent** data, to avoid reading too repeatedly the same data from disks.
+
+
+### Shared Memory
+
+Journal v2 indexes are mapped into memory. Netdata attempts to minimize shared memory use by instructing the kernel about the use of these files, or even unmounting them when they are not needed.
+
+The time-ranges of the queries running control the amount of shared memory required.
+
+## Metrics Registry
+
+DBENGINE uses 150 bytes of memory for every metric for which retention is maintained but is not currently being collected.
+
+---
+
+--- OLD DOCS BELOW THIS POINT ---
+
+---
 
-For 2000 metrics, with per hour resolution, retained for a year, Tier 2 needs: 4 bytes x 2000 metrics x 24 hours per day
-x 365 days per year = 67MB.
 
 ## Legacy configuration
 
 ### v1.35.1 and prior
 
-These versions of the Agent do not support [Tiering](#Tiering). You could change the metric retention for the parent and
+These versions of the Agent do not support [Tiers](#Tiers). You could change the metric retention for the parent and
 all of its children only with the `dbengine multihost disk space MB` setting. This setting accounts the space allocation
 for the parent node and all of its children.
 
@@ -105,15 +288,9 @@ the `[db]` section of your `netdata.conf`.
 
 ### v1.23.2 and prior
 
-_For Netdata Agents earlier than v1.23.2_, the Agent on the parent node uses one dbengine instance for itself, and
-another instance for every child node it receives metrics from. If you had four streaming nodes, you would have five
-instances in total (`1 parent + 4 child nodes = 5 instances`).
+_For Netdata Agents earlier than v1.23.2_, the Agent on the parent node uses one dbengine instance for itself, and another instance for every child node it receives metrics from. If you had four streaming nodes, you would have five instances in total (`1 parent + 4 child nodes = 5 instances`).
 
-The Agent allocates resources for each instance separately using the `dbengine disk space MB` (**deprecated**) setting.
-If
-`dbengine disk space MB`(**deprecated**) is set to the default `256`, each instance is given 256 MiB in disk space,
-which means the total disk space required to store all instances is,
-roughly, `256 MiB * 1 parent * 4 child nodes = 1280 MiB`.
+The Agent allocates resources for each instance separately using the `dbengine disk space MB` (**deprecated**) setting. If `dbengine disk space MB`(**deprecated**) is set to the default `256`, each instance is given 256 MiB in disk space, which means the total disk space required to store all instances is, roughly, `256 MiB * 1 parent * 4 child nodes = 1280 MiB`.
 
 #### Backward compatibility
 
@@ -128,7 +305,7 @@ Agent.
 ##### Information
 
 For more information about setting `[db].mode` on your nodes, in addition to other streaming configurations, see
-[streaming](/streaming/README.md).
+[streaming](https://github.com/netdata/netdata/blob/master/streaming/README.md).
 
 ## Requirements & limitations
 
@@ -154,7 +331,7 @@ An important observation is that RAM usage depends on both the `page cache size`
 options.
 
 You can use
-our [database engine calculator](/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics)
+our [database engine calculator](https://github.com/netdata/netdata/blob/master/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics)
 to validate the memory requirements for your particular system(s) and configuration (**out-of-date**).
 
 ### Disk space
@@ -208,7 +385,7 @@ You can apply the settings by running `sysctl -p` or by rebooting.
 
 ## Files
 
-With the DB engine mode the metric data are stored in database files. These files are organized in pairs, the datafiles
+With the DB engine mode the metric data are stored in database files. These files are organized in pairs, the datafiles  
 and their corresponding journalfiles, e.g.:
 
 ```sh
@@ -226,7 +403,7 @@ location is `/var/cache/netdata/dbengine/*`). The higher numbered filenames cont
 can safely delete some pairs of files when Netdata is stopped to manually free up some space.
 
 _Users should_ **back up** _their `./dbengine` folders if they consider this data to be important._ You can also set up
-one or more [exporting connectors](/exporting/README.md) to send your Netdata metrics to other databases for long-term
+one or more [exporting connectors](https://github.com/netdata/netdata/blob/master/exporting/README.md) to send your Netdata metrics to other databases for long-term
 storage at lower granularity.
 
 ## Operation
@@ -298,5 +475,3 @@ An interesting observation to make is that the CPU-bound run (16 GiB page cache)
 and generate a read load of 1.7M/sec, whereas in the CPU-bound scenario the read load is 70 times higher at 118M/sec.
 Consequently, there is a significant degree of interference by the reader threads, that slow down the writer threads.
 This is also possible because the interference effects are greater than the SSD impact on data generation throughput.
-
-