1 files changed, 135 insertions, 92 deletions
diff --git a/database/engine/README.md b/database/engine/README.md
index 7defcce9..c67e400f 100644
--- a/database/engine/README.md
+++ b/database/engine/README.md
@@ -6,75 +6,114 @@ custom_edit_url: https://github.com/netdata/netdata/edit/master/database/engine/
 
 # Database engine
 
-The Database Engine works like a traditional database. It dedicates a certain amount of RAM to data caching and
-indexing, while the rest of the data resides compressed on disk. Unlike other [memory modes](/database/README.md), the
-amount of historical metrics stored is based on the amount of disk space you allocate and the effective compression
+The Database Engine works like a traditional time series database. Unlike other [database modes](/database/README.md),
+the amount of historical metrics stored is based on the amount of disk space you allocate and the effective compression
 ratio, not a fixed number of metrics collected.
 
-By using both RAM and disk space, the database engine allows for long-term storage of per-second metrics inside of the
-Agent itself.
+## Tiering
 
-In addition, the database engine is the only memory mode that supports changing the data collection update frequency
-(`update_every`) without losing the metrics your Agent already gathered and stored.
+Tiering is a mechanism of providing multiple tiers of data with
+different [granularity on metrics](/docs/store/distributed-data-architecture.md#granularity-of-metrics).
 
-## Configuration
+For Netdata Agents with version `netdata-1.35.0.138.nightly` and greater, `dbengine` supports Tiering, allowing almost
+unlimited retention of data.
 
-To use the database engine, open `netdata.conf` and set `memory mode` to `dbengine`.
 
-```conf
-[global]
-    memory mode = dbengine
-```
+### Metric size
 
-To configure the database engine, look for the `page cache size` and `dbengine multihost disk space` settings in the
-`[global]` section of your `netdata.conf`. The Agent ignores the `history` setting when using the database engine.
+Every Tier down samples the exact lower tier (lower tiers have greater resolution). You can have up to 5
+Tiers **[0. . 4]** of data (including the Tier 0, which has the highest resolution)
 
-```conf
-[global]
-    page cache size = 32
-    dbengine multihost disk space = 256
-```
+Tier 0 is the default that was always available in `dbengine` mode. Tier 1 is the first level of aggregation, Tier 2 is
+the second, and so on.
 
-The above values are the default values for Page Cache size and DB engine disk space quota. Both numbers are
-in **MiB**.
+Metrics on all tiers except of the _Tier 0_ also store the following five additional values for every point for accurate
+representation:
 
-The `page cache size` option determines the amount of RAM in **MiB** dedicated to caching Netdata metric values. The
-actual page cache size will be slightly larger than this figure—see the [memory requirements](#memory-requirements)
-section for details.
+1. The `sum` of the points aggregated
+2. The `min` of the points aggregated
+3. The `max` of the points aggregated
+4. The `count` of the points aggregated (could be constant, but it may not be due to gaps in data collection)
+5. The `anomaly_count` of the points aggregated (how many of the aggregated points found anomalous)
 
-The `dbengine multihost disk space` option determines the amount of disk space in **MiB** that is dedicated to storing
-Netdata metric values and all related metadata describing them. You can use the [**database engine
-calculator**](/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics)
-to correctly set `dbengine multihost disk space` based on your metrics retention policy. The calculator gives an
-accurate estimate based on how many child nodes you have, how many metrics your Agent collects, and more.
+Among `min`, `max` and `sum`, the correct value is chosen based on the user query. `average` is calculated on the fly at
+query time.
 
-### Legacy configuration
+### Tiering in a nutshell
 
-The deprecated `dbengine disk space` option determines the amount of disk space in **MiB** that is dedicated to storing
-Netdata metric values per legacy database engine instance (see [details on the legacy mode](#legacy-mode) below).
+The `dbengine` is capable of retaining metrics for years. To further understand the `dbengine` tiering mechanism let's
+explore the following configuration.
 
-```conf
-[global]
-    dbengine disk space = 256
 ```
+[db]
+    mode = dbengine
+    
+    # per second data collection
+    update every = 1
+    
+    # enables Tier 1 and Tier 2, Tier 0 is always enabled in dbengine mode
+    storage tiers = 3
+    
+    # Tier 0, per second data for a week
+    dbengine multihost disk space MB = 1100
+    
+    # Tier 1, per minute data for a month
+    dbengine tier 1 multihost disk space MB = 330
+
+    # Tier 2, per hour data for a year
+    dbengine tier 2 multihost disk space MB = 67
+```
+
+For 2000 metrics, collected every second and retained for a week, Tier 0 needs: 1 byte x 2000 metrics x 3600 secs per
+hour x 24 hours per day x 7 days per week = 1100MB.
+
+By setting `dbengine multihost disk space MB` to `1100`, this node will start maintaining about a week of data. But pay
+attention to the number of metrics. If you have more than 2000 metrics on a node, or you need more that a week of high
+resolution metrics, you may need to adjust this setting accordingly.
+
+Tier 1 is by default sampling the data every **60 points of Tier 0**. In our case, Tier 0 is per second, if we want to
+transform this information in terms of time then the Tier 1 "resolution" is per minute.
+
+Tier 1 needs four times more storage per point compared to Tier 0. So, for 2000 metrics, with per minute resolution,
+retained for a month, Tier 1 needs: 4 bytes x 2000 metrics x 60 minutes per hour x 24 hours per day x 30 days per month
+= 330MB.
 
-### Streaming metrics to the database engine
+Tier 2 is by default sampling data every 3600 points of Tier 0 (60 of Tier 1, which is the previous exact Tier). Again
+in term of "time" (Tier 0 is per second), then Tier 2 is per hour.
 
-When using the multihost database engine, all parent and child nodes share the same `page cache size` and `dbengine
-multihost disk space` in a single dbengine instance. The [**database engine
-calculator**](/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics)
-helps you properly set `page cache size` and `dbengine multihost disk space` on your parent node to allocate enough
-resources based on your metrics retention policy and how many child nodes you have.
+The storage requirements are the same to Tier 1.
 
-#### Legacy mode
+For 2000 metrics, with per hour resolution, retained for a year, Tier 2 needs: 4 bytes x 2000 metrics x 24 hours per day
+x 365 days per year = 67MB.
+
+## Legacy configuration
+
+### v1.35.1 and prior
+
+These versions of the Agent do not support [Tiering](#Tiering). You could change the metric retention for the parent and
+all of its children only with the `dbengine multihost disk space MB` setting. This setting accounts the space allocation
+for the parent node and all of its children.
+
+To configure the database engine, look for the `page cache size MB` and `dbengine multihost disk space MB` settings in
+the `[db]` section of your `netdata.conf`.
+
+```conf
+[db]
+    dbengine page cache size MB = 32
+    dbengine multihost disk space MB = 256
+```
+
+### v1.23.2 and prior
 
 _For Netdata Agents earlier than v1.23.2_, the Agent on the parent node uses one dbengine instance for itself, and
 another instance for every child node it receives metrics from. If you had four streaming nodes, you would have five
 instances in total (`1 parent + 4 child nodes = 5 instances`).
 
-The Agent allocates resources for each instance separately using the `dbengine disk space` (**deprecated**) setting. If
-`dbengine disk space`(**deprecated**) is set to the default `256`, each instance is given 256 MiB in disk space, which
-means the total disk space required to store all instances is, roughly, `256 MiB * 1 parent * 4 child nodes = 1280 MiB`.
+The Agent allocates resources for each instance separately using the `dbengine disk space MB` (**deprecated**) setting.
+If
+`dbengine disk space MB`(**deprecated**) is set to the default `256`, each instance is given 256 MiB in disk space,
+which means the total disk space required to store all instances is,
+roughly, `256 MiB * 1 parent * 4 child nodes = 1280 MiB`.
 
 #### Backward compatibility
 
@@ -88,51 +127,54 @@ Agent.
 
 ##### Information
 
-For more information about setting `memory mode` on your nodes, in addition to other streaming configurations, see
+For more information about setting `[db].mode` on your nodes, in addition to other streaming configurations, see
 [streaming](/streaming/README.md).
 
-### Memory requirements
+## Requirements & limitations
 
-Using memory mode `dbengine` we can overcome most memory restrictions and store a dataset that is much larger than the
+### Memory
+
+Using database mode `dbengine` we can overcome most memory restrictions and store a dataset that is much larger than the
 available memory.
 
 There are explicit memory requirements **per** DB engine **instance**:
 
--   The total page cache memory footprint will be an additional `#dimensions-being-collected x 4096 x 2` bytes over what
-    the user configured with `page cache size`.
+- The total page cache memory footprint will be an additional `#dimensions-being-collected x 4096 x 2` bytes over what
+  the user configured with `dbengine page cache size MB`.
+
 
--   an additional `#pages-on-disk x 4096 x 0.03` bytes of RAM are allocated for metadata.
+- an additional `#pages-on-disk x 4096 x 0.03` bytes of RAM are allocated for metadata.
 
-    -   roughly speaking this is 3% of the uncompressed disk space taken by the DB files.
+    - roughly speaking this is 3% of the uncompressed disk space taken by the DB files.
 
-    -   for very highly compressible data (compression ratio > 90%) this RAM overhead is comparable to the disk space
-        footprint.
+    - for very highly compressible data (compression ratio > 90%) this RAM overhead is comparable to the disk space
+      footprint.
 
 An important observation is that RAM usage depends on both the `page cache size` and the `dbengine multihost disk space`
 options.
 
-You can use our [database engine
-calculator](/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics)
+You can use
+our [database engine calculator](/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics)
 to validate the memory requirements for your particular system(s) and configuration (**out-of-date**).
 
-### Disk space requirements
+### Disk space
 
 There are explicit disk space requirements **per** DB engine **instance**:
 
--   The total disk space footprint will be the maximum between `#dimensions-being-collected x 4096 x 2` bytes or what
-    the user configured with `dbengine multihost disk space` or `dbengine disk space`.
+- The total disk space footprint will be the maximum between `#dimensions-being-collected x 4096 x 2` bytes or what the
+  user configured with `dbengine multihost disk space` or `dbengine disk space`.
 
-### File descriptor requirements
+### File descriptor
 
-The Database Engine may keep a **significant** amount of files open per instance (e.g. per streaming child or
-parent server). When configuring your system you should make sure there are at least 50 file descriptors available per
+The Database Engine may keep a **significant** amount of files open per instance (e.g. per streaming child or parent
+server). When configuring your system you should make sure there are at least 50 file descriptors available per
 `dbengine` instance.
 
 Netdata allocates 25% of the available file descriptors to its Database Engine instances. This means that only 25% of
 the file descriptors that are available to the Netdata service are accessible by dbengine instances. You should take
 that into account when configuring your service or system-wide file descriptor limits. You can roughly estimate that the
 Netdata service needs 2048 file descriptors for every 10 streaming child hosts when streaming is configured to use
-`memory mode = dbengine`.
+`[db].mode = dbengine`.
 
 If for example one wants to allocate 65536 file descriptors to the Netdata service on a systemd system one needs to
 override the Netdata service by running `sudo systemctl edit netdata` and creating a file with contents:
@@ -149,7 +191,7 @@ ulimit -n 65536
 ```
 
 at the beginning of the service file. Alternatively you can change the system-wide limits of the kernel by changing
- `/etc/sysctl.conf`. For linux that would be:
+`/etc/sysctl.conf`. For linux that would be:
 
 ```conf
 fs.file-max = 65536
@@ -166,8 +208,8 @@ You can apply the settings by running `sysctl -p` or by rebooting.
 
 ## Files
 
-With the DB engine memory mode the metric data are stored in database files. These files are organized in pairs, the
-datafiles and their corresponding journalfiles, e.g.:
+With the DB engine mode the metric data are stored in database files. These files are organized in pairs, the datafiles
+and their corresponding journalfiles, e.g.:
 
 ```sh
 datafile-1-0000000001.ndf
@@ -192,15 +234,16 @@ storage at lower granularity.
 The DB engine stores chart metric values in 4096-byte pages in memory. Each chart dimension gets its own page to store
 consecutive values generated from the data collectors. Those pages comprise the **Page Cache**.
 
-When those pages fill up they are slowly compressed and flushed to disk. It can take `4096 / 4 = 1024 seconds = 17
-minutes`, for a chart dimension that is being collected every 1 second, to fill a page. Pages can be cut short when we
-stop Netdata or the DB engine instance so as to not lose the data. When we query the DB engine for data we trigger disk
-read I/O requests that fill the Page Cache with the requested pages and potentially evict cold (not recently used)
-pages. 
+When those pages fill up, they are slowly compressed and flushed to disk. It can
+take `4096 / 4 = 1024 seconds = 17 minutes`, for a chart dimension that is being collected every 1 second, to fill a
+page. Pages can be cut short when we stop Netdata or the DB engine instance so as to not lose the data. When we query
+the DB engine for data we trigger disk read I/O requests that fill the Page Cache with the requested pages and
+potentially evict cold (not recently used)
+pages.
 
 When the disk quota is exceeded the oldest values are removed from the DB engine at real time, by automatically deleting
 the oldest datafile and journalfile pair. Any corresponding pages residing in the Page Cache will also be invalidated
-and removed. The DB engine logic will try to maintain between 10 and 20 file pairs at any point in time. 
+and removed. The DB engine logic will try to maintain between 10 and 20 file pairs at any point in time.
 
 The Database Engine uses direct I/O to avoid polluting the OS filesystem caches and does not generate excessive I/O
 traffic so as to create the minimum possible interference with other applications.
@@ -215,38 +258,38 @@ Constellation ES.3 2TB magnetic HDD and a SAMSUNG MZQLB960HAJR-00007 960GB NAND
 For our workload, we defined 32 charts with 128 metrics each, giving us a total of 4096 metrics. We defined 1 worker
 thread per chart (32 threads) that generates new data points with a data generation interval of 1 second. The time axis
 of the time-series is emulated and accelerated so that the worker threads can generate as many data points as possible
-without delays. 
+without delays.
 
-We also defined 32 worker threads that perform queries on random metrics with semi-random time ranges. The
-starting time of the query is randomly selected between the beginning of the time-series and the time of the latest data
-point. The ending time is randomly selected between 1 second and 1 hour after the starting time. The pseudo-random
-numbers are generated with a uniform distribution.
+We also defined 32 worker threads that perform queries on random metrics with semi-random time ranges. The starting time
+of the query is randomly selected between the beginning of the time-series and the time of the latest data point. The
+ending time is randomly selected between 1 second and 1 hour after the starting time. The pseudo-random numbers are
+generated with a uniform distribution.
 
 The data are written to the database at the same time as they are read from it. This is a concurrent read/write mixed
-workload with a duration of 60 seconds. The faster `dbengine` runs, the bigger the dataset size becomes since more
-data points will be generated. We set a page cache size of 64MiB for the two disk-bound scenarios. This way, the dataset
-size of the metric data is much bigger than the RAM that is being used for caching so as to trigger I/O requests most
-of the time. In our final scenario, we set the page cache size to 16 GiB. That way, the dataset fits in the page cache
-so as to avoid all disk bottlenecks.
+workload with a duration of 60 seconds. The faster `dbengine` runs, the bigger the dataset size becomes since more data
+points will be generated. We set a page cache size of 64MiB for the two disk-bound scenarios. This way, the dataset size
+of the metric data is much bigger than the RAM that is being used for caching so as to trigger I/O requests most of the
+time. In our final scenario, we set the page cache size to 16 GiB. That way, the dataset fits in the page cache so as to
+avoid all disk bottlenecks.
 
 The reported numbers are the following:
 
 | device | page cache | dataset | reads/sec | writes/sec |
-| :----: | :--------: | ------: | --------: | ---------: |
-| HDD    | 64 MiB     | 4.1 GiB | 813K      | 18.0M      |
-| SSD    | 64 MiB     | 9.8 GiB | 1.7M      | 43.0M      |
-| N/A    | 16 GiB     | 6.8 GiB | 118.2M    | 30.2M      |
+|:------:|:----------:|--------:|----------:|-----------:|
+|  HDD   |   64 MiB   | 4.1 GiB |      813K |      18.0M |
+|  SSD   |   64 MiB   | 9.8 GiB |      1.7M |      43.0M |
+|  N/A   |   16 GiB   | 6.8 GiB |    118.2M |      30.2M |
 
 where "reads/sec" is the number of metric data points being read from the database via its API per second and
-"writes/sec" is the number of metric data points being written to the database per second. 
+"writes/sec" is the number of metric data points being written to the database per second.
 
 Notice that the HDD numbers are pretty high and not much slower than the SSD numbers. This is thanks to the database
 engine design being optimized for rotating media. In the database engine disk I/O requests are:
 
--   asynchronous to mask the high I/O latency of HDDs.
--   mostly large to reduce the amount of HDD seeking time.
--   mostly sequential to reduce the amount of HDD seeking time.
--   compressed to reduce the amount of required throughput.
+- asynchronous to mask the high I/O latency of HDDs.
+- mostly large to reduce the amount of HDD seeking time.
+- mostly sequential to reduce the amount of HDD seeking time.
+- compressed to reduce the amount of required throughput.
 
 As a result, the HDD is not thousands of times slower than the SSD, which is typical for other workloads.