Adding upstream version 1.37.1.upstream/1.37.1 upstream

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-05-04 14:31:17 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-05-04 14:31:17 +0000
commit: 8020f71afd34d7696d7933659df2d763ab05542f (patch)
tree: 2fdf1b5447ffd8bdd61e702ca183e814afdcb4fc /collectors/python.d.plugin/changefinder
parent: Initial commit. (diff)
download: netdata-8020f71afd34d7696d7933659df2d763ab05542f.tar.xz
netdata-8020f71afd34d7696d7933659df2d763ab05542f.zip
4 files changed, 489 insertions, 0 deletions
diff --git a/collectors/python.d.plugin/changefinder/Makefile.inc b/collectors/python.d.plugin/changefinder/Makefile.inc
new file mode 100644
index 0000000..01a9240
--- /dev/null
+++ b/collectors/python.d.plugin/changefinder/Makefile.inc
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+
+# THIS IS NOT A COMPLETE Makefile
+# IT IS INCLUDED BY ITS PARENT'S Makefile.am
+# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT
+
+# install these files
+dist_python_DATA       += changefinder/changefinder.chart.py
+dist_pythonconfig_DATA += changefinder/changefinder.conf
+
+# do not install these files, but include them in the distribution
+dist_noinst_DATA       += changefinder/README.md changefinder/Makefile.inc
+
diff --git a/collectors/python.d.plugin/changefinder/README.md b/collectors/python.d.plugin/changefinder/README.md
new file mode 100644
index 0000000..7ec3a25
--- /dev/null
+++ b/collectors/python.d.plugin/changefinder/README.md
@@ -0,0 +1,217 @@
+<!--
+title: "Online change point detection with Netdata"
+description: "Use ML-driven change point detection to narrow your focus and shorten root cause analysis."
+custom_edit_url: https://github.com/netdata/netdata/edit/master/collectors/python.d.plugin/changefinder/README.md
+-->
+
+# Online changepoint detection with Netdata
+
+This collector uses the Python [changefinder](https://github.com/shunsukeaihara/changefinder) library to
+perform [online](https://en.wikipedia.org/wiki/Online_machine_learning) [changepoint detection](https://en.wikipedia.org/wiki/Change_detection)
+on your Netdata charts and/or dimensions.
+
+Instead of this collector just _collecting_ data, it also does some computation on the data it collects to return a
+changepoint score for each chart or dimension you configure it to work on. This is
+an [online](https://en.wikipedia.org/wiki/Online_machine_learning) machine learning algorithm so there is no batch step
+to train the model, instead it evolves over time as more data arrives. That makes this particular algorithm quite cheap
+to compute at each step of data collection (see the notes section below for more details) and it should scale fairly
+well to work on lots of charts or hosts (if running on a parent node for example).
+
+> As this is a somewhat unique collector and involves often subjective concepts like changepoints and anomalies, we would love to hear any feedback on it from the community. Please let us know on the [community forum](https://community.netdata.cloud/t/changefinder-collector-feedback/972) or drop us a note at [analytics-ml-team@netdata.cloud](mailto:analytics-ml-team@netdata.cloud) for any and all feedback, both positive and negative. This sort of feedback is priceless to help us make complex features more useful.
+
+## Charts
+
+Two charts are available:
+
+### ChangeFinder Scores (`changefinder.scores`)
+
+This chart shows the percentile of the score that is output from the ChangeFinder library (it is turned off by default
+but available with `show_scores: true`).
+
+A high observed score is more likely to be a valid changepoint worth exploring, even more so when multiple charts or
+dimensions have high changepoint scores at the same time or very close together.
+
+### ChangeFinder Flags (`changefinder.flags`)
+
+This chart shows `1` or `0` if the latest score has a percentile value that exceeds the `cf_threshold` threshold. By
+default, any scores that are in the 99th or above percentile will raise a flag on this chart.
+
+The raw changefinder score itself can be a little noisy and so limiting ourselves to just periods where it surpasses
+the 99th percentile can help manage the "[signal to noise ratio](https://en.wikipedia.org/wiki/Signal-to-noise_ratio)"
+better.
+
+The `cf_threshold` parameter might be one you want to play around with to tune things specifically for the workloads on
+your node and the specific charts you want to monitor. For example, maybe the 95th percentile might work better for you
+than the 99th percentile.
+
+Below is an example of the chart produced by this collector. The first 3/4 of the period looks normal in that we see a
+few individual changes being picked up somewhat randomly over time. But then at around 14:59 towards the end of the
+chart we see two periods with 'spikes' of multiple changes for a small period of time. This is the sort of pattern that
+might be a sign something on the system that has changed sufficiently enough to merit some investigation.
+
+![changepoint-collector](https://user-images.githubusercontent.com/2178292/108773528-665de980-7556-11eb-895d-798669bcd695.png)
+
+## Requirements
+
+- This collector will only work with Python 3 and requires the packages below be installed.
+
+```bash
+# become netdata user
+sudo su -s /bin/bash netdata
+# install required packages for the netdata user
+pip3 install --user numpy==1.19.5 changefinder==0.03 scipy==1.5.4
+```
+
+**Note**: if you need to tell Netdata to use Python 3 then you can pass the below command in the python plugin section
+of your `netdata.conf` file.
+
+```yaml
+[ plugin:python.d ]
+  # update every = 1  
+  command options = -ppython3
+```
+
+## Configuration
+
+Install the Python requirements above, enable the collector and restart Netdata.
+
+```bash
+cd /etc/netdata/
+sudo ./edit-config python.d.conf
+# Set `changefinder: no` to `changefinder: yes`
+sudo systemctl restart netdata
+```
+
+The configuration for the changefinder collector defines how it will behave on your system and might take some
+experimentation with over time to set it optimally for your node. Out of the box, the config comes with
+some [sane defaults](https://www.netdata.cloud/blog/redefining-monitoring-netdata/) to get you started that try to
+balance the flexibility and power of the ML models with the goal of being as cheap as possible in term of cost on the
+node resources.
+
+_**Note**: If you are unsure about any of the below configuration options then it's best to just ignore all this and
+leave the `changefinder.conf` file alone to begin with. Then you can return to it later if you would like to tune things
+a bit more once the collector is running for a while and you have a feeling for its performance on your node._
+
+Edit the `python.d/changefinder.conf` configuration file using `edit-config` from the your
+agent's [config directory](/docs/configure/nodes.md), which is usually at `/etc/netdata`.
+
+```bash
+cd /etc/netdata   # Replace this path with your Netdata config directory, if different
+sudo ./edit-config python.d/changefinder.conf
+```
+
+The default configuration should look something like this. Here you can see each parameter (with sane defaults) and some
+information about each one and what it does.
+
+```yaml
+# ----------------------------------------------------------------------
+# JOBS (data collection sources)
+
+# Pull data from local Netdata node.
+local:
+
+  # A friendly name for this job.
+  name: 'local'
+
+  # What host to pull data from.
+  host: '127.0.0.1:19999'
+
+  # What charts to pull data for - A regex like 'system\..*|' or 'system\..*|apps.cpu|apps.mem' etc.
+  charts_regex: 'system\..*'
+
+  # Charts to exclude, useful if you would like to exclude some specific charts. 
+  # Note: should be a ',' separated string like 'chart.name,chart.name'.
+  charts_to_exclude: ''
+
+  # Get ChangeFinder scores 'per_dim' or 'per_chart'.
+  mode: 'per_chart'
+
+  # Default parameters that can be passed to the changefinder library.
+  cf_r: 0.5
+  cf_order: 1
+  cf_smooth: 15
+
+  # The percentile above which scores will be flagged.
+  cf_threshold: 99
+
+  # The number of recent scores to use when calculating the percentile of the changefinder score.
+  n_score_samples: 14400
+
+  # Set to true if you also want to chart the percentile scores in addition to the flags.
+  # Mainly useful for debugging or if you want to dive deeper on how the scores are evolving over time.
+  show_scores: false
+```
+
+## Troubleshooting
+
+To see any relevant log messages you can use a command like below.
+
+```bash
+grep 'changefinder' /var/log/netdata/error.log
+```
+
+If you would like to log in as `netdata` user and run the collector in debug mode to see more detail.
+
+```bash
+# become netdata user
+sudo su -s /bin/bash netdata
+# run collector in debug using `nolock` option if netdata is already running the collector itself.
+/usr/libexec/netdata/plugins.d/python.d.plugin changefinder debug trace nolock
+```
+
+## Notes
+
+- It may take an hour or two (depending on your choice of `n_score_samples`) for the collector to 'settle' into it's
+  typical behaviour in terms of the trained models and scores you will see in the normal running of your node. Mainly
+  this is because it can take a while to build up a proper distribution of previous scores in over to convert the raw
+  score returned by the ChangeFinder algorithm into a percentile based on the most recent `n_score_samples` that have
+  already been produced. So when you first turn the collector on, it will have a lot of flags in the beginning and then
+  should 'settle down' once it has built up enough history. This is a typical characteristic of online machine learning
+  approaches which need some initial window of time before they can be useful.
+- As this collector does most of the work in Python itself, you may want to try it out first on a test or development
+  system to get a sense of its performance characteristics on a node similar to where you would like to use it.
+- On a development n1-standard-2 (2 vCPUs, 7.5 GB memory) vm running Ubuntu 18.04 LTS and not doing any work some of the
+  typical performance characteristics we saw from running this collector (with defaults) were:
+    - A runtime (`netdata.runtime_changefinder`) of ~30ms.
+    - Typically ~1% additional cpu usage.
+    - About ~85mb of ram (`apps.mem`) being continually used by the `python.d.plugin` under default configuration.
+
+## Useful links and further reading
+
+- [PyPi changefinder](https://pypi.org/project/changefinder/) reference page.
+- [GitHub repo](https://github.com/shunsukeaihara/changefinder) for the changefinder library.
+- Relevant academic papers:
+    - Yamanishi K, Takeuchi J. A unifying framework for detecting outliers and change points from nonstationary time
+      series data. 8th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD02. 2002:
+      676. ([pdf](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.3469&rep=rep1&type=pdf))
+    - Kawahara Y, Sugiyama M. Sequential Change-Point Detection Based on Direct Density-Ratio Estimation. SIAM
+      International Conference on Data Mining. 2009:
+      389–400. ([pdf](https://onlinelibrary.wiley.com/doi/epdf/10.1002/sam.10124))
+    - Liu S, Yamada M, Collier N, Sugiyama M. Change-point detection in time-series data by relative density-ratio
+      estimation. Neural Networks. Jul.2013 43:72–83. [PubMed: 23500502] ([pdf](https://arxiv.org/pdf/1203.0453.pdf))
+    - T. Iwata, K. Nakamura, Y. Tokusashi, and H. Matsutani, “Accelerating Online Change-Point Detection Algorithm using
+      10 GbE FPGA NIC,” Proc. International European Conference on Parallel and Distributed Computing (Euro-Par’18)
+      Workshops, vol.11339, pp.506–517, Aug.
+      2018 ([pdf](https://www.arc.ics.keio.ac.jp/~matutani/papers/iwata_heteropar2018.pdf))
+- The [ruptures](https://github.com/deepcharles/ruptures) python package is also a good place to learn more about
+  changepoint detection (mostly offline as opposed to online but deals with similar concepts).
+- A nice [blog post](https://techrando.com/2019/08/14/a-brief-introduction-to-change-point-detection-using-python/)
+  showing some of the other options and libraries for changepoint detection in Python.
+- [Bayesian changepoint detection](https://github.com/hildensia/bayesian_changepoint_detection) library - we may explore
+  implementing a collector for this or integrating this approach into this collector at a future date if there is
+  interest and it proves computationaly feasible.
+- You might also find the
+  Netdata [anomalies collector](https://github.com/netdata/netdata/tree/master/collectors/python.d.plugin/anomalies)
+  interesting.
+- [Anomaly Detection](https://en.wikipedia.org/wiki/Anomaly_detection) wikipedia page.
+- [Anomaly Detection YouTube playlist](https://www.youtube.com/playlist?list=PL6Zhl9mK2r0KxA6rB87oi4kWzoqGd5vp0)
+  maintained by [andrewm4894](https://github.com/andrewm4894/) from Netdata.
+- [awesome-TS-anomaly-detection](https://github.com/rob-med/awesome-TS-anomaly-detection) Github list of useful tools,
+  libraries and resources.
+- [Mendeley public group](https://www.mendeley.com/community/interesting-anomaly-detection-papers/) with some
+  interesting anomaly detection papers we have been reading.
+- Good [blog post](https://www.anodot.com/blog/what-is-anomaly-detection/) from Anodot on time series anomaly detection.
+  Anodot also have some great whitepapers in this space too that some may find useful.
+- Novelty and outlier detection in
+  the [scikit-learn documentation](https://scikit-learn.org/stable/modules/outlier_detection.html).
+
diff --git a/collectors/python.d.plugin/changefinder/changefinder.chart.py b/collectors/python.d.plugin/changefinder/changefinder.chart.py
new file mode 100644
index 0000000..c18e560
--- /dev/null
+++ b/collectors/python.d.plugin/changefinder/changefinder.chart.py
@@ -0,0 +1,185 @@
+# -*- coding: utf-8 -*-
+# Description: changefinder netdata python.d module
+# Author: andrewm4894
+# SPDX-License-Identifier: GPL-3.0-or-later
+
+from json import loads
+import re
+
+from bases.FrameworkServices.UrlService import UrlService
+
+import numpy as np
+import changefinder
+from scipy.stats import percentileofscore
+
+update_every = 5
+disabled_by_default = True
+
+ORDER = [
+    'scores',
+    'flags'
+]
+
+CHARTS = {
+    'scores': {
+        'options': [None, 'ChangeFinder', 'score', 'Scores', 'scores', 'line'],
+        'lines': []
+    },
+    'flags': {
+        'options': [None, 'ChangeFinder', 'flag', 'Flags', 'flags', 'stacked'],
+        'lines': []
+    }
+}
+
+DEFAULT_PROTOCOL = 'http'
+DEFAULT_HOST = '127.0.0.1:19999'
+DEFAULT_CHARTS_REGEX = 'system.*'
+DEFAULT_MODE = 'per_chart'
+DEFAULT_CF_R = 0.5
+DEFAULT_CF_ORDER = 1
+DEFAULT_CF_SMOOTH = 15
+DEFAULT_CF_DIFF = False
+DEFAULT_CF_THRESHOLD = 99
+DEFAULT_N_SCORE_SAMPLES = 14400
+DEFAULT_SHOW_SCORES = False
+
+
+class Service(UrlService):
+    def __init__(self, configuration=None, name=None):
+        UrlService.__init__(self, configuration=configuration, name=name)
+        self.order = ORDER
+        self.definitions = CHARTS
+        self.protocol = self.configuration.get('protocol', DEFAULT_PROTOCOL)
+        self.host = self.configuration.get('host', DEFAULT_HOST)
+        self.url = '{}://{}/api/v1/allmetrics?format=json'.format(self.protocol, self.host)
+        self.charts_regex = re.compile(self.configuration.get('charts_regex', DEFAULT_CHARTS_REGEX))
+        self.charts_to_exclude = self.configuration.get('charts_to_exclude', '').split(',')
+        self.mode = self.configuration.get('mode', DEFAULT_MODE)
+        self.n_score_samples = int(self.configuration.get('n_score_samples', DEFAULT_N_SCORE_SAMPLES))
+        self.show_scores = int(self.configuration.get('show_scores', DEFAULT_SHOW_SCORES))
+        self.cf_r = float(self.configuration.get('cf_r', DEFAULT_CF_R))
+        self.cf_order = int(self.configuration.get('cf_order', DEFAULT_CF_ORDER))
+        self.cf_smooth = int(self.configuration.get('cf_smooth', DEFAULT_CF_SMOOTH))
+        self.cf_diff = bool(self.configuration.get('cf_diff', DEFAULT_CF_DIFF))
+        self.cf_threshold = float(self.configuration.get('cf_threshold', DEFAULT_CF_THRESHOLD))
+        self.collected_dims = {'scores': set(), 'flags': set()}
+        self.models = {}
+        self.x_latest = {}
+        self.scores_latest = {}
+        self.scores_samples = {}
+
+    def get_score(self, x, model):
+        """Update the score for the model based on most recent data, flag if it's percentile passes self.cf_threshold.
+        """
+
+        # get score
+        if model not in self.models:
+            # initialise empty model if needed
+            self.models[model] = changefinder.ChangeFinder(r=self.cf_r, order=self.cf_order, smooth=self.cf_smooth)
+        # if the update for this step fails then just fallback to last known score
+        try:
+            score = self.models[model].update(x)
+            self.scores_latest[model] = score
+        except Exception as _:
+            score = self.scores_latest.get(model, 0)
+        score = 0 if np.isnan(score) else score
+
+        # update sample scores used to calculate percentiles
+        if model in self.scores_samples:
+            self.scores_samples[model].append(score)
+        else:
+            self.scores_samples[model] = [score]
+        self.scores_samples[model] = self.scores_samples[model][-self.n_score_samples:]
+
+        # convert score to percentile
+        score = percentileofscore(self.scores_samples[model], score)
+
+        # flag based on score percentile
+        flag = 1 if score >= self.cf_threshold else 0
+
+        return score, flag
+
+    def validate_charts(self, chart, data, algorithm='absolute', multiplier=1, divisor=1):
+        """If dimension not in chart then add it.
+        """
+        if not self.charts:
+            return
+
+        for dim in data:
+            if dim not in self.collected_dims[chart]:
+                self.collected_dims[chart].add(dim)
+                self.charts[chart].add_dimension([dim, dim, algorithm, multiplier, divisor])
+
+        for dim in list(self.collected_dims[chart]):
+            if dim not in data:
+                self.collected_dims[chart].remove(dim)
+                self.charts[chart].del_dimension(dim, hide=False)
+
+    def diff(self, x, model):
+        """Take difference of data.
+        """
+        x_diff = x - self.x_latest.get(model, 0)
+        self.x_latest[model] = x
+        x = x_diff
+        return x
+
+    def _get_data(self):
+
+        # pull data from self.url
+        raw_data = self._get_raw_data()
+        if raw_data is None:
+            return None
+
+        raw_data = loads(raw_data)
+
+        # filter to just the data for the charts specified
+        charts_in_scope = list(filter(self.charts_regex.match, raw_data.keys()))
+        charts_in_scope = [c for c in charts_in_scope if c not in self.charts_to_exclude]
+
+        data_score = {}
+        data_flag = {}
+
+        # process each chart
+        for chart in charts_in_scope:
+
+            if self.mode == 'per_chart':
+
+                # average dims on chart and run changefinder on that average
+                x = [raw_data[chart]['dimensions'][dim]['value'] for dim in raw_data[chart]['dimensions']]
+                x = [x for x in x if x is not None]
+
+                if len(x) > 0:
+
+                    x = sum(x) / len(x)
+                    x = self.diff(x, chart) if self.cf_diff else x
+
+                    score, flag = self.get_score(x, chart)
+                    if self.show_scores:
+                        data_score['{}_score'.format(chart)] = score * 100
+                    data_flag[chart] = flag
+
+            else:
+
+                # run changefinder on each individual dim
+                for dim in raw_data[chart]['dimensions']:
+
+                    chart_dim = '{}|{}'.format(chart, dim)
+
+                    x = raw_data[chart]['dimensions'][dim]['value']
+                    x = x if x else 0
+                    x = self.diff(x, chart_dim) if self.cf_diff else x
+
+                    score, flag = self.get_score(x, chart_dim)
+                    if self.show_scores:
+                        data_score['{}_score'.format(chart_dim)] = score * 100
+                    data_flag[chart_dim] = flag
+
+        self.validate_charts('flags', data_flag)
+
+        if self.show_scores & len(data_score) > 0:
+            data_score['average_score'] = sum(data_score.values()) / len(data_score)
+            self.validate_charts('scores', data_score, divisor=100)
+
+        data = {**data_score, **data_flag}
+
+        return data
diff --git a/collectors/python.d.plugin/changefinder/changefinder.conf b/collectors/python.d.plugin/changefinder/changefinder.conf
new file mode 100644
index 0000000..56a681f
--- /dev/null
+++ b/collectors/python.d.plugin/changefinder/changefinder.conf
@@ -0,0 +1,74 @@
+# netdata python.d.plugin configuration for example
+#
+# This file is in YaML format. Generally the format is:
+#
+# name: value
+#
+# There are 2 sections:
+#  - global variables
+#  - one or more JOBS
+#
+# JOBS allow you to collect values from multiple sources.
+# Each source will have its own set of charts.
+#
+# JOB parameters have to be indented (using spaces only, example below).
+
+# ----------------------------------------------------------------------
+# Global Variables
+# These variables set the defaults for all JOBs, however each JOB
+# may define its own, overriding the defaults.
+
+# update_every sets the default data collection frequency.
+# If unset, the python.d.plugin default is used.
+# update_every: 5
+
+# priority controls the order of charts at the netdata dashboard.
+# Lower numbers move the charts towards the top of the page.
+# If unset, the default for python.d.plugin is used.
+# priority: 60000
+
+# penalty indicates whether to apply penalty to update_every in case of failures.
+# Penalty will increase every 5 failed updates in a row. Maximum penalty is 10 minutes.
+# penalty: yes
+
+# autodetection_retry sets the job re-check interval in seconds.
+# The job is not deleted if check fails.
+# Attempts to start the job are made once every autodetection_retry.
+# This feature is disabled by default.
+# autodetection_retry: 0
+
+# ----------------------------------------------------------------------
+# JOBS (data collection sources)
+
+local:
+
+  # A friendly name for this job.
+  name: 'local'
+
+  # What host to pull data from.
+  host: '127.0.0.1:19999'
+
+  # What charts to pull data for - A regex like 'system\..*|' or 'system\..*|apps.cpu|apps.mem' etc.
+  charts_regex: 'system\..*'
+
+  # Charts to exclude, useful if you would like to exclude some specific charts. 
+  # Note: should be a ',' separated string like 'chart.name,chart.name'.
+  charts_to_exclude: ''
+
+  # Get ChangeFinder scores 'per_dim' or 'per_chart'.
+  mode: 'per_chart'
+
+  # Default parameters that can be passed to the changefinder library.
+  cf_r: 0.5
+  cf_order: 1
+  cf_smooth: 15
+
+  # The percentile above which scores will be flagged.
+  cf_threshold: 99
+
+  # The number of recent scores to use when calculating the percentile of the changefinder score.
+  n_score_samples: 14400
+
+  # Set to true if you also want to chart the percentile scores in addition to the flags.
+  # Mainly useful for debugging or if you want to dive deeper on how the scores are evolving over time.
+  show_scores: false
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-05-04 14:31:17 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-05-04 14:31:17 +0000
commit	8020f71afd34d7696d7933659df2d763ab05542f (patch)
tree	2fdf1b5447ffd8bdd61e702ca183e814afdcb4fc /collectors/python.d.plugin/changefinder
parent	Initial commit. (diff)
download	netdata-8020f71afd34d7696d7933659df2d763ab05542f.tar.xz netdata-8020f71afd34d7696d7933659df2d763ab05542f.zip