summaryrefslogtreecommitdiffstats
path: root/collectors/python.d.plugin/changefinder
diff options
context:
space:
mode:
Diffstat (limited to 'collectors/python.d.plugin/changefinder')
-rw-r--r--collectors/python.d.plugin/changefinder/Makefile.inc13
-rw-r--r--collectors/python.d.plugin/changefinder/README.md218
-rw-r--r--collectors/python.d.plugin/changefinder/changefinder.chart.py185
-rw-r--r--collectors/python.d.plugin/changefinder/changefinder.conf74
4 files changed, 490 insertions, 0 deletions
diff --git a/collectors/python.d.plugin/changefinder/Makefile.inc b/collectors/python.d.plugin/changefinder/Makefile.inc
new file mode 100644
index 000000000..01a92408b
--- /dev/null
+++ b/collectors/python.d.plugin/changefinder/Makefile.inc
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+
+# THIS IS NOT A COMPLETE Makefile
+# IT IS INCLUDED BY ITS PARENT'S Makefile.am
+# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT
+
+# install these files
+dist_python_DATA += changefinder/changefinder.chart.py
+dist_pythonconfig_DATA += changefinder/changefinder.conf
+
+# do not install these files, but include them in the distribution
+dist_noinst_DATA += changefinder/README.md changefinder/Makefile.inc
+
diff --git a/collectors/python.d.plugin/changefinder/README.md b/collectors/python.d.plugin/changefinder/README.md
new file mode 100644
index 000000000..e1c1d4ba4
--- /dev/null
+++ b/collectors/python.d.plugin/changefinder/README.md
@@ -0,0 +1,218 @@
+<!--
+title: "Online change point detection with Netdata"
+description: "Use ML-driven change point detection to narrow your focus and shorten root cause analysis."
+custom_edit_url: https://github.com/netdata/netdata/edit/master/collectors/python.d.plugin/changefinder/README.md
+-->
+
+# Online changepoint detection with Netdata
+
+This collector uses the Python [changefinder](https://github.com/shunsukeaihara/changefinder) library to
+perform [online](https://en.wikipedia.org/wiki/Online_machine_learning) [changepoint detection](https://en.wikipedia.org/wiki/Change_detection)
+on your Netdata charts and/or dimensions.
+
+Instead of this collector just _collecting_ data, it also does some computation on the data it collects to return a
+changepoint score for each chart or dimension you configure it to work on. This is
+an [online](https://en.wikipedia.org/wiki/Online_machine_learning) machine learning algorithim so there is no batch step
+to train the model, instead it evolves over time as more data arrives. That makes this particualr algorithim quite cheap
+to compute at each step of data collection (see the notes section below for more details) and it should scale fairly
+well to work on lots of charts or hosts (if running on a parent node for example).
+
+> As this is a somewhat unique collector and involves often subjective concepts like changepoints and anomalies, we would love to hear any feedback on it from the community. Please let us know on the [community forum](https://community.netdata.cloud/t/changefinder-collector-feedback/972) or drop us a note at [analytics-ml-team@netdata.cloud](mailto:analytics-ml-team@netdata.cloud) for any and all feedback, both positive and negative. This sort of feedback is priceless to help us make complex features more useful.
+
+## Charts
+
+Two charts are available:
+
+### ChangeFinder Scores (`changefinder.scores`)
+
+This chart shows the percentile of the score that is output from the ChangeFinder library (it is turned off by default
+but available with `show_scores: true`).
+
+A high observed score is more likley to be a valid changepoint worth exploring, even more so when multiple charts or
+dimensions have high changepoint scores at the same time or very close together.
+
+### ChangeFinder Flags (`changefinder.flags`)
+
+This chart shows `1` or `0` if the latest score has a percentile value that exceeds the `cf_threshold` threshold. By
+default, any scores that are in the 99th or above percentile will raise a flag on this chart.
+
+The raw changefinder score itself can be a little noisey and so limiting ourselves to just periods where it surpasses
+the 99th percentile can help manage the "[signal to noise ratio](https://en.wikipedia.org/wiki/Signal-to-noise_ratio)"
+better.
+
+The `cf_threshold` paramater might be one you want to play around with to tune things specifically for the workloads on
+your node and the specific charts you want to monitor. For example, maybe the 95th percentile might work better for you
+than the 99th percentile.
+
+Below is an example of the chart produced by this collector. The first 3/4 of the period looks normal in that we see a
+few individual changes being picked up somewhat randomly over time. But then at around 14:59 towards the end of the
+chart we see two periods with 'spikes' of multiple changes for a small period of time. This is the sort of pattern that
+might be a sign something on the system that has changed sufficiently enough to merit some investigation.
+
+![changepoint-collector](https://user-images.githubusercontent.com/2178292/108773528-665de980-7556-11eb-895d-798669bcd695.png)
+
+## Requirements
+
+- This collector will only work with Python 3 and requires the packages below be installed.
+
+```bash
+# become netdata user
+sudo su -s /bin/bash netdata
+# install required packages for the netdata user
+pip3 install --user numpy==1.19.5 changefinder==0.03 scipy==1.5.4
+```
+
+**Note**: if you need to tell Netdata to use Python 3 then you can pass the below command in the python plugin section
+of your `netdata.conf` file.
+
+```yaml
+[ plugin:python.d ]
+ # update every = 1
+ command options = -ppython3
+```
+
+## Configuration
+
+Install the Python requirements above, enable the collector and restart Netdata.
+
+```bash
+cd /etc/netdata/
+sudo ./edit-config python.d.conf
+# Set `changefinder: no` to `changefinder: yes`
+sudo systemctl restart netdata
+```
+
+The configuration for the changefinder collector defines how it will behave on your system and might take some
+experimentation with over time to set it optimally for your node. Out of the box, the config comes with
+some [sane defaults](https://www.netdata.cloud/blog/redefining-monitoring-netdata/) to get you started that try to
+balance the flexibility and power of the ML models with the goal of being as cheap as possible in term of cost on the
+node resources.
+
+_**Note**: If you are unsure about any of the below configuration options then it's best to just ignore all this and
+leave the `changefinder.conf` file alone to begin with. Then you can return to it later if you would like to tune things
+a bit more once the collector is running for a while and you have a feeling for its performance on your node._
+
+Edit the `python.d/changefinder.conf` configuration file using `edit-config` from the your
+agent's [config directory](/docs/configure/nodes.md), which is usually at `/etc/netdata`.
+
+```bash
+cd /etc/netdata # Replace this path with your Netdata config directory, if different
+sudo ./edit-config python.d/changefinder.conf
+```
+
+The default configuration should look something like this. Here you can see each parameter (with sane defaults) and some
+information about each one and what it does.
+
+```yaml
+# ----------------------------------------------------------------------
+# JOBS (data collection sources)
+
+# Pull data from local Netdata node.
+local:
+
+ # A friendly name for this job.
+ name: 'local'
+
+ # What host to pull data from.
+ host: '127.0.0.1:19999'
+
+ # What charts to pull data for - A regex like 'system\..*|' or 'system\..*|apps.cpu|apps.mem' etc.
+ charts_regex: 'system\..*'
+
+ # Charts to exclude, useful if you would like to exclude some specific charts.
+ # Note: should be a ',' separated string like 'chart.name,chart.name'.
+ charts_to_exclude: ''
+
+ # Get ChangeFinder scores 'per_dim' or 'per_chart'.
+ mode: 'per_chart'
+
+ # Default parameters that can be passed to the changefinder library.
+ cf_r: 0.5
+ cf_order: 1
+ cf_smooth: 15
+
+ # The percentile above which scores will be flagged.
+ cf_threshold: 99
+
+ # The number of recent scores to use when calculating the percentile of the changefinder score.
+ n_score_samples: 14400
+
+ # Set to true if you also want to chart the percentile scores in addition to the flags.
+ # Mainly useful for debugging or if you want to dive deeper on how the scores are evolving over time.
+ show_scores: false
+```
+
+## Troubleshooting
+
+To see any relevant log messages you can use a command like below.
+
+```bash
+grep 'changefinder' /var/log/netdata/error.log
+```
+
+If you would like to log in as `netdata` user and run the collector in debug mode to see more detail.
+
+```bash
+# become netdata user
+sudo su -s /bin/bash netdata
+# run collector in debug using `nolock` option if netdata is already running the collector itself.
+/usr/libexec/netdata/plugins.d/python.d.plugin changefinder debug trace nolock
+```
+
+## Notes
+
+- It may take an hour or two (depending on your choice of `n_score_samples`) for the collector to 'settle' into it's
+ typical behaviour in terms of the trained models and scores you will see in the normal running of your node. Mainly
+ this is because it can take a while to build up a proper distribution of previous scores in over to convert the raw
+ score returned by the ChangeFinder algorithim into a percentile based on the most recent `n_score_samples` that have
+ already been produced. So when you first turn the collector on, it will have a lot of flags in the beginning and then
+ should 'settle down' once it has built up enough history. This is a typical characteristic of online machine learning
+ approaches which need some initial window of time before they can be useful.
+- As this collector does most of the work in Python itself, you may want to try it out first on a test or development
+ system to get a sense of its performance characteristics on a node similar to where you would like to use it.
+- On a development n1-standard-2 (2 vCPUs, 7.5 GB memory) vm running Ubuntu 18.04 LTS and not doing any work some of the
+ typical performance characteristics we saw from running this collector (with defaults) were:
+ - A runtime (`netdata.runtime_changefinder`) of ~30ms.
+ - Typically ~1% additional cpu usage.
+ - About ~85mb of ram (`apps.mem`) being continually used by the `python.d.plugin` under default configuration.
+
+## Useful links and further reading
+
+- [PyPi changefinder](https://pypi.org/project/changefinder/) reference page.
+- [GitHub repo](https://github.com/shunsukeaihara/changefinder) for the changefinder library.
+- Relevant academic papers:
+ - Yamanishi K, Takeuchi J. A unifying framework for detecting outliers and change points from nonstationary time
+ series data. 8th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD02. 2002:
+ 676. ([pdf](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.3469&rep=rep1&type=pdf))
+ - Kawahara Y, Sugiyama M. Sequential Change-Point Detection Based on Direct Density-Ratio Estimation. SIAM
+ International Conference on Data Mining. 2009:
+ 389–400. ([pdf](https://onlinelibrary.wiley.com/doi/epdf/10.1002/sam.10124))
+ - Liu S, Yamada M, Collier N, Sugiyama M. Change-point detection in time-series data by relative density-ratio
+ estimation. Neural Networks. Jul.2013 43:72–83. [PubMed: 23500502] ([pdf](https://arxiv.org/pdf/1203.0453.pdf))
+ - T. Iwata, K. Nakamura, Y. Tokusashi, and H. Matsutani, “Accelerating Online Change-Point Detection Algorithm using
+ 10 GbE FPGA NIC,” Proc. International European Conference on Parallel and Distributed Computing (Euro-Par’18)
+ Workshops, vol.11339, pp.506–517, Aug.
+ 2018 ([pdf](https://www.arc.ics.keio.ac.jp/~matutani/papers/iwata_heteropar2018.pdf))
+- The [ruptures](https://github.com/deepcharles/ruptures) python package is also a good place to learn more about
+ changepoint detection (mostly offline as opposed to online but deals with similar concepts).
+- A nice [blog post](https://techrando.com/2019/08/14/a-brief-introduction-to-change-point-detection-using-python/)
+ showing some of the other options and libraries for changepoint detection in Python.
+- [Bayesian changepoint detection](https://github.com/hildensia/bayesian_changepoint_detection) library - we may explore
+ implementing a collector for this or integrating this approach into this collector at a future date if there is
+ interest and it proves computationaly feasible.
+- You might also find the
+ Netdata [anomalies collector](https://github.com/netdata/netdata/tree/master/collectors/python.d.plugin/anomalies)
+ interesting.
+- [Anomaly Detection](https://en.wikipedia.org/wiki/Anomaly_detection) wikipedia page.
+- [Anomaly Detection YouTube playlist](https://www.youtube.com/playlist?list=PL6Zhl9mK2r0KxA6rB87oi4kWzoqGd5vp0)
+ maintained by [andrewm4894](https://github.com/andrewm4894/) from Netdata.
+- [awesome-TS-anomaly-detection](https://github.com/rob-med/awesome-TS-anomaly-detection) Github list of useful tools,
+ libraries and resources.
+- [Mendeley public group](https://www.mendeley.com/community/interesting-anomaly-detection-papers/) with some
+ interesting anomaly detection papers we have been reading.
+- Good [blog post](https://www.anodot.com/blog/what-is-anomaly-detection/) from Anodot on time series anomaly detection.
+ Anodot also have some great whitepapers in this space too that some may find useful.
+- Novelty and outlier detection in
+ the [scikit-learn documentation](https://scikit-learn.org/stable/modules/outlier_detection.html).
+
+[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fcollectors%2Fpython.d.plugin%2Fchangefinder%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]()
diff --git a/collectors/python.d.plugin/changefinder/changefinder.chart.py b/collectors/python.d.plugin/changefinder/changefinder.chart.py
new file mode 100644
index 000000000..c18e5600a
--- /dev/null
+++ b/collectors/python.d.plugin/changefinder/changefinder.chart.py
@@ -0,0 +1,185 @@
+# -*- coding: utf-8 -*-
+# Description: changefinder netdata python.d module
+# Author: andrewm4894
+# SPDX-License-Identifier: GPL-3.0-or-later
+
+from json import loads
+import re
+
+from bases.FrameworkServices.UrlService import UrlService
+
+import numpy as np
+import changefinder
+from scipy.stats import percentileofscore
+
+update_every = 5
+disabled_by_default = True
+
+ORDER = [
+ 'scores',
+ 'flags'
+]
+
+CHARTS = {
+ 'scores': {
+ 'options': [None, 'ChangeFinder', 'score', 'Scores', 'scores', 'line'],
+ 'lines': []
+ },
+ 'flags': {
+ 'options': [None, 'ChangeFinder', 'flag', 'Flags', 'flags', 'stacked'],
+ 'lines': []
+ }
+}
+
+DEFAULT_PROTOCOL = 'http'
+DEFAULT_HOST = '127.0.0.1:19999'
+DEFAULT_CHARTS_REGEX = 'system.*'
+DEFAULT_MODE = 'per_chart'
+DEFAULT_CF_R = 0.5
+DEFAULT_CF_ORDER = 1
+DEFAULT_CF_SMOOTH = 15
+DEFAULT_CF_DIFF = False
+DEFAULT_CF_THRESHOLD = 99
+DEFAULT_N_SCORE_SAMPLES = 14400
+DEFAULT_SHOW_SCORES = False
+
+
+class Service(UrlService):
+ def __init__(self, configuration=None, name=None):
+ UrlService.__init__(self, configuration=configuration, name=name)
+ self.order = ORDER
+ self.definitions = CHARTS
+ self.protocol = self.configuration.get('protocol', DEFAULT_PROTOCOL)
+ self.host = self.configuration.get('host', DEFAULT_HOST)
+ self.url = '{}://{}/api/v1/allmetrics?format=json'.format(self.protocol, self.host)
+ self.charts_regex = re.compile(self.configuration.get('charts_regex', DEFAULT_CHARTS_REGEX))
+ self.charts_to_exclude = self.configuration.get('charts_to_exclude', '').split(',')
+ self.mode = self.configuration.get('mode', DEFAULT_MODE)
+ self.n_score_samples = int(self.configuration.get('n_score_samples', DEFAULT_N_SCORE_SAMPLES))
+ self.show_scores = int(self.configuration.get('show_scores', DEFAULT_SHOW_SCORES))
+ self.cf_r = float(self.configuration.get('cf_r', DEFAULT_CF_R))
+ self.cf_order = int(self.configuration.get('cf_order', DEFAULT_CF_ORDER))
+ self.cf_smooth = int(self.configuration.get('cf_smooth', DEFAULT_CF_SMOOTH))
+ self.cf_diff = bool(self.configuration.get('cf_diff', DEFAULT_CF_DIFF))
+ self.cf_threshold = float(self.configuration.get('cf_threshold', DEFAULT_CF_THRESHOLD))
+ self.collected_dims = {'scores': set(), 'flags': set()}
+ self.models = {}
+ self.x_latest = {}
+ self.scores_latest = {}
+ self.scores_samples = {}
+
+ def get_score(self, x, model):
+ """Update the score for the model based on most recent data, flag if it's percentile passes self.cf_threshold.
+ """
+
+ # get score
+ if model not in self.models:
+ # initialise empty model if needed
+ self.models[model] = changefinder.ChangeFinder(r=self.cf_r, order=self.cf_order, smooth=self.cf_smooth)
+ # if the update for this step fails then just fallback to last known score
+ try:
+ score = self.models[model].update(x)
+ self.scores_latest[model] = score
+ except Exception as _:
+ score = self.scores_latest.get(model, 0)
+ score = 0 if np.isnan(score) else score
+
+ # update sample scores used to calculate percentiles
+ if model in self.scores_samples:
+ self.scores_samples[model].append(score)
+ else:
+ self.scores_samples[model] = [score]
+ self.scores_samples[model] = self.scores_samples[model][-self.n_score_samples:]
+
+ # convert score to percentile
+ score = percentileofscore(self.scores_samples[model], score)
+
+ # flag based on score percentile
+ flag = 1 if score >= self.cf_threshold else 0
+
+ return score, flag
+
+ def validate_charts(self, chart, data, algorithm='absolute', multiplier=1, divisor=1):
+ """If dimension not in chart then add it.
+ """
+ if not self.charts:
+ return
+
+ for dim in data:
+ if dim not in self.collected_dims[chart]:
+ self.collected_dims[chart].add(dim)
+ self.charts[chart].add_dimension([dim, dim, algorithm, multiplier, divisor])
+
+ for dim in list(self.collected_dims[chart]):
+ if dim not in data:
+ self.collected_dims[chart].remove(dim)
+ self.charts[chart].del_dimension(dim, hide=False)
+
+ def diff(self, x, model):
+ """Take difference of data.
+ """
+ x_diff = x - self.x_latest.get(model, 0)
+ self.x_latest[model] = x
+ x = x_diff
+ return x
+
+ def _get_data(self):
+
+ # pull data from self.url
+ raw_data = self._get_raw_data()
+ if raw_data is None:
+ return None
+
+ raw_data = loads(raw_data)
+
+ # filter to just the data for the charts specified
+ charts_in_scope = list(filter(self.charts_regex.match, raw_data.keys()))
+ charts_in_scope = [c for c in charts_in_scope if c not in self.charts_to_exclude]
+
+ data_score = {}
+ data_flag = {}
+
+ # process each chart
+ for chart in charts_in_scope:
+
+ if self.mode == 'per_chart':
+
+ # average dims on chart and run changefinder on that average
+ x = [raw_data[chart]['dimensions'][dim]['value'] for dim in raw_data[chart]['dimensions']]
+ x = [x for x in x if x is not None]
+
+ if len(x) > 0:
+
+ x = sum(x) / len(x)
+ x = self.diff(x, chart) if self.cf_diff else x
+
+ score, flag = self.get_score(x, chart)
+ if self.show_scores:
+ data_score['{}_score'.format(chart)] = score * 100
+ data_flag[chart] = flag
+
+ else:
+
+ # run changefinder on each individual dim
+ for dim in raw_data[chart]['dimensions']:
+
+ chart_dim = '{}|{}'.format(chart, dim)
+
+ x = raw_data[chart]['dimensions'][dim]['value']
+ x = x if x else 0
+ x = self.diff(x, chart_dim) if self.cf_diff else x
+
+ score, flag = self.get_score(x, chart_dim)
+ if self.show_scores:
+ data_score['{}_score'.format(chart_dim)] = score * 100
+ data_flag[chart_dim] = flag
+
+ self.validate_charts('flags', data_flag)
+
+ if self.show_scores & len(data_score) > 0:
+ data_score['average_score'] = sum(data_score.values()) / len(data_score)
+ self.validate_charts('scores', data_score, divisor=100)
+
+ data = {**data_score, **data_flag}
+
+ return data
diff --git a/collectors/python.d.plugin/changefinder/changefinder.conf b/collectors/python.d.plugin/changefinder/changefinder.conf
new file mode 100644
index 000000000..56a681f1e
--- /dev/null
+++ b/collectors/python.d.plugin/changefinder/changefinder.conf
@@ -0,0 +1,74 @@
+# netdata python.d.plugin configuration for example
+#
+# This file is in YaML format. Generally the format is:
+#
+# name: value
+#
+# There are 2 sections:
+# - global variables
+# - one or more JOBS
+#
+# JOBS allow you to collect values from multiple sources.
+# Each source will have its own set of charts.
+#
+# JOB parameters have to be indented (using spaces only, example below).
+
+# ----------------------------------------------------------------------
+# Global Variables
+# These variables set the defaults for all JOBs, however each JOB
+# may define its own, overriding the defaults.
+
+# update_every sets the default data collection frequency.
+# If unset, the python.d.plugin default is used.
+# update_every: 5
+
+# priority controls the order of charts at the netdata dashboard.
+# Lower numbers move the charts towards the top of the page.
+# If unset, the default for python.d.plugin is used.
+# priority: 60000
+
+# penalty indicates whether to apply penalty to update_every in case of failures.
+# Penalty will increase every 5 failed updates in a row. Maximum penalty is 10 minutes.
+# penalty: yes
+
+# autodetection_retry sets the job re-check interval in seconds.
+# The job is not deleted if check fails.
+# Attempts to start the job are made once every autodetection_retry.
+# This feature is disabled by default.
+# autodetection_retry: 0
+
+# ----------------------------------------------------------------------
+# JOBS (data collection sources)
+
+local:
+
+ # A friendly name for this job.
+ name: 'local'
+
+ # What host to pull data from.
+ host: '127.0.0.1:19999'
+
+ # What charts to pull data for - A regex like 'system\..*|' or 'system\..*|apps.cpu|apps.mem' etc.
+ charts_regex: 'system\..*'
+
+ # Charts to exclude, useful if you would like to exclude some specific charts.
+ # Note: should be a ',' separated string like 'chart.name,chart.name'.
+ charts_to_exclude: ''
+
+ # Get ChangeFinder scores 'per_dim' or 'per_chart'.
+ mode: 'per_chart'
+
+ # Default parameters that can be passed to the changefinder library.
+ cf_r: 0.5
+ cf_order: 1
+ cf_smooth: 15
+
+ # The percentile above which scores will be flagged.
+ cf_threshold: 99
+
+ # The number of recent scores to use when calculating the percentile of the changefinder score.
+ n_score_samples: 14400
+
+ # Set to true if you also want to chart the percentile scores in addition to the flags.
+ # Mainly useful for debugging or if you want to dive deeper on how the scores are evolving over time.
+ show_scores: false