diff options
Diffstat (limited to 'collectors/python.d.plugin/changefinder/README.md')
l---------[-rw-r--r--] | collectors/python.d.plugin/changefinder/README.md | 242 |
1 files changed, 1 insertions, 241 deletions
diff --git a/collectors/python.d.plugin/changefinder/README.md b/collectors/python.d.plugin/changefinder/README.md index 0e9bab88..0ca704eb 100644..120000 --- a/collectors/python.d.plugin/changefinder/README.md +++ b/collectors/python.d.plugin/changefinder/README.md @@ -1,241 +1 @@ -<!-- -title: "Online change point detection with Netdata" -description: "Use ML-driven change point detection to narrow your focus and shorten root cause analysis." -custom_edit_url: "https://github.com/netdata/netdata/edit/master/collectors/python.d.plugin/changefinder/README.md" -sidebar_label: "changefinder" -learn_status: "Published" -learn_topic_type: "References" -learn_rel_path: "Integrations/Monitor/QoS" ---> - -# Online change point detection with Netdata - -This collector uses the Python [changefinder](https://github.com/shunsukeaihara/changefinder) library to -perform [online](https://en.wikipedia.org/wiki/Online_machine_learning) [changepoint detection](https://en.wikipedia.org/wiki/Change_detection) -on your Netdata charts and/or dimensions. - -Instead of this collector just _collecting_ data, it also does some computation on the data it collects to return a -changepoint score for each chart or dimension you configure it to work on. This is -an [online](https://en.wikipedia.org/wiki/Online_machine_learning) machine learning algorithm so there is no batch step -to train the model, instead it evolves over time as more data arrives. That makes this particular algorithm quite cheap -to compute at each step of data collection (see the notes section below for more details) and it should scale fairly -well to work on lots of charts or hosts (if running on a parent node for example). - -> As this is a somewhat unique collector and involves often subjective concepts like changepoints and anomalies, we would love to hear any feedback on it from the community. Please let us know on the [community forum](https://community.netdata.cloud/t/changefinder-collector-feedback/972) or drop us a note at [analytics-ml-team@netdata.cloud](mailto:analytics-ml-team@netdata.cloud) for any and all feedback, both positive and negative. This sort of feedback is priceless to help us make complex features more useful. - -## Charts - -Two charts are available: - -### ChangeFinder Scores (`changefinder.scores`) - -This chart shows the percentile of the score that is output from the ChangeFinder library (it is turned off by default -but available with `show_scores: true`). - -A high observed score is more likely to be a valid changepoint worth exploring, even more so when multiple charts or -dimensions have high changepoint scores at the same time or very close together. - -### ChangeFinder Flags (`changefinder.flags`) - -This chart shows `1` or `0` if the latest score has a percentile value that exceeds the `cf_threshold` threshold. By -default, any scores that are in the 99th or above percentile will raise a flag on this chart. - -The raw changefinder score itself can be a little noisy and so limiting ourselves to just periods where it surpasses -the 99th percentile can help manage the "[signal to noise ratio](https://en.wikipedia.org/wiki/Signal-to-noise_ratio)" -better. - -The `cf_threshold` parameter might be one you want to play around with to tune things specifically for the workloads on -your node and the specific charts you want to monitor. For example, maybe the 95th percentile might work better for you -than the 99th percentile. - -Below is an example of the chart produced by this collector. The first 3/4 of the period looks normal in that we see a -few individual changes being picked up somewhat randomly over time. But then at around 14:59 towards the end of the -chart we see two periods with 'spikes' of multiple changes for a small period of time. This is the sort of pattern that -might be a sign something on the system that has changed sufficiently enough to merit some investigation. - -![changepoint-collector](https://user-images.githubusercontent.com/2178292/108773528-665de980-7556-11eb-895d-798669bcd695.png) - -## Requirements - -- This collector will only work with Python 3 and requires the packages below be installed. - -```bash -# become netdata user -sudo su -s /bin/bash netdata -# install required packages for the netdata user -pip3 install --user numpy==1.19.5 changefinder==0.03 scipy==1.5.4 -``` - -**Note**: if you need to tell Netdata to use Python 3 then you can pass the below command in the python plugin section -of your `netdata.conf` file. - -```yaml -[ plugin:python.d ] - # update every = 1 - command options = -ppython3 -``` - -## Configuration - -Install the Python requirements above, enable the collector and restart Netdata. - -```bash -cd /etc/netdata/ -sudo ./edit-config python.d.conf -# Set `changefinder: no` to `changefinder: yes` -sudo systemctl restart netdata -``` - -The configuration for the changefinder collector defines how it will behave on your system and might take some -experimentation with over time to set it optimally for your node. Out of the box, the config comes with -some [sane defaults](https://www.netdata.cloud/blog/redefining-monitoring-netdata/) to get you started that try to -balance the flexibility and power of the ML models with the goal of being as cheap as possible in term of cost on the -node resources. - -_**Note**: If you are unsure about any of the below configuration options then it's best to just ignore all this and -leave the `changefinder.conf` file alone to begin with. Then you can return to it later if you would like to tune things -a bit more once the collector is running for a while and you have a feeling for its performance on your node._ - -Edit the `python.d/changefinder.conf` configuration file using `edit-config` from the your -agent's [config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md), which is usually at `/etc/netdata`. - -```bash -cd /etc/netdata # Replace this path with your Netdata config directory, if different -sudo ./edit-config python.d/changefinder.conf -``` - -The default configuration should look something like this. Here you can see each parameter (with sane defaults) and some -information about each one and what it does. - -```yaml -# - -# JOBS (data collection sources) - -# Pull data from local Netdata node. -local: - - # A friendly name for this job. - name: 'local' - - # What host to pull data from. - host: '127.0.0.1:19999' - - # What charts to pull data for - A regex like 'system\..*|' or 'system\..*|apps.cpu|apps.mem' etc. - charts_regex: 'system\..*' - - # Charts to exclude, useful if you would like to exclude some specific charts. - # Note: should be a ',' separated string like 'chart.name,chart.name'. - charts_to_exclude: '' - - # Get ChangeFinder scores 'per_dim' or 'per_chart'. - mode: 'per_chart' - - # Default parameters that can be passed to the changefinder library. - cf_r: 0.5 - cf_order: 1 - cf_smooth: 15 - - # The percentile above which scores will be flagged. - cf_threshold: 99 - - # The number of recent scores to use when calculating the percentile of the changefinder score. - n_score_samples: 14400 - - # Set to true if you also want to chart the percentile scores in addition to the flags. - # Mainly useful for debugging or if you want to dive deeper on how the scores are evolving over time. - show_scores: false -``` - -## Troubleshooting - -To see any relevant log messages you can use a command like below. - -```bash -grep 'changefinder' /var/log/netdata/error.log -``` - -If you would like to log in as `netdata` user and run the collector in debug mode to see more detail. - -```bash -# become netdata user -sudo su -s /bin/bash netdata -# run collector in debug using `nolock` option if netdata is already running the collector itself. -/usr/libexec/netdata/plugins.d/python.d.plugin changefinder debug trace nolock -``` - -## Notes - -- It may take an hour or two (depending on your choice of `n_score_samples`) for the collector to 'settle' into it's - typical behaviour in terms of the trained models and scores you will see in the normal running of your node. Mainly - this is because it can take a while to build up a proper distribution of previous scores in over to convert the raw - score returned by the ChangeFinder algorithm into a percentile based on the most recent `n_score_samples` that have - already been produced. So when you first turn the collector on, it will have a lot of flags in the beginning and then - should 'settle down' once it has built up enough history. This is a typical characteristic of online machine learning - approaches which need some initial window of time before they can be useful. -- As this collector does most of the work in Python itself, you may want to try it out first on a test or development - system to get a sense of its performance characteristics on a node similar to where you would like to use it. -- On a development n1-standard-2 (2 vCPUs, 7.5 GB memory) vm running Ubuntu 18.04 LTS and not doing any work some of the - typical performance characteristics we saw from running this collector (with defaults) were: - - A runtime (`netdata.runtime_changefinder`) of ~30ms. - - Typically ~1% additional cpu usage. - - About ~85mb of ram (`apps.mem`) being continually used by the `python.d.plugin` under default configuration. - -## Useful links and further reading - -- [PyPi changefinder](https://pypi.org/project/changefinder/) reference page. -- [GitHub repo](https://github.com/shunsukeaihara/changefinder) for the changefinder library. -- Relevant academic papers: - - Yamanishi K, Takeuchi J. A unifying framework for detecting outliers and change points from nonstationary time - series data. 8th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD02. 2002: - 676. ([pdf](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.3469&rep=rep1&type=pdf)) - - Kawahara Y, Sugiyama M. Sequential Change-Point Detection Based on Direct Density-Ratio Estimation. SIAM - International Conference on Data Mining. 2009: - 389–400. ([pdf](https://onlinelibrary.wiley.com/doi/epdf/10.1002/sam.10124)) - - Liu S, Yamada M, Collier N, Sugiyama M. Change-point detection in time-series data by relative density-ratio - estimation. Neural Networks. Jul.2013 43:72–83. [PubMed: 23500502] ([pdf](https://arxiv.org/pdf/1203.0453.pdf)) - - T. Iwata, K. Nakamura, Y. Tokusashi, and H. Matsutani, “Accelerating Online Change-Point Detection Algorithm using - 10 GbE FPGA NIC,” Proc. International European Conference on Parallel and Distributed Computing (Euro-Par’18) - Workshops, vol.11339, pp.506–517, Aug. - 2018 ([pdf](https://www.arc.ics.keio.ac.jp/~matutani/papers/iwata_heteropar2018.pdf)) -- The [ruptures](https://github.com/deepcharles/ruptures) python package is also a good place to learn more about - changepoint detection (mostly offline as opposed to online but deals with similar concepts). -- A nice [blog post](https://techrando.com/2019/08/14/a-brief-introduction-to-change-point-detection-using-python/) - showing some of the other options and libraries for changepoint detection in Python. -- [Bayesian changepoint detection](https://github.com/hildensia/bayesian_changepoint_detection) library - we may explore - implementing a collector for this or integrating this approach into this collector at a future date if there is - interest and it proves computationaly feasible. -- You might also find the - Netdata [anomalies collector](https://github.com/netdata/netdata/tree/master/collectors/python.d.plugin/anomalies) - interesting. -- [Anomaly Detection](https://en.wikipedia.org/wiki/Anomaly_detection) wikipedia page. -- [Anomaly Detection YouTube playlist](https://www.youtube.com/playlist?list=PL6Zhl9mK2r0KxA6rB87oi4kWzoqGd5vp0) - maintained by [andrewm4894](https://github.com/andrewm4894/) from Netdata. -- [awesome-TS-anomaly-detection](https://github.com/rob-med/awesome-TS-anomaly-detection) Github list of useful tools, - libraries and resources. -- [Mendeley public group](https://www.mendeley.com/community/interesting-anomaly-detection-papers/) with some - interesting anomaly detection papers we have been reading. -- Good [blog post](https://www.anodot.com/blog/what-is-anomaly-detection/) from Anodot on time series anomaly detection. - Anodot also have some great whitepapers in this space too that some may find useful. -- Novelty and outlier detection in - the [scikit-learn documentation](https://scikit-learn.org/stable/modules/outlier_detection.html). - -### Troubleshooting - -To troubleshoot issues with the `changefinder` module, run the `python.d.plugin` with the debug option enabled. The -output will give you the output of the data collection job or error messages on why the collector isn't working. - -First, navigate to your plugins directory, usually they are located under `/usr/libexec/netdata/plugins.d/`. If that's -not the case on your system, open `netdata.conf` and look for the setting `plugins directory`. Once you're in the -plugin's directory, switch to the `netdata` user. - -```bash -cd /usr/libexec/netdata/plugins.d/ -sudo su -s /bin/bash netdata -``` - -Now you can manually run the `changefinder` module in debug mode: - -```bash -./python.d.plugin changefinder debug trace -``` - +integrations/python.d_changefinder.md
\ No newline at end of file |