diff options
Diffstat (limited to 'collectors/python.d.plugin/anomalies')
-rw-r--r-- | collectors/python.d.plugin/anomalies/Makefile.inc | 13 | ||||
-rw-r--r-- | collectors/python.d.plugin/anomalies/README.md | 248 | ||||
-rw-r--r-- | collectors/python.d.plugin/anomalies/anomalies.chart.py | 425 | ||||
-rw-r--r-- | collectors/python.d.plugin/anomalies/anomalies.conf | 184 | ||||
-rw-r--r-- | collectors/python.d.plugin/anomalies/metadata.yaml | 87 |
5 files changed, 957 insertions, 0 deletions
diff --git a/collectors/python.d.plugin/anomalies/Makefile.inc b/collectors/python.d.plugin/anomalies/Makefile.inc new file mode 100644 index 00000000..94937b36 --- /dev/null +++ b/collectors/python.d.plugin/anomalies/Makefile.inc @@ -0,0 +1,13 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_python_DATA += anomalies/anomalies.chart.py +dist_pythonconfig_DATA += anomalies/anomalies.conf + +# do not install these files, but include them in the distribution +dist_noinst_DATA += anomalies/README.md anomalies/Makefile.inc + diff --git a/collectors/python.d.plugin/anomalies/README.md b/collectors/python.d.plugin/anomalies/README.md new file mode 100644 index 00000000..80f50537 --- /dev/null +++ b/collectors/python.d.plugin/anomalies/README.md @@ -0,0 +1,248 @@ +<!-- +title: "Anomaly detection with Netdata" +description: "Use ML-driven anomaly detection to narrow your focus to only affected metrics and services/processes on your node to shorten root cause analysis." +custom_edit_url: "https://github.com/netdata/netdata/edit/master/collectors/python.d.plugin/anomalies/README.md" +sidebar_url: "Anomalies" +sidebar_label: "anomalies" +learn_status: "Published" +learn_rel_path: "Integrations/Monitor/Anything" +--> + +# Anomaly detection with Netdata + +**Note**: Check out the [Netdata Anomaly Advisor](https://github.com/netdata/netdata/blob/master/docs/cloud/insights/anomaly-advisor.md) for a more native anomaly detection experience within Netdata. + +This collector uses the Python [PyOD](https://pyod.readthedocs.io/en/latest/index.html) library to perform unsupervised [anomaly detection](https://en.wikipedia.org/wiki/Anomaly_detection) on your Netdata charts and/or dimensions. + +Instead of this collector just _collecting_ data, it also does some computation on the data it collects to return an anomaly probability and anomaly flag for each chart or custom model you define. This computation consists of a **train** function that runs every `train_n_secs` to train the ML models to learn what 'normal' typically looks like on your node. At each iteration there is also a **predict** function that uses the latest trained models and most recent metrics to produce an anomaly probability and anomaly flag for each chart or custom model you define. + +> As this is a somewhat unique collector and involves often subjective concepts like anomalies and anomaly probabilities, we would love to hear any feedback on it from the community. Please let us know on the [community forum](https://community.netdata.cloud/t/anomalies-collector-feedback-megathread/767) or drop us a note at [analytics-ml-team@netdata.cloud](mailto:analytics-ml-team@netdata.cloud) for any and all feedback, both positive and negative. This sort of feedback is priceless to help us make complex features more useful. + +## Charts + +Two charts are produced: + +- **Anomaly Probability** (`anomalies.probability`): This chart shows the probability that the latest observed data is anomalous based on the trained model for that chart (using the [`predict_proba()`](https://pyod.readthedocs.io/en/latest/api_cc.html#pyod.models.base.BaseDetector.predict_proba) method of the trained PyOD model). +- **Anomaly** (`anomalies.anomaly`): This chart shows `1` or `0` predictions of if the latest observed data is considered anomalous or not based on the trained model (using the [`predict()`](https://pyod.readthedocs.io/en/latest/api_cc.html#pyod.models.base.BaseDetector.predict) method of the trained PyOD model). + +Below is an example of the charts produced by this collector and how they might look when things are 'normal' on the node. The anomaly probabilities tend to bounce randomly around a typically low probability range, one or two might randomly jump or drift outside of this range every now and then and show up as anomalies on the anomaly chart. + +![netdata-anomalies-collector-normal](https://user-images.githubusercontent.com/2178292/100663699-99755000-334e-11eb-922f-0c41a0176484.jpg) + +If we then go onto the system and run a command like `stress-ng --all 2` to create some [stress](https://wiki.ubuntu.com/Kernel/Reference/stress-ng), we see some charts begin to have anomaly probabilities that jump outside the typical range. When the anomaly probabilities change enough, we will start seeing anomalies being flagged on the `anomalies.anomaly` chart. The idea is that these charts are the most anomalous right now so could be a good place to start your troubleshooting. + +![netdata-anomalies-collector-abnormal](https://user-images.githubusercontent.com/2178292/100663710-9bd7aa00-334e-11eb-9d14-76fda73bc309.jpg) + +Then, as the issue passes, the anomaly probabilities should settle back down into their 'normal' range again. + +![netdata-anomalies-collector-normal-again](https://user-images.githubusercontent.com/2178292/100666681-481a9000-3351-11eb-9979-64728ee2dfb6.jpg) + +## Requirements + +- This collector will only work with Python 3 and requires the packages below be installed. +- Typically you will not need to do this, but, if needed, to ensure Python 3 is used you can add the below line to the `[plugin:python.d]` section of `netdata.conf` + +```conf +[plugin:python.d] + # update every = 1 + command options = -ppython3 +``` + +Install the required python libraries. + +```bash +# become netdata user +sudo su -s /bin/bash netdata +# install required packages for the netdata user +pip3 install --user netdata-pandas==0.0.38 numba==0.50.1 scikit-learn==0.23.2 pyod==0.8.3 +``` + +## Configuration + +Install the Python requirements above, enable the collector and restart Netdata. + +```bash +cd /etc/netdata/ +sudo ./edit-config python.d.conf +# Set `anomalies: no` to `anomalies: yes` +sudo systemctl restart netdata +``` + +The configuration for the anomalies collector defines how it will behave on your system and might take some experimentation with over time to set it optimally for your node. Out of the box, the config comes with some [sane defaults](https://www.netdata.cloud/blog/redefining-monitoring-netdata/) to get you started that try to balance the flexibility and power of the ML models with the goal of being as cheap as possible in term of cost on the node resources. + +_**Note**: If you are unsure about any of the below configuration options then it's best to just ignore all this and leave the `anomalies.conf` file alone to begin with. Then you can return to it later if you would like to tune things a bit more once the collector is running for a while and you have a feeling for its performance on your node._ + +Edit the `python.d/anomalies.conf` configuration file using `edit-config` from the your agent's [config +directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md), which is usually at `/etc/netdata`. + +```bash +cd /etc/netdata # Replace this path with your Netdata config directory, if different +sudo ./edit-config python.d/anomalies.conf +``` + +The default configuration should look something like this. Here you can see each parameter (with sane defaults) and some information about each one and what it does. + +```conf +# - +# JOBS (data collection sources) + +# Pull data from local Netdata node. +anomalies: + name: 'Anomalies' + + # Host to pull data from. + host: '127.0.0.1:19999' + + # Username and Password for Netdata if using basic auth. + # username: '???' + # password: '???' + + # Use http or https to pull data + protocol: 'http' + + # SSL verify parameter for requests.get() calls + tls_verify: true + + # What charts to pull data for - A regex like 'system\..*|' or 'system\..*|apps.cpu|apps.mem' etc. + charts_regex: 'system\..*' + + # Charts to exclude, useful if you would like to exclude some specific charts. + # Note: should be a ',' separated string like 'chart.name,chart.name'. + charts_to_exclude: 'system.uptime,system.entropy' + + # What model to use - can be one of 'pca', 'hbos', 'iforest', 'cblof', 'loda', 'copod' or 'feature_bagging'. + # More details here: https://pyod.readthedocs.io/en/latest/pyod.models.html. + model: 'pca' + + # Max number of observations to train on, to help cap compute cost of training model if you set a very large train_n_secs. + train_max_n: 100000 + + # How often to re-train the model (assuming update_every=1 then train_every_n=1800 represents (re)training every 30 minutes). + # Note: If you want to turn off re-training set train_every_n=0 and after initial training the models will not be retrained. + train_every_n: 1800 + + # The length of the window of data to train on (14400 = last 4 hours). + train_n_secs: 14400 + + # How many prediction steps after a train event to just use previous prediction value for. + # Used to reduce possibility of the training step itself appearing as an anomaly on the charts. + train_no_prediction_n: 10 + + # If you would like to train the model for the first time on a specific window then you can define it using the below two variables. + # Start of training data for initial model. + # initial_train_data_after: 1604578857 + + # End of training data for initial model. + # initial_train_data_before: 1604593257 + + # If you would like to ignore recent data in training then you can offset it by offset_n_secs. + offset_n_secs: 0 + + # How many lagged values of each dimension to include in the 'feature vector' each model is trained on. + lags_n: 5 + + # How much smoothing to apply to each dimension in the 'feature vector' each model is trained on. + smooth_n: 3 + + # How many differences to take in preprocessing your data. + # More info on differencing here: https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average#Differencing + # diffs_n=0 would mean training models on the raw values of each dimension. + # diffs_n=1 means everything is done in terms of differences. + diffs_n: 1 + + # What is the typical proportion of anomalies in your data on average? + # This parameter can control the sensitivity of your models to anomalies. + # Some discussion here: https://github.com/yzhao062/pyod/issues/144 + contamination: 0.001 + + # Set to true to include an "average_prob" dimension on anomalies probability chart which is + # just the average of all anomaly probabilities at each time step + include_average_prob: true + + # Define any custom models you would like to create anomaly probabilities for, some examples below to show how. + # For example below example creates two custom models, one to run anomaly detection user and system cpu for our demo servers + # and one on the cpu and mem apps metrics for the python.d.plugin. + # custom_models: + # - name: 'demos_cpu' + # dimensions: 'london.my-netdata.io::system.cpu|user,london.my-netdata.io::system.cpu|system,newyork.my-netdata.io::system.cpu|user,newyork.my-netdata.io::system.cpu|system' + # - name: 'apps_python_d_plugin' + # dimensions: 'apps.cpu|python.d.plugin,apps.mem|python.d.plugin' + + # Set to true to normalize, using min-max standardization, features used for the custom models. + # Useful if your custom models contain dimensions on very different scales an model you use does + # not internally do its own normalization. Usually best to leave as false. + # custom_models_normalize: false +``` + +## Custom models + +In the `anomalies.conf` file you can also define some "custom models" which you can use to group one or more metrics into a single model much like is done by default for the charts you specify. This is useful if you have a handful of metrics that exist in different charts but perhaps are related to the same underlying thing you would like to perform anomaly detection on, for example a specific app or user. + +To define a custom model you would include configuration like below in `anomalies.conf`. By default there should already be some commented out examples in there. + +`name` is a name you give your custom model, this is what will appear alongside any other specified charts in the `anomalies.probability` and `anomalies.anomaly` charts. `dimensions` is a string of metrics you want to include in your custom model. By default the [netdata-pandas](https://github.com/netdata/netdata-pandas) library used to pull the data from Netdata uses a "chart.a|dim.1" type of naming convention in the pandas columns it returns, hence the `dimensions` string should look like "chart.name|dimension.name,chart.name|dimension.name". The examples below hopefully make this clear. + +```yaml +custom_models: + # a model for anomaly detection on the netdata user in terms of cpu, mem, threads, processes and sockets. + - name: 'user_netdata' + dimensions: 'users.cpu|netdata,users.mem|netdata,users.threads|netdata,users.processes|netdata,users.sockets|netdata' + # a model for anomaly detection on the netdata python.d.plugin app in terms of cpu, mem, threads, processes and sockets. + - name: 'apps_python_d_plugin' + dimensions: 'apps.cpu|python.d.plugin,apps.mem|python.d.plugin,apps.threads|python.d.plugin,apps.processes|python.d.plugin,apps.sockets|python.d.plugin' + +custom_models_normalize: false +``` + +## Troubleshooting + +To see any relevant log messages you can use a command like below. + +```bash +`grep 'anomalies' /var/log/netdata/error.log` +``` + +If you would like to log in as `netdata` user and run the collector in debug mode to see more detail. + +```bash +# become netdata user +sudo su -s /bin/bash netdata +# run collector in debug using `nolock` option if netdata is already running the collector itself. +/usr/libexec/netdata/plugins.d/python.d.plugin anomalies debug trace nolock +``` + +## Deepdive tutorial + +If you would like to go deeper on what exactly the anomalies collector is doing under the hood then check out this [deepdive tutorial](https://github.com/netdata/community/blob/main/netdata-agent-api/netdata-pandas/anomalies_collector_deepdive.ipynb) in our community repo where you can play around with some data from our demo servers (or your own if its accessible to you) and work through the calculations step by step. + +(Note: as its a Jupyter Notebook it might render a little prettier on [nbviewer](https://nbviewer.jupyter.org/github/netdata/community/blob/main/netdata-agent-api/netdata-pandas/anomalies_collector_deepdive.ipynb)) + +## Notes + +- Python 3 is required as the [`netdata-pandas`](https://github.com/netdata/netdata-pandas) package uses Python async libraries ([asks](https://pypi.org/project/asks/) and [trio](https://pypi.org/project/trio/)) to make asynchronous calls to the [Netdata REST API](https://github.com/netdata/netdata/blob/master/web/api/README.md) to get the required data for each chart. +- Python 3 is also required for the underlying ML libraries of [numba](https://pypi.org/project/numba/), [scikit-learn](https://pypi.org/project/scikit-learn/), and [PyOD](https://pypi.org/project/pyod/). +- It may take a few hours or so (depending on your choice of `train_secs_n`) for the collector to 'settle' into it's typical behaviour in terms of the trained models and probabilities you will see in the normal running of your node. +- As this collector does most of the work in Python itself, with [PyOD](https://pyod.readthedocs.io/en/latest/) leveraging [numba](https://numba.pydata.org/) under the hood, you may want to try it out first on a test or development system to get a sense of its performance characteristics on a node similar to where you would like to use it. +- `lags_n`, `smooth_n`, and `diffs_n` together define the preprocessing done to the raw data before models are trained and before each prediction. This essentially creates a [feature vector](https://en.wikipedia.org/wiki/Feature_(machine_learning)#:~:text=In%20pattern%20recognition%20and%20machine,features%20that%20represent%20some%20object.&text=Feature%20vectors%20are%20often%20combined,score%20for%20making%20a%20prediction.) for each chart model (or each custom model). The default settings for these parameters aim to create a rolling matrix of recent smoothed [differenced](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average#Differencing) values for each chart. The aim of the model then is to score how unusual this 'matrix' of features is for each chart based on what it has learned as 'normal' from the training data. So as opposed to just looking at the single most recent value of a dimension and considering how strange it is, this approach looks at a recent smoothed window of all dimensions for a chart (or dimensions in a custom model) and asks how unusual the data as a whole looks. This should be more flexible in capturing a wider range of [anomaly types](https://andrewm4894.com/2020/10/19/different-types-of-time-series-anomalies/) and be somewhat more robust to temporary 'spikes' in the data that tend to always be happening somewhere in your metrics but often are not the most important type of anomaly (this is all covered in a lot more detail in the [deepdive tutorial](https://nbviewer.jupyter.org/github/netdata/community/blob/main/netdata-agent-api/netdata-pandas/anomalies_collector_deepdive.ipynb)). +- You can see how long model training is taking by looking in the logs for the collector `grep 'anomalies' /var/log/netdata/error.log | grep 'training'` and you should see lines like `2020-12-01 22:02:14: python.d INFO: anomalies[local] : training complete in 2.81 seconds (runs_counter=2700, model=pca, train_n_secs=14400, models=26, n_fit_success=26, n_fit_fails=0, after=1606845731, before=1606860131).`. + - This also gives counts of the number of models, if any, that failed to fit and so had to default back to the DefaultModel (which is currently [HBOS](https://pyod.readthedocs.io/en/latest/_modules/pyod/models/hbos.html)). + - `after` and `before` here refer to the start and end of the training data used to train the models. +- On a development n1-standard-2 (2 vCPUs, 7.5 GB memory) vm running Ubuntu 18.04 LTS and not doing any work some of the typical performance characteristics we saw from running this collector (with defaults) were: + - A runtime (`netdata.runtime_anomalies`) of ~80ms when doing scoring and ~3 seconds when training or retraining the models. + - Typically ~3%-3.5% additional cpu usage from scoring, jumping to ~60% for a couple of seconds during model training. + - About ~150mb of ram (`apps.mem`) being continually used by the `python.d.plugin`. +- If you activate this collector on a fresh node, it might take a little while to build up enough data to calculate a realistic and useful model. +- Some models like `iforest` can be comparatively expensive (on same n1-standard-2 system above ~2s runtime during predict, ~40s training time, ~50% cpu on both train and predict) so if you would like to use it you might be advised to set a relatively high `update_every` maybe 10, 15 or 30 in `anomalies.conf`. +- Setting a higher `train_every_n` and `update_every` is an easy way to devote less resources on the node to anomaly detection. Specifying less charts and a lower `train_n_secs` will also help reduce resources at the expense of covering less charts and maybe a more noisy model if you set `train_n_secs` to be too small for how your node tends to behave. +- If you would like to enable this on a Raspberry Pi, then check out [this guide](https://github.com/netdata/netdata/blob/master/docs/guides/monitor/raspberry-pi-anomaly-detection.md) which will guide you through first installing LLVM. + +## Useful links and further reading + +- [PyOD documentation](https://pyod.readthedocs.io/en/latest/), [PyOD Github](https://github.com/yzhao062/pyod). +- [Anomaly Detection](https://en.wikipedia.org/wiki/Anomaly_detection) wikipedia page. +- [Anomaly Detection YouTube playlist](https://www.youtube.com/playlist?list=PL6Zhl9mK2r0KxA6rB87oi4kWzoqGd5vp0) maintained by [andrewm4894](https://github.com/andrewm4894/) from Netdata. +- [awesome-TS-anomaly-detection](https://github.com/rob-med/awesome-TS-anomaly-detection) Github list of useful tools, libraries and resources. +- [Mendeley public group](https://www.mendeley.com/community/interesting-anomaly-detection-papers/) with some interesting anomaly detection papers we have been reading. +- Good [blog post](https://www.anodot.com/blog/what-is-anomaly-detection/) from Anodot on time series anomaly detection. Anodot also have some great whitepapers in this space too that some may find useful. +- Novelty and outlier detection in the [scikit-learn documentation](https://scikit-learn.org/stable/modules/outlier_detection.html). + diff --git a/collectors/python.d.plugin/anomalies/anomalies.chart.py b/collectors/python.d.plugin/anomalies/anomalies.chart.py new file mode 100644 index 00000000..24e84cc1 --- /dev/null +++ b/collectors/python.d.plugin/anomalies/anomalies.chart.py @@ -0,0 +1,425 @@ +# -*- coding: utf-8 -*- +# Description: anomalies netdata python.d module +# Author: andrewm4894 +# SPDX-License-Identifier: GPL-3.0-or-later + +import sys +import time +from datetime import datetime +import re +import warnings + +import requests +import numpy as np +import pandas as pd +from netdata_pandas.data import get_data, get_allmetrics_async +from pyod.models.hbos import HBOS +from pyod.models.pca import PCA +from pyod.models.loda import LODA +from pyod.models.iforest import IForest +from pyod.models.cblof import CBLOF +from pyod.models.feature_bagging import FeatureBagging +from pyod.models.copod import COPOD +from sklearn.preprocessing import MinMaxScaler + +from bases.FrameworkServices.SimpleService import SimpleService + +# ignore some sklearn/numpy warnings that are ok +warnings.filterwarnings('ignore', r'All-NaN slice encountered') +warnings.filterwarnings('ignore', r'invalid value encountered in true_divide') +warnings.filterwarnings('ignore', r'divide by zero encountered in true_divide') +warnings.filterwarnings('ignore', r'invalid value encountered in subtract') + +disabled_by_default = True + +ORDER = ['probability', 'anomaly'] + +CHARTS = { + 'probability': { + 'options': ['probability', 'Anomaly Probability', 'probability', 'anomalies', 'anomalies.probability', 'line'], + 'lines': [] + }, + 'anomaly': { + 'options': ['anomaly', 'Anomaly', 'count', 'anomalies', 'anomalies.anomaly', 'stacked'], + 'lines': [] + }, +} + + +class Service(SimpleService): + def __init__(self, configuration=None, name=None): + SimpleService.__init__(self, configuration=configuration, name=name) + self.basic_init() + self.charts_init() + self.custom_models_init() + self.data_init() + self.model_params_init() + self.models_init() + self.collected_dims = {'probability': set(), 'anomaly': set()} + + def check(self): + if not (sys.version_info[0] >= 3 and sys.version_info[1] >= 6): + self.error("anomalies collector only works with Python>=3.6") + if len(self.host_charts_dict[self.host]) > 0: + _ = get_allmetrics_async(host_charts_dict=self.host_charts_dict, protocol=self.protocol, user=self.username, pwd=self.password) + return True + + def basic_init(self): + """Perform some basic initialization. + """ + self.order = ORDER + self.definitions = CHARTS + self.protocol = self.configuration.get('protocol', 'http') + self.host = self.configuration.get('host', '127.0.0.1:19999') + self.username = self.configuration.get('username', None) + self.password = self.configuration.get('password', None) + self.tls_verify = self.configuration.get('tls_verify', True) + self.fitted_at = {} + self.df_allmetrics = pd.DataFrame() + self.last_train_at = 0 + self.include_average_prob = bool(self.configuration.get('include_average_prob', True)) + self.reinitialize_at_every_step = bool(self.configuration.get('reinitialize_at_every_step', False)) + + def charts_init(self): + """Do some initialisation of charts in scope related variables. + """ + self.charts_regex = re.compile(self.configuration.get('charts_regex','None')) + self.charts_available = [c for c in list(requests.get(f'{self.protocol}://{self.host}/api/v1/charts', verify=self.tls_verify).json().get('charts', {}).keys())] + self.charts_in_scope = list(filter(self.charts_regex.match, self.charts_available)) + self.charts_to_exclude = self.configuration.get('charts_to_exclude', '').split(',') + if len(self.charts_to_exclude) > 0: + self.charts_in_scope = [c for c in self.charts_in_scope if c not in self.charts_to_exclude] + + def custom_models_init(self): + """Perform initialization steps related to custom models. + """ + self.custom_models = self.configuration.get('custom_models', None) + self.custom_models_normalize = bool(self.configuration.get('custom_models_normalize', False)) + if self.custom_models: + self.custom_models_names = [model['name'] for model in self.custom_models] + self.custom_models_dims = [i for s in [model['dimensions'].split(',') for model in self.custom_models] for i in s] + self.custom_models_dims = [dim if '::' in dim else f'{self.host}::{dim}' for dim in self.custom_models_dims] + self.custom_models_charts = list(set([dim.split('|')[0].split('::')[1] for dim in self.custom_models_dims])) + self.custom_models_hosts = list(set([dim.split('::')[0] for dim in self.custom_models_dims])) + self.custom_models_host_charts_dict = {} + for host in self.custom_models_hosts: + self.custom_models_host_charts_dict[host] = list(set([dim.split('::')[1].split('|')[0] for dim in self.custom_models_dims if dim.startswith(host)])) + self.custom_models_dims_renamed = [f"{model['name']}|{dim}" for model in self.custom_models for dim in model['dimensions'].split(',')] + self.models_in_scope = list(set([f'{self.host}::{c}' for c in self.charts_in_scope] + self.custom_models_names)) + self.charts_in_scope = list(set(self.charts_in_scope + self.custom_models_charts)) + self.host_charts_dict = {self.host: self.charts_in_scope} + for host in self.custom_models_host_charts_dict: + if host not in self.host_charts_dict: + self.host_charts_dict[host] = self.custom_models_host_charts_dict[host] + else: + for chart in self.custom_models_host_charts_dict[host]: + if chart not in self.host_charts_dict[host]: + self.host_charts_dict[host].extend(chart) + else: + self.models_in_scope = [f'{self.host}::{c}' for c in self.charts_in_scope] + self.host_charts_dict = {self.host: self.charts_in_scope} + self.model_display_names = {model: model.split('::')[1] if '::' in model else model for model in self.models_in_scope} + #self.info(f'self.host_charts_dict (len={len(self.host_charts_dict[self.host])}): {self.host_charts_dict}') + + def data_init(self): + """Initialize some empty data objects. + """ + self.data_probability_latest = {f'{m}_prob': 0 for m in self.charts_in_scope} + self.data_anomaly_latest = {f'{m}_anomaly': 0 for m in self.charts_in_scope} + self.data_latest = {**self.data_probability_latest, **self.data_anomaly_latest} + + def model_params_init(self): + """Model parameters initialisation. + """ + self.train_max_n = self.configuration.get('train_max_n', 100000) + self.train_n_secs = self.configuration.get('train_n_secs', 14400) + self.offset_n_secs = self.configuration.get('offset_n_secs', 0) + self.train_every_n = self.configuration.get('train_every_n', 1800) + self.train_no_prediction_n = self.configuration.get('train_no_prediction_n', 10) + self.initial_train_data_after = self.configuration.get('initial_train_data_after', 0) + self.initial_train_data_before = self.configuration.get('initial_train_data_before', 0) + self.contamination = self.configuration.get('contamination', 0.001) + self.lags_n = {model: self.configuration.get('lags_n', 5) for model in self.models_in_scope} + self.smooth_n = {model: self.configuration.get('smooth_n', 5) for model in self.models_in_scope} + self.diffs_n = {model: self.configuration.get('diffs_n', 5) for model in self.models_in_scope} + + def models_init(self): + """Models initialisation. + """ + self.model = self.configuration.get('model', 'pca') + if self.model == 'pca': + self.models = {model: PCA(contamination=self.contamination) for model in self.models_in_scope} + elif self.model == 'loda': + self.models = {model: LODA(contamination=self.contamination) for model in self.models_in_scope} + elif self.model == 'iforest': + self.models = {model: IForest(n_estimators=50, bootstrap=True, behaviour='new', contamination=self.contamination) for model in self.models_in_scope} + elif self.model == 'cblof': + self.models = {model: CBLOF(n_clusters=3, contamination=self.contamination) for model in self.models_in_scope} + elif self.model == 'feature_bagging': + self.models = {model: FeatureBagging(base_estimator=PCA(contamination=self.contamination), contamination=self.contamination) for model in self.models_in_scope} + elif self.model == 'copod': + self.models = {model: COPOD(contamination=self.contamination) for model in self.models_in_scope} + elif self.model == 'hbos': + self.models = {model: HBOS(contamination=self.contamination) for model in self.models_in_scope} + else: + self.models = {model: HBOS(contamination=self.contamination) for model in self.models_in_scope} + self.custom_model_scalers = {model: MinMaxScaler() for model in self.models_in_scope} + + def model_init(self, model): + """Model initialisation of a single model. + """ + if self.model == 'pca': + self.models[model] = PCA(contamination=self.contamination) + elif self.model == 'loda': + self.models[model] = LODA(contamination=self.contamination) + elif self.model == 'iforest': + self.models[model] = IForest(n_estimators=50, bootstrap=True, behaviour='new', contamination=self.contamination) + elif self.model == 'cblof': + self.models[model] = CBLOF(n_clusters=3, contamination=self.contamination) + elif self.model == 'feature_bagging': + self.models[model] = FeatureBagging(base_estimator=PCA(contamination=self.contamination), contamination=self.contamination) + elif self.model == 'copod': + self.models[model] = COPOD(contamination=self.contamination) + elif self.model == 'hbos': + self.models[model] = HBOS(contamination=self.contamination) + else: + self.models[model] = HBOS(contamination=self.contamination) + self.custom_model_scalers[model] = MinMaxScaler() + + def reinitialize(self): + """Reinitialize charts, models and data to a beginning state. + """ + self.charts_init() + self.custom_models_init() + self.data_init() + self.model_params_init() + self.models_init() + + def save_data_latest(self, data, data_probability, data_anomaly): + """Save the most recent data objects to be used if needed in the future. + """ + self.data_latest = data + self.data_probability_latest = data_probability + self.data_anomaly_latest = data_anomaly + + def validate_charts(self, chart, data, algorithm='absolute', multiplier=1, divisor=1): + """If dimension not in chart then add it. + """ + for dim in data: + if dim not in self.collected_dims[chart]: + self.collected_dims[chart].add(dim) + self.charts[chart].add_dimension([dim, dim, algorithm, multiplier, divisor]) + + for dim in list(self.collected_dims[chart]): + if dim not in data: + self.collected_dims[chart].remove(dim) + self.charts[chart].del_dimension(dim, hide=False) + + def add_custom_models_dims(self, df): + """Given a df, select columns used by custom models, add custom model name as prefix, and append to df. + + :param df <pd.DataFrame>: dataframe to append new renamed columns to. + :return: <pd.DataFrame> dataframe with additional columns added relating to the specified custom models. + """ + df_custom = df[self.custom_models_dims].copy() + df_custom.columns = self.custom_models_dims_renamed + df = df.join(df_custom) + + return df + + def make_features(self, arr, train=False, model=None): + """Take in numpy array and preprocess accordingly by taking diffs, smoothing and adding lags. + + :param arr <np.ndarray>: numpy array we want to make features from. + :param train <bool>: True if making features for training, in which case need to fit_transform scaler and maybe sample train_max_n. + :param model <str>: model to make features for. + :return: <np.ndarray> transformed numpy array. + """ + + def lag(arr, n): + res = np.empty_like(arr) + res[:n] = np.nan + res[n:] = arr[:-n] + + return res + + arr = np.nan_to_num(arr) + + diffs_n = self.diffs_n[model] + smooth_n = self.smooth_n[model] + lags_n = self.lags_n[model] + + if self.custom_models_normalize and model in self.custom_models_names: + if train: + arr = self.custom_model_scalers[model].fit_transform(arr) + else: + arr = self.custom_model_scalers[model].transform(arr) + + if diffs_n > 0: + arr = np.diff(arr, diffs_n, axis=0) + arr = arr[~np.isnan(arr).any(axis=1)] + + if smooth_n > 1: + arr = np.cumsum(arr, axis=0, dtype=float) + arr[smooth_n:] = arr[smooth_n:] - arr[:-smooth_n] + arr = arr[smooth_n - 1:] / smooth_n + arr = arr[~np.isnan(arr).any(axis=1)] + + if lags_n > 0: + arr_orig = np.copy(arr) + for lag_n in range(1, lags_n + 1): + arr = np.concatenate((arr, lag(arr_orig, lag_n)), axis=1) + arr = arr[~np.isnan(arr).any(axis=1)] + + if train: + if len(arr) > self.train_max_n: + arr = arr[np.random.randint(arr.shape[0], size=self.train_max_n), :] + + arr = np.nan_to_num(arr) + + return arr + + def train(self, models_to_train=None, train_data_after=0, train_data_before=0): + """Pull required training data and train a model for each specified model. + + :param models_to_train <list>: list of models to train on. + :param train_data_after <int>: integer timestamp for start of train data. + :param train_data_before <int>: integer timestamp for end of train data. + """ + now = datetime.now().timestamp() + if train_data_after > 0 and train_data_before > 0: + before = train_data_before + after = train_data_after + else: + before = int(now) - self.offset_n_secs + after = before - self.train_n_secs + + # get training data + df_train = get_data( + host_charts_dict=self.host_charts_dict, host_prefix=True, host_sep='::', after=after, before=before, + sort_cols=True, numeric_only=True, protocol=self.protocol, float_size='float32', user=self.username, pwd=self.password, + verify=self.tls_verify + ).ffill() + if self.custom_models: + df_train = self.add_custom_models_dims(df_train) + + # train model + self.try_fit(df_train, models_to_train=models_to_train) + self.info(f'training complete in {round(time.time() - now, 2)} seconds (runs_counter={self.runs_counter}, model={self.model}, train_n_secs={self.train_n_secs}, models={len(self.fitted_at)}, n_fit_success={self.n_fit_success}, n_fit_fails={self.n_fit_fail}, after={after}, before={before}).') + self.last_train_at = self.runs_counter + + def try_fit(self, df_train, models_to_train=None): + """Try fit each model and try to fallback to a default model if fit fails for any reason. + + :param df_train <pd.DataFrame>: data to train on. + :param models_to_train <list>: list of models to train. + """ + if models_to_train is None: + models_to_train = list(self.models.keys()) + self.n_fit_fail, self.n_fit_success = 0, 0 + for model in models_to_train: + if model not in self.models: + self.model_init(model) + X_train = self.make_features( + df_train[df_train.columns[df_train.columns.str.startswith(f'{model}|')]].values, + train=True, model=model) + try: + self.models[model].fit(X_train) + self.n_fit_success += 1 + except Exception as e: + self.n_fit_fail += 1 + self.info(e) + self.info(f'training failed for {model} at run_counter {self.runs_counter}, defaulting to hbos model.') + self.models[model] = HBOS(contamination=self.contamination) + self.models[model].fit(X_train) + self.fitted_at[model] = self.runs_counter + + def predict(self): + """Get latest data, make it into a feature vector, and get predictions for each available model. + + :return: (<dict>,<dict>) tuple of dictionaries, one for probability scores and the other for anomaly predictions. + """ + # get recent data to predict on + df_allmetrics = get_allmetrics_async( + host_charts_dict=self.host_charts_dict, host_prefix=True, host_sep='::', wide=True, sort_cols=True, + protocol=self.protocol, numeric_only=True, float_size='float32', user=self.username, pwd=self.password + ) + if self.custom_models: + df_allmetrics = self.add_custom_models_dims(df_allmetrics) + self.df_allmetrics = self.df_allmetrics.append(df_allmetrics).ffill().tail((max(self.lags_n.values()) + max(self.smooth_n.values()) + max(self.diffs_n.values())) * 2) + + # get predictions + data_probability, data_anomaly = self.try_predict() + + return data_probability, data_anomaly + + def try_predict(self): + """Try make prediction and fall back to last known prediction if fails. + + :return: (<dict>,<dict>) tuple of dictionaries, one for probability scores and the other for anomaly predictions. + """ + data_probability, data_anomaly = {}, {} + for model in self.fitted_at.keys(): + model_display_name = self.model_display_names[model] + try: + X_model = np.nan_to_num( + self.make_features( + self.df_allmetrics[self.df_allmetrics.columns[self.df_allmetrics.columns.str.startswith(f'{model}|')]].values, + model=model + )[-1,:].reshape(1, -1) + ) + data_probability[model_display_name + '_prob'] = np.nan_to_num(self.models[model].predict_proba(X_model)[-1][1]) * 10000 + data_anomaly[model_display_name + '_anomaly'] = self.models[model].predict(X_model)[-1] + except Exception as _: + #self.info(e) + if model_display_name + '_prob' in self.data_latest: + #self.info(f'prediction failed for {model} at run_counter {self.runs_counter}, using last prediction instead.') + data_probability[model_display_name + '_prob'] = self.data_latest[model_display_name + '_prob'] + data_anomaly[model_display_name + '_anomaly'] = self.data_latest[model_display_name + '_anomaly'] + else: + #self.info(f'prediction failed for {model} at run_counter {self.runs_counter}, skipping as no previous prediction.') + continue + + return data_probability, data_anomaly + + def get_data(self): + + # initialize to what's available right now + if self.reinitialize_at_every_step or len(self.host_charts_dict[self.host]) == 0: + self.charts_init() + self.custom_models_init() + self.model_params_init() + + # if not all models have been trained then train those we need to + if len(self.fitted_at) < len(self.models_in_scope): + self.train( + models_to_train=[m for m in self.models_in_scope if m not in self.fitted_at], + train_data_after=self.initial_train_data_after, + train_data_before=self.initial_train_data_before + ) + # retrain all models as per schedule from config + elif self.train_every_n > 0 and self.runs_counter % self.train_every_n == 0: + self.reinitialize() + self.train() + + # roll forward previous predictions around a training step to avoid the possibility of having the training itself trigger an anomaly + if (self.runs_counter - self.last_train_at) <= self.train_no_prediction_n: + data_probability = self.data_probability_latest + data_anomaly = self.data_anomaly_latest + else: + data_probability, data_anomaly = self.predict() + if self.include_average_prob: + average_prob = np.mean(list(data_probability.values())) + data_probability['average_prob'] = 0 if np.isnan(average_prob) else average_prob + + data = {**data_probability, **data_anomaly} + + self.validate_charts('probability', data_probability, divisor=100) + self.validate_charts('anomaly', data_anomaly) + + self.save_data_latest(data, data_probability, data_anomaly) + + #self.info(f'len(data)={len(data)}') + #self.info(f'data') + + return data diff --git a/collectors/python.d.plugin/anomalies/anomalies.conf b/collectors/python.d.plugin/anomalies/anomalies.conf new file mode 100644 index 00000000..ef867709 --- /dev/null +++ b/collectors/python.d.plugin/anomalies/anomalies.conf @@ -0,0 +1,184 @@ +# netdata python.d.plugin configuration for anomalies +# +# This file is in YaML format. Generally the format is: +# +# name: value +# +# There are 2 sections: +# - global variables +# - one or more JOBS +# +# JOBS allow you to collect values from multiple sources. +# Each source will have its own set of charts. +# +# JOB parameters have to be indented (using spaces only, example below). + +# ---------------------------------------------------------------------- +# Global Variables +# These variables set the defaults for all JOBs, however each JOB +# may define its own, overriding the defaults. + +# update_every sets the default data collection frequency. +# If unset, the python.d.plugin default is used. +# update_every: 2 + +# priority controls the order of charts at the netdata dashboard. +# Lower numbers move the charts towards the top of the page. +# If unset, the default for python.d.plugin is used. +# priority: 60000 + +# ---------------------------------------------------------------------- +# JOBS (data collection sources) + +# Pull data from local Netdata node. +anomalies: + name: 'Anomalies' + + # Host to pull data from. + host: '127.0.0.1:19999' + + # Username and Password for Netdata if using basic auth. + # username: '???' + # password: '???' + + # Use http or https to pull data + protocol: 'http' + + # SSL verify parameter for requests.get() calls + tls_verify: true + + # What charts to pull data for - A regex like 'system\..*|' or 'system\..*|apps.cpu|apps.mem' etc. + charts_regex: 'system\..*' + + # Charts to exclude, useful if you would like to exclude some specific charts. + # Note: should be a ',' separated string like 'chart.name,chart.name'. + charts_to_exclude: 'system.uptime,system.entropy' + + # What model to use - can be one of 'pca', 'hbos', 'iforest', 'cblof', 'loda', 'copod' or 'feature_bagging'. + # More details here: https://pyod.readthedocs.io/en/latest/pyod.models.html. + model: 'pca' + + # Max number of observations to train on, to help cap compute cost of training model if you set a very large train_n_secs. + train_max_n: 100000 + + # How often to re-train the model (assuming update_every=1 then train_every_n=1800 represents (re)training every 30 minutes). + # Note: If you want to turn off re-training set train_every_n=0 and after initial training the models will not be retrained. + train_every_n: 1800 + + # The length of the window of data to train on (14400 = last 4 hours). + train_n_secs: 14400 + + # How many prediction steps after a train event to just use previous prediction value for. + # Used to reduce possibility of the training step itself appearing as an anomaly on the charts. + train_no_prediction_n: 10 + + # If you would like to train the model for the first time on a specific window then you can define it using the below two variables. + # Start of training data for initial model. + # initial_train_data_after: 1604578857 + + # End of training data for initial model. + # initial_train_data_before: 1604593257 + + # If you would like to ignore recent data in training then you can offset it by offset_n_secs. + offset_n_secs: 0 + + # How many lagged values of each dimension to include in the 'feature vector' each model is trained on. + lags_n: 5 + + # How much smoothing to apply to each dimension in the 'feature vector' each model is trained on. + smooth_n: 3 + + # How many differences to take in preprocessing your data. + # More info on differencing here: https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average#Differencing + # diffs_n=0 would mean training models on the raw values of each dimension. + # diffs_n=1 means everything is done in terms of differences. + diffs_n: 1 + + # What is the typical proportion of anomalies in your data on average? + # This parameter can control the sensitivity of your models to anomalies. + # Some discussion here: https://github.com/yzhao062/pyod/issues/144 + contamination: 0.001 + + # Set to true to include an "average_prob" dimension on anomalies probability chart which is + # just the average of all anomaly probabilities at each time step + include_average_prob: true + + # Define any custom models you would like to create anomaly probabilities for, some examples below to show how. + # For example below example creates two custom models, one to run anomaly detection user and system cpu for our demo servers + # and one on the cpu and mem apps metrics for the python.d.plugin. + # custom_models: + # - name: 'demos_cpu' + # dimensions: 'london.my-netdata.io::system.cpu|user,london.my-netdata.io::system.cpu|system,newyork.my-netdata.io::system.cpu|user,newyork.my-netdata.io::system.cpu|system' + # - name: 'apps_python_d_plugin' + # dimensions: 'apps.cpu|python.d.plugin,apps.mem|python.d.plugin' + + # Set to true to normalize, using min-max standardization, features used for the custom models. + # Useful if your custom models contain dimensions on very different scales an model you use does + # not internally do its own normalization. Usually best to leave as false. + # custom_models_normalize: false + +# Standalone Custom models example as an additional collector job. +# custom: +# name: 'custom' +# host: '127.0.0.1:19999' +# protocol: 'http' +# charts_regex: 'None' +# charts_to_exclude: 'None' +# model: 'pca' +# train_max_n: 100000 +# train_every_n: 1800 +# train_n_secs: 14400 +# offset_n_secs: 0 +# lags_n: 5 +# smooth_n: 3 +# diffs_n: 1 +# contamination: 0.001 +# custom_models: +# - name: 'user_netdata' +# dimensions: 'users.cpu|netdata,users.mem|netdata,users.threads|netdata,users.processes|netdata,users.sockets|netdata' +# - name: 'apps_python_d_plugin' +# dimensions: 'apps.cpu|python.d.plugin,apps.mem|python.d.plugin,apps.threads|python.d.plugin,apps.processes|python.d.plugin,apps.sockets|python.d.plugin' + +# Pull data from some demo nodes for cross node custom models. +# demos: +# name: 'demos' +# host: '127.0.0.1:19999' +# protocol: 'http' +# charts_regex: 'None' +# charts_to_exclude: 'None' +# model: 'pca' +# train_max_n: 100000 +# train_every_n: 1800 +# train_n_secs: 14400 +# offset_n_secs: 0 +# lags_n: 5 +# smooth_n: 3 +# diffs_n: 1 +# contamination: 0.001 +# custom_models: +# - name: 'system.cpu' +# dimensions: 'london.my-netdata.io::system.cpu|user,london.my-netdata.io::system.cpu|system,newyork.my-netdata.io::system.cpu|user,newyork.my-netdata.io::system.cpu|system' +# - name: 'system.ip' +# dimensions: 'london.my-netdata.io::system.ip|received,london.my-netdata.io::system.ip|sent,newyork.my-netdata.io::system.ip|received,newyork.my-netdata.io::system.ip|sent' +# - name: 'system.net' +# dimensions: 'london.my-netdata.io::system.net|received,london.my-netdata.io::system.net|sent,newyork.my-netdata.io::system.net|received,newyork.my-netdata.io::system.net|sent' +# - name: 'system.io' +# dimensions: 'london.my-netdata.io::system.io|in,london.my-netdata.io::system.io|out,newyork.my-netdata.io::system.io|in,newyork.my-netdata.io::system.io|out' + +# Example additional job if you want to also pull data from a child streaming to your +# local parent or even a remote node so long as the Netdata REST API is accessible. +# mychildnode1: +# name: 'mychildnode1' +# host: '127.0.0.1:19999/host/mychildnode1' +# protocol: 'http' +# charts_regex: 'system\..*' +# charts_to_exclude: 'None' +# model: 'pca' +# train_max_n: 100000 +# train_every_n: 1800 +# train_n_secs: 14400 +# offset_n_secs: 0 +# lags_n: 5 +# smooth_n: 3 +# diffs_n: 1 +# contamination: 0.001 diff --git a/collectors/python.d.plugin/anomalies/metadata.yaml b/collectors/python.d.plugin/anomalies/metadata.yaml new file mode 100644 index 00000000..d138cf5d --- /dev/null +++ b/collectors/python.d.plugin/anomalies/metadata.yaml @@ -0,0 +1,87 @@ +# NOTE: this file is commented out as users are reccomended to use the +# native anomaly detection capabilities on the agent instead. +# meta: +# plugin_name: python.d.plugin +# module_name: anomalies +# monitored_instance: +# name: python.d anomalies +# link: "" +# categories: [] +# icon_filename: "" +# related_resources: +# integrations: +# list: [] +# info_provided_to_referring_integrations: +# description: "" +# keywords: [] +# most_popular: false +# overview: +# data_collection: +# metrics_description: "" +# method_description: "" +# supported_platforms: +# include: [] +# exclude: [] +# multi_instance: true +# additional_permissions: +# description: "" +# default_behavior: +# auto_detection: +# description: "" +# limits: +# description: "" +# performance_impact: +# description: "" +# setup: +# prerequisites: +# list: [] +# configuration: +# file: +# name: "" +# description: "" +# options: +# description: "" +# folding: +# title: "" +# enabled: true +# list: [] +# examples: +# folding: +# enabled: true +# title: "" +# list: [] +# troubleshooting: +# problems: +# list: [] +# alerts: +# - name: anomalies_anomaly_probabilities +# link: https://github.com/netdata/netdata/blob/master/health/health.d/anomalies.conf +# metric: anomalies.probability +# info: average anomaly probability over the last 2 minutes +# - name: anomalies_anomaly_flags +# link: https://github.com/netdata/netdata/blob/master/health/health.d/anomalies.conf +# metric: anomalies.anomaly +# info: number of anomalies in the last 2 minutes +# metrics: +# folding: +# title: Metrics +# enabled: false +# description: "" +# availability: [] +# scopes: +# - name: global +# description: "" +# labels: [] +# metrics: +# - name: anomalies.probability +# description: Anomaly Probability +# unit: "probability" +# chart_type: line +# dimensions: +# - name: a dimension per probability +# - name: anomalies.anomaly +# description: Anomaly +# unit: "count" +# chart_type: stacked +# dimensions: +# - name: a dimension per anomaly |