diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2023-05-08 16:27:04 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2023-05-08 16:27:04 +0000 |
commit | a836a244a3d2bdd4da1ee2641e3e957850668cea (patch) | |
tree | cb87c75b3677fab7144f868435243f864048a1e6 /docs/cloud/insights | |
parent | Adding upstream version 1.38.1. (diff) | |
download | netdata-a836a244a3d2bdd4da1ee2641e3e957850668cea.tar.xz netdata-a836a244a3d2bdd4da1ee2641e3e957850668cea.zip |
Adding upstream version 1.39.0.upstream/1.39.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'docs/cloud/insights')
-rw-r--r-- | docs/cloud/insights/anomaly-advisor.md (renamed from docs/cloud/insights/anomaly-advisor.mdx) | 19 | ||||
-rw-r--r-- | docs/cloud/insights/events-feed.md | 79 | ||||
-rw-r--r-- | docs/cloud/insights/metric-correlations.md | 16 |
3 files changed, 96 insertions, 18 deletions
diff --git a/docs/cloud/insights/anomaly-advisor.mdx b/docs/cloud/insights/anomaly-advisor.md index 98a28d92c..4804dbc16 100644 --- a/docs/cloud/insights/anomaly-advisor.mdx +++ b/docs/cloud/insights/anomaly-advisor.md @@ -1,12 +1,14 @@ ---- +<!-- title: "Anomaly Advisor" description: "Quickly find anomalous metrics anywhere in your infrastructure." -custom_edit_url: "https://github.com/netdata/netdata/blob/master/docs/cloud/insights/anomaly-advisor.mdx" +custom_edit_url: "https://github.com/netdata/netdata/blob/master/docs/cloud/insights/anomaly-advisor.md" sidebar_label: "Anomaly Advisor" learn_status: "Published" learn_topic_type: "Tasks" learn_rel_path: "Operations" ---- +--> + +# Anomaly Advisor import ReactPlayer from 'react-player' @@ -19,7 +21,7 @@ interest. If you are running a Netdata version higher than `v1.35.0-29-nightly` you will be able to use the Anomaly Advisor out of the box with zero configuration. If you are on an earlier Netdata version you will need to first enable ML on your nodes by following the steps below. -To enable the Anomaly Advisor you must first enable ML on your nodes via a small config change in `netdata.conf`. Once the anomaly detection models have trained on the Agent (with default settings this takes a couple of hours until enough data has been seen to train the models) you will then be able to enable the Anomaly Advisor feature in Netdata Cloud. +To enable the Anomaly Advisor you must first enable ML on your nodes via a small config change in `netdata.conf`. Once the anomaly detection models have trained on the Agent (with default settings this takes a couple of hours until enough data has been seen to train the models) you will then be able to enable the Anomaly Advisor feature in Netdata Cloud. ### Enable ML on Netdata Agent @@ -30,9 +32,7 @@ To enable ML on your Netdata Agent, you need to edit the `[ml]` section in your enabled = yes ``` -At a minimum you just need to set `enabled = yes` to enable ML with default params. More details about configuration can be found in the [Netdata Agent ML docs](https://learn.netdata.cloud/docs/agent/ml#configuration). - -**Note**: Follow [this guide](https://github.com/netdata/netdata/blob/master/docs/guides/step-by-step/step-04.md) if you are unfamiliar with making configuration changes in Netdata. +At a minimum you just need to set `enabled = yes` to enable ML with default params. More details about configuration can be found in the [Netdata Agent ML docs](https://github.com/netdata/netdata/blob/master/ml/README.md#configuration). When you have finished your configuration, restart Netdata with a command like `sudo systemctl restart netdata` for the config changes to take effect. You can find more info on restarting Netdata [here](https://github.com/netdata/netdata/blob/master/docs/configure/start-stop-restart.md). @@ -44,7 +44,7 @@ Once this line flattens out all configured metrics should have models trained an ## Using Anomaly Advisor -To use the Anomaly Advisor, go to the "anomalies" tab. Once you highlight a particular timeframe of interest, a selection of the most anomalous dimensions will appear below. +To use the Anomaly Advisor, go to the "anomalies" tab. Once you highlight a particular timeframe of interest, a selection of the most anomalous dimensions will appear below. The aim here is to surface the most anomalous metrics in the space or room for the highlighted window to try and cut down on the amount of manual searching required to get to the root cause of your issues. @@ -68,7 +68,7 @@ You can expand any sparkline chart to see the underlying raw data to see how it ![image](https://user-images.githubusercontent.com/2178292/164430105-f747d1e0-f3cb-4495-a5f7-b7bbb71039ae.png) -On the upper right hand side of the page you can select which nodes to filter on if you wish to do so. The ML training status of each node is also displayed. +On the upper right hand side of the page you can select which nodes to filter on if you wish to do so. The ML training status of each node is also displayed. On the lower right hand side of the page an index of anomaly rates is displayed for the highlighted timeline of interest. The index is sorted from most anomalous metric (highest anomaly rate) to least (lowest anomaly rate). Clicking on an entry in the index will scroll the rest of the page to the corresponding anomaly rate sparkline for that metric. @@ -80,6 +80,7 @@ On the lower right hand side of the page an index of anomaly rates is displayed You can read more detail on how anomaly detection in the Netdata Agent works in our [Agent docs](https://github.com/netdata/netdata/blob/master/ml/README.md). 🚧 **Note**: This functionality is still **under active development** and considered experimental. We dogfood it internally and among early adopters within the Netdata community to build the feature. If you would like to get involved and help us with feedback, you can reach us through any of the following channels: + - Email us at analytics-ml-team@netdata.cloud - Comment on the [beta launch post](https://community.netdata.cloud/t/anomaly-advisor-beta-launch/2717) in the Netdata community - Join us in the [🤖-ml-powered-monitoring](https://discord.gg/4eRSEUpJnc) channel of the Netdata discord. diff --git a/docs/cloud/insights/events-feed.md b/docs/cloud/insights/events-feed.md new file mode 100644 index 000000000..0e297ba81 --- /dev/null +++ b/docs/cloud/insights/events-feed.md @@ -0,0 +1,79 @@ +<!-- +title: "Events feed" +sidebar_label: "Events feed" +custom_edit_url: "https://github.com/netdata/netdata/blob/master/docs/cloud/insights/events-feed.md" +sidebar_position: "2800" +learn_status: "Published" +learn_topic_type: "Concepts" +learn_rel_path: "Concepts" +learn_docs_purpose: "Present the Netdata Events feed." +--> + +# Events feed + +Netdata Cloud provides the Events feed which is a powerful feature that tracks events that happen on your infrastructure, or in your Space. The feed lets you investigate events that occurred in the past, which is invaluable for troubleshooting. Common use cases are ones like when a node goes offline, and you want to understand what events happened before that. A detailed event history can also assist in attributing sudden pattern changes in a time series to specific changes in your environment. + +## What are the available events? + +At a high-level view, these are the domains from which the Events feed will provide visibility into. + +> ⚠️ Based on your space's plan, different allowances are defined to query past data. + +| **Domains of events** | **Community** | **Pro** | **Business** | +| :-- | :-- | :-- | :-- | +| **Auditing events** - COMING SOON<br/>Events related to actions done on your Space, e.g. invite user, change user role or change plan.| 4 hours | 7 days | 90 days | +| **[Topology events](#topology-events)**<br/>Node state transition events, e.g. live or offline.| 4 hours | 7 days | 14 days | +| **[Alert events](#alert-events)**<br/>Alert state transition events, can be seen as an alert history log.| 4 hours | 7 days | 90 days | + +### Topology events + +| **Event name** | **Description** | **Example** | +| :-- | :-- | :-- | +| Node Became Live | The node is collecting and streaming metrics to Cloud.| Node `netdata-k8s-state-xyz` was **live** | +| Node Became Stale | The node is offline and not streaming metrics to Cloud. It can show historical data from a parent node. | Node `ip-xyz.ec2.internal` was **stale** | +| Node Became Offline | The node is offline, not streaming metrics to Cloud and not available in any parent node.| Node `ip-xyz.ec2.internal` was **offline** | +| Node Created | The node is created but it is still `Unseen` on Cloud, didn't establish a successful connection yet.| Node `ip-xyz.ec2.internal` was **created** | +| Node Removed |The node was removed from the Space, for example by using the `Delete` action on the node. This is a soft delete in that the node gets marked as deleted, but retains the association with this space. If it becomes live again, it will be restored (see `Node Restored` below) and reappear in this space as before. | Node `ip-xyz.ec2.internal` was **deleted (soft)** | +| Node Restored | The node was restored. See `Node Removed` above. | Node `ip-xyz.ec2.internal` was **restored** | +| Node Deleted | The node was deleted from the Space. This is a hard delete and no information on the node is retained. | Node `ip-xyz.ec2.internal` was **deleted (hard)** | +| Agent Connected | The agent connected to the Cloud MQTT server (Agent-Cloud Link established).<br/>These events can only be seen on _All nodes_ War Room. | Agent with claim ID `7d87bqs9-cv42-4823-8sd4-3614548850c7` has connected to Cloud. | +| Agent Disconnected | The agent disconnected from the Cloud MQTT server (Agent-Cloud Link severed).<br/>These events can only be seen on _All nodes_ War Room. | Agent with claim ID `7d87bqs9-cv42-4823-8sd4-3614548850c7` has disconnected from Cloud: **Connection Timeout**. | +| Space Statistics | Daily snapshot of space node statistics.<br/>These events can only be seen on _All nodes_ War Room. | Space statistics. Nodes: **22 live**, **21 stale**, **18 removed**, **61 total**. | + + +### Alert events + +| **Event name** | **Description** | **Example** | +| :-- | :-- | :-- | +| Node Alert State Changed | These are node alert state transition events and can be seen as an alert history log. You will be able to see transitions to or from any of these states: Cleared, Warning, Critical, Removed, Error or Unknown | Transition to Cleared:<br/>`httpcheck_web_service_bad_status` for `httpcheck_netdata_cloud.request_status` on `netdata-parent-xyz` recovered with value **8.33%**<br/><br/>Transition from Cleared to Warning or Critical:<br/>`httpcheck_web_service_bad_status` for `httpcheck_netdata_cloud.request_status` on `netdata-parent-xyz` was raised to **WARNING** with value **10%**<br/><br/>Transition from Warning to Critical:<br/>`httpcheck_web_service_bad_status` for `httpcheck_netdata_cloud.request_status` on `netdata-parent-xyz` escalated to **CRITICAL** with value **25%**<br/><br/>Transition from Critical to Warning:<br/>`httpcheck_web_service_bad_status` for `httpcheck_netdata_cloud.request_status` on `netdata-parent-xyz` was demoted to **WARNING** with value **10%**<br/><br/>Transition to Removed:<br/>Alert `httpcheck_web_service_bad_status` for `httpcheck_netdata_cloud.request_status` on `netdata-parent-xyz` is no longer available, state can't be assessed.<br/><br/>Transition to Error:<br/>For this alert `httpcheck_web_service_bad_status` related to `httpcheck_netdata_cloud.request_status` on `netdata-parent-xyz` we couldn't calculate the current value ⓘ| + +## Who can access the events? + +All users will be able to see events from the Topology and Alerts domain but Auditing events, once these are added, only be accessible to administrators. For more details checkout [Netdata Role-Based Access model](https://github.com/netdata/netdata/blob/master/docs/cloud/manage/role-based-access.md). + +## How to use the events feed + +1. Click on the **Events** tab (located near the top of your screen) +1. You will be presented with a table listing the events that occurred from the timeframe defined on the date time picker +1. You can use the filtering capabilities available on right-hand bar to slice through the results provided. See more details on event types and filters + +Note: When you try to query a longer period than what your space allows you will see an error message highlighting that you are querying data outside of your plan. + +### Event types and filters + +| Event type | Tags | Nodes | Alert Status | Alert Names | Chart Names | +| :-- | :-- | :-- | :-- | :-- | :-- | +| Node Became Live | node, lifecycle | Node name | - | - | - | +| Node Became Stale | node, lifecycle | Node name | - | - | - | +| Node Became Offline | node, lifecycle | Node name | - | - | - | +| Node Created | node, lifecycle | Node name | - | - | - | +| Node Removed | node, lifecycle | Node name | - | - | - | +| Node Restored | node, lifecycle | Node name | - | - | - | +| Node Deleted | node, lifecycle | Node name | - | - | - | +| Agent Claimed | agent | - | - | - | - | +| Agent Connected | agent | - | - | - | - | +| Agent Disconnected | agent | - | - | - | - | +| Agent Authenticated | agent | - | - | - | - | +| Agent Authentication Failed | agent | - | - | - | - | +| Space Statistics | space, node, statistics | Node name | - | - | - | +| Node Alert State Changed | alert, node | Node name | Cleared, Warning, Critical, Removed, Error or Unknown | Alert name | Chart name | diff --git a/docs/cloud/insights/metric-correlations.md b/docs/cloud/insights/metric-correlations.md index ce8835d34..c8ead9be3 100644 --- a/docs/cloud/insights/metric-correlations.md +++ b/docs/cloud/insights/metric-correlations.md @@ -1,4 +1,4 @@ ---- +<!-- title: "Metric Correlations" description: "Quickly find metrics and charts closely related to a particular timeframe of interest anywhere in your infrastructure to discover the root cause faster." custom_edit_url: "https://github.com/netdata/netdata/blob/master/docs/cloud/insights/metric-correlations.md" @@ -6,7 +6,9 @@ sidebar_label: "Metric Correlations" learn_status: "Published" learn_topic_type: "Tasks" learn_rel_path: "Operations" ---- +--> + +# Metric Correlations The Metric Correlations (MC) feature lets you quickly find metrics and charts related to a particular window of interest that you want to explore further. By displaying the standard Netdata dashboard, filtered to show only charts that are relevant to the window of interest, you can get to the root cause sooner. @@ -51,9 +53,9 @@ Behind the scenes, Netdata will aggregate the raw data as needed such that arbit ### Data -Netdata is different from typical observability agents since, in addition to just collecting raw metric values, it will by default also assign an "[Anomaly Bit](/docs/agent/ml#anomaly-bit)" related to each collected metric each second. This bit will be 0 for "normal" and 1 for "anomalous". This means that each metric also natively has an "[Anomaly Rate](/docs/agent/ml#anomaly-rate)" associated with it and, as such, MC can be run against the raw metric values or their corresponding anomaly rates. +Netdata is different from typical observability agents since, in addition to just collecting raw metric values, it will by default also assign an "[Anomaly Bit](https://github.com/netdata/netdata/tree/master/ml#anomaly-bit---100--anomalous-0--normal)" related to each collected metric each second. This bit will be 0 for "normal" and 1 for "anomalous". This means that each metric also natively has an "[Anomaly Rate](https://github.com/netdata/netdata/tree/master/ml#anomaly-rate---averageanomaly-bit)" associated with it and, as such, MC can be run against the raw metric values or their corresponding anomaly rates. -**Note**: Read more [here](https://github.com/netdata/netdata/blob/master/docs/guides/monitor/anomaly-detection.md) to learn more about the native anomaly detection features within netdata. +**Note**: Read more [here](https://github.com/netdata/netdata/blob/master/ml/README.md) to learn more about the native anomaly detection features within netdata. - `Metrics` - Run MC on the raw metric values. - `Anomaly Rate` - Run MC on the corresponding anomaly rate for each metric. @@ -72,7 +74,7 @@ Should you still want to, disabling nodes for Metric Correlation on the agent is ## Usage tips! -- When running Metric Correlations from the [Overview tab](https://learn.netdata.cloud/docs/cloud/visualize/overview#overview) across multiple nodes, you might find better results if you iterate on the initial results by grouping by node to then filter to nodes of interest and run the Metric Correlations again. So a typical workflow in this case would be to: +- When running Metric Correlations from the [Overview tab](https://github.com/netdata/netdata/blob/master/docs/cloud/visualize/overview.md#overview-and-single-node-view) across multiple nodes, you might find better results if you iterate on the initial results by grouping by node to then filter to nodes of interest and run the Metric Correlations again. So a typical workflow in this case would be to: - If unsure which nodes you are interested in then run MC on all nodes. - Within the initial results returned group the most interesting chart by node to see if the changes are across all nodes or a subset of nodes. - If you see a subset of nodes clearly jump out when you group by node, then filter for just those nodes of interest and run the MC again. This will result in less aggregation needing to be done by Netdata and so should help give clearer results as you interact with the slider. @@ -81,7 +83,3 @@ Should you still want to, disabling nodes for Metric Correlation on the agent is - `Volume` might favour picking up more sparse metrics that were relatively flat and then came to life with some spikes (or vice versa). This is because for such metrics that just don't have that many different values in them, it is impossible to construct a cumulative distribution that can then be compared. So `Volume` might be useful in spotting examples of metrics turning on or off. ![example where volume captured network traffic turning on](https://user-images.githubusercontent.com/2178292/182336924-d02fd3d3-7f09-41da-9cfc-809d01396d9d.png) - `KS2` since it relies on the full distribution might be better at highlighting more complex changes that `Volume` is unable to capture. For example a change in the variation of a metric might be picked up easily by `KS2` but missed (or just much lower scored) by `Volume` since the averages might remain not all that different between baseline and highlight even if their variance has changed a lot. ![example where KS2 captured a change in entropy distribution that volume alone might not have picked up](https://user-images.githubusercontent.com/2178292/182338289-59b61e6b-089d-431c-bc8e-bd19ba6ad5a5.png) - Use `Volume` and `Anomaly Rate` together to ask what metrics have turned most anomalous from baseline to highlighted window. You can expand the embedded anomaly rate chart once you have results to see this more clearly. ![example where Volume and Anomaly Rate together help show what dimensions where most anomalous](https://user-images.githubusercontent.com/2178292/182338666-6d19fa92-89d3-4d61-804c-8f10982114f5.png) - -## What's next? - -You can read more about all the ML powered capabilities of Netdata [here](https://github.com/netdata/netdata/blob/master/docs/guides/monitor/anomaly-detection.md). If you aren't yet familiar with the power of Netdata Cloud's visualization features, check out the [Nodes view](https://github.com/netdata/netdata/blob/master/docs/cloud/visualize/nodes.md) and learn how to [build new dashboards](https://github.com/netdata/netdata/blob/master/docs/cloud/visualize/dashboards.md). |