diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2021-03-31 12:59:21 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2021-03-31 12:59:21 +0000 |
commit | bb8713bbc1c4594366fc735c04910edbf4c61aab (patch) | |
tree | d7da56c0b89aa371dd8ad986995dd145fdf6670a /docs/guides/monitor | |
parent | Releasing debian version 1.29.3-4. (diff) | |
download | netdata-bb8713bbc1c4594366fc735c04910edbf4c61aab.tar.xz netdata-bb8713bbc1c4594366fc735c04910edbf4c61aab.zip |
Merging upstream version 1.30.0.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'docs/guides/monitor')
-rw-r--r-- | docs/guides/monitor/anomaly-detection.md | 26 | ||||
-rw-r--r-- | docs/guides/monitor/kubernetes-k8s-netdata.md | 374 | ||||
-rw-r--r-- | docs/guides/monitor/lamp-stack.md | 249 | ||||
-rw-r--r-- | docs/guides/monitor/pi-hole-raspberry-pi.md | 8 | ||||
-rw-r--r-- | docs/guides/monitor/process.md | 10 | ||||
-rw-r--r-- | docs/guides/monitor/raspberry-pi-anomaly-detection.md | 127 | ||||
-rw-r--r-- | docs/guides/monitor/statsd.md | 297 | ||||
-rw-r--r-- | docs/guides/monitor/visualize-monitor-anomalies.md | 3 |
8 files changed, 871 insertions, 223 deletions
diff --git a/docs/guides/monitor/anomaly-detection.md b/docs/guides/monitor/anomaly-detection.md index bb9dbc829..2fa4896c6 100644 --- a/docs/guides/monitor/anomaly-detection.md +++ b/docs/guides/monitor/anomaly-detection.md @@ -79,9 +79,10 @@ yourself if it doesn't already exist. Either way, the final result should look l anomalies: yes ``` -[Restart the Agent](/docs/configure/start-stop-restart.md) with `sudo systemctl restart netdata` to start up the -anomalies collector. By default, the model training process runs every 30 minutes, and uses the previous 4 hours of -metrics to establish a baseline for health and performance across the default included charts. +[Restart the Agent](/docs/configure/start-stop-restart.md) with `sudo systemctl restart netdata`, or the [appropriate +method](/docs/configure/start-stop-restart.md) for your system, to start up the anomalies collector. By default, the +model training process runs every 30 minutes, and uses the previous 4 hours of metrics to establish a baseline for +health and performance across the default included charts. > 💡 The anomaly collector may need 30-60 seconds to finish its initial training and have enough data to start > generating anomaly scores. You may need to refresh your browser tab for the **Anomalies** section to appear in menus @@ -106,7 +107,7 @@ involve tweaking the behavior of the ML training itself. doesn't have historical metrics going back that far, consider [changing the metrics retention policy](/docs/store/change-metrics-storage.md) or reducing this window. - `custom_models`: A way to define custom models that you want anomaly probabilities for, including multi-node or - streaming setups. More on custom models in part 3 of this guide series. + streaming setups. > ⚠️ Setting `charts_regex` with many charts or `train_n_secs` to a very large number will have an impact on the > resources and time required to train a model for every chart. The actual performance implications depend on the @@ -172,20 +173,19 @@ example, it's time to apply that knowledge to other mission-critical parts of yo what to monitor next, check out our list of [collectors](/collectors/COLLECTORS.md) to see what kind of metrics Netdata can collect from your systems, containers, and applications. -For a more user-friendly anomaly detection experience, try out the [Metric -Correlations](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations) feature in Netdata Cloud. Metric -Correlations runs only at your requests, removing unrelated charts from the dashboard to help you focus on root cause -analysis. +Keep on moving to [part 2](/docs/guides/monitor/visualize-monitor-anomalies.md), which covers the charts and alarms +Netdata creates for unsupervised anomaly detection. -Stay tuned for the next two parts of this guide, which provide more real-world context for the anomalies collector. -First, maximize the immediate value you get from anomaly detection by tracking preconfigured alarms, visualizing -anomalies in charts, and building a new dashboard tailored to your applications. Then, learn about creating custom ML -models, which help you holistically monitor an application or service by monitoring anomalies across a _cluster of -charts_. +For a different troubleshooting experience, try out the [Metric +Correlations](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations) feature in Netdata Cloud. Metric +Correlations helps you perform faster root cause analysis by narrowing a dashboard to only the charts most likely to be +related to an anomaly. ### Related reference documentation - [Netdata Agent · Anomalies collector](/collectors/python.d.plugin/anomalies/README.md) +- [Netdata Agent · Nginx collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/nginx) +- [Netdata Agent · web log collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/weblog) - [Netdata Cloud · Metric Correlations](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations) [![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fmonitor%2Fanomaly-detectionl&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/monitor/kubernetes-k8s-netdata.md b/docs/guides/monitor/kubernetes-k8s-netdata.md index 40af0e94e..c5cb2c1bc 100644 --- a/docs/guides/monitor/kubernetes-k8s-netdata.md +++ b/docs/guides/monitor/kubernetes-k8s-netdata.md @@ -1,11 +1,25 @@ <!-- -title: "Monitor a Kubernetes (k8s) cluster with Netdata" -description: "Use Netdata's helmchart, service discovery plugin, and Kubelet/kube-proxy collectors for real-time visibility into your Kubernetes cluster." +title: "Kubernetes monitoring with Netdata: Overview and visualizations" +description: "Learn how to navigate Netdata's Kubernetes monitoring features for visualizing the health and performance of a Kubernetes cluster with per-second granulrity." image: /img/seo/guides/monitor/kubernetes-k8s-netdata.png +author: "Joel Hans" +author_title: "Editorial Director, Technical & Educational Resources" +author_img: "/img/authors/joel-hans.jpg" custom_edit_url: https://github.com/netdata/netdata/edit/master/docs/guides/monitor/kubernetes-k8s-netdata.md --> -# Monitor a Kubernetes cluster with Netdata +# Kubernetes monitoring with Netdata: Overview and visualizations + +At Netdata, we've built Kubernetes monitoring tools that add visibility without complexity while also helping you +actively troubleshoot anomalies or outages. This guide walks you through each of the visualizations and offers best +practices on how to use them to start Kubernetes monitoring in a matter of minutes, not hours or days. + +Netdata's Kubernetes monitoring solution uses a handful of [complementary tools and +collectors](#related-reference-documentation) for peeling back the many complex layers of a Kubernetes cluster, +_entirely for free_. These methods work together to give you every metric you need to troubleshoot performance or +availability issues across your Kubernetes infrastructure. + +## Challenge While Kubernetes (k8s) might simplify the way you deploy, scale, and load-balance your applications, not all clusters come with "batteries included" when it comes to monitoring. Doubly so for a monitoring stack that helps you actively @@ -18,261 +32,223 @@ customization, or integration with your preferred alerting methods. Without this visibility, it's like you built an entire house and _then_ smashed your way through the finished walls to add windows. -At Netdata, we're working to build Kubernetes monitoring tools that add visibility without complexity while also helping -you actively troubleshoot anomalies or outages. Better yet, this toolkit includes a few complementary collectors that -let you monitor the many layers of a Kubernetes cluster entirely for free. - -We already have a few complementary tools and collectors for monitoring the many layers of a Kubernetes cluster, -_entirely for free_. These methods work together to help you troubleshoot performance or availability issues across -your k8s infrastructure. - -- A [Helm chart](https://github.com/netdata/helmchart), which bootstraps a Netdata Agent pod on every node in your - cluster, plus an additional parent pod for storing metrics and managing alarm notifications. -- A [service discovery plugin](https://github.com/netdata/agent-service-discovery), which discovers and creates - configuration files for [compatible - applications](https://github.com/netdata/helmchart#service-discovery-and-supported-services) and any endpoints - covered by our [generic Prometheus - collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/prometheus). With these - configuration files, Netdata collects metrics from any compatible applications as they run _inside_ of a pod. - Service discovery happens without manual intervention as pods are created, destroyed, or moved between nodes. -- A [Kubelet collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubelet), which runs - on each node in a k8s cluster to monitor the number of pods/containers, the volume of operations on each container, - and more. -- A [kube-proxy collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubeproxy), which - also runs on each node and monitors latency and the volume of HTTP requests to the proxy. -- A [cgroups collector](/collectors/cgroups.plugin/README.md), which collects CPU, memory, and bandwidth metrics for - each container running on your k8s cluster. - -By following this guide, you'll learn how to discover, explore, and take away insights from each of these layers in your -Kubernetes cluster. Let's get started. - -## Prerequisites - -To follow this guide, you need: - -- A working cluster running Kubernetes v1.9 or newer. -- The [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) command line tool, within [one minor version +## Solution + +In this tutorial, you'll learn how to navigate Netdata's Kubernetes monitoring features, using +[robot-shop](https://github.com/instana/robot-shop) as an example deployment. Deploying robot-shop is purely optional. +You can also follow along with your own Kubernetes deployment if you choose. While the metrics might be different, the +navigation and best practices are the same for every cluster. + +## What you need to get started + +To follow this tutorial, you need: + +- A free Netdata Cloud account. [Sign up](https://app.netdata.cloud/sign-up?cloudRoute=/spaces) if you don't have one + already. +- A working cluster running Kubernetes v1.9 or newer, with a Netdata deployment and claimed parent/child nodes. See + our [Kubernetes deployment process](/packaging/installer/methods/kubernetes.md) for details on deployment and + claiming. +- The [`kubectl`](https://kubernetes.io/docs/reference/kubectl/overview/) command line tool, within [one minor version difference](https://kubernetes.io/docs/tasks/tools/install-kubectl/#before-you-begin) of your cluster, on an administrative system. - The [Helm package manager](https://helm.sh/) v3.0.0 or newer on the same administrative system. -**You need to install the Netdata Helm chart on your cluster** before you proceed. See our [Kubernetes installation -process](/packaging/installer/methods/kubernetes.md) for details. +### Install the `robot-shop` demo (optional) -This guide uses a 3-node cluster, running on Digital Ocean, as an example. This cluster runs CockroachDB, Redis, and -Apache, which we'll use as examples of how to monitor a Kubernetes cluster with Netdata. +Begin by downloading the robot-shop code and using `helm` to create a new deployment. ```bash -kubectl get nodes -NAME STATUS ROLES AGE VERSION -pool-0z7557lfb-3fnbf Ready <none> 51m v1.17.5 -pool-0z7557lfb-3fnbx Ready <none> 51m v1.17.5 -pool-0z7557lfb-3fnby Ready <none> 51m v1.17.5 - -kubectl get pods -NAME READY STATUS RESTARTS AGE -cockroachdb-0 1/1 Running 0 44h -cockroachdb-1 1/1 Running 0 44h -cockroachdb-2 1/1 Running 1 44h -cockroachdb-init-q7mp6 0/1 Completed 0 44h -httpd-6f6cb96d77-4zlc9 1/1 Running 0 2m47s -httpd-6f6cb96d77-d9gs6 1/1 Running 0 2m47s -httpd-6f6cb96d77-xtpwn 1/1 Running 0 11m -netdata-child-5p2m9 2/2 Running 0 42h -netdata-child-92qvf 2/2 Running 0 42h -netdata-child-djc6w 2/2 Running 0 42h -netdata-parent-0 1/1 Running 0 42h -redis-6bb94d4689-6nn6v 1/1 Running 0 73s -redis-6bb94d4689-c2fk2 1/1 Running 0 73s -redis-6bb94d4689-tjcz5 1/1 Running 0 88s +git clone git@github.com:instana/robot-shop.git +cd robot-shop/K8s/helm +kubectl create ns robot-shop +helm install robot-shop --namespace robot-shop . ``` -## Explore Netdata's Kubernetes charts +Running `kubectl get pods` shows both the Netdata and robot-shop deployments. -The Helm chart installs and enables everything you need for visibility into your k8s cluster, including the service -discovery plugin, Kubelet collector, kube-proxy collector, and cgroups collector. - -To get started, open your browser and navigate to your cluster's Netdata dashboard. See our [Kubernetes installation -instructions](/packaging/installer/methods/kubernetes.md) for how to access the dashboard based on your cluster's -configuration. - -You'll see metrics from the parent pod as soon as you navigate to the dashboard: - -![The Netdata dashboard when monitoring a Kubernetes -cluster](https://user-images.githubusercontent.com/1153921/85343043-c6206400-b4a0-11ea-8de6-cf2c6837c456.png) - -Remember that the parent pod is responsible for storing metrics from all the child pods and sending alarms. +```bash +kubectl get pods --all-namespaces +NAMESPACE NAME READY STATUS RESTARTS AGE +default netdata-child-29f9c 2/2 Running 0 10m +default netdata-child-8xphf 2/2 Running 0 10m +default netdata-child-jdvds 2/2 Running 0 11m +default netdata-parent-554c755b7d-qzrx4 1/1 Running 0 11m +kube-system aws-node-jnjv8 1/1 Running 0 17m +kube-system aws-node-svzdb 1/1 Running 0 17m +kube-system aws-node-ts6n2 1/1 Running 0 17m +kube-system coredns-559b5db75d-f58hp 1/1 Running 0 22h +kube-system coredns-559b5db75d-tkzj2 1/1 Running 0 22h +kube-system kube-proxy-9p9cd 1/1 Running 0 17m +kube-system kube-proxy-lt9ss 1/1 Running 0 17m +kube-system kube-proxy-n75t9 1/1 Running 0 17m +robot-shop cart-b4bbc8fff-t57js 1/1 Running 0 14m +robot-shop catalogue-8b5f66c98-mr85z 1/1 Running 0 14m +robot-shop dispatch-67d955c7d8-lnr44 1/1 Running 0 14m +robot-shop mongodb-7f65d86c-dsslc 1/1 Running 0 14m +robot-shop mysql-764c4c5fc7-kkbnf 1/1 Running 0 14m +robot-shop payment-67c87cb7d-5krxv 1/1 Running 0 14m +robot-shop rabbitmq-5bb66bb6c9-6xr5b 1/1 Running 0 14m +robot-shop ratings-94fd9c75b-42wvh 1/1 Running 0 14m +robot-shop redis-0 0/1 Pending 0 14m +robot-shop shipping-7d69cb88b-w7hpj 1/1 Running 0 14m +robot-shop user-79c445b44b-hwnm9 1/1 Running 0 14m +robot-shop web-8bb887476-lkcjx 1/1 Running 0 14m +``` -Take note of the **Replicated Nodes** menu, which shows not only the parent pod, but also the three child pods. This -example cluster has three child pods, but the number of child pods depends entirely on the number of nodes in your -cluster. +## Explore Netdata's Kubernetes monitoring charts -You'll use the links in the **Replicated Nodes** menu to navigate between the various pods in your cluster. Let's do -that now to explore the pod-level Kubernetes monitoring Netdata delivers. +The Netdata Helm chart deploys and enables everything you need for monitoring Kubernetes on every layer. Once you deploy +Netdata and claim your cluster's nodes, you're ready to check out the visualizations **with zero configuration**. -### Pods +To get started, [sign in](https://app.netdata.cloud/sign-in?cloudRoute=/spaces) to your Netdata Cloud account. Head over +to the War Room you claimed your cluster to, if not **General**. -Click on any of the nodes under **netdata-parent-0**. Netdata redirects you to a separate instance of the Netdata -dashboard, run by the Netdata child pod, which visualizes thousands of metrics from that node. +Netdata Cloud is already visualizing your Kubernetes metrics, streamed in real-time from each node, in the +[Overview](https://learn.netdata.cloud/docs/cloud/visualize/overview): -![The Netdata dashboard monitoring a pod in a Kubernetes -cluster](https://user-images.githubusercontent.com/1153921/85348461-85c8e200-b4b0-11ea-85fa-e88046e94719.png) +![Netdata's Kubernetes monitoring +dashboard](https://user-images.githubusercontent.com/1153921/109037415-eafc5500-7687-11eb-8773-9b95941e3328.png) -From this dashboard, you can see all the familiar charts showing the health and performance of an individual node, just -like you would if you installed Netdata on a single physical system. Explore CPU, memory, bandwidth, networking, and -more. +Let's walk through monitoring each layer of a Kubernetes cluster using the Overview as our framework. -You can use the menus on the right-hand side of the dashboard to navigate between different sections of charts and -metrics. +## Cluster and node metrics -For example, click on the **Applications** section to view per-application metrics, collected by -[apps.plugin](/collectors/apps.plugin/README.md). The first chart you see is **Apps CPU Time (100% = 1 core) -(apps.cpu)**, which shows the CPU utilization of various applications running on the node. You shouldn't be surprised to -find Netdata processes (`netdata`, `sd-agent`, and more) alongside Kubernetes processes (`kubelet`, `kube-proxy`, and -`containers`). +The gauges and time-series charts you see right away in the Overview show aggregated metrics from every node in your +cluster. -![Per-application monitoring on a Kubernetes -cluster](https://user-images.githubusercontent.com/1153921/85348852-ad6c7a00-b4b1-11ea-95b4-5952bd0e9d98.png) +For example, the `apps.cpu` chart (in the **Applications** menu item), visualizes the CPU utilization of various +applications/services running on each of the nodes in your cluster. The **X Nodes** dropdown shows which nodes +contribute to the chart and links to jump a single-node dashboard for further investigation. -Beneath the **Applications** section, you'll begin to see sections for **k8s kubelet**, **k8s kubeproxy**, and long -strings that start with **k8s**, which are sections for metrics collected by -[`cgroups.plugin`](/collectors/cgroups.plugin/README.md). Let's skip over those for now and head further down to see -Netdata's service discovery in action. +![Per-application monitoring in a Kubernetes +cluster](https://user-images.githubusercontent.com/1153921/109042169-19c8fa00-768d-11eb-91a7-1a7afc41fea2.png) -### Service discovery (services running inside of pods) +For example, the chart above shows a spike in the CPU utilization from `rabbitmq` every minute or so, along with a +baseline CPU utilization of 10-15% across the cluster. -Thanks to Netdata's service discovery feature, you monitor containerized applications running in k8s pods with zero -configuration or manual intervention. Service discovery is like a watchdog for created or deleted pods, recognizing the -service they run based on the image name and port and immediately attempting to apply a logical default configuration. +Read about the [Overview](https://learn.netdata.cloud/docs/cloud/visualize/overview) and some best practices on [viewing +an overview of your infrastructure](/docs/visualize/overview-infrastructure.md) for details on using composite charts to +drill down into per-node performance metrics. -Service configuration supports [popular -applications](https://github.com/netdata/helmchart#service-discovery-and-supported-services), plus any endpoints covered -by our [generic Prometheus collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/prometheus), -which are automatically added or removed from Netdata as soon as the pods are created or destroyed. +## Pod and container metrics -You can find these service discovery sections near the bottom of the menu. The names for these sections follow a -pattern: the name of the detected service, followed by a string of the module name, pod TUID, service type, port -protocol, and port number. See the graphic below to help you identify service discovery sections. +Click on the **Kubernetes xxxxxxx...** section to jump down to Netdata Cloud's unique Kubernetes visualizations for view +real-time resource utilization metrics from your Kubernetes pods and containers. -![Showing the difference between cgroups and service discovery -sections](https://user-images.githubusercontent.com/1153921/85443711-73998300-b546-11ea-9b3b-2dddfe00bdf8.png) +![Navigating to the Kubernetes monitoring +visualizations](https://user-images.githubusercontent.com/1153921/109049195-349f6c80-7695-11eb-8902-52a029dca77f.png) -For example, the first service discovery section shows metrics for a pod running an Apache web server running on port 80 -in a pod named `httpd-6f6cb96d77-xtpwn`. +### Health map -> If you don't see any service discovery sections, it's either because your services are not compatible with service -> discovery or you changed their default configuration, such as the listening port. See the [list of supported -> services](https://github.com/netdata/helmchart#service-discovery-and-supported-services) for details about whether -> your installed services are compatible with service discovery, or read the [configuration -> instructions](/packaging/installer/methods/kubernetes.md#configure-service-discovery) to change how it discovers the -> supported services. +The first visualization is the [health map](https://learn.netdata.cloud/docs/cloud/visualize/kubernetes#health-map), +which places each container into its own box, then varies the intensity of their color to visualize the resource +utilization. By default, the health map shows the **average CPU utilization as a percentage of the configured limit** +for every container in your cluster. -Click on any of these service discovery sections to see metrics from that particular service. For example, click on the -**Apache apache-default httpd-6f6cb96d77-xtpwn httpd tcp 80** section brings you to a series of charts populated by the -[Apache collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/apache) itself. +![The Kubernetes health map in Netdata +Cloud](https://user-images.githubusercontent.com/1153921/109050085-3f0e3600-7696-11eb-988f-52cb187f53ea.png) -With service discovery, you can now see valuable metrics like requests, bandwidth, workers, and more for this pod. +Let's explore the most colorful box by hovering over it. -![Apache metrics collected via service -discovery](https://user-images.githubusercontent.com/1153921/85443905-a5aae500-b546-11ea-99f0-be20ba796feb.png) +![Hovering over a +container](https://user-images.githubusercontent.com/1153921/109049544-a8417980-7695-11eb-80a7-109b4a645a27.png) -The same goes for metrics coming from the CockroachDB pod running on this same node. +The **Context** tab shows `rabbitmq-5bb66bb6c9-6xr5b` as the container's image name, which means this container is +running a [RabbitMQ](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/rabbitmq) workload. -![CockroachDB metrics collected via service -discovery](https://user-images.githubusercontent.com/1153921/85444316-0e925d00-b547-11ea-83ba-b834275cb419.png) +Click the **Metrics** tab to see real-time metrics from that container. Unsurprisingly, it shows a spike in CPU +utilization at regular intervals. -Service discovery helps you monitor the health of specific applications running on your Kubernetes cluster, which in -turn gives you a complete resource when troubleshooting your infrastructure's health and performance. +![Viewing real-time container +metrics](https://user-images.githubusercontent.com/1153921/109050482-aa580800-7696-11eb-9e3e-d3bdf0f3eff7.png) -### Kubelet +### Time-series charts -Let's head back up the menu to the **k8s kubelet** section. Kubelet is an agent that runs on every node in a cluster. It -receives a set of PodSpecs from the Kubernetes Control Plane and ensures the pods described there are both running and -healthy. Think of it as a manager for the various pods on that node. +Beneath the health map is a variety of time-series charts that help you visualize resource utilization over time, which +is useful for targeted troubleshooting. -Monitoring each node's Kubelet can be invaluable when diagnosing issues with your Kubernetes cluster. For example, you -can see when the volume of running containers/pods has dropped. +The default is to display metrics grouped by the `k8s_namespace` label, which shows resource utilization based on your +different namespaces. -![Charts showing pod and container removal during a scale -down](https://user-images.githubusercontent.com/1153921/85598613-9ab48b00-b600-11ea-827e-d9ec7779e2d4.png) +![Time-series Kubernetes monitoring in Netdata +Cloud](https://user-images.githubusercontent.com/1153921/109075210-126a1680-76b6-11eb-918d-5acdcdac152d.png) -This drop might signal a fault or crash in a particular Kubernetes service or deployment (see `kubectl get services` or -`kubectl get deployments` for more details). If the number of pods increases, it may be because of something more -benign, like another member of your team scaling up a service with `kubectl scale`. +Each composite chart has a [definition bar](https://learn.netdata.cloud/docs/cloud/visualize/overview#definition-bar) +for complete customization. For example, grouping the top chart by `k8s_container_name` reveals new information. -You can also view charts for the Kubelet API server, the volume of runtime/Docker operations by type, -configuration-related errors, and the actual vs. desired numbers of volumes, plus a lot more. +![Changing time-series charts](https://user-images.githubusercontent.com/1153921/109075212-139b4380-76b6-11eb-836f-939482ae55fc.png) -Kubelet metrics are collected and visualized thanks to the [kubelet -collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubelet), which is enabled with -zero configuration on most Kubernetes clusters with standard configurations. +## Service metrics -### kube-proxy +Netdata has a [service discovery plugin](https://github.com/netdata/agent-service-discovery), which discovers and +creates configuration files for [compatible +services](https://github.com/netdata/helmchart#service-discovery-and-supported-services) and any endpoints covered by +our [generic Prometheus collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/prometheus). +Netdata uses these files to collect metrics from any compatible application as they run _inside_ of a pod. Service +discovery happens without manual intervention as pods are created, destroyed, or moved between nodes. -Scroll down into the **k8s kubeproxy** section to see metrics about the network proxy that runs on each node in your -Kubernetes cluster. kube-proxy allows for pods to communicate with each other and accept sessions from outside your -cluster. +Service metrics show up on the Overview as well, beneath the **Kubernetes** section, and are labeled according to the +service in question. For example, the **RabbitMQ** section has numerous charts from the [`rabbitmq` +collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/rabbitmq): -With Netdata, you can monitor how often your k8s proxies are syncing proxy rules between nodes. Dramatic changes in -these figures could indicate an anomaly in your cluster that's worthy of further investigation. +![Finding service discovery +metrics](https://user-images.githubusercontent.com/1153921/109054511-2eac8a00-769b-11eb-97f1-da93acb4b5fe.png) -kube-proxy metrics are collected and visualized thanks to the [kube-proxy -collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubeproxy), which is enabled with -zero configuration on most Kubernetes clusters with standard configurations. +> The robot-shop cluster has more supported services, such as MySQL, which are not visible with zero configuration. This +> is usually because of services running on non-default ports, using non-default names, or required passwords. Read up +> on [configuring service discovery](/packaging/installer/methods/kubernetes.md#configure-service-discovery) to collect +> more service metrics. -### Containers +Service metrics are essential to infrastructure monitoring, as they're the best indicator of the end-user experience, +and key signals for troubleshooting anomalies or issues. -We can finally talk about the final piece of Kubernetes monitoring: containers. Each Kubernetes pod is a set of one or -more cooperating containers, sharing the same namespace, all of which are resourced and tracked by the cgroups feature -of the Linux kernel. Netdata automatically detects and monitors each running container by interfacing with the cgroups -feature itself. +## Kubernetes components -You can find these sections beneath **Users**, **k8s kubelet**, and **k8s kubeproxy**. Below, a number of containers -devoted to running services like CockroachDB, Apache, Redis, and more. +Netdata also automatically collects metrics from two essential Kubernetes processes. -![A number of sections devoted to -containers](https://user-images.githubusercontent.com/1153921/85480217-74e1a480-b574-11ea-9da7-dd975e0fde0c.png) +### kubelet -Let's look at the section devoted to the container that runs the Apache pod named `httpd-6f6cb96d77-xtpwn`, as described -in the previous part on [service discovery](#service-discovery-services-running-inside-of-pods). +The **k8s kubelet** section visualizes metrics from the Kubernetes agent responsible for managing every pod on a given +node. This also happens without any configuration thanks to the [kubelet +collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubelet). -![cgroups metrics for an Apache -container/pod](https://user-images.githubusercontent.com/1153921/85480516-03562600-b575-11ea-92ae-dd605bf04106.png) +Monitoring each node's kubelet can be invaluable when diagnosing issues with your Kubernetes cluster. For example, you +can see if the number of running containers/pods has dropped, which could signal a fault or crash in a particular +Kubernetes service or deployment (see `kubectl get services` or `kubectl get deployments` for more details). If the +number of pods increases, it may be because of something more benign, like another team member scaling up a +service with `kubectl scale`. -At first glance, these sections might seem redundant. You might ask, "Why do I need both a service discovery section -_and_ a container section? It's just one pod, after all!" +You can also view charts for the Kubelet API server, the volume of runtime/Docker operations by type, +configuration-related errors, and the actual vs. desired numbers of volumes, plus a lot more. -The difference is that while the service discovery section shows _Apache_ metrics, the equivalent cgroups section shows -that container's CPU, memory, and bandwidth usage. You can use the two sections in conjunction to monitor the health and -performance of your pods and the services they run. +### kube-proxy -For example, let's say you get an alarm notification from `netdata-parent-0` saying the -`ea287694-0f22-4f39-80aa-2ca066caf45a` container (also known as the `httpd-6f6cb96d77-xtpwn` pod) is using 99% of its -available RAM. You can then hop over to the **Apache apache-default httpd-6f6cb96d77-xtpwn httpd tcp 80** section to -further investigate why Apache is using an unexpected amount of RAM. +The **k8s kube-proxy** section displays metrics about the network proxy that runs on each node in your Kubernetes +cluster. kube-proxy lets pods communicate with each other and accept sessions from outside your cluster. Its metrics are +collected by the [kube-proxy +collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubeproxy). -All container metrics, whether they're managed by Kubernetes or the Docker service directly, are collected by the -[cgroups collector](/collectors/cgroups.plugin/README.md). Because this collector integrates with the cgroups Linux -kernel feature itself, monitoring containers requires zero configuration on most Kubernetes clusters. +With Netdata, you can monitor how often your k8s proxies are syncing proxy rules between nodes. Dramatic changes in +these figures could indicate an anomaly in your cluster that's worthy of further investigation. ## What's next? -After following this guide, you should have a more comprehensive understanding of how to monitor your Kubernetes cluster -with Netdata. With this setup, you can monitor the health and performance of all your nodes, pods, services, and k8s -agents. Pre-configured alarms will tell you when something goes awry, and this setup gives you every per-second metric -you need to make informed decisions about your cluster. +After reading this guide, you should now be able to monitor any Kubernetes cluster with Netdata, including nodes, pods, +containers, services, and more. -The best part of monitoring a Kubernetes cluster with Netdata is that you don't have to worry about constantly running -complex `kubectl` commands to see hundreds of highly granular metrics from your nodes. And forget about using `kubectl -exec -it pod bash` to start up a shell on a pod to find and diagnose an issue with any given pod on your cluster. +With the health map, time-series charts, and the ability to drill down into individual nodes, you can see hundreds of +per-second metrics with zero configuration and less time remembering all the `kubectl` options. Netdata moves with your +cluster, automatically picking up new nodes or services as your infrastructure scales. And it's entirely free for +clusters of all sizes. -And with service discovery, all your compatible pods will automatically appear and disappear as they scale up, move, or -scale down across your cluster. +### Related reference documentation -To monitor your Kubernetes cluster with Netdata, start by [installing the Helm -chart](/packaging/installer/methods/kubernetes.md) if you haven't already. The Netdata Agent is open source and entirely -free for every cluster and every organization, whether you have 10 or 10,000 pods. A few minutes and one `helm install` -later and you'll have started on the path of building an effective platform for troubleshooting the next performance or -availability issue on your Kubernetes cluster. +- [Netdata Helm chart](https://github.com/netdata/helmchart) +- [Netdata service discovery](https://github.com/netdata/agent-service-discovery) +- [Netdata Agent · `kubelet` + collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubelet) +- [Netdata Agent · `kube-proxy` + collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubeproxy) +- [Netdata Agent · `cgroups.plugin`](/collectors/cgroups.plugin/README.md) [![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fmonitor%2Fkubernetes-k8s-netdata.md&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/monitor/lamp-stack.md b/docs/guides/monitor/lamp-stack.md new file mode 100644 index 000000000..f11dfe5bd --- /dev/null +++ b/docs/guides/monitor/lamp-stack.md @@ -0,0 +1,249 @@ +<!-- +title: "LAMP stack monitoring (Linux, Apache, MySQL, PHP) with Netdata" +description: "Set up robust LAMP stack monitoring (Linux, Apache, MySQL, PHP) in just a few minutes using a free, open-source monitoring tool that collects metrics every second." +image: /img/seo/guides/monitor/lamp-stack.png +author: "Joel Hans" +author_title: "Editorial Director, Technical & Educational Resources" +author_img: "/img/authors/joel-hans.jpg" +custom_edit_url: https://github.com/netdata/netdata/edit/master/docs/guides/monitor/lamp-stack.md +--> + +# LAMP stack monitoring (Linux, Apache, MySQL, PHP) with Netdata + +The LAMP stack is the "hello world" for deploying dynamic web applications. It's fast, flexible, and reliable, which +means a developer or sysadmin won't go far in their career without interacting with the stack and its services. + +_LAMP_ is an acronym of the core services that make up the web application: **L**inux, **A**pache, **M**ySQL, and +**P**HP. + +- [Linux](https://en.wikipedia.org/wiki/Linux) is the operating system running the whole stack. +- [Apache](https://httpd.apache.org/) is a web server that responds to HTTP requests from users and returns web pages. +- [MySQL](https://www.mysql.com/) is a database that stores and returns information based on queries from the web + application. +- [PHP](https://www.php.net/) is a scripting language used to query the MySQL database and build new pages. + +LAMP stacks are the foundation for tons of end-user applications, with [Wordpress](https://wordpress.org/) being the +most popular. + +## Challenge + +You've already deployed a LAMP stack, either in testing or production. You want to monitor every service's performance +and availability to ensure the best possible experience for your end-users. You might also be particularly interested in +using a free, open-source monitoring tool. + +Depending on your monitoring experience, you may not even know what metrics you're looking for, much less how to build +dashboards using a query language. You need a robust monitoring experience that has the metrics you need without a ton +of required setup. + +## Solution + +In this tutorial, you'll set up robust LAMP stack monitoring with Netdata in just a few minutes. When you're done, +you'll have one dashboard to monitor every part of your web application, including each essential LAMP stack service. + +This dashboard updates every second with new metrics, and pairs those metrics up with preconfigured alarms to keep you +informed of any errors or odd behavior. + +## What you need to get started + +To follow this tutorial, you need: + +- A physical or virtual Linux system, which we'll call a _node_. +- A functional LAMP stack. There's plenty of tutorials for installing a LAMP stack, like [this + one](https://www.digitalocean.com/community/tutorials/how-to-install-linux-apache-mysql-php-lamp-stack-ubuntu-18-04) + from Digital Ocean. +- Optionally, a [Netdata Cloud](https://app.netdata.cloud/sign-up?cloudRoute=/spaces) account, which you can use to view + metrics from multiple nodes in one dashboard, and a whole lot more, for free. + +## Install the Netdata Agent + +If you don't have the free, open-source [Netdata Agent](/docs/get/README.md) installed on your node yet, get started +with a [single kickstart command](/packaging/installer/methods/kickstart.md): + +```bash +bash <(curl -Ss https://my-netdata.io/kickstart.sh) +``` + +The Netdata Agent is now collecting metrics from your node every second. You don't need to jump into the dashboard yet, +but if you're curious, open your favorite browser and navigate to `http://localhost:19999` or `http://NODE:19999`, +replacing `NODE` with the hostname or IP address of your system. + +## Enable hardware and Linux system monitoring + +There's nothing you need to do to enable [system monitoring](/docs/collect/system-metrics.md) and Linux monitoring with +the Netdata Agent, which autodetects metrics from CPUs, memory, disks, networking devices, and Linux processes like +systemd without any configuration. If you're using containers, Netdata automatically collects resource utilization +metrics from each using the [cgroups data collector](/collectors/cgroups.plugin/README.md). + +## Enable Apache monitoring + +Let's begin by configuring Apache to work with Netdata's [Apache data +collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/apache). + +Actually, there's nothing for you to do to enable Apache monitoring with Netdata. + +Apache comes with `mod_status` enabled by default these days, and Netdata is smart enough to look for metrics at that +endpoint without you configuring it. Netdata is already collecting [`mod_status` +metrics](https://httpd.apache.org/docs/2.4/mod/mod_status.html), which is just _part_ of your web server monitoring. + +## Enable web log monitoring + +The Netdata Agent also comes with a [web log +collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/weblog), which reads Apache's access +log file, procesess each line, and converts them into per-second metrics. On Debian systems, it reads the file at +`/var/log/apache2/access.log`. + +At installation, the Netdata Agent adds itself to the [`adm` +group](https://wiki.debian.org/SystemGroups#Groups_without_an_associated_user), which gives the `netdata` process the +right privileges to read Apache's log files. In other words, you don't need to do anything to enable Apache web log +monitoring. + +## Enable MySQL monitoring + +Because your MySQL database is password-protected, you do need to tell MySQL to allow the `netdata` user to connect to +without a password. Netdata's [MySQL data +collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/mysql) collects metrics in _read-only_ +mode, without being able to alter or affect operations in any way. + +First, log into the MySQL shell. Then, run the following three commands, one at a time: + +```mysql +CREATE USER 'netdata'@'localhost'; +GRANT USAGE, REPLICATION CLIENT, PROCESS ON *.* TO 'netdata'@'localhost'; +FLUSH PRIVILEGES; +``` + +Run `sudo systemctl restart netdata`, or the [appropriate alternative for your +system](/docs/configure/start-stop-restart.md), to collect dozens of metrics every second for robust MySQL monitoring. + +## Enable PHP monitoring + +Unlike Apache or MySQL, PHP isn't a service that you can monitor directly, unless you instrument a PHP-based application +with [StatsD](/collectors/statsd.plugin/README.md). + +However, if you use [PHP-FPM](https://php-fpm.org/) in your LAMP stack, you can monitor that process with our [PHP-FPM +data collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/phpfpm). + +Open your PHP-FPM configuration for editing, replacing `7.4` with your version of PHP: + +```bash +sudo nano /etc/php/7.4/fpm/pool.d/www.conf +``` + +> Not sure what version of PHP you're using? Run `php -v`. + +Find the line that reads `;pm.status_path = /status` and remove the `;` so it looks like this: + +```conf +pm.status_path = /status +``` + +Next, add a new `/status` endpoint to Apache. Open the Apache configuration file you're using for your LAMP stack. + +```bash +sudo nano /etc/apache2/sites-available/your_lamp_stack.conf +``` + +Add the following to the end of the file, again replacing `7.4` with your version of PHP: + +```apache +ProxyPass "/status" "unix:/run/php/php7.4-fpm.sock|fcgi://localhost" +``` + +Save and close the file. Finally, restart the PHP-FPM, Apache, and Netdata processes. + +```bash +sudo systemctl restart php7.4-fpm.service +sudo systemctl restart apache2 +sudo systemctl restart netdata +``` + +As the Netdata Agent starts up again, it automatically connects to the new `127.0.0.1/status` page and collects +per-second PHP-FPM metrics to get you started with PHP monitoring. + +## View LAMP stack metrics + +If the Netdata Agent isn't already open in your browser, open a new tab and navigate to `http://localhost:19999` or +`http://NODE:19999`, replacing `NODE` with the hostname or IP address of your system. + +> If you [signed up](https://app.netdata.cloud/sign-up?cloudRoute=/spaces) for Netdata Cloud earlier, you can also view +> the exact same LAMP stack metrics there, plus additional features, like drag-and-drop custom dashboards. Be sure to +> [claim your node](/docs/get/README.md#claim-your-node-to-netdata-cloud) to start streaming metrics to your browser +> through Netdata Cloud. + +Netdata automatically organizes all metrics and charts onto a single page for easy navigation. Peek at gauges to see +overall system performance, then scroll down to see more. Click-and-drag with your mouse to pan _all_ charts back and +forth through different time intervals, or hold `SHIFT` and use the scrollwheel (or two-finger scroll) to zoom in and +out. Check out our doc on [interacting with charts](/docs/visualize/interact-dashboards-charts.md) for all the details. + +![The Netdata +dashboard](https://user-images.githubusercontent.com/1153921/109520555-98e17800-7a69-11eb-86ec-16f689da4527.png) + +The **System Overview** section, which you can also see in the right-hand menu, contains key hardware monitoring charts, +including CPU utilization, memory page faults, network monitoring, and much more. The **Applications** section shows you +exactly which Linux processes are using the most system resources. + +Next, let's check out LAMP-specific metrics. You should see four relevant sections: **Apache local**, **MySQL local**, +**PHP-FPM local**, and **web log apache**. Click on any of these to see metrics from each service in your LAMP stack. + +![LAMP stack monitoring in +Netdata](https://user-images.githubusercontent.com/1153921/109516332-49994880-7a65-11eb-807c-3cba045582e6.png) + +### Key LAMP stack monitoring charts + +Here's a quick reference for what charts you might want to focus on after setting up Netdata. + +| Chart name / context | Type | Why? | +|-------------------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| System Load Average (`system.load`) | Hardware monitoring | A good baseline load average is `0.7`, while `1` (on a 1-core system, `2` on a 2-core system, and so on) means resources are "perfectly" utilized. Higher load indicates a bottleneck somewhere in your system. | +| System RAM (`system.ram`) | Hardware monitoring | Look at the `free` dimension. If that drops to `0`, your system will use swap memory and slow down. | +| Uptime (`apache_local.uptime`) | Apache monitoring | This chart should always be "climbing," indicating a continuous uptime. Investigate any drops back to `0`. | +| Requests By Type (`web_log_apache.requests_by_type`) | Apache monitoring | Check for increases in the `error` or `bad` dimensions, which could indicate users arriving at broken pages or PHP returning errors. | +| Queries (`mysql_local.queries`) | MySQL monitoring | Queries is the total number of queries (queries per second, QPS). Check this chart for sudden spikes or drops, which indicate either increases in traffic/demand or bottlenecks in hardware performance. | +| Active Connections (`mysql_local.connections_active`) | MySQL monitoring | If the `active` dimension nears the `limit`, your MySQL database will bottleneck responses. | +| Performance (phpfpm_local.performance) | PHP monitoring | The `slow requests` dimension lets you know if any requests exceed the configured `request_slowlog_timeout`. If so, users might be having a less-than-ideal experience. | + +## Get alarms for LAMP stack errors + +The Netdata Agent comes with hundreds of pre-configured alarms to help you keep tabs on your system, including 19 alarms +designed for smarter LAMP stack monitoring. + +Click the 🔔 icon in the top navigation to [see active alarms](/docs/monitor/view-active-alarms.md). The **Active** tabs +shows any alarms currently triggered, while the **All** tab displays a list of _every_ pre-configured alarm. The + +![An example of LAMP stack +alarms](https://user-images.githubusercontent.com/1153921/109524120-5883f900-7a6d-11eb-830e-0e7baaa28163.png) + +[Tweak alarms](/docs/monitor/configure-alarms.md) based on your infrastructure monitoring needs, and to see these alarms +in other places, like your inbox or a Slack channel, [enable a notification +method](/docs/monitor/enable-notifications.md). + +## What's next? + +You've now set up robust monitoring for your entire LAMP stack: Linux, Apache, MySQL, and PHP (-FPM, to be exact). These +metrics will help you keep tabs on the performance and availability of your web application and all its essential +services. The per-second metrics granularity means you have the most accurate information possible for troubleshooting +any LAMP-related issues. + +Another powerful way to monitor the availability of a LAMP stack is the [`httpcheck` +collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/httpcheck), which pings a web server at +a regular interval and tells you whether if and how quickly it's responding. The `response_match` option also lets you +monitor when the web server's response isn't what you expect it to be, which might happen if PHP-FPM crashes, for +example. + +The best way to use the `httpcheck` collector is from a separate node from the one running your LAMP stack, which is why +we're not covering it here, but it _does_ work in a single-node setup. Just don't expect it to tell you if your whole +node crashed. + +If you're planning on managing more than one node, or want to take advantage of advanced features, like finding the +source of issues faster with [Metric Correlations](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations), +[sign up](https://app.netdata.cloud/sign-up?cloudRoute=/spaces) for a free Netdata Cloud account. + +### Related reference documentation + +- [Netdata Agent · Get Netdata](/docs/get/README.md) +- [Netdata Agent · Apache data collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/apache) +- [Netdata Agent · Web log collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/weblog) +- [Netdata Agent · MySQL data collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/mysql) +- [Netdata Agent · PHP-FPM data collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/phpfpm) + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fmonitor%2Flamp-stack&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>)
\ No newline at end of file diff --git a/docs/guides/monitor/pi-hole-raspberry-pi.md b/docs/guides/monitor/pi-hole-raspberry-pi.md index a180466fb..dc5e0b314 100644 --- a/docs/guides/monitor/pi-hole-raspberry-pi.md +++ b/docs/guides/monitor/pi-hole-raspberry-pi.md @@ -83,9 +83,9 @@ As far as configuring Netdata to monitor Pi-hole metrics, there's nothing you ac collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/pihole) will autodetect the new service running on your Raspberry Pi and immediately start collecting metrics every second. -Restart Netdata with `sudo service netdata restart` to start Netdata, which will then recognize that Pi-hole is running -and start a per-second collection job. When you refresh your Netdata dashboard or load it up again in a new tab, you'll -see a new entry in the menu for **Pi-hole** metrics. +Restart Netdata with `sudo systemctl restart netdata`, which will then recognize that Pi-hole is running and start a +per-second collection job. When you refresh your Netdata dashboard or load it up again in a new tab, you'll see a new +entry in the menu for **Pi-hole** metrics. ## Use Netdata to explore and monitor your Raspberry Pi and Pi-hole @@ -119,7 +119,7 @@ cd /etc/netdata sudo ./edit-config charts.d.conf ``` -Uncomment the `sensors=force` line and save the file. Restart Netdata with `sudo service netdata restart` to enable +Uncomment the `sensors=force` line and save the file. Restart Netdata with `sudo systemctl restart netdata` to enable Raspberry Pi temperature sensor monitoring. ### Storing historical metrics on your Raspberry Pi diff --git a/docs/guides/monitor/process.md b/docs/guides/monitor/process.md index 893e6b704..0f7c6861a 100644 --- a/docs/guides/monitor/process.md +++ b/docs/guides/monitor/process.md @@ -169,8 +169,9 @@ postgres: postgres* sql: mariad* postmaster* oracle_* ora_* sqlservr ``` -Restart Netdata with `service netdata restart`, or the appropriate method for your system, to start collecting -utilization metrics from your application. Time to [visualize your process metrics](#visualize-process-metrics). +Restart Netdata with `sudo systemctl restart netdata`, or the [appropriate +method](/docs/configure/start-stop-restart.md) for your system, to start collecting utilization metrics from your +application. Time to [visualize your process metrics](#visualize-process-metrics). ### Custom applications @@ -194,8 +195,9 @@ custom-app: custom-app ... ``` -Restart Netdata with `service netdata restart`, or the appropriate method for your system, to start collecting -utilization metrics from your application. +Restart Netdata with `sudo systemctl restart netdata`, or the [appropriate +method](/docs/configure/start-stop-restart.md) for your system, to start collecting utilization metrics from your +application. ## Visualize process metrics diff --git a/docs/guides/monitor/raspberry-pi-anomaly-detection.md b/docs/guides/monitor/raspberry-pi-anomaly-detection.md new file mode 100644 index 000000000..f5587a89b --- /dev/null +++ b/docs/guides/monitor/raspberry-pi-anomaly-detection.md @@ -0,0 +1,127 @@ +<!-- +title: "Unsupervised anomaly detection for Raspberry Pi monitoring" +description: "Use a low-overhead machine learning algorithm and an open-source monitoring tool to detect anomalous metrics on a Raspberry Pi." +image: /img/seo/guides/monitor/raspberry-pi-anomaly-detection.png +author: "Andy Maguire" +author_title: "Senior Machine Learning Engineer" +author_img: "/img/authors/andy-maguire.jpg" +custom_edit_url: https://github.com/netdata/netdata/edit/master/docs/guides/monitor/raspberry-pi-anomaly-detection.md +--> + +# Unsupervised anomaly detection for Raspberry Pi monitoring + +We love IoT and edge at Netdata, we also love machine learning. Even better if we can combine the two to ease the pain +of monitoring increasingly complex systems. + +We recently explored what might be involved in enabling our Python-based [anomalies +collector](/collectors/python.d.plugin/anomalies/README.md) on a Raspberry Pi. To our delight, it's actually quite +straightforward! + +Read on to learn all the steps and enable unsupervised anomaly detection on your on Raspberry Pi(s). + +> Spoiler: It's just a couple of extra commands that will make you feel like a pro. + +## What you need to get started + +- A Raspberry Pi running Raspbian, which we'll call a _node_. +- The [open-source Netdata Agent](https://github.com/netdata/netdata). If you don't have it installed on your node yet, + [get it now](/docs/get/README.md). + +## Install dependencies + +First make sure Netdata is using Python 3 when it runs Python-based data collectors. + +Next, open `netdata.conf` using [`edit-config`](/docs/configure/nodes.md#use-edit-config-to-edit-configuration-files) +from within the [Netdata config directory](/docs/configure/nodes.md#the-netdata-config-directory). Scroll down to the +`[plugin:python.d]` section to pass in the `-ppython3` command option. + +```conf +[plugin:python.d] + # update every = 1 + command options = -ppython3 +``` + +Next, install some of the underlying libraries used by the Python packages the collector depends upon. + +```bash +sudo apt install llvm-9 libatlas3-base libgfortran5 libatlas-base-dev +``` + +Now you're ready to install the Python packages used by the collector itself. First, become the `netdata` user. + +```bash +sudo su -s /bin/bash netdata +``` + +Then pass in the location to find `llvm` as an environment variable for `pip3`. + +```bash +LLVM_CONFIG=llvm-config-9 pip3 install --user llvmlite numpy==1.20.1 netdata-pandas==0.0.32 numba==0.50.1 scikit-learn==0.23.2 pyod==0.8.3 +``` + +## Enable the anomalies collector + +Now you're ready to enable the collector and [restart Netdata](/docs/configure/start-stop-restart.md). + +```bash +sudo ./edit-config python.d.conf +# set `anomalies: no` to `anomalies: yes` + +# restart netdata +sudo systemctl restart netdata +``` + +And that should be it! Wait a minute or two, refresh your Netdata dashboard, you should see the default anomalies +charts under the **Anomalies** section in the dashboard's menu. + +![Anomaly detection on the Raspberry +Pi](https://user-images.githubusercontent.com/1153921/110149717-9d749c00-7d9b-11eb-853c-e041a36f0a41.png) + +## Overhead on system + +Of course one of the most important considerations when trying to do anomaly detection at the edge (as opposed to in a +centralized cloud somewhere) is the resource utilization impact of running a monitoring tool. + +With the default configuration, the anomalies collector uses about 6.5% of CPU at each run. During the retraining step, +CPU utilization jumps to between 20-30% for a few seconds, but you can [configure +retraining](/collectors/python.d.plugin/anomalies/README.md#configuration) to happen less often if you wish. + +![CPU utilization of anomaly detection on the Raspberry +Pi](https://user-images.githubusercontent.com/1153921/110149718-9d749c00-7d9b-11eb-9af8-46e2032cd1d0.png) + +In terms of the runtime of the collector, it was averaging around 250ms during each prediction step, jumping to about +8-10 seconds during a retraining step. This jump equates only to a small gap in the anomaly charts for a few seconds. + +![Execution time of anomaly detection on the Raspberry +Pi](https://user-images.githubusercontent.com/1153921/110149715-9cdc0580-7d9b-11eb-826d-faf6f620621a.png) + +The last consideration then is the amount of RAM the collector needs to store both the models and some of the data +during training. By default, the anomalies collector, along with all other running Python-based collectors, uses about +100MB of system memory. + +![RAM utilization of anomaly detection on the Raspberry +Pi](https://user-images.githubusercontent.com/1153921/110149720-9e0d3280-7d9b-11eb-883d-b1d4d9b9b5e1.png) + +## What's next? + +So, all in all, with a small little bit of extra set up and a small overhead on the Pi itself, the anomalies collector +looks like a potentially useful addition to enable unsupervised anomaly detection on your Pi. + +See our two-part guide series for a more complete picture of configuring the anomalies collector, plus some best +practices on using the charts it automatically generates: + +- [_Detect anomalies in systems and applications_](/docs/guides/monitor/anomaly-detection.md) +- [_Monitor and visualize anomalies with Netdata_](/docs/guides/monitor/visualize-monitor-anomalies.md) + +If you're using your Raspberry Pi for other purposes, like blocking ads/trackers with Pi-hole, check out our companions +Pi guide: [_Monitor Pi-hole (and a Raspberry Pi) with Netdata_](/docs/guides/monitor/pi-hole-raspberry-pi.md). + +Once you've had a chance to give unsupervised anomaly detection a go, share your use cases and let us know of any +feedback on our [community forum](https://community.netdata.cloud/t/anomalies-collector-feedback-megathread/767). + +### Related reference documentation + +- [Netdata Agent · Get Netdata](/docs/get/README.md) +- [Netdata Agent · Anomalies collector](/collectors/python.d.plugin/anomalies/README.md) + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fmonitor%2Fraspberry-pi-anomaly-detection&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/monitor/statsd.md b/docs/guides/monitor/statsd.md new file mode 100644 index 000000000..9b1de3047 --- /dev/null +++ b/docs/guides/monitor/statsd.md @@ -0,0 +1,297 @@ +<!-- +title: How to use any StatsD data source with Netdata +description: "Learn how to monitor any custom application instrumented with StatsD with per-second metrics and fully customizable, interactive charts." +image: /img/seo/guides/monitor/statsd.png +author: "Odysseas Lamtzidis" +author_title: "Developer Advocate" +author_img: "/img/authors/odysseas-lamtzidis.jpg" +custom_edit_url: https://github.com/netdata/netdata/edit/master/docs/guides/monitor/statsd.md +--> + +# StatsD Guide + +StatsD is a protocol and server implementation, first introduced at Etsy, to aggregate and summarize application metrics. With StatsD, applications are instrumented by developers using the libraries that already exist for the language, without caring about managing the data. The StatsD server is in charge of receiving the metrics, performing some simple processing on them, and then pushing them to the time-series database (TSDB) for long-term storage and visualization. + +Netdata is a fully-functional StatsD server and TSDB implementation, so you can instantly visualize metrics by simply sending them to Netdata using the built-in StatsD server. + +In this guide, we'll go through a scenario of visualizing our data in Netdata in a matter of seconds using [k6](https://k6.io), an open-source tool for automating load testing that outputs metrics to the StatsD format. + +Although we'll use k6 as the use-case, the same principles can be applied to every application that supports the StatsD protocol. Simply enable the StatsD output and point it to the node that runs Netdata, which is `localhost` in this case. + +In general, the process for creating a StatsD collector can be summarized in 2 steps: + +- Run an experiment by sending StatsD metrics to Netdata, without any prior configuration. This will create a chart per metric (called private charts) and will help you verify that everything works as expected from the application side of things. + - Make sure to reload the dashboard tab **after** you start sending data to Netdata. +- Create a configuration file for your app using [edit-config](https://learn.netdata.cloud/guides/step-by-step/step-04): `sudo ./edit-config statsd.d/myapp.conf` + - Each app will have it's own section in the right-hand menu. + +Now, let's see the above process in detail. + +## Prerequisites + +- A node with the [Netdata Agent](https://learn.netdata.cloud/docs/get#install-the-netdata-agent) installed. +- An application to instrument. For this guide, that will be [k6](https://k6.io/docs/getting-started/installation). + +## Understanding the metrics + +The real in instrumenting an application with StatsD for you is to decide what metrics you want to visualize and how you want them grouped. In other words, you need decide which metrics will be grouped in the same charts and how the charts will be grouped on Netdata's dashboard. + +Start with documentation for the particular application that you want to monitor (or the technological stack that you are using). In our case, the [k6 documentation](https://k6.io/docs/using-k6/metrics/) has a whole page dedicated to the metrics output by k6, along with descriptions. + +If you are using StatsD to monitor an existing application, you don't have much control over these metrics. For example, k6 has a type called `trend`, which is identical to timers and histograms. Thus, _k6 is clearly dictating_ which metrics can be used as histograms and simple gauges. + +On the other hand, if you are instrumenting your own code, you will need to not only decide what are the "things" that you want to measure, but also decide which StatsD metric type is the appropriate for each. + +## Use private charts to see all available metrics + +In Netdata, every metric will receive its own chart, called a `private chart`. Although in the final implementation this is something that we will disable, since it can create considerable noise (imagine having 100s of metrics), it’s very handy while building the configuration file. + +You can get a quick visual representation of the metrics and their type (e.g it’s a gauge, a timer, etc.). + +An important thing to notice is that StatsD has different types of metrics, as illustrated in the [Netdata documentation](https://learn.netdata.cloud/docs/agent/collectors/statsd.plugin#metrics-supported-by-netdata). Histograms and timers support mathematical operations to be performed on top of the baseline metric, like reporting the `average` of the value. + +Here are some examples of default private charts. You can see that the histogram private charts will visualize all the available operations. + +**Gauge private chart** + +![Gauge metric example](https://i.imgur.com/Sr5nJEV.png) + +**Histogram private chart** + +![Timer metric example](https://i.imgur.com/P4p0hvq.png) + +## Create a new StatsD configuration file + +Start by creating a new configuration file under the `statsd.d/` folder in the [Netdata config directory](/docs/configure/nodes.md#the-netdata-config-directory). Use [`edit-config`](/docs/configure/nodes.md#use-edit-config-to-edit-configuration-files) to create a new file called `k6.conf`. + +```bash= +sudo ./edit-config statsd.d/k6.conf +``` + +Copy the following configuration into your file as a starting point. + +```conf +[app] + name = k6 + metrics = k6* + private charts = yes + gaps when not collected = no + memory mode = dbengine +``` + +Next, you need is to understand how to organize metrics in Netdata’s StatsD. + +### Synthetic charts + +Netdata lets you group the metrics exposed by your instrumented application with _synthetic charts_. + +First, create a `[dictionary]` section to transform the names of the metrics into human-readable equivalents. `http_req_blocked`, `http_req_connecting`, `http_req_receiving`, and `http_reqs` are all metrics exposed by k6. + +``` +[dictionary] + http_req_blocked = Blocked HTTP Requests + http_req_connecting = Connecting HTTP Requests + http_req_receiving = Receiving HTTP Requests + http_reqs = Total HTTP requests +``` + +Continue this dictionary process with any other metrics you want to collect with Netdata. + +### Families and context + +Families and context are additional ways to group metrics. Families control the submenu at right-hand menu and it's a subcategory of the section. Given the metrics given by K6, we are organizing them in 2 major groups, or `families`: `k6 native metrics` and `http metrics`. + +Context is a second way to group metrics, when the metrics are of the same nature but different origin. In our case, if we ran several different load testing experiments side-by-side, we could define the same app, but different context (e.g `http_requests.experiment1`, `http_requests.experiment2`). + +Find more details about family and context in our [documentation](/web/README.md#families). + +### Dimension + +Now, having decided on how we are going to group the charts, we need to define how we are going to group metrics into different charts. This is particularly important, since we decide: + +- What metrics **not** to show, since they are not useful for our use-case. +- What metrics to consolidate into the same charts, so as to reduce noice and increase visual correlation. + +The dimension option has this syntax: `dimension = [pattern] METRIC NAME TYPE MULTIPLIER DIVIDER OPTIONS` + +- **pattern**: A keyword that tells the StatsD server the `METRIC` string is actually a [simple pattern].(/libnetdata/simple_pattern/README.md). We don't simple patterns in the example, but if we wanted to visualize all the `http_req` metrics, we could have a single dimension: `dimension = pattern 'k6.http_req*' last 1 1`. Find detailed examples with patterns in our [documentation](/collectors/statsd.plugin/README.md#dimension-patterns). +- **METRIC** The id of the metric as it comes from the client. You can easily find this in the private charts above, for example: `k6.http_req_connecting`. +- **NAME**: The name of the dimension. You can use the dictionary to expand this to something more human-readable. +- **TYPE**: + - For all charts: + - `events`: The number of events (data points) received by the StatsD server + - `last`: The last value that the server received + - For histograms and timers: + - `min`, `max`, `sum`, `average`, `percentile`, `median`, `stddev`: This is helpful if you want to see different representations of the same value. You can find an example at the `[iteration_duration]` above. Note that the baseline `metric` is the same, but the `name` of the dimension is different, since we use the baseline, but we perform a computation on it, creating a different final metric for visualization(dimension). +- **MULTIPLIER DIVIDER**: Handy if you want to convert Kilobytes to Megabytes or you want to give negative value. The second is handy for better visualization of send/receive. You can find an example at the **packets** submenu of the **IPv4 Networking Section**. + +> ❕ If you define a chart, run Netdata to visualize metrics, and then add or remove a dimension from that chart, this will result in a new chart with the same name, confusing Netdata. If you change the dimensions of the chart, please make sure to also change the `name` of that chart, since it serves as the `id` of that chart in Netdata's storage. (e.g http_req --> http_req_1). + +### Finalize your StatsD configuration file + +It's time to assemble all the pieces together and create the synthetic charts that will consist our application dashboard in Netdata. We can do it in a few simple steps: + +- Decide which metrics we want to use (we have viewed all of them as private charts). For example, we want to use `k6.http_requests`, `k6.vus`, etc. +- Decide how we want organize them in different synthetic charts. For example, we want `k6.http_requests`, `k6.vus` on their own, but `k6.http_req_blocked` and `k6.http_req_connecting` on the same chart. +- For each synthetic chart, we define a **unique** name and a human readable title. +- We decide at which `family` (submenu section) we want each synthetic chart to belong to. For example, here we have defined 2 families: `http requests`, `k6_metrics`. +- If we have multiple instances of the same metric, we can define different contexts, (Optional). +- We define a dimension according to the syntax we highlighted above. +- We define a type for each synthetic chart (line, area, stacked) +- We define the units for each synthetic chart. + +Following the above steps, we append to the `k6.conf` that we defined above, the following configuration: + +``` +[http_req_total] + name = http_req_total + title = Total HTTP Requests + family = http requests + context = k6.http_requests + dimension = k6.http_reqs http_reqs last 1 1 sum + type = line + units = requests/s + +[vus] + name = vus + title = Virtual Active Users + family = k6_metrics + dimension = k6.vus vus last 1 1 + dimension = k6.vus_max vus_max last 1 1 + type = line + unit = vus + +[iteration_duration] + name = iteration_duration_2 + title = Iteration duration + family = k6_metrics + dimension = k6.iteration_duration iteration_duration last 1 1 + dimension = k6.iteration_duration iteration_duration_max max 1 1 + dimension = k6.iteration_duration iteration_duration_min min 1 1 + dimension = k6.iteration_duration iteration_duration_avg avg 1 1 + type = line + unit = s + +[dropped_iterations] + name = dropped_iterations + title = Dropped Iterations + family = k6_metrics + dimension = k6.dropped_iterations dropped_iterations last 1 1 + units = iterations + type = line + +[data] + name = data + title = K6 Data + family = k6_metrics + dimension = k6.data_received data_received last 1 1 + dimension = k6.data_sent data_sent last -1 1 + units = kb/s + type = area + +[http_req_status] + name = http_req_status + title = HTTP Requests Status + family = http requests + dimension = k6.http_req_blocked http_req_blocked last 1 1 + dimension = k6.http_req_connecting http_req_connecting last 1 1 + units = ms + type = line + +[http_req_duration] + name = http_req_duration + title = HTTP requests duration + family = http requests + dimension = k6.http_req_sending http_req_sending last 1 1 + dimension = k6.http_req_waiting http_req_waiting last 1 1 + dimension = k6.http_req_receiving http_req_receiving last 1 1 + units = ms + type = stacked +``` + +> Take note that Netdata will report the rate for metrics and counters, even if k6 or another application sends an _absolute_ number. For example, k6 sends absolute HTTP requests with `http_reqs`, but Netdat visualizes that in `requests/second`. + +To enable this StatsD configuration, [restart Netdata](/docs/configure/start-stop-restart.md). + +## Final touches + +At this point, you have used StatsD to gather metrics for k6, creating a whole new section in your Netdata dashboard in the process. Uil can further customize the icon of the particular section, as well as the description for each chart. + +To edit the section, please follow the Netdata [documentation](https://learn.netdata.cloud/docs/agent/web/gui#customizing-the-local-dashboard). + +While the following configuration will be placed in a new file, as the documentation suggests, it is instructing to use `dashboard_info.js` as a template. Open the file and see how the rest of sections and collectors have been defined. + +```javascript= +netdataDashboard.menu = { + 'k6': { + title: 'K6 Load Testing', + icon: '<i class="fas fa-cogs"></i>', + info: 'k6 is an open-source load testing tool and cloud service providing the best developer experience for API performance testing.' + }, + . + . + . +``` + +We can then add a description for each chart. Simply find the following section in `dashboard_info.js` to understand how a chart definitions are used: + +```javascript= +netdataDashboard.context = { + 'system.cpu': { + info: function (os) { + void (os); + return 'Total CPU utilization (all cores). 100% here means there is no CPU idle time at all. You can get per core usage at the <a href="#menu_cpu">CPUs</a> section and per application usage at the <a href="#menu_apps">Applications Monitoring</a> section.' + + netdataDashboard.sparkline('<br/>Keep an eye on <b>iowait</b> ', 'system.cpu', 'iowait', '%', '. If it is constantly high, your disks are a bottleneck and they slow your system down.') + + netdataDashboard.sparkline('<br/>An important metric worth monitoring, is <b>softirq</b> ', 'system.cpu', 'softirq', '%', '. A constantly high percentage of softirq may indicate network driver issues.'); + }, + valueRange: "[0, 100]" + }, +``` + +Afterwards, you can open your `custom_dashboard_info.js`, as suggested in the documentation linked above, and add something like the following example: + +```javascript= +netdataDashboard.context = { + 'k6.http_req_duration': { + info: "Total time for the request. It's equal to http_req_sending + http_req_waiting + http_req_receiving (i.e. how long did the remote server take to process the request and respond, without the initial DNS lookup/connection times)" + }, + +``` +The chart is identified as ``<section_name>.<chart_name>``. + +These descriptions can greatly help the Netdata user who is monitoring your application in the midst of an incident. + +The `info` field supports `html`, embedding useful links and instructions in the description. + +## Vendoring a new collector + +After all this hussle, not only did we illustrate how to visualize any data source in Netdata using the StatsD protocol, but we have also created a new collector in the process. + +While using the same underlying collector-StatsD-every new `myapp.conf` file will in essence create a new data source and dashboard section for Netdata. While Netdata will load all the configuration files by default, it will **not** create dashboard sections or charts, unless it start receiving data for that particular data source. This means that we can now share our collector with the rest of the Netdata community. + +If you want to contribute or you need any help in developing your collector, we have a whole [Forum Category](https://community.netdata.cloud/c/agent-development/9) dedicated to contributing to the Netdata Agent. + +### Making a PR to the netdata/netdata repository + +- Make sure you follow the contributing guide and read our Code of Conduct +- Fork the netdata/netdata repository +- Place the configuration file inside `netdata/collectors/statsd.plugin` +- Add a reference in `netdata/collectors/statsd.plugin/Makefile.am`. For example, if we contribute the `k6.conf` file: +```Makefile +dist_statsdconfig_DATA = \ + example.conf \ + k6.conf \ + $(NULL) +``` + +## What's next? + +In this tutorial, you learned how to monitor an application using Netdata's StatsD implementation. + +Netdata allows you easily visualize any StatsD metric without any configuration, since it creates a private metric per chart by default. But to make your implementation more robust, you also learned how to group metrics by family and context, and create multiple dimensions. With these tools, you can quickly instrument any application with StatsD to monitor its performance and availability with per-second metrics. + +### Related reference documentation + +- [Netdata Agent · StatsD](/collectors/statsd.plugin/README.md) + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fmonitor%2Fstatsdr&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/monitor/visualize-monitor-anomalies.md b/docs/guides/monitor/visualize-monitor-anomalies.md index f37dadc62..681ba8390 100644 --- a/docs/guides/monitor/visualize-monitor-anomalies.md +++ b/docs/guides/monitor/visualize-monitor-anomalies.md @@ -136,9 +136,6 @@ unsupervised anomaly detection, or would like to see something added to it. You that works well for monitoring some other popular application, like MySQL, PostgreSQL, Redis, or anything else we [support through collectors](/collectors/COLLECTORS.md). -In part 3 of this series on unsupervised anomaly detection using Netdata, we'll create a custom model to apply -unsupervised anomaly detection to an entire mission-critical application. Stay tuned! - ### Related reference documentation - [Netdata Agent · Anomalies collector](/collectors/python.d.plugin/anomalies/README.md) |