diff options
Diffstat (limited to 'collectors')
54 files changed, 2729 insertions, 664 deletions
diff --git a/collectors/README.md b/collectors/README.md index b7fc7328..83c92d9d 100644 --- a/collectors/README.md +++ b/collectors/README.md @@ -15,9 +15,9 @@ To minimize the number of processes spawn for data collection, netdata also supp data collection modules with the minimum of code. Currently netdata provides plugin orchestrators - BASH v4+ [charts.d.plugin](charts.d.plugin), - node.js [node.d.plugin](node.d.plugin) and - python v2+ (including v3) [python.d.plugin](python.d.plugin). + BASH v4+ [charts.d.plugin](charts.d.plugin/), + node.js [node.d.plugin](node.d.plugin/) and + python v2+ (including v3) [python.d.plugin](python.d.plugin/). ## Netdata Plugins diff --git a/collectors/apps.plugin/README.md b/collectors/apps.plugin/README.md index 05680efe..d1ca8114 100644 --- a/collectors/apps.plugin/README.md +++ b/collectors/apps.plugin/README.md @@ -188,21 +188,21 @@ Here is an example for the process group `sql` at `https://registry.my-netdata.i Netdata is able give you a lot more badges for your app. Examples below for process group `sql`: -- CPU usage: ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.cpu&dimensions=sql&value_color=green=0%7Corange%3C50%7Cred) -- Disk Physical Reads ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.preads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Disk Physical Writes ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.pwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Disk Logical Reads ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lreads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Disk Logical Writes ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Open Files ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.files&dimensions=sql&value_color=green%3E30%7Cred) -- Real Memory ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.mem&dimensions=sql&value_color=green%3C100%7Corange%3C200%7Cred) -- Virtual Memory ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.vmem&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Swap Memory ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.swap&dimensions=sql&value_color=green=0%7Cred) -- Minor Page Faults ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.minor_faults&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Processes ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.processes&dimensions=sql&value_color=green%3E0%7Cred) -- Threads ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.threads&dimensions=sql&value_color=green%3E=28%7Cred) -- Major Faults (swap activity) ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.major_faults&dimensions=sql&value_color=green=0%7Cred) -- Open Pipes ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.pipes&dimensions=sql&value_color=green=0%7Cred) -- Open Sockets ![image](http://registry.my-netdata.io/api/v1/badge.svg?chart=apps.sockets&dimensions=sql&value_color=green%3E=3%7Cred) +- CPU usage: ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.cpu&dimensions=sql&value_color=green=0%7Corange%3C50%7Cred) +- Disk Physical Reads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.preads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Disk Physical Writes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.pwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Disk Logical Reads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lreads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Disk Logical Writes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Open Files ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.files&dimensions=sql&value_color=green%3E30%7Cred) +- Real Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.mem&dimensions=sql&value_color=green%3C100%7Corange%3C200%7Cred) +- Virtual Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.vmem&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Swap Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.swap&dimensions=sql&value_color=green=0%7Cred) +- Minor Page Faults ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.minor_faults&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Processes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.processes&dimensions=sql&value_color=green%3E0%7Cred) +- Threads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.threads&dimensions=sql&value_color=green%3E=28%7Cred) +- Major Faults (swap activity) ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.major_faults&dimensions=sql&value_color=green=0%7Cred) +- Open Pipes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.pipes&dimensions=sql&value_color=green=0%7Cred) +- Open Sockets ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.sockets&dimensions=sql&value_color=green%3E=3%7Cred) For more information about badges check [Generating Badges](../../web/api/badges) diff --git a/collectors/cgroups.plugin/README.md b/collectors/cgroups.plugin/README.md index e78aa044..47eeebc5 100644 --- a/collectors/cgroups.plugin/README.md +++ b/collectors/cgroups.plugin/README.md @@ -54,7 +54,7 @@ To provide a sane default for this setting, netdata uses the following pattern l search for cgroups in subpaths matching = !*/init.scope !*-qemu !/init.scope !/system !/systemd !/user !/user.slice * ``` -So, we disable checking for **child cgroups** in systemd internal cgroups ([systemd services are monitored by netdata](https://github.com/netdata/netdata/wiki/monitoring-systemd-services)), user cgroups (normally used for desktop and remote user sessions), qemu virtual machines (child cgroups of virtual machines) and `init.scope`. All others are enabled. +So, we disable checking for **child cgroups** in systemd internal cgroups ([systemd services are monitored by netdata](#monitoring-systemd-services)), user cgroups (normally used for desktop and remote user sessions), qemu virtual machines (child cgroups of virtual machines) and `init.scope`. All others are enabled. ### enabled cgroups @@ -87,7 +87,7 @@ For this mapping netdata provides 2 configuration options: The whole point for the additional pattern list, is to limit the number of times the script will be called. Without this pattern list, the script might be called thousands of times, depending on the number of cgroups available in the system. -The above pattern list is matched against the path of the cgroup. For matched cgroups, netdata calls the script [cgroup-name.sh](https://github.com/netdata/netdata/blob/master/collectors/cgroups.plugin/cgroup-name.sh.in) to get its name. This script queries `docker`, or applies heuristics to find give a name for the cgroup. +The above pattern list is matched against the path of the cgroup. For matched cgroups, netdata calls the script [cgroup-name.sh](cgroup-name.sh.in) to get its name. This script queries `docker`, or applies heuristics to find give a name for the cgroup. ## Monitoring systemd services diff --git a/collectors/checks.plugin/Makefile.am b/collectors/checks.plugin/Makefile.am index babdcf0d..19554bed 100644 --- a/collectors/checks.plugin/Makefile.am +++ b/collectors/checks.plugin/Makefile.am @@ -2,3 +2,7 @@ AUTOMAKE_OPTIONS = subdir-objects MAINTAINERCLEANFILES = $(srcdir)/Makefile.in + +dist_noinst_DATA = \ + README.md \ + $(NULL) diff --git a/collectors/checks.plugin/Makefile.in b/collectors/checks.plugin/Makefile.in index 63212546..faadbe58 100644 --- a/collectors/checks.plugin/Makefile.in +++ b/collectors/checks.plugin/Makefile.in @@ -15,6 +15,7 @@ @SET_MAKE@ # SPDX-License-Identifier: GPL-3.0-or-later + VPATH = @srcdir@ am__is_gnu_make = test -n '$(MAKEFILE_LIST)' && test -n '$(MAKELEVEL)' am__make_running_with_option = \ @@ -80,7 +81,8 @@ POST_UNINSTALL = : build_triplet = @build@ host_triplet = @host@ subdir = collectors/checks.plugin -DIST_COMMON = $(srcdir)/Makefile.in $(srcdir)/Makefile.am +DIST_COMMON = $(srcdir)/Makefile.in $(srcdir)/Makefile.am \ + $(dist_noinst_DATA) ACLOCAL_M4 = $(top_srcdir)/aclocal.m4 am__aclocal_m4_deps = $(top_srcdir)/build/m4/ax_c___atomic.m4 \ $(top_srcdir)/build/m4/ax_c__generic.m4 \ @@ -117,6 +119,7 @@ am__can_run_installinfo = \ n|no|NO) false;; \ *) (install-info --version) >/dev/null 2>&1;; \ esac +DATA = $(dist_noinst_DATA) am__tagged_files = $(HEADERS) $(SOURCES) $(TAGS_FILES) $(LISP) DISTFILES = $(DIST_COMMON) $(DIST_SOURCES) $(TEXINFOS) $(EXTRA_DIST) ACLOCAL = @ACLOCAL@ @@ -267,6 +270,10 @@ varlibdir = @varlibdir@ webdir = @webdir@ AUTOMAKE_OPTIONS = subdir-objects MAINTAINERCLEANFILES = $(srcdir)/Makefile.in +dist_noinst_DATA = \ + README.md \ + $(NULL) + all: all-am .SUFFIXES: @@ -339,7 +346,7 @@ distdir: $(DISTFILES) done check-am: all-am check: check-am -all-am: Makefile +all-am: Makefile $(DATA) installdirs: install: install-am install-exec: install-exec-am diff --git a/collectors/checks.plugin/README.md b/collectors/checks.plugin/README.md new file mode 100644 index 00000000..503b96ad --- /dev/null +++ b/collectors/checks.plugin/README.md @@ -0,0 +1,3 @@ +# Netdata internal checks + +A debugging plugin (by default it is disabled) diff --git a/collectors/diskspace.plugin/README.md b/collectors/diskspace.plugin/README.md index 74d6cde3..f7d0e7b4 100644 --- a/collectors/diskspace.plugin/README.md +++ b/collectors/diskspace.plugin/README.md @@ -1,5 +1,6 @@ -> for disks performance monitoring, see the `proc` plugin, [here](../proc.plugin/#monitoring-disks-performance-with-netdata) - # diskspace.plugin This plugin monitors the disk space usage of mounted disks, under Linux. + +> for disks performance monitoring, see the `proc` plugin, [here](../proc.plugin/#monitoring-disks) + diff --git a/collectors/fping.plugin/README.md b/collectors/fping.plugin/README.md index 0554a7ed..a83b7912 100644 --- a/collectors/fping.plugin/README.md +++ b/collectors/fping.plugin/README.md @@ -37,7 +37,7 @@ fping_opts="-R -b 56 -i 1 -r 0 -t 5000" ## alarms netdata will automatically attach a few alarms for each host. -Check the [latest versions of the fping alarms](https://github.com/netdata/netdata/blob/master/health/health.d/fping.conf) +Check the [latest versions of the fping alarms](../../health/health.d/fping.conf) ## Additional Tips diff --git a/collectors/freebsd.plugin/Makefile.am b/collectors/freebsd.plugin/Makefile.am index e80ec702..ca4d4ddd 100644 --- a/collectors/freebsd.plugin/Makefile.am +++ b/collectors/freebsd.plugin/Makefile.am @@ -3,3 +3,7 @@ AUTOMAKE_OPTIONS = subdir-objects MAINTAINERCLEANFILES = $(srcdir)/Makefile.in + +dist_noinst_DATA = \ + README.md \ + $(NULL) diff --git a/collectors/freebsd.plugin/Makefile.in b/collectors/freebsd.plugin/Makefile.in index c88b3d75..d3332677 100644 --- a/collectors/freebsd.plugin/Makefile.in +++ b/collectors/freebsd.plugin/Makefile.in @@ -15,6 +15,7 @@ @SET_MAKE@ # SPDX-License-Identifier: GPL-3.0-or-later + VPATH = @srcdir@ am__is_gnu_make = test -n '$(MAKEFILE_LIST)' && test -n '$(MAKELEVEL)' am__make_running_with_option = \ @@ -80,7 +81,8 @@ POST_UNINSTALL = : build_triplet = @build@ host_triplet = @host@ subdir = collectors/freebsd.plugin -DIST_COMMON = $(srcdir)/Makefile.in $(srcdir)/Makefile.am +DIST_COMMON = $(srcdir)/Makefile.in $(srcdir)/Makefile.am \ + $(dist_noinst_DATA) ACLOCAL_M4 = $(top_srcdir)/aclocal.m4 am__aclocal_m4_deps = $(top_srcdir)/build/m4/ax_c___atomic.m4 \ $(top_srcdir)/build/m4/ax_c__generic.m4 \ @@ -117,6 +119,7 @@ am__can_run_installinfo = \ n|no|NO) false;; \ *) (install-info --version) >/dev/null 2>&1;; \ esac +DATA = $(dist_noinst_DATA) am__tagged_files = $(HEADERS) $(SOURCES) $(TAGS_FILES) $(LISP) DISTFILES = $(DIST_COMMON) $(DIST_SOURCES) $(TEXINFOS) $(EXTRA_DIST) ACLOCAL = @ACLOCAL@ @@ -267,6 +270,10 @@ varlibdir = @varlibdir@ webdir = @webdir@ AUTOMAKE_OPTIONS = subdir-objects MAINTAINERCLEANFILES = $(srcdir)/Makefile.in +dist_noinst_DATA = \ + README.md \ + $(NULL) + all: all-am .SUFFIXES: @@ -339,7 +346,7 @@ distdir: $(DISTFILES) done check-am: all-am check: check-am -all-am: Makefile +all-am: Makefile $(DATA) installdirs: install: install-am install-exec: install-exec-am diff --git a/collectors/freebsd.plugin/README.md b/collectors/freebsd.plugin/README.md new file mode 100644 index 00000000..e6302f42 --- /dev/null +++ b/collectors/freebsd.plugin/README.md @@ -0,0 +1,3 @@ +# freebsd + +Collects resource usage and performance data on FreeBSD systems diff --git a/collectors/freeipmi.plugin/README.md b/collectors/freeipmi.plugin/README.md index f7c5cc14..6d4ad186 100644 --- a/collectors/freeipmi.plugin/README.md +++ b/collectors/freeipmi.plugin/README.md @@ -87,7 +87,7 @@ The plugin supports a few options. To see them, run: options ipmi_si kipmid_max_busy_us=10 For more information: - https://github.com/ktsaou/netdata/tree/master/plugins/freeipmi.plugin + https://github.com/netdata/netdata/tree/master/collectors/freeipmi.plugin ``` diff --git a/collectors/freeipmi.plugin/freeipmi_plugin.c b/collectors/freeipmi.plugin/freeipmi_plugin.c index a1cff3af..7fc012d3 100644 --- a/collectors/freeipmi.plugin/freeipmi_plugin.c +++ b/collectors/freeipmi.plugin/freeipmi_plugin.c @@ -1624,7 +1624,7 @@ int main (int argc, char **argv) { " options ipmi_si kipmid_max_busy_us=10\n" "\n" " For more information:\n" - " https://github.com/ktsaou/netdata/tree/master/plugins/freeipmi.plugin\n" + " https://github.com/netdata/netdata/tree/master/collectors/freeipmi.plugin\n" "\n" , VERSION , netdata_update_every diff --git a/collectors/macos.plugin/Makefile.am b/collectors/macos.plugin/Makefile.am index babdcf0d..19554bed 100644 --- a/collectors/macos.plugin/Makefile.am +++ b/collectors/macos.plugin/Makefile.am @@ -2,3 +2,7 @@ AUTOMAKE_OPTIONS = subdir-objects MAINTAINERCLEANFILES = $(srcdir)/Makefile.in + +dist_noinst_DATA = \ + README.md \ + $(NULL) diff --git a/collectors/macos.plugin/Makefile.in b/collectors/macos.plugin/Makefile.in index 6247dda7..d5979211 100644 --- a/collectors/macos.plugin/Makefile.in +++ b/collectors/macos.plugin/Makefile.in @@ -15,6 +15,7 @@ @SET_MAKE@ # SPDX-License-Identifier: GPL-3.0-or-later + VPATH = @srcdir@ am__is_gnu_make = test -n '$(MAKEFILE_LIST)' && test -n '$(MAKELEVEL)' am__make_running_with_option = \ @@ -80,7 +81,8 @@ POST_UNINSTALL = : build_triplet = @build@ host_triplet = @host@ subdir = collectors/macos.plugin -DIST_COMMON = $(srcdir)/Makefile.in $(srcdir)/Makefile.am +DIST_COMMON = $(srcdir)/Makefile.in $(srcdir)/Makefile.am \ + $(dist_noinst_DATA) ACLOCAL_M4 = $(top_srcdir)/aclocal.m4 am__aclocal_m4_deps = $(top_srcdir)/build/m4/ax_c___atomic.m4 \ $(top_srcdir)/build/m4/ax_c__generic.m4 \ @@ -117,6 +119,7 @@ am__can_run_installinfo = \ n|no|NO) false;; \ *) (install-info --version) >/dev/null 2>&1;; \ esac +DATA = $(dist_noinst_DATA) am__tagged_files = $(HEADERS) $(SOURCES) $(TAGS_FILES) $(LISP) DISTFILES = $(DIST_COMMON) $(DIST_SOURCES) $(TEXINFOS) $(EXTRA_DIST) ACLOCAL = @ACLOCAL@ @@ -267,6 +270,10 @@ varlibdir = @varlibdir@ webdir = @webdir@ AUTOMAKE_OPTIONS = subdir-objects MAINTAINERCLEANFILES = $(srcdir)/Makefile.in +dist_noinst_DATA = \ + README.md \ + $(NULL) + all: all-am .SUFFIXES: @@ -339,7 +346,7 @@ distdir: $(DISTFILES) done check-am: all-am check: check-am -all-am: Makefile +all-am: Makefile $(DATA) installdirs: install: install-am install-exec: install-exec-am diff --git a/collectors/macos.plugin/README.md b/collectors/macos.plugin/README.md new file mode 100644 index 00000000..ddbcc8f9 --- /dev/null +++ b/collectors/macos.plugin/README.md @@ -0,0 +1,3 @@ +# macos + +Collects resource usage and performance data on MacOS systems diff --git a/collectors/node.d.plugin/README.md b/collectors/node.d.plugin/README.md index dd977017..af8708c7 100644 --- a/collectors/node.d.plugin/README.md +++ b/collectors/node.d.plugin/README.md @@ -9,7 +9,21 @@ 5. Allows each **module** to have one or more data collection **jobs** 6. Each **job** is collecting one or more metrics from a single data source -# Motivation +## Pull Request Checklist for Node.js Plugins + +This is a generic checklist for submitting a new Node.js plugin for Netdata. It is by no means comprehensive. + +At minimum, to be buildable and testable, the PR needs to include: + +* The module itself, following proper naming conventions: `node.d/<module_dir>/<module_name>.node.js` +* A README.md file for the plugin. +* The configuration file for the module +* A basic configuration for the plugin in the appropriate global config file: `conf.d/node.d.conf`, which is also in JSON format. If the module should be enabled by default, add a section for it in the `modules` dictionary. +* A line for the plugin in the appropriate `Makefile.am` file: `node.d/Makefile.am` under `dist_node_DATA`. +* A line for the plugin configuration file in `conf.d/Makefile.am`: under `dist_nodeconfig_DATA` +* Optionally, chart information in `web/dashboard_info.js`. This generally involves specifying a name and icon for the section, and may include descriptions for the section or individual charts. + +## Motivation Node.js is perfect for asynchronous operations. It is very fast and quite common (actually the whole web is based on it). Since data collection is not a CPU intensive task, node.js is an ideal solution for it. diff --git a/collectors/plugins.d/README.md b/collectors/plugins.d/README.md index d3aa5b5b..c5981803 100644 --- a/collectors/plugins.d/README.md +++ b/collectors/plugins.d/README.md @@ -374,23 +374,23 @@ or do not output the line at all. ## Modular Plugins -1. **python**, use `python.d.plugin`, there are many examples in the [python.d directory](../python.d.plugin) +1. **python**, use `python.d.plugin`, there are many examples in the [python.d directory](../python.d.plugin/) python is ideal for netdata plugins. It is a simple, yet powerful way to collect data, it has a very small memory footprint, although it is not the most CPU efficient way to do it. -2. **node.js**, use `node.d.plugin`, there are a few examples in the [node.d directory](../node.d.plugin) +2. **node.js**, use `node.d.plugin`, there are a few examples in the [node.d directory](../node.d.plugin/) node.js is the fastest scripting language for collecting data. If your plugin needs to do a lot of work, compute values, etc, node.js is probably the best choice before moving to compiled code. Keep in mind though that node.js is not memory efficient; it will probably need more RAM compared to python. -3. **BASH**, use `charts.d.plugin`, there are many examples in the [charts.d directory](../charts.d.plugin) +3. **BASH**, use `charts.d.plugin`, there are many examples in the [charts.d directory](../charts.d.plugin/) BASH is the simplest scripting language for collecting values. It is the less efficient though in terms of CPU resources. You can use it to collect data quickly, but extensive use of it might use a lot of system resources. 4. **C** Of course, C is the most efficient way of collecting data. This is why netdata itself is written in C. - ---- + +## Properly Writing Plugins ## Writing Plugins Properly @@ -470,3 +470,4 @@ There are a few rules for writing plugins properly: 3. If you are not sure of memory leaks, exit every one hour. Netdata will re-start your process. 4. If possible, try to autodetect if your plugin should be enabled, without any configuration. + diff --git a/collectors/proc.plugin/README.md b/collectors/proc.plugin/README.md index 9d444f3d..12306565 100644..100755 --- a/collectors/proc.plugin/README.md +++ b/collectors/proc.plugin/README.md @@ -1,4 +1,3 @@ - # proc.plugin - `/proc/net/dev` (all network interfaces for all their values) @@ -9,7 +8,7 @@ - `/proc/net/stat/nf_conntrack` (connection tracking performance) - `/proc/net/stat/synproxy` (synproxy performance) - `/proc/net/ip_vs/stats` (IPVS connection statistics) - - `/proc/stat` (CPU utilization) + - `/proc/stat` (CPU utilization and attributes) - `/proc/meminfo` (memory information) - `/proc/vmstat` (system performance) - `/proc/net/rpc/nfsd` (NFS server statistics for both v3 and v4 NFS servers) @@ -25,7 +24,7 @@ --- -# Monitoring Disks +## Monitoring Disks > Live demo of disk monitoring at: **[http://london.netdata.rocks](https://registry.my-netdata.io/#menu_disk)** @@ -33,75 +32,45 @@ Performance monitoring for Linux disks is quite complicated. The main reason is Hopefully, the Linux kernel provides many metrics that can provide deep insights of what our disks our doing. The kernel measures all these metrics on all layers of storage: **virtual disks**, **physical disks** and **partitions of disks**. -Let's see the list of metrics provided by netdata for each of the above: - -### I/O bandwidth/s (kb/s) - -The amount of data transferred from and to the disk. - -### I/O operations/s - -The number of I/O operations completed. - -### Queued I/O operations - -The number of currently queued I/O operations. For traditional disks that execute commands one after another, one of them is being run by the disk and the rest are just waiting in a queue. - -### Backlog size (time in ms) - -The expected duration of the currently queued I/O operations. - -### Utilization (time percentage) - -The percentage of time the disk was busy with something. This is a very interesting metric, since for most disks, that execute commands sequentially, **this is the key indication of congestion**. A sequential disk that is 100% of the available time busy, has no time to do anything more, so even if the bandwidth or the number of operations executed by the disk is low, its capacity has been reached. - -Of course, for newer disk technologies (like fusion cards) that are capable to execute multiple commands in parallel, this metric is just meaningless. - -### Average I/O operation time (ms) - -The average time for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. - -### Average I/O operation size (kb) - -The average amount of data of the completed I/O operations. - -### Average Service Time (ms) - -The average service time for completed I/O operations. This metric is calculated using the total busy time of the disk and the number of completed operations. If the disk is able to execute multiple parallel operations the reporting average service time will be misleading. - -### Merged I/O operations/s - -The Linux kernel is capable of merging I/O operations. So, if two requests to read data from the disk are adjacent, the Linux kernel may merge them to one before giving them to disk. This metric measures the number of operations that have been merged by the Linux kernel. - -### Total I/O time - -The sum of the duration of all completed I/O operations. This number can exceed the interval if the disk is able to execute multiple I/O operations in parallel. - -### Space usage - -For mounted disks, netdata will provide a chart for their space, with 3 dimensions: - -1. free -2. used -3. reserved for root - -### inode usage - -For mounted disks, netdata will provide a chart for their inodes (number of file and directories), with 3 dimensions: - -1. free -2. used -3. reserved for root - ---- - -## disk names +### Monitored disk metrics + +- I/O bandwidth/s (kb/s) + The amount of data transferred from and to the disk. +- I/O operations/s + The number of I/O operations completed. +- Queued I/O operations + The number of currently queued I/O operations. For traditional disks that execute commands one after another, one of them is being run by the disk and the rest are just waiting in a queue. +- Backlog size (time in ms) + The expected duration of the currently queued I/O operations. +- Utilization (time percentage) + The percentage of time the disk was busy with something. This is a very interesting metric, since for most disks, that execute commands sequentially, **this is the key indication of congestion**. A sequential disk that is 100% of the available time busy, has no time to do anything more, so even if the bandwidth or the number of operations executed by the disk is low, its capacity has been reached. + Of course, for newer disk technologies (like fusion cards) that are capable to execute multiple commands in parallel, this metric is just meaningless. +- Average I/O operation time (ms) + The average time for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. +- Average I/O operation size (kb) + The average amount of data of the completed I/O operations. +- Average Service Time (ms) + The average service time for completed I/O operations. This metric is calculated using the total busy time of the disk and the number of completed operations. If the disk is able to execute multiple parallel operations the reporting average service time will be misleading. +- Merged I/O operations/s + The Linux kernel is capable of merging I/O operations. So, if two requests to read data from the disk are adjacent, the Linux kernel may merge them to one before giving them to disk. This metric measures the number of operations that have been merged by the Linux kernel. +- Total I/O time + The sum of the duration of all completed I/O operations. This number can exceed the interval if the disk is able to execute multiple I/O operations in parallel. +- Space usage + For mounted disks, netdata will provide a chart for their space, with 3 dimensions: + 1. free + 2. used + 3. reserved for root +- inode usage + For mounted disks, netdata will provide a chart for their inodes (number of file and directories), with 3 dimensions: + 1. free + 2. used + 3. reserved for root + +### disk names netdata will automatically set the name of disks on the dashboard, from the mount point they are mounted, of course only when they are mounted. Changes in mount points are not currently detected (you will have to restart netdata to change the name of the disk). ---- - -## performance metrics +### performance metrics By default netdata will enable monitoring metrics only when they are not zero. If they are constantly zero they are ignored. Metrics that will start having values, after netdata is started, will be detected and charts will be automatically added to the dashboard (a refresh of the dashboard is needed for them to appear though). @@ -198,3 +167,76 @@ So, to disable performance metrics for all loop devices you could add `performan performance metrics for disks with major 7 = no ``` +## Monitoring CPUs + +The `/proc/stat` module monitors CPU utilization, interrupts, context switches, processes started/running, thermal throttling, frequency, and idle states. It gathers this information from multiple files. + +If more than 50 cores are present in a system then CPU thermal throttling, frequency, and idle state charts are disabled. + +#### configuration + +`keep per core files open` option in the `[plugin:proc:/proc/stat]` configuration section allows reducing the number of file operations on multiple files. + +### CPU frequency + +The module shows the current CPU frequency as set by the `cpufreq` kernel +module. + +**Requirement:** +You need to have `CONFIG_CPU_FREQ` and (optionally) `CONFIG_CPU_FREQ_STAT` +enabled in your kernel. + +`cpufreq` interface provides two different ways of getting the information through `/sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq` and `/sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state` files. The latter is more accurate so it is preferred in the module. `scaling_cur_freq` represents only the current CPU frequency, and doesn't account for any state changes which happen between updates. The module switches back and forth between these two methods if governor is changed. + +It produces one chart with multiple lines (one line per core). + +#### configuration + +`scaling_cur_freq filename to monitor` and `time_in_state filename to monitor` in the `[plugin:proc:/proc/stat]` configuration section + +### CPU idle states + +The module monitors the usage of CPU idle states. + +**Requirement:** +Your kernel needs to have `CONFIG_CPU_IDLE` enabled. + +It produces one stacked chart per CPU, showing the percentage of time spent in +each state. + +#### configuration + +`schedstat filename to monitor`, `cpuidle name filename to monitor`, and `cpuidle time filename to monitor` in the `[plugin:proc:/proc/stat]` configuration section + +## Linux Anti-DDoS + +![image6](https://cloud.githubusercontent.com/assets/2662304/14253733/53550b16-fa95-11e5-8d9d-4ed171df4735.gif) + +--- +SYNPROXY is a TCP SYN packets proxy. It can be used to protect any TCP server (like a web server) from SYN floods and similar DDos attacks. + +SYNPROXY is a netfilter module, in the Linux kernel (since version 3.12). It is optimized to handle millions of packets per second utilizing all CPUs available without any concurrency locking between the connections. + +The net effect of this, is that the real servers will not notice any change during the attack. The valid TCP connections will pass through and served, while the attack will be stopped at the firewall. + +To use SYNPROXY on your firewall, please follow our setup guides: + + - **[Working with SYNPROXY](https://github.com/firehol/firehol/wiki/Working-with-SYNPROXY)** + - **[Working with SYNPROXY and traps](https://github.com/firehol/firehol/wiki/Working-with-SYNPROXY-and-traps)** + +### Real-time monitoring of Linux Anti-DDoS + +netdata is able to monitor in real-time (per second updates) the operation of the Linux Anti-DDoS protection. + +It visualizes 4 charts: + +1. TCP SYN Packets received on ports operated by SYNPROXY +2. TCP Cookies (valid, invalid, retransmits) +3. Connections Reopened +4. Entries used + +Example image: + +![ddos](https://cloud.githubusercontent.com/assets/2662304/14398891/6016e3fc-fdf0-11e5-942b-55de6a52cb66.gif) + +See Linux Anti-DDoS in action at: **[netdata demo site (with SYNPROXY enabled)](https://registry.my-netdata.io/#menu_netfilter_submenu_synproxy)** diff --git a/collectors/proc.plugin/proc_net_dev.c b/collectors/proc.plugin/proc_net_dev.c index 97cbc060..1e426e97 100644 --- a/collectors/proc.plugin/proc_net_dev.c +++ b/collectors/proc.plugin/proc_net_dev.c @@ -66,7 +66,7 @@ static struct netdev { kernel_uint_t tcollisions; kernel_uint_t tcarrier; kernel_uint_t tcompressed; - kernel_uint_t speed_max; + kernel_uint_t speed; // charts RRDSET *st_bandwidth; @@ -96,6 +96,10 @@ static struct netdev { RRDDIM *rd_tcarrier; RRDDIM *rd_tcompressed; + usec_t speed_last_collected_usec; + char *filename_speed; + RRDSETVAR *chart_var_speed; + struct netdev *next; } *netdev_root = NULL, *netdev_last_used = NULL; @@ -139,7 +143,7 @@ static void netdev_charts_release(struct netdev *d) { d->rd_tcompressed = NULL; } -static void netdev_free_strings(struct netdev *d) { +static void netdev_free_chart_strings(struct netdev *d) { freez((void *)d->chart_type_net_bytes); freez((void *)d->chart_type_net_compressed); freez((void *)d->chart_type_net_drops); @@ -161,9 +165,10 @@ static void netdev_free_strings(struct netdev *d) { static void netdev_free(struct netdev *d) { netdev_charts_release(d); - netdev_free_strings(d); + netdev_free_chart_strings(d); freez((void *)d->name); + freez((void *)d->filename_speed); freez((void *)d); netdev_added--; } @@ -265,7 +270,7 @@ static inline void netdev_rename_cgroup(struct netdev *d, struct netdev_rename * info("CGROUP: renaming network interface '%s' as '%s' under '%s'", r->host_device, r->container_device, r->container_name); netdev_charts_release(d); - netdev_free_strings(d); + netdev_free_chart_strings(d); char buffer[RRD_ID_LENGTH_MAX + 1]; @@ -435,15 +440,21 @@ int do_proc_net_dev(int update_every, usec_t dt) { static procfile *ff = NULL; static int enable_new_interfaces = -1; static int do_bandwidth = -1, do_packets = -1, do_errors = -1, do_drops = -1, do_fifo = -1, do_compressed = -1, do_events = -1; - static char *path_to_sys_devices_virtual_net = NULL; - static char *path_to_sys_net_speed = NULL; + static char *path_to_sys_devices_virtual_net = NULL, *path_to_sys_class_net_speed = NULL, *proc_net_dev_filename = NULL; + static long long int dt_to_refresh_speed = 0; if(unlikely(enable_new_interfaces == -1)) { char filename[FILENAME_MAX + 1]; + snprintfz(filename, FILENAME_MAX, "%s%s", netdata_configured_host_prefix, (*netdata_configured_host_prefix)?"/proc/1/net/dev":"/proc/net/dev"); + proc_net_dev_filename = config_get(CONFIG_SECTION_PLUGIN_PROC_NETDEV, "filename to monitor", filename); + snprintfz(filename, FILENAME_MAX, "%s%s", netdata_configured_host_prefix, "/sys/devices/virtual/net/%s"); path_to_sys_devices_virtual_net = config_get(CONFIG_SECTION_PLUGIN_PROC_NETDEV, "path to get virtual interfaces", filename); + snprintfz(filename, FILENAME_MAX, "%s%s", netdata_configured_host_prefix, "/sys/class/net/%s/speed"); + path_to_sys_class_net_speed = config_get(CONFIG_SECTION_PLUGIN_PROC_NETDEV, "path to get net device speed", filename); + enable_new_interfaces = config_get_boolean_ondemand(CONFIG_SECTION_PLUGIN_PROC_NETDEV, "enable new interfaces detected at runtime", CONFIG_BOOLEAN_AUTO); do_bandwidth = config_get_boolean_ondemand(CONFIG_SECTION_PLUGIN_PROC_NETDEV, "bandwidth for all interfaces", CONFIG_BOOLEAN_AUTO); @@ -455,12 +466,13 @@ int do_proc_net_dev(int update_every, usec_t dt) { do_events = config_get_boolean_ondemand(CONFIG_SECTION_PLUGIN_PROC_NETDEV, "frames, collisions, carrier counters for all interfaces", CONFIG_BOOLEAN_AUTO); disabled_list = simple_pattern_create(config_get(CONFIG_SECTION_PLUGIN_PROC_NETDEV, "disable by default interfaces matching", "lo fireqos* *-ifb"), NULL, SIMPLE_PATTERN_EXACT); + + dt_to_refresh_speed = config_get_number(CONFIG_SECTION_PLUGIN_PROC_NETDEV, "refresh interface speed every seconds", 10) * USEC_PER_SEC; + if(dt_to_refresh_speed < 0) dt_to_refresh_speed = 0; } if(unlikely(!ff)) { - char filename[FILENAME_MAX + 1]; - snprintfz(filename, FILENAME_MAX, "%s%s", netdata_configured_host_prefix, (*netdata_configured_host_prefix)?"/proc/1/net/dev":"/proc/net/dev"); - ff = procfile_open(config_get(CONFIG_SECTION_PLUGIN_PROC_NETDEV, "filename to monitor", filename), " \t,:|", PROCFILE_FLAG_DEFAULT); + ff = procfile_open(proc_net_dev_filename, " \t,|", PROCFILE_FLAG_DEFAULT); if(unlikely(!ff)) return 1; } @@ -481,7 +493,11 @@ int do_proc_net_dev(int update_every, usec_t dt) { // require 17 words on each line if(unlikely(procfile_linewords(ff, l) < 17)) continue; - struct netdev *d = get_netdev(procfile_lineword(ff, l, 0)); + char *name = procfile_lineword(ff, l, 0); + size_t len = strlen(name); + if(name[len - 1] == ':') name[len - 1] = '\0'; + + struct netdev *d = get_netdev(name); d->updated = 1; netdev_found++; @@ -505,12 +521,10 @@ int do_proc_net_dev(int update_every, usec_t dt) { else d->virtual = 0; - // set nic speed if present if(likely(!d->virtual)) { - snprintfz(buffer, FILENAME_MAX, "%s/sys/class/net/%s/speed", netdata_configured_host_prefix, d->name); - path_to_sys_net_speed = config_get(CONFIG_SECTION_PLUGIN_PROC_NETDEV, "path to get net device speed", buffer); - int ret = read_single_number_file(path_to_sys_net_speed, (unsigned long long*)&d->speed_max); - if(ret) error("Cannot read '%s'.", path_to_sys_net_speed); + // set the filename to get the interface speed + snprintfz(buffer, FILENAME_MAX, path_to_sys_class_net_speed, d->name); + d->filename_speed = strdupz(buffer); } snprintfz(buffer, FILENAME_MAX, "plugin:proc:/proc/net/dev:%s", d->name); @@ -574,6 +588,17 @@ int do_proc_net_dev(int update_every, usec_t dt) { d->tcarrier = str2kernel_uint_t(procfile_lineword(ff, l, 15)); } + //info("PROC_NET_DEV: %s speed %zu, bytes %zu/%zu, packets %zu/%zu/%zu, errors %zu/%zu, drops %zu/%zu, fifo %zu/%zu, compressed %zu/%zu, rframe %zu, tcollisions %zu, tcarrier %zu" + // , d->name, d->speed + // , d->rbytes, d->tbytes + // , d->rpackets, d->tpackets, d->rmulticast + // , d->rerrors, d->terrors + // , d->rdrops, d->tdrops + // , d->rfifo, d->tfifo + // , d->rcompressed, d->tcompressed + // , d->rframe, d->tcollisions, d->tcarrier + // ); + // -------------------------------------------------------------------- if(unlikely((d->do_bandwidth == CONFIG_BOOLEAN_AUTO && (d->rbytes || d->tbytes)))) @@ -597,9 +622,6 @@ int do_proc_net_dev(int update_every, usec_t dt) { , RRDSET_TYPE_AREA ); - RRDSETVAR *nic_speed_max = rrdsetvar_custom_chart_variable_create(d->st_bandwidth, "nic_speed_max"); - if(nic_speed_max) rrdsetvar_custom_chart_variable_set(nic_speed_max, (calculated_number)d->speed_max); - d->rd_rbytes = rrddim_add(d->st_bandwidth, "received", NULL, 8, BITS_IN_A_KILOBIT, RRD_ALGORITHM_INCREMENTAL); d->rd_tbytes = rrddim_add(d->st_bandwidth, "sent", NULL, -8, BITS_IN_A_KILOBIT, RRD_ALGORITHM_INCREMENTAL); @@ -616,6 +638,35 @@ int do_proc_net_dev(int update_every, usec_t dt) { rrddim_set_by_pointer(d->st_bandwidth, d->rd_rbytes, (collected_number)d->rbytes); rrddim_set_by_pointer(d->st_bandwidth, d->rd_tbytes, (collected_number)d->tbytes); rrdset_done(d->st_bandwidth); + + // update the interface speed + if(d->filename_speed) { + d->speed_last_collected_usec += dt; + + if(unlikely(d->speed_last_collected_usec >= (usec_t)dt_to_refresh_speed)) { + + if(unlikely(!d->chart_var_speed)) { + d->chart_var_speed = rrdsetvar_custom_chart_variable_create(d->st_bandwidth, "nic_speed_max"); + if(!d->chart_var_speed) { + error("Cannot create interface %s chart variable 'nic_speed_max'. Will not update its speed anymore.", d->name); + freez(d->filename_speed); + d->filename_speed = NULL; + } + } + + if(d->filename_speed && d->chart_var_speed) { + if(read_single_number_file(d->filename_speed, (unsigned long long *) &d->speed)) { + error("Cannot refresh interface %s speed by reading '%s'. Will not update its speed anymore.", d->name, d->filename_speed); + freez(d->filename_speed); + d->filename_speed = NULL; + } + else { + rrdsetvar_custom_chart_variable_set(d->chart_var_speed, (calculated_number) d->speed); + d->speed_last_collected_usec = 0; + } + } + } + } } // -------------------------------------------------------------------- diff --git a/collectors/proc.plugin/proc_net_stat_conntrack.c b/collectors/proc.plugin/proc_net_stat_conntrack.c index f5257c0a..642e33f8 100644 --- a/collectors/proc.plugin/proc_net_stat_conntrack.c +++ b/collectors/proc.plugin/proc_net_stat_conntrack.c @@ -50,7 +50,7 @@ int do_proc_net_stat_conntrack(int update_every, usec_t dt) { if(!do_sockets && !read_full) return 1; - rrdvar_max = rrdvar_custom_host_variable_create(localhost, "netfilter.conntrack.max"); + rrdvar_max = rrdvar_custom_host_variable_create(localhost, "netfilter_conntrack_max"); } if(likely(read_full)) { diff --git a/collectors/proc.plugin/proc_stat.c b/collectors/proc.plugin/proc_stat.c index fb77df64..931b415a 100644..100755 --- a/collectors/proc.plugin/proc_stat.c +++ b/collectors/proc.plugin/proc_stat.c @@ -12,9 +12,23 @@ struct per_core_single_number_file { RRDDIM *rd; }; +struct last_ticks { + collected_number frequency; + collected_number ticks; +}; + +// This is an extension of struct per_core_single_number_file at CPU_FREQ_INDEX. +// Either scaling_cur_freq or time_in_state file is used at one time. +struct per_core_time_in_state_file { + const char *filename; + procfile *ff; + size_t last_ticks_len; + struct last_ticks *last_ticks; +}; + #define CORE_THROTTLE_COUNT_INDEX 0 #define PACKAGE_THROTTLE_COUNT_INDEX 1 -#define SCALING_CUR_FREQ_INDEX 2 +#define CPU_FREQ_INDEX 2 #define PER_CORE_FILES 3 struct cpu_chart { @@ -33,6 +47,8 @@ struct cpu_chart { RRDDIM *rd_guest_nice; struct per_core_single_number_file files[PER_CORE_FILES]; + + struct per_core_time_in_state_file time_in_state_files; }; static int keep_per_core_fds_open = CONFIG_BOOLEAN_YES; @@ -87,7 +103,6 @@ static int read_per_core_files(struct cpu_chart *all_cpu_charts, size_t len, siz f->found = 1; f->value = str2ll(buf, NULL); - // info("read '%s', parsed as " COLLECTED_NUMBER_FORMAT, buf, f->value); if(likely(f->value != 0)) files_nonzero++; } @@ -101,6 +116,112 @@ static int read_per_core_files(struct cpu_chart *all_cpu_charts, size_t len, siz return (int)files_nonzero; } +static int read_per_core_time_in_state_files(struct cpu_chart *all_cpu_charts, size_t len, size_t index) { + size_t x, files_read = 0, files_nonzero = 0; + + for(x = 0; x < len ; x++) { + struct per_core_single_number_file *f = &all_cpu_charts[x].files[index]; + struct per_core_time_in_state_file *tsf = &all_cpu_charts[x].time_in_state_files; + + f->found = 0; + + if(unlikely(!tsf->filename)) + continue; + + if(unlikely(!tsf->ff)) { + tsf->ff = procfile_open(tsf->filename, " \t:", PROCFILE_FLAG_DEFAULT); + if(unlikely(!tsf->ff)) + { + error("Cannot open file '%s'", tsf->filename); + continue; + } + } + + tsf->ff = procfile_readall(tsf->ff); + if(unlikely(!tsf->ff)) { + error("Cannot read file '%s'", tsf->filename); + procfile_close(tsf->ff); + tsf->ff = NULL; + continue; + } + else { + // successful read + + size_t lines = procfile_lines(tsf->ff), l; + size_t words; + unsigned long long total_ticks_since_last = 0, avg_freq = 0; + + // Check if there is at least one frequency in time_in_state + if (procfile_word(tsf->ff, 0)[0] == '\0') { + if(unlikely(keep_per_core_fds_open != CONFIG_BOOLEAN_YES)) { + procfile_close(tsf->ff); + tsf->ff = NULL; + } + // TODO: Is there a better way to avoid spikes than calculating the average over + // the whole period under schedutil governor? + // freez(tsf->last_ticks); + // tsf->last_ticks = NULL; + // tsf->last_ticks_len = 0; + continue; + } + + if (unlikely(tsf->last_ticks_len < lines || tsf->last_ticks == NULL)) { + tsf->last_ticks = reallocz(tsf->last_ticks, sizeof(struct last_ticks) * lines); + memset(tsf->last_ticks, 0, sizeof(struct last_ticks) * lines); + tsf->last_ticks_len = lines; + } + + f->value = 0; + + for(l = 0; l < lines - 1 ;l++) { + unsigned long long frequency = 0, ticks = 0, ticks_since_last = 0; + + words = procfile_linewords(tsf->ff, l); + if(unlikely(words < 2)) { + error("Cannot read time_in_state line. Expected 2 params, read %zu.", words); + continue; + } + frequency = str2ull(procfile_lineword(tsf->ff, l, 0)); + ticks = str2ull(procfile_lineword(tsf->ff, l, 1)); + + // It is assumed that frequencies are static and sorted + ticks_since_last = ticks - tsf->last_ticks[l].ticks; + tsf->last_ticks[l].frequency = frequency; + tsf->last_ticks[l].ticks = ticks; + + total_ticks_since_last += ticks_since_last; + avg_freq += frequency * ticks_since_last; + + } + + if (likely(total_ticks_since_last)) { + avg_freq /= total_ticks_since_last; + f->value = avg_freq; + } + + if(unlikely(keep_per_core_fds_open != CONFIG_BOOLEAN_YES)) { + procfile_close(tsf->ff); + tsf->ff = NULL; + } + } + + files_read++; + + f->found = 1; + + if(likely(f->value != 0)) + files_nonzero++; + } + + if(unlikely(files_read == 0)) + return -1; + + if(unlikely(files_nonzero == 0)) + return 0; + + return (int)files_nonzero; +} + static void chart_per_core_files(struct cpu_chart *all_cpu_charts, size_t len, size_t index, RRDSET *st, collected_number multiplier, collected_number divisor, RRD_ALGORITHM algorithm) { size_t x; for(x = 0; x < len ; x++) { @@ -122,10 +243,11 @@ int do_proc_stat(int update_every, usec_t dt) { static struct cpu_chart *all_cpu_charts = NULL; static size_t all_cpu_charts_size = 0; static procfile *ff = NULL; - static int do_cpu = -1, do_cpu_cores = -1, do_interrupts = -1, do_context = -1, do_forks = -1, do_processes = -1, do_core_throttle_count = -1, do_package_throttle_count = -1, do_scaling_cur_freq = -1; + static int do_cpu = -1, do_cpu_cores = -1, do_interrupts = -1, do_context = -1, do_forks = -1, do_processes = -1, do_core_throttle_count = -1, do_package_throttle_count = -1, do_cpu_freq = -1; static uint32_t hash_intr, hash_ctxt, hash_processes, hash_procs_running, hash_procs_blocked; - static char *core_throttle_count_filename = NULL, *package_throttle_count_filename = NULL, *scaling_cur_freq_filename = NULL; + static char *core_throttle_count_filename = NULL, *package_throttle_count_filename = NULL, *scaling_cur_freq_filename = NULL, *time_in_state_filename = NULL; static RRDVAR *cpus_var = NULL; + static int accurate_freq_avail = 0, accurate_freq_is_used = 0; size_t cores_found = (size_t)processors; if(unlikely(do_cpu == -1)) { @@ -137,25 +259,25 @@ int do_proc_stat(int update_every, usec_t dt) { do_processes = config_get_boolean("plugin:proc:/proc/stat", "processes running", CONFIG_BOOLEAN_YES); // give sane defaults based on the number of processors - if(processors > 50) { + if(unlikely(processors > 50)) { // the system has too many processors keep_per_core_fds_open = CONFIG_BOOLEAN_NO; do_core_throttle_count = CONFIG_BOOLEAN_NO; do_package_throttle_count = CONFIG_BOOLEAN_NO; - do_scaling_cur_freq = CONFIG_BOOLEAN_NO; + do_cpu_freq = CONFIG_BOOLEAN_NO; } else { // the system has a reasonable number of processors keep_per_core_fds_open = CONFIG_BOOLEAN_YES; do_core_throttle_count = CONFIG_BOOLEAN_AUTO; do_package_throttle_count = CONFIG_BOOLEAN_NO; - do_scaling_cur_freq = CONFIG_BOOLEAN_NO; + do_cpu_freq = CONFIG_BOOLEAN_YES; } keep_per_core_fds_open = config_get_boolean("plugin:proc:/proc/stat", "keep per core files open", keep_per_core_fds_open); do_core_throttle_count = config_get_boolean_ondemand("plugin:proc:/proc/stat", "core_throttle_count", do_core_throttle_count); do_package_throttle_count = config_get_boolean_ondemand("plugin:proc:/proc/stat", "package_throttle_count", do_package_throttle_count); - do_scaling_cur_freq = config_get_boolean_ondemand("plugin:proc:/proc/stat", "scaling_cur_freq", do_scaling_cur_freq); + do_cpu_freq = config_get_boolean_ondemand("plugin:proc:/proc/stat", "cpu frequency", do_cpu_freq); hash_intr = simple_hash("intr"); hash_ctxt = simple_hash("ctxt"); @@ -172,6 +294,9 @@ int do_proc_stat(int update_every, usec_t dt) { snprintfz(filename, FILENAME_MAX, "%s%s", netdata_configured_host_prefix, "/sys/devices/system/cpu/%s/cpufreq/scaling_cur_freq"); scaling_cur_freq_filename = config_get("plugin:proc:/proc/stat", "scaling_cur_freq filename to monitor", filename); + + snprintfz(filename, FILENAME_MAX, "%s%s", netdata_configured_host_prefix, "/sys/devices/system/cpu/%s/cpufreq/stats/time_in_state"); + time_in_state_filename = config_get("plugin:proc:/proc/stat", "time_in_state filename to monitor", filename); } if(unlikely(!ff)) { @@ -202,7 +327,7 @@ int do_proc_stat(int update_every, usec_t dt) { } size_t core = (row_key[3] == '\0') ? 0 : str2ul(&row_key[3]) + 1; - if(core > 0) cores_found = core; + if(likely(core > 0)) cores_found = core; if(likely((core == 0 && do_cpu) || (core > 0 && do_cpu_cores))) { char *id; @@ -227,7 +352,7 @@ int do_proc_stat(int update_every, usec_t dt) { char *title, *type, *context, *family; long priority; - if(core >= all_cpu_charts_size) { + if(unlikely(core >= all_cpu_charts_size)) { size_t old_cpu_charts_size = all_cpu_charts_size; all_cpu_charts_size = core + 1; all_cpu_charts = reallocz(all_cpu_charts, sizeof(struct cpu_chart) * all_cpu_charts_size); @@ -238,7 +363,7 @@ int do_proc_stat(int update_every, usec_t dt) { if(unlikely(!cpu_chart->st)) { cpu_chart->id = strdupz(id); - if(core == 0) { + if(unlikely(core == 0)) { title = "Total CPU utilization"; type = "system"; context = "system.cpu"; @@ -252,9 +377,6 @@ int do_proc_stat(int update_every, usec_t dt) { family = "utilization"; priority = NETDATA_CHART_PRIO_CPU_PER_CORE; - // TODO: check for /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq - // TODO: check for /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state - char filename[FILENAME_MAX + 1]; struct stat stbuf; @@ -276,12 +398,23 @@ int do_proc_stat(int update_every, usec_t dt) { } } - if(do_scaling_cur_freq != CONFIG_BOOLEAN_NO) { + if(do_cpu_freq != CONFIG_BOOLEAN_NO) { + snprintfz(filename, FILENAME_MAX, scaling_cur_freq_filename, id); + + if (stat(filename, &stbuf) == 0) { + cpu_chart->files[CPU_FREQ_INDEX].filename = strdupz(filename); + cpu_chart->files[CPU_FREQ_INDEX].fd = -1; + do_cpu_freq = CONFIG_BOOLEAN_YES; + } + + snprintfz(filename, FILENAME_MAX, time_in_state_filename, id); + if (stat(filename, &stbuf) == 0) { - cpu_chart->files[SCALING_CUR_FREQ_INDEX].filename = strdupz(filename); - cpu_chart->files[SCALING_CUR_FREQ_INDEX].fd = -1; - do_scaling_cur_freq = CONFIG_BOOLEAN_YES; + cpu_chart->time_in_state_files.filename = strdupz(filename); + cpu_chart->time_in_state_files.ff = NULL; + do_cpu_freq = CONFIG_BOOLEAN_YES; + accurate_freq_avail = 1; } } } @@ -532,21 +665,40 @@ int do_proc_stat(int update_every, usec_t dt) { } } - if(likely(do_scaling_cur_freq != CONFIG_BOOLEAN_NO)) { - int r = read_per_core_files(&all_cpu_charts[1], all_cpu_charts_size - 1, SCALING_CUR_FREQ_INDEX); - if(likely(r != -1 && (do_scaling_cur_freq == CONFIG_BOOLEAN_YES || r > 0))) { - do_scaling_cur_freq = CONFIG_BOOLEAN_YES; + if(likely(do_cpu_freq != CONFIG_BOOLEAN_NO)) { + char filename[FILENAME_MAX + 1]; + int r = 0; + + if (accurate_freq_avail) { + r = read_per_core_time_in_state_files(&all_cpu_charts[1], all_cpu_charts_size - 1, CPU_FREQ_INDEX); + if(r > 0 && !accurate_freq_is_used) { + accurate_freq_is_used = 1; + snprintfz(filename, FILENAME_MAX, time_in_state_filename, "cpu*"); + info("cpufreq is using %s", filename); + } + } + if (r < 1) { + r = read_per_core_files(&all_cpu_charts[1], all_cpu_charts_size - 1, CPU_FREQ_INDEX); + if(accurate_freq_is_used) { + accurate_freq_is_used = 0; + snprintfz(filename, FILENAME_MAX, scaling_cur_freq_filename, "cpu*"); + info("cpufreq fell back to %s", filename); + } + } + + if(likely(r != -1 && (do_cpu_freq == CONFIG_BOOLEAN_YES || r > 0))) { + do_cpu_freq = CONFIG_BOOLEAN_YES; static RRDSET *st_scaling_cur_freq = NULL; if(unlikely(!st_scaling_cur_freq)) st_scaling_cur_freq = rrdset_create_localhost( "cpu" - , "scaling_cur_freq" + , "cpufreq" , NULL , "cpufreq" - , "cpu.scaling_cur_freq" - , "Per CPU Core, Current CPU Scaling Frequency" + , "cpufreq.cpufreq" + , "Current CPU Frequency" , "MHz" , PLUGIN_PROC_NAME , PLUGIN_PROC_MODULE_STAT_NAME @@ -557,7 +709,7 @@ int do_proc_stat(int update_every, usec_t dt) { else rrdset_next(st_scaling_cur_freq); - chart_per_core_files(&all_cpu_charts[1], all_cpu_charts_size - 1, SCALING_CUR_FREQ_INDEX, st_scaling_cur_freq, 1, 1000, RRD_ALGORITHM_ABSOLUTE); + chart_per_core_files(&all_cpu_charts[1], all_cpu_charts_size - 1, CPU_FREQ_INDEX, st_scaling_cur_freq, 1, 1000, RRD_ALGORITHM_ABSOLUTE); rrdset_done(st_scaling_cur_freq); } } diff --git a/collectors/python.d.plugin/Makefile.am b/collectors/python.d.plugin/Makefile.am index 5f214e43..984050c4 100644 --- a/collectors/python.d.plugin/Makefile.am +++ b/collectors/python.d.plugin/Makefile.am @@ -74,9 +74,11 @@ include monit/Makefile.inc include mysql/Makefile.inc include nginx/Makefile.inc include nginx_plus/Makefile.inc +include nvidia_smi/Makefile.inc include nsd/Makefile.inc include ntpd/Makefile.inc include ovpn_status_log/Makefile.inc +include openldap/Makefile.inc include phpfpm/Makefile.inc include portcheck/Makefile.inc include postfix/Makefile.inc @@ -95,6 +97,7 @@ include spigotmc/Makefile.inc include springboot/Makefile.inc include squid/Makefile.inc include tomcat/Makefile.inc +include tor/Makefile.inc include traefik/Makefile.inc include unbound/Makefile.inc include uwsgi/Makefile.inc diff --git a/collectors/python.d.plugin/Makefile.in b/collectors/python.d.plugin/Makefile.in index ca2743d5..49560689 100644 --- a/collectors/python.d.plugin/Makefile.in +++ b/collectors/python.d.plugin/Makefile.in @@ -400,6 +400,24 @@ # IT IS INCLUDED BY ITS PARENT'S Makefile.am # IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + VPATH = @srcdir@ am__is_gnu_make = test -n '$(MAKEFILE_LIST)' && test -n '$(MAKELEVEL)' @@ -489,10 +507,12 @@ DIST_COMMON = $(top_srcdir)/build/subst.inc \ $(srcdir)/memcached/Makefile.inc \ $(srcdir)/mongodb/Makefile.inc $(srcdir)/monit/Makefile.inc \ $(srcdir)/mysql/Makefile.inc $(srcdir)/nginx/Makefile.inc \ - $(srcdir)/nginx_plus/Makefile.inc $(srcdir)/nsd/Makefile.inc \ + $(srcdir)/nginx_plus/Makefile.inc \ + $(srcdir)/nvidia_smi/Makefile.inc $(srcdir)/nsd/Makefile.inc \ $(srcdir)/ntpd/Makefile.inc \ $(srcdir)/ovpn_status_log/Makefile.inc \ - $(srcdir)/phpfpm/Makefile.inc $(srcdir)/portcheck/Makefile.inc \ + $(srcdir)/openldap/Makefile.inc $(srcdir)/phpfpm/Makefile.inc \ + $(srcdir)/portcheck/Makefile.inc \ $(srcdir)/postfix/Makefile.inc $(srcdir)/postgres/Makefile.inc \ $(srcdir)/powerdns/Makefile.inc \ $(srcdir)/proxysql/Makefile.inc $(srcdir)/puppet/Makefile.inc \ @@ -503,14 +523,14 @@ DIST_COMMON = $(top_srcdir)/build/subst.inc \ $(srcdir)/smartd_log/Makefile.inc \ $(srcdir)/spigotmc/Makefile.inc \ $(srcdir)/springboot/Makefile.inc $(srcdir)/squid/Makefile.inc \ - $(srcdir)/tomcat/Makefile.inc $(srcdir)/traefik/Makefile.inc \ - $(srcdir)/unbound/Makefile.inc $(srcdir)/uwsgi/Makefile.inc \ - $(srcdir)/varnish/Makefile.inc $(srcdir)/w1sensor/Makefile.inc \ - $(srcdir)/web_log/Makefile.inc $(srcdir)/Makefile.in \ - $(srcdir)/Makefile.am $(dist_plugins_SCRIPTS) \ - $(dist_python_SCRIPTS) $(dist_bases_DATA) \ - $(dist_bases_framework_services_DATA) $(dist_libconfig_DATA) \ - $(dist_noinst_DATA) $(dist_python_DATA) \ + $(srcdir)/tomcat/Makefile.inc $(srcdir)/tor/Makefile.inc \ + $(srcdir)/traefik/Makefile.inc $(srcdir)/unbound/Makefile.inc \ + $(srcdir)/uwsgi/Makefile.inc $(srcdir)/varnish/Makefile.inc \ + $(srcdir)/w1sensor/Makefile.inc $(srcdir)/web_log/Makefile.inc \ + $(srcdir)/Makefile.in $(srcdir)/Makefile.am \ + $(dist_plugins_SCRIPTS) $(dist_python_SCRIPTS) \ + $(dist_bases_DATA) $(dist_bases_framework_services_DATA) \ + $(dist_libconfig_DATA) $(dist_noinst_DATA) $(dist_python_DATA) \ $(dist_python_urllib3_DATA) \ $(dist_python_urllib3_backports_DATA) \ $(dist_python_urllib3_contrib_DATA) \ @@ -903,6 +923,12 @@ dist_plugins_SCRIPTS = \ # do not install these files, but include them in the distribution # do not install these files, but include them in the distribution + +# do not install these files, but include them in the distribution + +# do not install these files, but include them in the distribution + +# do not install these files, but include them in the distribution dist_noinst_DATA = python.d.plugin.in README.md $(NULL) \ adaptec_raid/README.md adaptec_raid/Makefile.inc \ apache/README.md apache/Makefile.inc beanstalk/README.md \ @@ -931,26 +957,29 @@ dist_noinst_DATA = python.d.plugin.in README.md $(NULL) \ memcached/Makefile.inc mongodb/README.md mongodb/Makefile.inc \ monit/README.md monit/Makefile.inc mysql/README.md \ mysql/Makefile.inc nginx/README.md nginx/Makefile.inc \ - nginx_plus/README.md nginx_plus/Makefile.inc nsd/README.md \ + nginx_plus/README.md nginx_plus/Makefile.inc \ + nvidia_smi/README.md nvidia_smi/Makefile.inc nsd/README.md \ nsd/Makefile.inc ntpd/README.md ntpd/Makefile.inc \ ovpn_status_log/README.md ovpn_status_log/Makefile.inc \ - phpfpm/README.md phpfpm/Makefile.inc portcheck/README.md \ - portcheck/Makefile.inc postfix/README.md postfix/Makefile.inc \ - postgres/README.md postgres/Makefile.inc powerdns/README.md \ - powerdns/Makefile.inc proxysql/README.md proxysql/Makefile.inc \ - puppet/README.md puppet/Makefile.inc rabbitmq/README.md \ - rabbitmq/Makefile.inc redis/README.md redis/Makefile.inc \ - rethinkdbs/README.md rethinkdbs/Makefile.inc \ - retroshare/README.md retroshare/Makefile.inc samba/README.md \ - samba/Makefile.inc sensors/README.md sensors/Makefile.inc \ - smartd_log/README.md smartd_log/Makefile.inc \ - spigotmc/README.md spigotmc/Makefile.inc springboot/README.md \ + openldap/README.md openldap/Makefile.inc phpfpm/README.md \ + phpfpm/Makefile.inc portcheck/README.md portcheck/Makefile.inc \ + postfix/README.md postfix/Makefile.inc postgres/README.md \ + postgres/Makefile.inc powerdns/README.md powerdns/Makefile.inc \ + proxysql/README.md proxysql/Makefile.inc puppet/README.md \ + puppet/Makefile.inc rabbitmq/README.md rabbitmq/Makefile.inc \ + redis/README.md redis/Makefile.inc rethinkdbs/README.md \ + rethinkdbs/Makefile.inc retroshare/README.md \ + retroshare/Makefile.inc samba/README.md samba/Makefile.inc \ + sensors/README.md sensors/Makefile.inc smartd_log/README.md \ + smartd_log/Makefile.inc spigotmc/README.md \ + spigotmc/Makefile.inc springboot/README.md \ springboot/Makefile.inc squid/README.md squid/Makefile.inc \ - tomcat/README.md tomcat/Makefile.inc traefik/README.md \ - traefik/Makefile.inc unbound/README.md unbound/Makefile.inc \ - uwsgi/README.md uwsgi/Makefile.inc varnish/README.md \ - varnish/Makefile.inc w1sensor/README.md w1sensor/Makefile.inc \ - web_log/README.md web_log/Makefile.inc + tomcat/README.md tomcat/Makefile.inc tor/README.md \ + tor/Makefile.inc traefik/README.md traefik/Makefile.inc \ + unbound/README.md unbound/Makefile.inc uwsgi/README.md \ + uwsgi/Makefile.inc varnish/README.md varnish/Makefile.inc \ + w1sensor/README.md w1sensor/Makefile.inc web_log/README.md \ + web_log/Makefile.inc dist_python_SCRIPTS = \ $(NULL) @@ -1082,6 +1111,12 @@ dist_python_SCRIPTS = \ # install these files # install these files + +# install these files + +# install these files + +# install these files dist_python_DATA = $(NULL) adaptec_raid/adaptec_raid.chart.py \ apache/apache.chart.py beanstalk/beanstalk.chart.py \ bind_rndc/bind_rndc.chart.py boinc/boinc.chart.py \ @@ -1101,17 +1136,19 @@ dist_python_DATA = $(NULL) adaptec_raid/adaptec_raid.chart.py \ mdstat/mdstat.chart.py megacli/megacli.chart.py \ memcached/memcached.chart.py mongodb/mongodb.chart.py \ monit/monit.chart.py mysql/mysql.chart.py nginx/nginx.chart.py \ - nginx_plus/nginx_plus.chart.py nsd/nsd.chart.py \ - ntpd/ntpd.chart.py ovpn_status_log/ovpn_status_log.chart.py \ - phpfpm/phpfpm.chart.py portcheck/portcheck.chart.py \ - postfix/postfix.chart.py postgres/postgres.chart.py \ - powerdns/powerdns.chart.py proxysql/proxysql.chart.py \ - puppet/puppet.chart.py rabbitmq/rabbitmq.chart.py \ - redis/redis.chart.py rethinkdbs/rethinkdbs.chart.py \ - retroshare/retroshare.chart.py samba/samba.chart.py \ - sensors/sensors.chart.py smartd_log/smartd_log.chart.py \ - spigotmc/spigotmc.chart.py springboot/springboot.chart.py \ - squid/squid.chart.py tomcat/tomcat.chart.py \ + nginx_plus/nginx_plus.chart.py nvidia_smi/nvidia_smi.chart.py \ + nsd/nsd.chart.py ntpd/ntpd.chart.py \ + ovpn_status_log/ovpn_status_log.chart.py \ + openldap/openldap.chart.py phpfpm/phpfpm.chart.py \ + portcheck/portcheck.chart.py postfix/postfix.chart.py \ + postgres/postgres.chart.py powerdns/powerdns.chart.py \ + proxysql/proxysql.chart.py puppet/puppet.chart.py \ + rabbitmq/rabbitmq.chart.py redis/redis.chart.py \ + rethinkdbs/rethinkdbs.chart.py retroshare/retroshare.chart.py \ + samba/samba.chart.py sensors/sensors.chart.py \ + smartd_log/smartd_log.chart.py spigotmc/spigotmc.chart.py \ + springboot/springboot.chart.py squid/squid.chart.py \ + tomcat/tomcat.chart.py tor/tor.chart.py \ traefik/traefik.chart.py unbound/unbound.chart.py \ uwsgi/uwsgi.chart.py varnish/varnish.chart.py \ w1sensor/w1sensor.chart.py web_log/web_log.chart.py @@ -1138,8 +1175,9 @@ dist_pythonconfig_DATA = $(top_srcdir)/installer/.keep $(NULL) \ litespeed/litespeed.conf logind/logind.conf mdstat/mdstat.conf \ megacli/megacli.conf memcached/memcached.conf \ mongodb/mongodb.conf monit/monit.conf mysql/mysql.conf \ - nginx/nginx.conf nginx_plus/nginx_plus.conf nsd/nsd.conf \ - ntpd/ntpd.conf ovpn_status_log/ovpn_status_log.conf \ + nginx/nginx.conf nginx_plus/nginx_plus.conf \ + nvidia_smi/nvidia_smi.conf nsd/nsd.conf ntpd/ntpd.conf \ + ovpn_status_log/ovpn_status_log.conf openldap/openldap.conf \ phpfpm/phpfpm.conf portcheck/portcheck.conf \ postfix/postfix.conf postgres/postgres.conf \ powerdns/powerdns.conf proxysql/proxysql.conf \ @@ -1148,8 +1186,8 @@ dist_pythonconfig_DATA = $(top_srcdir)/installer/.keep $(NULL) \ samba/samba.conf sensors/sensors.conf \ smartd_log/smartd_log.conf spigotmc/spigotmc.conf \ springboot/springboot.conf squid/squid.conf tomcat/tomcat.conf \ - traefik/traefik.conf unbound/unbound.conf uwsgi/uwsgi.conf \ - varnish/varnish.conf w1sensor/w1sensor.conf \ + tor/tor.conf traefik/traefik.conf unbound/unbound.conf \ + uwsgi/uwsgi.conf varnish/varnish.conf w1sensor/w1sensor.conf \ web_log/web_log.conf pythonmodulesdir = $(pythondir)/python_modules dist_pythonmodules_DATA = \ @@ -1296,7 +1334,7 @@ all: all-am .SUFFIXES: .SUFFIXES: .in -$(srcdir)/Makefile.in: @MAINTAINER_MODE_TRUE@ $(srcdir)/Makefile.am $(top_srcdir)/build/subst.inc $(srcdir)/adaptec_raid/Makefile.inc $(srcdir)/apache/Makefile.inc $(srcdir)/beanstalk/Makefile.inc $(srcdir)/bind_rndc/Makefile.inc $(srcdir)/boinc/Makefile.inc $(srcdir)/ceph/Makefile.inc $(srcdir)/chrony/Makefile.inc $(srcdir)/couchdb/Makefile.inc $(srcdir)/cpufreq/Makefile.inc $(srcdir)/cpuidle/Makefile.inc $(srcdir)/dnsdist/Makefile.inc $(srcdir)/dns_query_time/Makefile.inc $(srcdir)/dockerd/Makefile.inc $(srcdir)/dovecot/Makefile.inc $(srcdir)/elasticsearch/Makefile.inc $(srcdir)/example/Makefile.inc $(srcdir)/exim/Makefile.inc $(srcdir)/fail2ban/Makefile.inc $(srcdir)/freeradius/Makefile.inc $(srcdir)/go_expvar/Makefile.inc $(srcdir)/haproxy/Makefile.inc $(srcdir)/hddtemp/Makefile.inc $(srcdir)/httpcheck/Makefile.inc $(srcdir)/icecast/Makefile.inc $(srcdir)/ipfs/Makefile.inc $(srcdir)/isc_dhcpd/Makefile.inc $(srcdir)/linux_power_supply/Makefile.inc $(srcdir)/litespeed/Makefile.inc $(srcdir)/logind/Makefile.inc $(srcdir)/mdstat/Makefile.inc $(srcdir)/megacli/Makefile.inc $(srcdir)/memcached/Makefile.inc $(srcdir)/mongodb/Makefile.inc $(srcdir)/monit/Makefile.inc $(srcdir)/mysql/Makefile.inc $(srcdir)/nginx/Makefile.inc $(srcdir)/nginx_plus/Makefile.inc $(srcdir)/nsd/Makefile.inc $(srcdir)/ntpd/Makefile.inc $(srcdir)/ovpn_status_log/Makefile.inc $(srcdir)/phpfpm/Makefile.inc $(srcdir)/portcheck/Makefile.inc $(srcdir)/postfix/Makefile.inc $(srcdir)/postgres/Makefile.inc $(srcdir)/powerdns/Makefile.inc $(srcdir)/proxysql/Makefile.inc $(srcdir)/puppet/Makefile.inc $(srcdir)/rabbitmq/Makefile.inc $(srcdir)/redis/Makefile.inc $(srcdir)/rethinkdbs/Makefile.inc $(srcdir)/retroshare/Makefile.inc $(srcdir)/samba/Makefile.inc $(srcdir)/sensors/Makefile.inc $(srcdir)/smartd_log/Makefile.inc $(srcdir)/spigotmc/Makefile.inc $(srcdir)/springboot/Makefile.inc $(srcdir)/squid/Makefile.inc $(srcdir)/tomcat/Makefile.inc $(srcdir)/traefik/Makefile.inc $(srcdir)/unbound/Makefile.inc $(srcdir)/uwsgi/Makefile.inc $(srcdir)/varnish/Makefile.inc $(srcdir)/w1sensor/Makefile.inc $(srcdir)/web_log/Makefile.inc $(am__configure_deps) +$(srcdir)/Makefile.in: @MAINTAINER_MODE_TRUE@ $(srcdir)/Makefile.am $(top_srcdir)/build/subst.inc $(srcdir)/adaptec_raid/Makefile.inc $(srcdir)/apache/Makefile.inc $(srcdir)/beanstalk/Makefile.inc $(srcdir)/bind_rndc/Makefile.inc $(srcdir)/boinc/Makefile.inc $(srcdir)/ceph/Makefile.inc $(srcdir)/chrony/Makefile.inc $(srcdir)/couchdb/Makefile.inc $(srcdir)/cpufreq/Makefile.inc $(srcdir)/cpuidle/Makefile.inc $(srcdir)/dnsdist/Makefile.inc $(srcdir)/dns_query_time/Makefile.inc $(srcdir)/dockerd/Makefile.inc $(srcdir)/dovecot/Makefile.inc $(srcdir)/elasticsearch/Makefile.inc $(srcdir)/example/Makefile.inc $(srcdir)/exim/Makefile.inc $(srcdir)/fail2ban/Makefile.inc $(srcdir)/freeradius/Makefile.inc $(srcdir)/go_expvar/Makefile.inc $(srcdir)/haproxy/Makefile.inc $(srcdir)/hddtemp/Makefile.inc $(srcdir)/httpcheck/Makefile.inc $(srcdir)/icecast/Makefile.inc $(srcdir)/ipfs/Makefile.inc $(srcdir)/isc_dhcpd/Makefile.inc $(srcdir)/linux_power_supply/Makefile.inc $(srcdir)/litespeed/Makefile.inc $(srcdir)/logind/Makefile.inc $(srcdir)/mdstat/Makefile.inc $(srcdir)/megacli/Makefile.inc $(srcdir)/memcached/Makefile.inc $(srcdir)/mongodb/Makefile.inc $(srcdir)/monit/Makefile.inc $(srcdir)/mysql/Makefile.inc $(srcdir)/nginx/Makefile.inc $(srcdir)/nginx_plus/Makefile.inc $(srcdir)/nvidia_smi/Makefile.inc $(srcdir)/nsd/Makefile.inc $(srcdir)/ntpd/Makefile.inc $(srcdir)/ovpn_status_log/Makefile.inc $(srcdir)/openldap/Makefile.inc $(srcdir)/phpfpm/Makefile.inc $(srcdir)/portcheck/Makefile.inc $(srcdir)/postfix/Makefile.inc $(srcdir)/postgres/Makefile.inc $(srcdir)/powerdns/Makefile.inc $(srcdir)/proxysql/Makefile.inc $(srcdir)/puppet/Makefile.inc $(srcdir)/rabbitmq/Makefile.inc $(srcdir)/redis/Makefile.inc $(srcdir)/rethinkdbs/Makefile.inc $(srcdir)/retroshare/Makefile.inc $(srcdir)/samba/Makefile.inc $(srcdir)/sensors/Makefile.inc $(srcdir)/smartd_log/Makefile.inc $(srcdir)/spigotmc/Makefile.inc $(srcdir)/springboot/Makefile.inc $(srcdir)/squid/Makefile.inc $(srcdir)/tomcat/Makefile.inc $(srcdir)/tor/Makefile.inc $(srcdir)/traefik/Makefile.inc $(srcdir)/unbound/Makefile.inc $(srcdir)/uwsgi/Makefile.inc $(srcdir)/varnish/Makefile.inc $(srcdir)/w1sensor/Makefile.inc $(srcdir)/web_log/Makefile.inc $(am__configure_deps) @for dep in $?; do \ case '$(am__configure_deps)' in \ *$$dep*) \ @@ -1317,7 +1355,7 @@ Makefile: $(srcdir)/Makefile.in $(top_builddir)/config.status echo ' cd $(top_builddir) && $(SHELL) ./config.status $(subdir)/$@ $(am__depfiles_maybe)'; \ cd $(top_builddir) && $(SHELL) ./config.status $(subdir)/$@ $(am__depfiles_maybe);; \ esac; -$(top_srcdir)/build/subst.inc $(srcdir)/adaptec_raid/Makefile.inc $(srcdir)/apache/Makefile.inc $(srcdir)/beanstalk/Makefile.inc $(srcdir)/bind_rndc/Makefile.inc $(srcdir)/boinc/Makefile.inc $(srcdir)/ceph/Makefile.inc $(srcdir)/chrony/Makefile.inc $(srcdir)/couchdb/Makefile.inc $(srcdir)/cpufreq/Makefile.inc $(srcdir)/cpuidle/Makefile.inc $(srcdir)/dnsdist/Makefile.inc $(srcdir)/dns_query_time/Makefile.inc $(srcdir)/dockerd/Makefile.inc $(srcdir)/dovecot/Makefile.inc $(srcdir)/elasticsearch/Makefile.inc $(srcdir)/example/Makefile.inc $(srcdir)/exim/Makefile.inc $(srcdir)/fail2ban/Makefile.inc $(srcdir)/freeradius/Makefile.inc $(srcdir)/go_expvar/Makefile.inc $(srcdir)/haproxy/Makefile.inc $(srcdir)/hddtemp/Makefile.inc $(srcdir)/httpcheck/Makefile.inc $(srcdir)/icecast/Makefile.inc $(srcdir)/ipfs/Makefile.inc $(srcdir)/isc_dhcpd/Makefile.inc $(srcdir)/linux_power_supply/Makefile.inc $(srcdir)/litespeed/Makefile.inc $(srcdir)/logind/Makefile.inc $(srcdir)/mdstat/Makefile.inc $(srcdir)/megacli/Makefile.inc $(srcdir)/memcached/Makefile.inc $(srcdir)/mongodb/Makefile.inc $(srcdir)/monit/Makefile.inc $(srcdir)/mysql/Makefile.inc $(srcdir)/nginx/Makefile.inc $(srcdir)/nginx_plus/Makefile.inc $(srcdir)/nsd/Makefile.inc $(srcdir)/ntpd/Makefile.inc $(srcdir)/ovpn_status_log/Makefile.inc $(srcdir)/phpfpm/Makefile.inc $(srcdir)/portcheck/Makefile.inc $(srcdir)/postfix/Makefile.inc $(srcdir)/postgres/Makefile.inc $(srcdir)/powerdns/Makefile.inc $(srcdir)/proxysql/Makefile.inc $(srcdir)/puppet/Makefile.inc $(srcdir)/rabbitmq/Makefile.inc $(srcdir)/redis/Makefile.inc $(srcdir)/rethinkdbs/Makefile.inc $(srcdir)/retroshare/Makefile.inc $(srcdir)/samba/Makefile.inc $(srcdir)/sensors/Makefile.inc $(srcdir)/smartd_log/Makefile.inc $(srcdir)/spigotmc/Makefile.inc $(srcdir)/springboot/Makefile.inc $(srcdir)/squid/Makefile.inc $(srcdir)/tomcat/Makefile.inc $(srcdir)/traefik/Makefile.inc $(srcdir)/unbound/Makefile.inc $(srcdir)/uwsgi/Makefile.inc $(srcdir)/varnish/Makefile.inc $(srcdir)/w1sensor/Makefile.inc $(srcdir)/web_log/Makefile.inc: +$(top_srcdir)/build/subst.inc $(srcdir)/adaptec_raid/Makefile.inc $(srcdir)/apache/Makefile.inc $(srcdir)/beanstalk/Makefile.inc $(srcdir)/bind_rndc/Makefile.inc $(srcdir)/boinc/Makefile.inc $(srcdir)/ceph/Makefile.inc $(srcdir)/chrony/Makefile.inc $(srcdir)/couchdb/Makefile.inc $(srcdir)/cpufreq/Makefile.inc $(srcdir)/cpuidle/Makefile.inc $(srcdir)/dnsdist/Makefile.inc $(srcdir)/dns_query_time/Makefile.inc $(srcdir)/dockerd/Makefile.inc $(srcdir)/dovecot/Makefile.inc $(srcdir)/elasticsearch/Makefile.inc $(srcdir)/example/Makefile.inc $(srcdir)/exim/Makefile.inc $(srcdir)/fail2ban/Makefile.inc $(srcdir)/freeradius/Makefile.inc $(srcdir)/go_expvar/Makefile.inc $(srcdir)/haproxy/Makefile.inc $(srcdir)/hddtemp/Makefile.inc $(srcdir)/httpcheck/Makefile.inc $(srcdir)/icecast/Makefile.inc $(srcdir)/ipfs/Makefile.inc $(srcdir)/isc_dhcpd/Makefile.inc $(srcdir)/linux_power_supply/Makefile.inc $(srcdir)/litespeed/Makefile.inc $(srcdir)/logind/Makefile.inc $(srcdir)/mdstat/Makefile.inc $(srcdir)/megacli/Makefile.inc $(srcdir)/memcached/Makefile.inc $(srcdir)/mongodb/Makefile.inc $(srcdir)/monit/Makefile.inc $(srcdir)/mysql/Makefile.inc $(srcdir)/nginx/Makefile.inc $(srcdir)/nginx_plus/Makefile.inc $(srcdir)/nvidia_smi/Makefile.inc $(srcdir)/nsd/Makefile.inc $(srcdir)/ntpd/Makefile.inc $(srcdir)/ovpn_status_log/Makefile.inc $(srcdir)/openldap/Makefile.inc $(srcdir)/phpfpm/Makefile.inc $(srcdir)/portcheck/Makefile.inc $(srcdir)/postfix/Makefile.inc $(srcdir)/postgres/Makefile.inc $(srcdir)/powerdns/Makefile.inc $(srcdir)/proxysql/Makefile.inc $(srcdir)/puppet/Makefile.inc $(srcdir)/rabbitmq/Makefile.inc $(srcdir)/redis/Makefile.inc $(srcdir)/rethinkdbs/Makefile.inc $(srcdir)/retroshare/Makefile.inc $(srcdir)/samba/Makefile.inc $(srcdir)/sensors/Makefile.inc $(srcdir)/smartd_log/Makefile.inc $(srcdir)/spigotmc/Makefile.inc $(srcdir)/springboot/Makefile.inc $(srcdir)/squid/Makefile.inc $(srcdir)/tomcat/Makefile.inc $(srcdir)/tor/Makefile.inc $(srcdir)/traefik/Makefile.inc $(srcdir)/unbound/Makefile.inc $(srcdir)/uwsgi/Makefile.inc $(srcdir)/varnish/Makefile.inc $(srcdir)/w1sensor/Makefile.inc $(srcdir)/web_log/Makefile.inc: $(top_builddir)/config.status: $(top_srcdir)/configure $(CONFIG_STATUS_DEPENDENCIES) cd $(top_builddir) && $(MAKE) $(AM_MAKEFLAGS) am--refresh diff --git a/collectors/python.d.plugin/README.md b/collectors/python.d.plugin/README.md index df24cd18..673fc2c9 100644 --- a/collectors/python.d.plugin/README.md +++ b/collectors/python.d.plugin/README.md @@ -9,6 +9,20 @@ 5. Allows each **module** to have one or more data collection **jobs** 6. Each **job** is collecting one or more metrics from a single data source +## Pull Request Checklist for Python Plugins + +This is a generic checklist for submitting a new Python plugin for Netdata. It is by no means comprehensive. + +At minimum, to be buildable and testable, the PR needs to include: + +* The module itself, following proper naming conventions: `python.d/<module_dir>/<module_name>.chart.py` +* A README.md file for the plugin under `python.d/<module_dir>`. +* The configuration file for the module: `conf.d/python.d/<module_name>.conf`. Python config files are in YAML format, and should include comments describing what options are present. The instructions are also needed in the configuration section of the README.md +* A basic configuration for the plugin in the appropriate global config file: `conf.d/python.d.conf`, which is also in YAML format. Either add a line that reads `# <module_name>: yes` if the module is to be enabled by default, or one that reads `<module_name>: no` if it is to be disabled by default. +* A line for the plugin in `python.d/Makefile.am` under `dist_python_DATA`. +* A line for the plugin configuration file in `conf.d/Makefile.am`, under `dist_pythonconfig_DATA` +* Optionally, chart information in `web/dashboard_info.js`. This generally involves specifying a name and icon for the section, and may include descriptions for the section or individual charts. + ## Disclaimer @@ -60,7 +74,7 @@ Writing new python module is simple. You just need to remember to include 5 majo - **_get_data** method - all code needs to be compatible with Python 2 (**≥ 2.7**) *and* 3 (**≥ 3.1**) -If you plan to submit the module in a PR, make sure and go through the [PR checklist for new modules](https://github.com/netdata/netdata/wiki/New-Module-PR-Checklist) beforehand to make sure you have updated all the files you need to. +If you plan to submit the module in a PR, make sure and go through the [PR checklist for new modules](#pull-request-checklist-for-python-plugins) beforehand to make sure you have updated all the files you need to. ### Global variables `ORDER` and `CHART` @@ -195,4 +209,4 @@ Sockets are accessed in non-blocking mode with 15 second timeout. After every execution of `_get_raw_data` socket is closed, to prevent this module needs to set `_keep_alive` variable to `True` and implement custom `_check_raw_data` method. -`_check_raw_data` should take raw data and return `True` if all data is received otherwise it should return `False`. Also it should do it in fast and efficient way.
\ No newline at end of file +`_check_raw_data` should take raw data and return `True` if all data is received otherwise it should return `False`. Also it should do it in fast and efficient way. diff --git a/collectors/python.d.plugin/beanstalk/beanstalk.conf b/collectors/python.d.plugin/beanstalk/beanstalk.conf index 94080187..3b11d919 100644 --- a/collectors/python.d.plugin/beanstalk/beanstalk.conf +++ b/collectors/python.d.plugin/beanstalk/beanstalk.conf @@ -72,7 +72,7 @@ # autodetection_retry: 0 # the JOB's re-check interval in seconds # chart_cleanup: 10 # the JOB's chart cleanup interval in iterations # -# Additionally to the above, apache also supports the following: +# Additionally to the above, beanstalk also supports the following: # # host: 'host' # Server ip address or hostname. Default: 127.0.0.1 # port: port # Beanstalkd port. Default: diff --git a/collectors/python.d.plugin/elasticsearch/README.md b/collectors/python.d.plugin/elasticsearch/README.md index 75e17015..7ce6c0b7 100644 --- a/collectors/python.d.plugin/elasticsearch/README.md +++ b/collectors/python.d.plugin/elasticsearch/README.md @@ -49,8 +49,8 @@ Sample: ```yaml local: - host : 'ipaddress' # Server ip address or hostname - port : 'password' # Port on which elasticsearch listed + host : 'ipaddress' # Elasticsearch server ip address or hostname + port : 'port' # Port on which elasticsearch listens cluster_health : True/False # Calls to cluster health elasticsearch API. Enabled by default. cluster_stats : True/False # Calls to cluster stats elasticsearch API. Enabled by default. ``` diff --git a/collectors/python.d.plugin/go_expvar/README.md b/collectors/python.d.plugin/go_expvar/README.md index 6309c195..e3356e1f 100644 --- a/collectors/python.d.plugin/go_expvar/README.md +++ b/collectors/python.d.plugin/go_expvar/README.md @@ -192,7 +192,7 @@ See [this issue](https://github.com/netdata/netdata/pull/1902#issuecomment-28449 Please see these two links to the official netdata documentation for more information about the values: - [External plugins - charts](../../plugins.d/#chart) -- [Chart variables](https://github.com/netdata/netdata/wiki/How-to-write-new-module#global-variables-order-and-chart) +- [Chart variables](../#global-variables-order-and-chart) **Line definitions** diff --git a/collectors/python.d.plugin/go_expvar/go_expvar.conf b/collectors/python.d.plugin/go_expvar/go_expvar.conf index ba8922d2..af89158a 100644 --- a/collectors/python.d.plugin/go_expvar/go_expvar.conf +++ b/collectors/python.d.plugin/go_expvar/go_expvar.conf @@ -76,7 +76,7 @@ # # Please visit the module wiki page for more information on how to use the extra_charts variable: # -# https://github.com/netdata/netdata/wiki/Monitoring-Go-Applications#monitoring-custom-vars-with-go_expvar +# https://github.com/netdata/netdata/tree/master/collectors/python.d.plugin/go_expvar # # Configuration example # --------------------- diff --git a/collectors/python.d.plugin/nvidia_smi/Makefile.inc b/collectors/python.d.plugin/nvidia_smi/Makefile.inc new file mode 100644 index 00000000..c23bd251 --- /dev/null +++ b/collectors/python.d.plugin/nvidia_smi/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_python_DATA += nvidia_smi/nvidia_smi.chart.py +dist_pythonconfig_DATA += nvidia_smi/nvidia_smi.conf + +# do not install these files, but include them in the distribution +dist_noinst_DATA += nvidia_smi/README.md nvidia_smi/Makefile.inc diff --git a/collectors/python.d.plugin/nvidia_smi/README.md b/collectors/python.d.plugin/nvidia_smi/README.md new file mode 100644 index 00000000..06acfc29 --- /dev/null +++ b/collectors/python.d.plugin/nvidia_smi/README.md @@ -0,0 +1,39 @@ +# nvidia_smi + +This module monitors the `nvidia-smi` cli tool. + +**Requirements and Notes:** + + * You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface). + + * You must enable this plugin as its disabled by default due to minor performance issues. + + * On some systems when the GPU is idle the `nvidia-smi` tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue. + + * Currently the `nvidia-smi` tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: https://github.com/netdata/netdata/pull/4357 + + * Contributions are welcome. + + * Make sure `netdata` user can execute `/usr/bin/nvidia-smi` or wherever your binary is. + + * `poll_seconds` is how often in seconds the tool is polled for as an integer. + +It produces: + +1. Per GPU + * GPU utilization + * memory allocation + * memory utilization + * fan speed + * power usage + * temperature + * clock speed + * PCI bandwidth + +### configuration + +Sample: + +```yaml +poll_seconds: 1 +```
\ No newline at end of file diff --git a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py new file mode 100644 index 00000000..c3fff621 --- /dev/null +++ b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py @@ -0,0 +1,361 @@ +# -*- coding: utf-8 -*- +# Description: nvidia-smi netdata python.d module +# Original Author: Steven Noonan (tycho) +# Author: Ilya Mashchenko (l2isbad) + +import subprocess +import threading +import xml.etree.ElementTree as et + +from bases.collection import find_binary +from bases.FrameworkServices.SimpleService import SimpleService + +disabled_by_default = True + + +NVIDIA_SMI = 'nvidia-smi' + +EMPTY_ROW = '' +EMPTY_ROW_LIMIT = 500 +POLLER_BREAK_ROW = '</nvidia_smi_log>' + +PCI_BANDWIDTH = 'pci_bandwidth' +FAN_SPEED = 'fan_speed' +GPU_UTIL = 'gpu_utilization' +MEM_UTIL = 'mem_utilization' +ENCODER_UTIL = 'encoder_utilization' +MEM_ALLOCATED = 'mem_allocated' +TEMPERATURE = 'temperature' +CLOCKS = 'clocks' +POWER = 'power' + +ORDER = [ + PCI_BANDWIDTH, + FAN_SPEED, + GPU_UTIL, + MEM_UTIL, + ENCODER_UTIL, + MEM_ALLOCATED, + TEMPERATURE, + CLOCKS, + POWER, +] + + +def gpu_charts(gpu): + fam = gpu.full_name() + + charts = { + PCI_BANDWIDTH: { + 'options': [None, 'PCI Express Bandwidth Utilization', 'KB/s', fam, 'nvidia_smi.pci_bandwidth', 'area'], + 'lines': [ + ['rx_util', 'rx', 'absolute', 1, 1], + ['tx_util', 'tx', 'absolute', 1, -1], + ] + }, + FAN_SPEED: { + 'options': [None, 'Fan Speed', '%', fam, 'nvidia_smi.fan_speed', 'line'], + 'lines': [ + ['fan_speed', 'speed'], + ] + }, + GPU_UTIL: { + 'options': [None, 'GPU Utilization', '%', fam, 'nvidia_smi.gpu_utilization', 'line'], + 'lines': [ + ['gpu_util', 'utilization'], + ] + }, + MEM_UTIL: { + 'options': [None, 'Memory Bandwidth Utilization', '%', fam, 'nvidia_smi.mem_utilization', 'line'], + 'lines': [ + ['memory_util', 'utilization'], + ] + }, + ENCODER_UTIL: { + 'options': [None, 'Encoder/Decoder Utilization', '%', fam, 'nvidia_smi.encoder_utilization', 'line'], + 'lines': [ + ['encoder_util', 'encoder'], + ['decoder_util', 'decoder'], + ] + }, + MEM_ALLOCATED: { + 'options': [None, 'Memory Allocated', 'MB', fam, 'nvidia_smi.memory_allocated', 'line'], + 'lines': [ + ['fb_memory_usage', 'used'], + ] + }, + TEMPERATURE: { + 'options': [None, 'Temperature', 'celsius', fam, 'nvidia_smi.temperature', 'line'], + 'lines': [ + ['gpu_temp', 'temp'], + ] + }, + CLOCKS: { + 'options': [None, 'Clock Frequencies', 'MHz', fam, 'nvidia_smi.clocks', 'line'], + 'lines': [ + ['graphics_clock', 'graphics'], + ['video_clock', 'video'], + ['sm_clock', 'sm'], + ['mem_clock', 'mem'], + ] + }, + POWER: { + 'options': [None, 'Power Utilization', 'Watts', fam, 'nvidia_smi.power', 'line'], + 'lines': [ + ['power_draw', 'power', 1, 100], + ] + }, + } + + idx = gpu.num + + order = ['gpu{0}_{1}'.format(idx, v) for v in ORDER] + charts = dict(('gpu{0}_{1}'.format(idx, k), v) for k, v in charts.items()) + + for chart in charts.values(): + for line in chart['lines']: + line[0] = 'gpu{0}_{1}'.format(idx, line[0]) + + return order, charts + + +class NvidiaSMI: + def __init__(self): + self.command = find_binary(NVIDIA_SMI) + self.active_proc = None + + def run_once(self): + proc = subprocess.Popen([self.command, '-x', '-q'], stdout=subprocess.PIPE) + stdout, _ = proc.communicate() + return stdout + + def run_loop(self, interval): + if self.active_proc: + self.kill() + proc = subprocess.Popen([self.command, '-x', '-q', '-l', str(interval)], stdout=subprocess.PIPE) + self.active_proc = proc + return proc.stdout + + def kill(self): + if self.active_proc: + self.active_proc.kill() + self.active_proc = None + + +class NvidiaSMIPoller(threading.Thread): + def __init__(self, poll_interval): + threading.Thread.__init__(self) + self.daemon = True + + self.smi = NvidiaSMI() + self.interval = poll_interval + + self.lock = threading.RLock() + self.last_data = str() + self.exit = False + self.empty_rows = 0 + self.rows = list() + + def has_smi(self): + return bool(self.smi.command) + + def run_once(self): + return self.smi.run_once() + + def run(self): + out = self.smi.run_loop(self.interval) + + for row in out: + if self.exit or self.empty_rows > EMPTY_ROW_LIMIT: + break + self.process_row(row) + self.smi.kill() + + def process_row(self, row): + row = row.decode() + self.empty_rows += (row == EMPTY_ROW) + self.rows.append(row) + + if POLLER_BREAK_ROW in row: + self.lock.acquire() + self.last_data = '\n'.join(self.rows) + self.lock.release() + + self.rows = list() + self.empty_rows = 0 + + def is_started(self): + return self.ident is not None + + def shutdown(self): + self.exit = True + + def data(self): + self.lock.acquire() + data = self.last_data + self.lock.release() + return data + + +def handle_attr_error(method): + def on_call(*args, **kwargs): + try: + return method(*args, **kwargs) + except AttributeError: + return None + return on_call + + +class GPU: + def __init__(self, num, root): + self.num = num + self.root = root + + def id(self): + return self.root.get('id') + + def name(self): + return self.root.find('product_name').text + + def full_name(self): + return 'gpu{0} {1}'.format(self.num, self.name()) + + @handle_attr_error + def rx_util(self): + return self.root.find('pci').find('rx_util').text.split()[0] + + @handle_attr_error + def tx_util(self): + return self.root.find('pci').find('tx_util').text.split()[0] + + @handle_attr_error + def fan_speed(self): + return self.root.find('fan_speed').text.split()[0] + + @handle_attr_error + def gpu_util(self): + return self.root.find('utilization').find('gpu_util').text.split()[0] + + @handle_attr_error + def memory_util(self): + return self.root.find('utilization').find('memory_util').text.split()[0] + + @handle_attr_error + def encoder_util(self): + return self.root.find('utilization').find('encoder_util').text.split()[0] + + @handle_attr_error + def decoder_util(self): + return self.root.find('utilization').find('decoder_util').text.split()[0] + + @handle_attr_error + def fb_memory_usage(self): + return self.root.find('fb_memory_usage').find('used').text.split()[0] + + @handle_attr_error + def temperature(self): + return self.root.find('temperature').find('gpu_temp').text.split()[0] + + @handle_attr_error + def graphics_clock(self): + return self.root.find('clocks').find('graphics_clock').text.split()[0] + + @handle_attr_error + def video_clock(self): + return self.root.find('clocks').find('video_clock').text.split()[0] + + @handle_attr_error + def sm_clock(self): + return self.root.find('clocks').find('sm_clock').text.split()[0] + + @handle_attr_error + def mem_clock(self): + return self.root.find('clocks').find('mem_clock').text.split()[0] + + @handle_attr_error + def power_draw(self): + return float(self.root.find('power_readings').find('power_draw').text.split()[0]) * 100 + + def data(self): + data = { + 'rx_util': self.rx_util(), + 'tx_util': self.tx_util(), + 'fan_speed': self.fan_speed(), + 'gpu_util': self.gpu_util(), + 'memory_util': self.memory_util(), + 'encoder_util': self.encoder_util(), + 'decoder_util': self.decoder_util(), + 'fb_memory_usage': self.fb_memory_usage(), + 'gpu_temp': self.temperature(), + 'graphics_clock': self.graphics_clock(), + 'video_clock': self.video_clock(), + 'sm_clock': self.sm_clock(), + 'mem_clock': self.mem_clock(), + 'power_draw': self.power_draw(), + } + + return dict(('gpu{0}_{1}'.format(self.num, k), v) for k, v in data.items() if v is not None) + + +class Service(SimpleService): + def __init__(self, configuration=None, name=None): + super(Service, self).__init__(configuration=configuration, name=name) + self.order = list() + self.definitions = dict() + + poll = int(configuration.get('poll_seconds', 1)) + self.poller = NvidiaSMIPoller(poll) + + def get_data(self): + if not self.poller.is_alive(): + self.debug('poller is off') + return None + + last_data = self.poller.data() + + parsed = self.parse_xml(last_data) + if parsed is None: + return None + + data = dict() + for idx, root in enumerate(parsed.findall('gpu')): + data.update(GPU(idx, root).data()) + + return data or None + + def check(self): + if not self.poller.has_smi(): + self.error("couldn't find '{0}' binary".format(NVIDIA_SMI)) + return False + + raw_data = self.poller.run_once() + if not raw_data: + self.error("failed to invoke '{0}' binary".format(NVIDIA_SMI)) + return False + + parsed = self.parse_xml(raw_data) + if parsed is None: + return False + + gpus = parsed.findall('gpu') + if not gpus: + return False + + self.create_charts(gpus) + self.poller.start() + + return True + + def parse_xml(self, data): + try: + return et.fromstring(data) + except et.ParseError as error: + self.error(error) + + return None + + def create_charts(self, gpus): + for idx, root in enumerate(gpus): + order, charts = gpu_charts(GPU(idx, root)) + self.order.extend(order) + self.definitions.update(charts) diff --git a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf new file mode 100644 index 00000000..e1bcf3fa --- /dev/null +++ b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf @@ -0,0 +1,68 @@ +# netdata python.d.plugin configuration for nvidia_smi +# +# This file is in YaML format. Generally the format is: +# +# name: value +# +# There are 2 sections: +# - global variables +# - one or more JOBS +# +# JOBS allow you to collect values from multiple sources. +# Each source will have its own set of charts. +# +# JOB parameters have to be indented (using spaces only, example below). + +# ---------------------------------------------------------------------- +# Global Variables +# These variables set the defaults for all JOBs, however each JOB +# may define its own, overriding the defaults. + +# update_every sets the default data collection frequency. +# If unset, the python.d.plugin default is used. +# update_every: 1 + +# priority controls the order of charts at the netdata dashboard. +# Lower numbers move the charts towards the top of the page. +# If unset, the default for python.d.plugin is used. +# priority: 60000 + +# retries sets the number of retries to be made in case of failures. +# If unset, the default for python.d.plugin is used. +# Attempts to restore the service are made once every update_every +# and only if the module has collected values in the past. +# retries: 60 + +# autodetection_retry sets the job re-check interval in seconds. +# The job is not deleted if check fails. +# Attempts to start the job are made once every autodetection_retry. +# This feature is disabled by default. +# autodetection_retry: 0 + +# ---------------------------------------------------------------------- +# JOBS (data collection sources) +# +# The default JOBS share the same *name*. JOBS with the same name +# are mutually exclusive. Only one of them will be allowed running at +# any time. This allows autodetection to try several alternatives and +# pick the one that works. +# +# Any number of jobs is supported. +# +# All python.d.plugin JOBS (for all its modules) support a set of +# predefined parameters. These are: +# +# job_name: +# name: myname # the JOB's name as it will appear at the +# # dashboard (by default is the job_name) +# # JOBs sharing a name are mutually exclusive +# update_every: 1 # the JOB's data collection frequency +# priority: 60000 # the JOB's order on the dashboard +# retries: 60 # the JOB's number of restoration attempts +# autodetection_retry: 0 # the JOB's re-check interval in seconds +# +# Additionally to the above, example also supports the following: +# +# poll_seconds: SECONDS # default is 1. Sets the frequency of seconds the nvidia-smi tool is polled. +# +# ---------------------------------------------------------------------- diff --git a/collectors/python.d.plugin/openldap/Makefile.inc b/collectors/python.d.plugin/openldap/Makefile.inc new file mode 100644 index 00000000..dc947e21 --- /dev/null +++ b/collectors/python.d.plugin/openldap/Makefile.inc @@ -0,0 +1,13 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_python_DATA += openldap/openldap.chart.py +dist_pythonconfig_DATA += openldap/openldap.conf + +# do not install these files, but include them in the distribution +dist_noinst_DATA += openldap/README.md openldap/Makefile.inc + diff --git a/collectors/python.d.plugin/openldap/README.md b/collectors/python.d.plugin/openldap/README.md new file mode 100644 index 00000000..938535bc --- /dev/null +++ b/collectors/python.d.plugin/openldap/README.md @@ -0,0 +1,57 @@ +# openldap + +This module provides statistics information from openldap (slapd) server. +Statistics are taken from LDAP monitoring interface. Manual page, slapd-monitor(5) is available. + +**Requirement:** +* Follow instructions from https://www.openldap.org/doc/admin24/monitoringslapd.html to activate monitoring interface. +* Install python ldap module `pip install ldap` or `yum install python-ldap` +* Modify openldap.conf with your credentials + +### Module gives information with following charts: + +1. **connections** + * total connections number + +2. **Bytes** + * sent + +3. **operations** + * completed + * initiated + +4. **referrals** + * sent + +5. **entries** + * sent + +6. **ldap operations** + * bind + * search + * unbind + * add + * delete + * modify + * compare + +7. **waiters** + * read + * write + + + +### configuration + +Sample: + +```yaml +openldap: + name : 'local' + username : "cn=monitor,dc=superb,dc=eu" + password : "testpass" + server : 'localhost' + port : 389 +``` + +--- diff --git a/collectors/python.d.plugin/openldap/openldap.chart.py b/collectors/python.d.plugin/openldap/openldap.chart.py new file mode 100644 index 00000000..6342d386 --- /dev/null +++ b/collectors/python.d.plugin/openldap/openldap.chart.py @@ -0,0 +1,204 @@ +# -*- coding: utf-8 -*- +# Description: openldap netdata python.d module +# Author: Manolis Kartsonakis (ekartsonakis) +# SPDX-License-Identifier: GPL-3.0+ + +try: + import ldap + HAS_LDAP = True +except ImportError: + HAS_LDAP = False + +from bases.FrameworkServices.SimpleService import SimpleService + +# default module values (can be overridden per job in `config`) +priority = 60000 + +DEFAULT_SERVER = 'localhost' +DEFAULT_PORT = '389' +DEFAULT_TIMEOUT = 1 + +ORDER = [ + 'total_connections', + 'bytes_sent', + 'operations', + 'referrals_sent', + 'entries_sent', + 'ldap_operations', + 'waiters' +] + +CHARTS = { + 'total_connections': { + 'options': [None, 'Total Connections', 'connections/s', 'ldap', 'openldap.total_connections', 'line'], + 'lines': [ + ['total_connections', 'connections', 'incremental'] + ] + }, + 'bytes_sent': { + 'options': [None, 'Traffic', 'KB/s', 'ldap', 'openldap.traffic_stats', 'line'], + 'lines': [ + ['bytes_sent', 'sent', 'incremental', 1, 1024] + ] + }, + 'operations': { + 'options': [None, 'Operations Status', 'ops/s', 'ldap', 'openldap.operations_status', 'line'], + 'lines': [ + ['completed_operations', 'completed', 'incremental'], + ['initiated_operations', 'initiated', 'incremental'] + ] + }, + 'referrals_sent': { + 'options': [None, 'Referrals', 'referals/s', 'ldap', 'openldap.referrals', 'line'], + 'lines': [ + ['referrals_sent', 'sent', 'incremental'] + ] + }, + 'entries_sent': { + 'options': [None, 'Entries', 'entries/s', 'ldap', 'openldap.entries', 'line'], + 'lines': [ + ['entries_sent', 'sent', 'incremental'] + ] + }, + 'ldap_operations': { + 'options': [None, 'Operations', 'ops/s', 'ldap', 'openldap.ldap_operations', 'line'], + 'lines': [ + ['bind_operations', 'bind', 'incremental'], + ['search_operations', 'search', 'incremental'], + ['unbind_operations', 'unbind', 'incremental'], + ['add_operations', 'add', 'incremental'], + ['delete_operations', 'delete', 'incremental'], + ['modify_operations', 'modify', 'incremental'], + ['compare_operations', 'compare', 'incremental'] + ] + }, + 'waiters': { + 'options': [None, 'Waiters', 'waiters/s', 'ldap', 'openldap.waiters', 'line'], + 'lines': [ + ['write_waiters', 'write', 'incremental'], + ['read_waiters', 'read', 'incremental'] + ] + }, +} + +# Stuff to gather - make tuples of DN dn and attrib to get +SEARCH_LIST = { + 'total_connections': ( + 'cn=Total,cn=Connections,cn=Monitor', 'monitorCounter', + ), + 'bytes_sent': ( + 'cn=Bytes,cn=Statistics,cn=Monitor', 'monitorCounter', + ), + 'completed_operations': ( + 'cn=Operations,cn=Monitor', 'monitorOpCompleted', + ), + 'initiated_operations': ( + 'cn=Operations,cn=Monitor', 'monitorOpInitiated', + ), + 'referrals_sent': ( + 'cn=Referrals,cn=Statistics,cn=Monitor', 'monitorCounter', + ), + 'entries_sent': ( + 'cn=Entries,cn=Statistics,cn=Monitor', 'monitorCounter', + ), + 'bind_operations': ( + 'cn=Bind,cn=Operations,cn=Monitor', 'monitorOpCompleted', + ), + 'unbind_operations': ( + 'cn=Unbind,cn=Operations,cn=Monitor', 'monitorOpCompleted', + ), + 'add_operations': ( + 'cn=Add,cn=Operations,cn=Monitor', 'monitorOpInitiated', + ), + 'delete_operations': ( + 'cn=Delete,cn=Operations,cn=Monitor', 'monitorOpCompleted', + ), + 'modify_operations': ( + 'cn=Modify,cn=Operations,cn=Monitor', 'monitorOpCompleted', + ), + 'compare_operations': ( + 'cn=Compare,cn=Operations,cn=Monitor', 'monitorOpCompleted', + ), + 'search_operations': ( + 'cn=Search,cn=Operations,cn=Monitor', 'monitorOpCompleted', + ), + 'write_waiters': ( + 'cn=Write,cn=Waiters,cn=Monitor', 'monitorCounter', + ), + 'read_waiters': ( + 'cn=Read,cn=Waiters,cn=Monitor', 'monitorCounter', + ), +} + + +class Service(SimpleService): + def __init__(self, configuration=None, name=None): + SimpleService.__init__(self, configuration=configuration, name=name) + self.order = ORDER + self.definitions = CHARTS + + self.server = configuration.get('server', DEFAULT_SERVER) + self.port = configuration.get('port', DEFAULT_PORT) + self.username = configuration.get('username') + self.password = configuration.get('password') + self.timeout = configuration.get('timeout', DEFAULT_TIMEOUT) + + self.alive = False + self.conn = None + + def disconnect(self): + if self.conn: + self.conn.unbind() + self.conn = None + self.alive = False + + def connect(self): + try: + self.conn = ldap.initialize('ldap://%s:%s' % (self.server, self.port)) + self.conn.set_option(ldap.OPT_NETWORK_TIMEOUT, self.timeout) + if self.username and self.password: + self.conn.simple_bind(self.username, self.password) + except ldap.LDAPError as error: + self.error(error) + return False + + self.alive = True + return True + + def reconnect(self): + self.disconnect() + return self.connect() + + def check(self): + if not HAS_LDAP: + self.error("'python-ldap' package is needed") + return None + + return self.connect() and self.get_data() + + def get_data(self): + if not self.alive and not self.reconnect(): + return None + + data = dict() + for key in SEARCH_LIST: + dn = SEARCH_LIST[key][0] + attr = SEARCH_LIST[key][1] + try: + num = self.conn.search(dn, ldap.SCOPE_BASE, 'objectClass=*', [attr, ]) + result_type, result_data = self.conn.result(num, 1) + except ldap.LDAPError as error: + self.error("Empty result. Check bind username/password. Message: ",error) + self.alive = False + return None + + try: + if result_type == 101: + val = int(result_data[0][1].values()[0][0]) + except (ValueError, IndexError) as error: + self.debug(error) + continue + + data[key] = val + + return data diff --git a/collectors/python.d.plugin/openldap/openldap.conf b/collectors/python.d.plugin/openldap/openldap.conf new file mode 100644 index 00000000..662cc58c --- /dev/null +++ b/collectors/python.d.plugin/openldap/openldap.conf @@ -0,0 +1,74 @@ +# netdata python.d.plugin configuration for openldap +# +# This file is in YaML format. Generally the format is: +# +# name: value +# +# There are 2 sections: +# - global variables +# - one or more JOBS +# +# JOBS allow you to collect values from multiple sources. +# Each source will have its own set of charts. +# +# JOB parameters have to be indented (using spaces only, example below). + +# ---------------------------------------------------------------------- +# Global Variables +# These variables set the defaults for all JOBs, however each JOB +# may define its own, overriding the defaults. + +# update_every sets the default data collection frequency. +# If unset, the python.d.plugin default is used. +# postfix is slow, so once every 10 seconds +update_every: 10 + +# priority controls the order of charts at the netdata dashboard. +# Lower numbers move the charts towards the top of the page. +# If unset, the default for python.d.plugin is used. +# priority: 60000 + +# retries sets the number of retries to be made in case of failures. +# If unset, the default for python.d.plugin is used. +# Attempts to restore the service are made once every update_every +# and only if the module has collected values in the past. +# retries: 60 + +# autodetection_retry sets the job re-check interval in seconds. +# The job is not deleted if check fails. +# Attempts to start the job are made once every autodetection_retry. +# This feature is disabled by default. +# autodetection_retry: 0 + +# ---------------------------------------------------------------------- +# JOBS (data collection sources) +# +# The default JOBS share the same *name*. JOBS with the same name +# are mutually exclusive. Only one of them will be allowed running at +# any time. This allows autodetection to try several alternatives and +# pick the one that works. +# +# Any number of jobs is supported. +# +# All python.d.plugin JOBS (for all its modules) support a set of +# predefined parameters. These are: +# +# job_name: +# name: myname # the JOB's name as it will appear at the +# # dashboard (by default is the job_name) +# # JOBs sharing a name are mutually exclusive +# update_every: 1 # the JOB's data collection frequency +# priority: 60000 # the JOB's order on the dashboard +# retries: 60 # the JOB's number of restoration attempts +# autodetection_retry: 0 # the JOB's re-check interval in seconds +# +# ---------------------------------------------------------------------- +# OPENLDAP EXTRA PARAMETERS + +# Set here your LDAP connection settings + +#username : "cn=admin,dc=example,dc=com" # The bind user with right to access monitor statistics +#password : "yourpass" # The password for the binded user +#server : 'localhost' # The listening address of the LDAP server +#port : 389 # The listening port of the LDAP server +#timeout : 1 # Seconds to timeout if no connection exists
\ No newline at end of file diff --git a/collectors/python.d.plugin/python.d.conf b/collectors/python.d.plugin/python.d.conf index 97f4cb8d..40c8c033 100644 --- a/collectors/python.d.plugin/python.d.conf +++ b/collectors/python.d.plugin/python.d.conf @@ -67,11 +67,13 @@ logind: no # mysql: yes # nginx: yes # nginx_plus: yes +# nvidia_smi: yes # nginx_log has been replaced by web_log nginx_log: no # nsd: yes # ntpd: yes +# openldap: yes # ovpn_status_log: yes # phpfpm: yes # postfix: yes @@ -90,8 +92,9 @@ nginx_log: no # springboot: yes # squid: yes # tomcat: yes +# tor: yes unbound: no # uwsgi: yes # varnish: yes # w1sensor: yes -# web_log: yes +# web_log: yes
\ No newline at end of file diff --git a/collectors/python.d.plugin/python.d.plugin b/collectors/python.d.plugin/python.d.plugin index 264c3383..efff2273 100644 --- a/collectors/python.d.plugin/python.d.plugin +++ b/collectors/python.d.plugin/python.d.plugin @@ -56,7 +56,7 @@ BASE_CONFIG = {'update_every': os.getenv('NETDATA_UPDATE_EVERY', 1), MODULE_EXTENSION = '.chart.py' -OBSOLETE_MODULES = ['apache_cache', 'gunicorn_log', 'nginx_log'] +OBSOLETE_MODULES = ['apache_cache', 'gunicorn_log', 'nginx_log', 'cpufreq'] def module_ok(m): diff --git a/collectors/python.d.plugin/python.d.plugin.in b/collectors/python.d.plugin/python.d.plugin.in index 7ac03fd9..8b55ad41 100755 --- a/collectors/python.d.plugin/python.d.plugin.in +++ b/collectors/python.d.plugin/python.d.plugin.in @@ -56,7 +56,7 @@ BASE_CONFIG = {'update_every': os.getenv('NETDATA_UPDATE_EVERY', 1), MODULE_EXTENSION = '.chart.py' -OBSOLETE_MODULES = ['apache_cache', 'gunicorn_log', 'nginx_log'] +OBSOLETE_MODULES = ['apache_cache', 'gunicorn_log', 'nginx_log', 'cpufreq'] def module_ok(m): diff --git a/collectors/python.d.plugin/python_modules/third_party/lm_sensors.py b/collectors/python.d.plugin/python_modules/third_party/lm_sensors.py index f10cd620..f873eac8 100644 --- a/collectors/python.d.plugin/python_modules/third_party/lm_sensors.py +++ b/collectors/python.d.plugin/python_modules/third_party/lm_sensors.py @@ -17,11 +17,79 @@ import ctypes.util _libc = cdll.LoadLibrary(ctypes.util.find_library("c")) # see https://github.com/paroj/sensors.py/issues/1 _libc.free.argtypes = [c_void_p] + _hdl = cdll.LoadLibrary(ctypes.util.find_library("sensors")) version = c_char_p.in_dll(_hdl, "libsensors_version").value.decode("ascii") +class SensorsError(Exception): + pass + + +class ErrorWildcards(SensorsError): + pass + + +class ErrorNoEntry(SensorsError): + pass + + +class ErrorAccessRead(SensorsError, OSError): + pass + + +class ErrorKernel(SensorsError, OSError): + pass + + +class ErrorDivZero(SensorsError, ZeroDivisionError): + pass + + +class ErrorChipName(SensorsError): + pass + + +class ErrorBusName(SensorsError): + pass + + +class ErrorParse(SensorsError): + pass + + +class ErrorAccessWrite(SensorsError, OSError): + pass + + +class ErrorIO(SensorsError, IOError): + pass + + +class ErrorRecursion(SensorsError): + pass + + +_ERR_MAP = { + 1: ErrorWildcards, + 2: ErrorNoEntry, + 3: ErrorAccessRead, + 4: ErrorKernel, + 5: ErrorDivZero, + 6: ErrorChipName, + 7: ErrorBusName, + 8: ErrorParse, + 9: ErrorAccessWrite, + 10: ErrorIO, + 11: ErrorRecursion +} + + +def raise_sensor_error(errno, message=''): + raise _ERR_MAP[abs(errno)](message) + + class bus_id(Structure): _fields_ = [("type", c_short), ("nr", c_short)] @@ -65,8 +133,8 @@ class subfeature(Structure): _hdl.sensors_get_detected_chips.restype = POINTER(chip_name) _hdl.sensors_get_features.restype = POINTER(feature) _hdl.sensors_get_all_subfeatures.restype = POINTER(subfeature) -_hdl.sensors_get_label.restype = c_void_p # return pointer instead of str so we can free it -_hdl.sensors_get_adapter_name.restype = c_char_p # docs do not say whether to free this or not +_hdl.sensors_get_label.restype = c_void_p # return pointer instead of str so we can free it +_hdl.sensors_get_adapter_name.restype = c_char_p # docs do not say whether to free this or not _hdl.sensors_strerror.restype = c_char_p ### RAW API ### @@ -78,8 +146,9 @@ COMPUTE_MAPPING = 4 def init(cfg_file=None): file = _libc.fopen(cfg_file.encode("utf-8"), "r") if cfg_file is not None else None - if _hdl.sensors_init(file) != 0: - raise Exception("sensors_init failed") + result = _hdl.sensors_init(file) + if result != 0: + raise_sensor_error(result, "sensors_init failed") if file is not None: _libc.fclose(file) @@ -94,7 +163,7 @@ def parse_chip_name(orig_name): err = _hdl.sensors_parse_chip_name(orig_name.encode("utf-8"), byref(ret)) if err < 0: - raise Exception(strerror(err)) + raise_sensor_error(err, strerror(err)) return ret @@ -129,7 +198,7 @@ def chip_snprintf_name(chip, buffer_size=200): err = _hdl.sensors_snprintf_chip_name(ret, buffer_size, byref(chip)) if err < 0: - raise Exception(strerror(err)) + raise_sensor_error(err, strerror(err)) return ret.value.decode("utf-8") @@ -140,7 +209,7 @@ def do_chip_sets(chip): """ err = _hdl.sensors_do_chip_sets(byref(chip)) if err < 0: - raise Exception(strerror(err)) + raise_sensor_error(err, strerror(err)) def get_adapter_name(bus): @@ -178,7 +247,7 @@ def get_value(chip, subfeature_nr): val = c_double() err = _hdl.sensors_get_value(byref(chip), subfeature_nr, byref(val)) if err < 0: - raise Exception(strerror(err)) + raise_sensor_error(err, strerror(err)) return val.value @@ -189,7 +258,7 @@ def set_value(chip, subfeature_nr, value): val = c_double(value) err = _hdl.sensors_set_value(byref(chip), subfeature_nr, byref(val)) if err < 0: - raise Exception(strerror(err)) + raise_sensor_error(err, strerror(err)) ### Convenience API ### @@ -213,7 +282,7 @@ class ChipIterator: if self.match is not None: free_chip_name(self.match) - def next(self): # python2 compability + def next(self): # python2 compability return self.__next__() @@ -233,7 +302,7 @@ class FeatureIterator: return feature - def next(self): # python2 compability + def next(self): # python2 compability return self.__next__() @@ -254,5 +323,5 @@ class SubFeatureIterator: return subfeature - def next(self): # python2 compability + def next(self): # python2 compability return self.__next__() diff --git a/collectors/python.d.plugin/sensors/sensors.chart.py b/collectors/python.d.plugin/sensors/sensors.chart.py index 69d2bfe9..d70af3b0 100644 --- a/collectors/python.d.plugin/sensors/sensors.chart.py +++ b/collectors/python.d.plugin/sensors/sensors.chart.py @@ -3,13 +3,22 @@ # Author: Pawel Krupa (paulfantom) # SPDX-License-Identifier: GPL-3.0-or-later -from bases.FrameworkServices.SimpleService import SimpleService from third_party import lm_sensors as sensors +from bases.FrameworkServices.SimpleService import SimpleService + # default module values (can be overridden per job in `config`) # update_every = 2 -ORDER = ['temperature', 'fan', 'voltage', 'current', 'power', 'energy', 'humidity'] +ORDER = [ + 'temperature', + 'fan', + 'voltage', + 'current', + 'power', + 'energy', + 'humidity', +] # This is a prototype of chart definition which is used to dynamically create self.definitions CHARTS = { @@ -94,16 +103,22 @@ class Service(SimpleService): prefix = sensors.chip_snprintf_name(chip) for feature in sensors.FeatureIterator(chip): sfi = sensors.SubFeatureIterator(chip, feature) + val = None for sf in sfi: - val = sensors.get_value(chip, sf.number) - break + try: + val = sensors.get_value(chip, sf.number) + break + except sensors.SensorsError: + continue + if val is None: + continue type_name = TYPE_MAP[feature.type] if type_name in LIMITS: limit = LIMITS[type_name] if val < limit[0] or val > limit[1]: continue data[prefix + '_' + str(feature.name.decode())] = int(val * 1000) - except Exception as error: + except sensors.SensorsError as error: self.error(error) return None @@ -117,8 +132,14 @@ class Service(SimpleService): continue for feature in sensors.FeatureIterator(chip): sfi = sensors.SubFeatureIterator(chip, feature) - vals = [sensors.get_value(chip, sf.number) for sf in sfi] - if vals[0] == 0: + vals = list() + for sf in sfi: + try: + vals.append(sensors.get_value(chip, sf.number)) + except sensors.SensorsError as error: + self.error('{0}: {1}'.format(sf.name, error)) + continue + if not vals or vals[0] == 0: continue if TYPE_MAP[feature.type] == sensor: # create chart @@ -137,7 +158,7 @@ class Service(SimpleService): def check(self): try: sensors.init() - except Exception as error: + except sensors.SensorsError as error: self.error(error) return False diff --git a/collectors/python.d.plugin/smartd_log/README.md b/collectors/python.d.plugin/smartd_log/README.md index 121a6357..a31ad0c7 100644 --- a/collectors/python.d.plugin/smartd_log/README.md +++ b/collectors/python.d.plugin/smartd_log/README.md @@ -2,29 +2,92 @@ Module monitor `smartd` log files to collect HDD/SSD S.M.A.R.T attributes. -It produces following charts (you can add additional attributes in the module configuration file): +**Requirements:** +* `smartmontools` -1. **Read Error Rate** attribute 1 +It produces following charts for SCSI devices: -2. **Start/Stop Count** attribute 4 +1. **Read Error Corrected** -3. **Reallocated Sectors Count** attribute 5 +2. **Read Error Uncorrected** -4. **Seek Error Rate** attribute 7 +3. **Write Error Corrected** -5. **Power-On Hours Count** attribute 9 +4. **Write Error Uncorrected** -6. **Power Cycle Count** attribute 12 +5. **Verify Error Corrected** -7. **Load/Unload Cycles** attribute 193 +6. **Verify Error Uncorrected** -8. **Temperature** attribute 194 +7. **Temperature** -9. **Current Pending Sectors** attribute 197 -10. **Off-Line Uncorrectable** attribute 198 +For ATA devices: +1. **Read Error Rate** -11. **Write Error Rate** attribute 200 +2. **Seek Error Rate** + +3. **Soft Read Error Rate** + +4. **Write Error Rate** + +5. **SATA Interface Downshift** + +6. **UDMA CRC Error Count** + +7. **Throughput Performance** + +8. **Seek Time Performance** + +9. **Start/Stop Count** + +10. **Power-On Hours Count** + +11. **Power Cycle Count** + +12. **Unexpected Power Loss** + +13. **Spin-Up Time** + +14. **Spin-up Retries** + +15. **Calibration Retries** + +16. **Temperature** + +17. **Reallocated Sectors Count** + +18. **Reserved Block Count** + +19. **Program Fail Count** + +20. **Erase Fail Count** + +21. **Wear Leveller Worst Case Erase Count** + +22. **Unused Reserved NAND Blocks** + +23. **Reallocation Event Count** + +24. **Current Pending Sector Count** + +25. **Offline Uncorrectable Sector Count** + +26. **Percent Lifetime Used** + +### prerequisite +`smartd` must be running with `-A` option to write smartd attribute information to files. + +For this you need to set `smartd_opts` (or `SMARTD_ARGS`, check _smartd.service_ content) in `/etc/default/smartmontools`: + + +``` +# dump smartd attrs info every 600 seconds +smartd_opts="-A /var/log/smartd/ -i 600" +``` + + +`smartd` appends logs at every run. It's strongly recommended to use `logrotate` for smartd files. ### configuration @@ -33,6 +96,6 @@ local: log_path : '/var/log/smartd/' ``` -If no configuration is given, module will attempt to read log files in /var/log/smartd/ directory. +If no configuration is given, module will attempt to read log files in `/var/log/smartd/` directory. --- diff --git a/collectors/python.d.plugin/smartd_log/smartd_log.chart.py b/collectors/python.d.plugin/smartd_log/smartd_log.chart.py index 21dbccec..13762fab 100644 --- a/collectors/python.d.plugin/smartd_log/smartd_log.chart.py +++ b/collectors/python.d.plugin/smartd_log/smartd_log.chart.py @@ -6,182 +6,537 @@ import os import re -from collections import namedtuple +from copy import deepcopy from time import time from bases.collection import read_last_line from bases.FrameworkServices.SimpleService import SimpleService -# charts order (can be overridden if you want less charts, or different order) -ORDER = ['1', '4', '5', '7', '9', '12', '193', '194', '197', '198', '200'] - -SMART_ATTR = { - '1': 'Read Error Rate', - '2': 'Throughput Performance', - '3': 'Spin-Up Time', - '4': 'Start/Stop Count', - '5': 'Reallocated Sectors Count', - '6': 'Read Channel Margin', - '7': 'Seek Error Rate', - '8': 'Seek Time Performance', - '9': 'Power-On Hours Count', - '10': 'Spin-up Retries', - '11': 'Calibration Retries', - '12': 'Power Cycle Count', - '13': 'Soft Read Error Rate', - '100': 'Erase/Program Cycles', - '103': 'Translation Table Rebuild', - '108': 'Unknown (108)', - '170': 'Reserved Block Count', - '171': 'Program Fail Count', - '172': 'Erase Fail Count', - '173': 'Wear Leveller Worst Case Erase Count', - '174': 'Unexpected Power Loss', - '175': 'Program Fail Count', - '176': 'Erase Fail Count', - '177': 'Wear Leveling Count', - '178': 'Used Reserved Block Count', - '179': 'Used Reserved Block Count', - '180': 'Unused Reserved Block Count', - '181': 'Program Fail Count', - '182': 'Erase Fail Count', - '183': 'SATA Downshifts', - '184': 'End-to-End error', - '185': 'Head Stability', - '186': 'Induced Op-Vibration Detection', - '187': 'Reported Uncorrectable Errors', - '188': 'Command Timeout', - '189': 'High Fly Writes', - '190': 'Temperature', - '191': 'G-Sense Errors', - '192': 'Power-Off Retract Cycles', - '193': 'Load/Unload Cycles', - '194': 'Temperature', - '195': 'Hardware ECC Recovered', - '196': 'Reallocation Events', - '197': 'Current Pending Sectors', - '198': 'Off-line Uncorrectable', - '199': 'UDMA CRC Error Rate', - '200': 'Write Error Rate', - '201': 'Soft Read Errors', - '202': 'Data Address Mark Errors', - '203': 'Run Out Cancel', - '204': 'Soft ECC Corrections', - '205': 'Thermal Asperity Rate', - '206': 'Flying Height', - '207': 'Spin High Current', - '209': 'Offline Seek Performance', - '220': 'Disk Shift', - '221': 'G-Sense Error Rate', - '222': 'Loaded Hours', - '223': 'Load/Unload Retries', - '224': 'Load Friction', - '225': 'Load/Unload Cycles', - '226': 'Load-in Time', - '227': 'Torque Amplification Count', - '228': 'Power-Off Retracts', - '230': 'GMR Head Amplitude', - '231': 'Temperature', - '232': 'Available Reserved Space', - '233': 'Media Wearout Indicator', - '240': 'Head Flying Hours', - '241': 'Total LBAs Written', - '242': 'Total LBAs Read', - '250': 'Read Error Retry Rate' -} - -LIMIT = namedtuple('LIMIT', ['min', 'max']) - -LIMITS = { - '194': LIMIT(0, 200) -} -RESCAN_INTERVAL = 60 - -REGEX = re.compile( +INCREMENTAL = 'incremental' +ABSOLUTE = 'absolute' + +ATA = 'ata' +SCSI = 'scsi' +CSV = '.csv' + +DEF_RESCAN_INTERVAL = 60 +DEF_AGE = 30 +DEF_PATH = '/var/log/smartd' + +ATTR1 = '1' +ATTR2 = '2' +ATTR3 = '3' +ATTR4 = '4' +ATTR5 = '5' +ATTR7 = '7' +ATTR8 = '8' +ATTR9 = '9' +ATTR10 = '10' +ATTR11 = '11' +ATTR12 = '12' +ATTR13 = '13' +ATTR170 = '170' +ATTR171 = '171' +ATTR172 = '172' +ATTR173 = '173' +ATTR174 = '174' +ATTR180 = '180' +ATTR183 = '183' +ATTR190 = '190' +ATTR194 = '194' +ATTR196 = '196' +ATTR197 = '197' +ATTR198 = '198' +ATTR199 = '199' +ATTR202 = '202' +ATTR206 = '206' +ATTR_READ_ERR_COR = 'read-total-err-corrected' +ATTR_READ_ERR_UNC = 'read-total-unc-errors' +ATTR_WRITE_ERR_COR = 'write-total-err-corrected' +ATTR_WRITE_ERR_UNC = 'write-total-unc-errors' +ATTR_VERIFY_ERR_COR = 'verify-total-err-corrected' +ATTR_VERIFY_ERR_UNC = 'verify-total-unc-errors' +ATTR_TEMPERATURE = 'temperature' + + +RE_ATA = re.compile( '(\d+);' # attribute '(\d+);' # normalized value '(\d+)', # raw value re.X ) +RE_SCSI = re.compile( + '([a-z-]+);' # attribute + '([0-9.]+)', # raw value + re.X +) -def chart_template(chart_name): - units, attr_id = chart_name.split('_')[-2:] - title = '{value_type} {description}'.format(value_type=units.capitalize(), - description=SMART_ATTR[attr_id]) - family = SMART_ATTR[attr_id].lower() - - return { - chart_name: { - 'options': [None, title, units, family, 'smartd_log.' + chart_name, 'line'], - 'lines': [] - } +ORDER = [ + # errors + 'read_error_rate', + 'seek_error_rate', + 'soft_read_error_rate', + 'write_error_rate', + 'read_total_err_corrected', + 'read_total_unc_errors', + 'write_total_err_corrected', + 'write_total_unc_errors', + 'verify_total_err_corrected', + 'verify_total_unc_errors', + # external failure + 'sata_interface_downshift', + 'udma_crc_error_count', + # performance + 'throughput_performance', + 'seek_time_performance', + # power + 'start_stop_count', + 'power_on_hours_count', + 'power_cycle_count', + 'unexpected_power_loss', + # spin + 'spin_up_time', + 'spin_up_retries', + 'calibration_retries', + # temperature + 'airflow_temperature_celsius', + 'temperature_celsius', + # wear + 'reallocated_sectors_count', + 'reserved_block_count', + 'program_fail_count', + 'erase_fail_count', + 'wear_leveller_worst_case_erase_count', + 'unused_reserved_nand_blocks', + 'reallocation_event_count', + 'current_pending_sector_count', + 'offline_uncorrectable_sector_count', + 'percent_lifetime_used', +] + +CHARTS = { + 'read_error_rate': { + 'options': [None, 'Read Error Rate', 'value', 'errors', 'smartd_log.read_error_rate', 'line'], + 'lines': [], + 'attrs': [ATTR1], + 'algo': ABSOLUTE, + }, + 'seek_error_rate': { + 'options': [None, 'Seek Error Rate', 'value', 'errors', 'smartd_log.seek_error_rate', 'line'], + 'lines': [], + 'attrs': [ATTR7], + 'algo': ABSOLUTE, + }, + 'soft_read_error_rate': { + 'options': [None, 'Soft Read Error Rate', 'errors', 'errors', 'smartd_log.soft_read_error_rate', 'line'], + 'lines': [], + 'attrs': [ATTR13], + 'algo': INCREMENTAL, + }, + 'write_error_rate': { + 'options': [None, 'Write Error Rate', 'value', 'errors', 'smartd_log.write_error_rate', 'line'], + 'lines': [], + 'attrs': [ATTR206], + 'algo': ABSOLUTE, + }, + 'read_total_err_corrected': { + 'options': [None, 'Read Error Corrected', 'errors', 'errors', 'smartd_log.read_total_err_corrected', 'line'], + 'lines': [], + 'attrs': [ATTR_READ_ERR_COR], + 'algo': INCREMENTAL, + }, + 'read_total_unc_errors': { + 'options': [None, 'Read Error Uncorrected', 'errors', 'errors', 'smartd_log.read_total_unc_errors', 'line'], + 'lines': [], + 'attrs': [ATTR_READ_ERR_UNC], + 'algo': INCREMENTAL, + }, + 'write_total_err_corrected': { + 'options': [None, 'Write Error Corrected', 'errors', 'errors', 'smartd_log.read_total_err_corrected', 'line'], + 'lines': [], + 'attrs': [ATTR_WRITE_ERR_COR], + 'algo': INCREMENTAL, + }, + 'write_total_unc_errors': { + 'options': [None, 'Write Error Uncorrected', 'errors', 'errors', 'smartd_log.write_total_unc_errors', 'line'], + 'lines': [], + 'attrs': [ATTR_WRITE_ERR_UNC], + 'algo': INCREMENTAL, + }, + 'verify_total_err_corrected': { + 'options': [None, 'Verify Error Corrected', 'errors', 'errors', 'smartd_log.verify_total_err_corrected', + 'line'], + 'lines': [], + 'attrs': [ATTR_VERIFY_ERR_COR], + 'algo': INCREMENTAL, + }, + 'verify_total_unc_errors': { + 'options': [None, 'Verify Error Uncorrected', 'errors', 'errors', 'smartd_log.verify_total_unc_errors', 'line'], + 'lines': [], + 'attrs': [ATTR_VERIFY_ERR_UNC], + 'algo': INCREMENTAL, + }, + 'sata_interface_downshift': { + 'options': [None, 'SATA Interface Downshift', 'events', 'external failure', + 'smartd_log.sata_interface_downshift', 'line'], + 'lines': [], + 'attrs': [ATTR183], + 'algo': INCREMENTAL, + }, + 'udma_crc_error_count': { + 'options': [None, 'UDMA CRC Error Count', 'errors', 'external failure', 'smartd_log.udma_crc_error_count', + 'line'], + 'lines': [], + 'attrs': [ATTR199], + 'algo': INCREMENTAL, + }, + 'throughput_performance': { + 'options': [None, 'Throughput Performance', 'value', 'performance', 'smartd_log.throughput_performance', + 'line'], + 'lines': [], + 'attrs': [ATTR2], + 'algo': ABSOLUTE, + }, + 'seek_time_performance': { + 'options': [None, 'Seek Time Performance', 'value', 'performance', 'smartd_log.seek_time_performance', 'line'], + 'lines': [], + 'attrs': [ATTR8], + 'algo': ABSOLUTE, + }, + 'start_stop_count': { + 'options': [None, 'Start/Stop Count', 'events', 'power', 'smartd_log.start_stop_count', 'line'], + 'lines': [], + 'attrs': [ATTR4], + 'algo': ABSOLUTE, + }, + 'power_on_hours_count': { + 'options': [None, 'Power-On Hours Count', 'hours', 'power', 'smartd_log.power_on_hours_count', 'line'], + 'lines': [], + 'attrs': [ATTR9], + 'algo': ABSOLUTE, + }, + 'power_cycle_count': { + 'options': [None, 'Power Cycle Count', 'events', 'power', 'smartd_log.power_cycle_count', 'line'], + 'lines': [], + 'attrs': [ATTR12], + 'algo': ABSOLUTE, + }, + 'unexpected_power_loss': { + 'options': [None, 'Unexpected Power Loss', 'events', 'power', 'smartd_log.unexpected_power_loss', 'line'], + 'lines': [], + 'attrs': [ATTR174], + 'algo': ABSOLUTE, + }, + 'spin_up_time': { + 'options': [None, 'Spin-Up Time', 'ms', 'spin', 'smartd_log.spin_up_time', 'line'], + 'lines': [], + 'attrs': [ATTR3], + 'algo': ABSOLUTE, + }, + 'spin_up_retries': { + 'options': [None, 'Spin-up Retries', 'retries', 'spin', 'smartd_log.spin_up_retries', 'line'], + 'lines': [], + 'attrs': [ATTR10], + 'algo': INCREMENTAL, + }, + 'calibration_retries': { + 'options': [None, 'Calibration Retries', 'retries', 'spin', 'smartd_log.calibration_retries', 'line'], + 'lines': [], + 'attrs': [ATTR11], + 'algo': INCREMENTAL, + }, + 'airflow_temperature_celsius': { + 'options': [None, 'Airflow Temperature Celsius', 'celsius', 'temperature', + 'smartd_log.airflow_temperature_celsius', 'line'], + 'lines': [], + 'attrs': [ATTR190], + 'algo': ABSOLUTE, + }, + 'temperature_celsius': { + 'options': [None, 'Temperature', 'celsius', 'temperature', 'smartd_log.temperature_celsius', 'line'], + 'lines': [], + 'attrs': [ATTR194, ATTR_TEMPERATURE], + 'algo': ABSOLUTE, + }, + 'reallocated_sectors_count': { + 'options': [None, 'Reallocated Sectors Count', 'sectors', 'wear', 'smartd_log.reallocated_sectors_count', + 'line'], + 'lines': [], + 'attrs': [ATTR5], + 'algo': INCREMENTAL, + }, + 'reserved_block_count': { + 'options': [None, 'Reserved Block Count', '%', 'wear', 'smartd_log.reserved_block_count', 'line'], + 'lines': [], + 'attrs': [ATTR170], + 'algo': ABSOLUTE, + }, + 'program_fail_count': { + 'options': [None, 'Program Fail Count', 'errors', 'wear', 'smartd_log.program_fail_count', 'line'], + 'lines': [], + 'attrs': [ATTR171], + 'algo': INCREMENTAL, + }, + 'erase_fail_count': { + 'options': [None, 'Erase Fail Count', 'failures', 'wear', 'smartd_log.erase_fail_count', 'line'], + 'lines': [], + 'attrs': [ATTR172], + 'algo': INCREMENTAL, + }, + 'wear_leveller_worst_case_erase_count': { + 'options': [None, 'Wear Leveller Worst Case Erase Count', 'erases', 'wear', + 'smartd_log.wear_leveller_worst_case_erase_count', 'line'], + 'lines': [], + 'attrs': [ATTR173], + 'algo': ABSOLUTE, + }, + 'unused_reserved_nand_blocks': { + 'options': [None, 'Unused Reserved NAND Blocks', 'blocks', 'wear', 'smartd_log.unused_reserved_nand_blocks', + 'line'], + 'lines': [], + 'attrs': [ATTR180], + 'algo': ABSOLUTE, + }, + 'reallocation_event_count': { + 'options': [None, 'Reallocation Event Count', 'events', 'wear', 'smartd_log.reallocation_event_count', 'line'], + 'lines': [], + 'attrs': [ATTR196], + 'algo': INCREMENTAL, + }, + 'current_pending_sector_count': { + 'options': [None, 'Current Pending Sector Count', 'sectors', 'wear', 'smartd_log.current_pending_sector_count', + 'line'], + 'lines': [], + 'attrs': [ATTR197], + 'algo': ABSOLUTE, + }, + 'offline_uncorrectable_sector_count': { + 'options': [None, 'Offline Uncorrectable Sector Count', 'sectors', 'wear', + 'smartd_log.offline_uncorrectable_sector_count', 'line'], + 'lines': [], + 'attrs': [ATTR198], + 'algo': ABSOLUTE, + + }, + 'percent_lifetime_used': { + 'options': [None, 'Percent Lifetime Used', '%', 'wear', 'smartd_log.percent_lifetime_used', 'line'], + 'lines': [], + 'attrs': [ATTR202], + 'algo': ABSOLUTE, } +} + +# NOTE: 'parse_temp' decodes ATA 194 raw value. Not heavily tested. Written by @Ferroin +# C code: +# https://github.com/smartmontools/smartmontools/blob/master/smartmontools/atacmds.cpp#L2051 +# +# Calling 'parse_temp' on the raw value will return a 4-tuple, containing +# * temperature +# * minimum +# * maximum +# * over-temperature count +# substituting None for values it can't decode. +# +# Example: +# >>> parse_temp(42952491042) +# >>> (34, 10, 43, None) +# +# +# def check_temp_word(i): +# if i <= 0x7F: +# return 0x11 +# elif i <= 0xFF: +# return 0x01 +# elif 0xFF80 <= i: +# return 0x10 +# return 0x00 +# +# +# def check_temp_range(t, b0, b1): +# if b0 > b1: +# t0, t1 = b1, b0 +# else: +# t0, t1 = b0, b1 +# +# if all([ +# -60 <= t0, +# t0 <= t, +# t <= t1, +# t1 <= 120, +# not (t0 == -1 and t1 <= 0) +# ]): +# return t0, t1 +# return None, None +# +# +# def parse_temp(raw): +# byte = list() +# word = list() +# for i in range(0, 6): +# byte.append(0xFF & (raw >> (i * 8))) +# for i in range(0, 3): +# word.append(0xFFFF & (raw >> (i * 16))) +# +# ctwd = check_temp_word(word[0]) +# +# if not word[2]: +# if ctwd and not word[1]: +# # byte[0] is temp, no other data +# return byte[0], None, None, None +# +# if ctwd and all(check_temp_range(byte[0], byte[2], byte[3])): +# # byte[0] is temp, byte[2] is max or min, byte[3] is min or max +# trange = check_temp_range(byte[0], byte[2], byte[3]) +# return byte[0], trange[0], trange[1], None +# +# if ctwd and all(check_temp_range(byte[0], byte[1], byte[2])): +# # byte[0] is temp, byte[1] is max or min, byte[2] is min or max +# trange = check_temp_range(byte[0], byte[1], byte[2]) +# return byte[0], trange[0], trange[1], None +# +# return None, None, None, None +# +# if ctwd: +# if all( +# [ +# ctwd & check_temp_word(word[1]) & check_temp_word(word[2]) != 0x00, +# all(check_temp_range(byte[0], byte[2], byte[4])), +# ] +# ): +# # byte[0] is temp, byte[2] is max or min, byte[4] is min or max +# trange = check_temp_range(byte[0], byte[2], byte[4]) +# return byte[0], trange[0], trange[1], None +# else: +# trange = check_temp_range(byte[0], byte[2], byte[3]) +# if word[2] < 0x7FFF and all(trange) and trange[1] >= 40: +# # byte[0] is temp, byte[2] is max or min, byte[3] is min or max, word[2] is overtemp count +# return byte[0], trange[0], trange[1], word[2] +# # no data +# return None, None, None, None + + +CHARTED_ATTRS = dict((attr, k) for k, v in CHARTS.items() for attr in v['attrs']) + + +class BaseAtaSmartAttribute: + def __init__(self, name, normalized_value, raw_value): + self.name = name + self.normalized_value = normalized_value + self.raw_value = raw_value + + def value(self): + raise NotImplementedError + + +class AtaRaw(BaseAtaSmartAttribute): + def value(self): + return self.raw_value + + +class AtaNormalized(BaseAtaSmartAttribute): + def value(self): + return self.normalized_value + + +class Ata9(BaseAtaSmartAttribute): + def value(self): + value = int(self.raw_value) + if value > 1e6: + return value & 0xFFFF + return value + + +class Ata190(BaseAtaSmartAttribute): + def value(self): + return 100 - int(self.normalized_value) + + +class BaseSCSISmartAttribute: + def __init__(self, name, raw_value): + self.name = name + self.raw_value = raw_value + + def value(self): + raise NotImplementedError + + +class SCSIRaw(BaseSCSISmartAttribute): + def value(self): + return self.raw_value + + +def ata_attribute_factory(value): + name = value[0] + + if name == ATTR9: + return Ata9(*value) + elif name == ATTR190: + return Ata190(*value) + elif name in [ + ATTR1, + ATTR7, + ATTR194, + ATTR202, + ATTR206, + ]: + return AtaNormalized(*value) + + return AtaRaw(*value) -def handle_os_error(method): - def on_call(*args): - try: - return method(*args) - except OSError: - return None - return on_call +def scsi_attribute_factory(value): + return SCSIRaw(*value) -class SmartAttribute(object): - def __init__(self, idx, normalized, raw): - self.id = idx - self.normalized = normalized - self._raw = raw +def attribute_factory(value): + name = value[0] + if name.isdigit(): + return ata_attribute_factory(value) + return scsi_attribute_factory(value) - @property - def raw(self): - if self.id in LIMITS: - limit = LIMITS[self.id] - if limit.min <= int(self._raw) <= limit.max: - return self._raw - return None - return self._raw - @raw.setter - def raw(self, value): - self._raw = value +def handle_error(*errors): + def on_method(method): + def on_call(*args): + try: + return method(*args) + except errors: + return None + return on_call + return on_method class DiskLogFile: - def __init__(self, path): - self.path = path - self.size = os.path.getsize(path) + def __init__(self, full_path): + self.path = full_path + self.size = os.path.getsize(full_path) - @handle_os_error + @handle_error(OSError) def is_changed(self): - new_size = os.path.getsize(self.path) - old_size, self.size = self.size, new_size - - return new_size != old_size and new_size + return self.size != os.path.getsize(self.path) - @staticmethod - @handle_os_error - def is_valid(log_file, exclude): - return all([log_file.endswith('.csv'), - not [p for p in exclude if p in log_file], - os.access(log_file, os.R_OK), - os.path.getsize(log_file)]) + @handle_error(OSError) + def is_active(self, current_time, limit): + return (current_time - os.path.getmtime(self.path)) / 60 < limit + @handle_error(OSError) + def read(self): + self.size = os.path.getsize(self.path) + return read_last_line(self.path) -class Disk: - def __init__(self, full_path, age): - self.log_file = DiskLogFile(full_path) - self.name = os.path.basename(full_path).split('.')[-3] - self.age = int(age) - self.status = True - self.attributes = dict() - self.get_attributes() +class BaseDisk: + def __init__(self, name, log_file): + self.name = re.sub(r'_+', '_', name) + self.log_file = log_file + self.attrs = list() + self.alive = True + self.charted = False def __eq__(self, other): - if isinstance(other, Disk): + if isinstance(other, BaseDisk): return self.name == other.name return self.name == other @@ -191,163 +546,179 @@ class Disk: def __hash__(self): return hash(repr(self)) - @handle_os_error - def is_active(self): - return (time() - os.path.getmtime(self.log_file.path)) / 60 < self.age + def parser(self, data): + raise NotImplementedError - @handle_os_error - def get_attributes(self): - last_line = read_last_line(self.log_file.path) - self.attributes = dict((attr, SmartAttribute(attr, normalized, raw)) for attr, normalized, raw - in REGEX.findall(last_line)) - return True + @handle_error(TypeError) + def populate_attrs(self): + self.attrs = list() + line = self.log_file.read() + for value in self.parser(line): + self.attrs.append(attribute_factory(value)) + + return len(self.attrs) def data(self): data = dict() - for attr in self.attributes.values(): - data['_'.join([self.name, 'normalized', attr.id])] = attr.normalized - if attr.raw is not None: - data['_'.join([self.name, 'raw', attr.id])] = attr.raw + for attr in self.attrs: + data['{0}_{1}'.format(self.name, attr.name)] = attr.value() return data -class Service(SimpleService): - def __init__(self, configuration=None, name=None): - SimpleService.__init__(self, configuration=configuration, name=name) - self.log_path = self.configuration.get('log_path', '/var/log/smartd') - self.raw = self.configuration.get('raw_values', True) - self.exclude = self.configuration.get('exclude_disks', str()).split() - self.age = self.configuration.get('age', 30) +class ATADisk(BaseDisk): + def parser(self, data): + return RE_ATA.findall(data) - self.runs = 0 - self.disks = list() - self.order = list() - self.definitions = dict() - def check(self): - self.disks = self.scan() +class SCSIDisk(BaseDisk): + def parser(self, data): + return RE_SCSI.findall(data) - if not self.disks: - return None - user_defined_sa = self.configuration.get('smart_attributes') +class Service(SimpleService): + def __init__(self, configuration=None, name=None): + SimpleService.__init__(self, configuration=configuration, name=name) + self.order = ORDER + self.definitions = deepcopy(CHARTS) - if user_defined_sa: - order = user_defined_sa.split() or ORDER - else: - order = ORDER + self.log_path = configuration.get('log_path', DEF_PATH) + self.age = configuration.get('age', DEF_AGE) + self.exclude = configuration.get('exclude_disks', str()).split() - self.create_charts(order) + self.disks = list() + self.runs = 0 - return True + def check(self): + return self.scan() > 0 def get_data(self): self.runs += 1 - if self.runs % RESCAN_INTERVAL == 0: - self.cleanup_and_rescan() + if self.runs % DEF_RESCAN_INTERVAL == 0: + self.cleanup() + self.scan() data = dict() for disk in self.disks: - - if not disk.status: + if not disk.alive: continue + if not disk.charted: + self.add_disk_to_charts(disk) + changed = disk.log_file.is_changed() - # True = changed, False = unchanged, None = Exception if changed is None: - disk.status = False + disk.alive = False continue - if changed: - success = disk.get_attributes() - if not success: - disk.status = False - continue + if changed and disk.populate_attrs() is None: + disk.alive = False + continue data.update(disk.data()) - return data or None - - def create_charts(self, order): - for attr in order: - raw_name, normalized_name = 'attr_id_raw_' + attr, 'attr_id_normalized_' + attr - raw, normalized = chart_template(raw_name), chart_template(normalized_name) - self.order.extend([normalized_name, raw_name]) - self.definitions.update(raw) - self.definitions.update(normalized) - - for disk in self.disks: - if attr not in disk.attributes: - self.debug("'{disk}' has no attribute '{attr_id}'".format(disk=disk.name, - attr_id=attr)) - continue - normalized[normalized_name]['lines'].append(['_'.join([disk.name, 'normalized', attr]), disk.name]) - - if not self.raw: - continue - - if disk.attributes[attr].raw is not None: - raw[raw_name]['lines'].append(['_'.join([disk.name, 'raw', attr]), disk.name]) - continue - self.debug("'{disk}' attribute '{attr_id}' value not in {limits}".format(disk=disk.name, - attr_id=attr, - limits=LIMITS[attr])) - - def cleanup_and_rescan(self): - self.cleanup() - new_disks = self.scan(only_new=True) - - for disk in new_disks: - valid = False - - for chart in self.charts: - value_type, idx = chart.id.split('_')[2:] - - if idx in disk.attributes: - valid = True - dimension_id = '_'.join([disk.name, value_type, idx]) - - if dimension_id in chart: - chart.hide_dimension(dimension_id=dimension_id, reverse=True) - else: - chart.add_dimension([dimension_id, disk.name]) - if valid: - self.disks.append(disk) + return data def cleanup(self): - for disk in self.disks: + current_time = time() + for disk in self.disks[:]: + if any( + [ + not disk.alive, + not disk.log_file.is_active(current_time, self.age), + ] + ): + self.disks.remove(disk.name) + self.remove_disk_from_charts(disk) + + def scan(self): + self.debug('scanning {0}'.format(self.log_path)) + current_time = time() + + for full_name in os.listdir(self.log_path): + disk = self.create_disk_from_file(full_name, current_time) + if not disk: + continue + self.disks.append(disk) + + return len(self.disks) + + def create_disk_from_file(self, full_name, current_time): + name = os.path.basename(full_name).split('.')[-3] + path = os.path.join(self.log_path, full_name) + + if name in self.disks: + return None + + if [p for p in self.exclude if p in name]: + return None + + if not full_name.endswith(CSV): + self.debug('skipping {0}: not a csv file'.format(full_name)) + return None + + if not os.access(path, os.R_OK): + self.debug('skipping {0}: not readable'.format(full_name)) + return None + + if os.path.getsize(path) == 0: + self.debug('skipping {0}: zero size'.format(full_name)) + return None + + if (current_time - os.path.getmtime(path)) / 60 > self.age: + self.debug('skipping {0}: haven\'t been updated for last {1} minutes'.format(full_name, self.age)) + return None + + if ATA in full_name: + disk = ATADisk(name, DiskLogFile(path)) + elif SCSI in full_name: + disk = SCSIDisk(name, DiskLogFile(path)) + else: + self.debug('skipping {0}: unknown type'.format(full_name)) + return None + + disk.populate_attrs() + if not disk.attrs: + self.error('skipping {0}: parsing failed'.format(full_name)) + return None + + self.debug('added {0}'.format(full_name)) + return disk + + def add_disk_to_charts(self, disk): + if len(self.charts) == 0 or disk.charted: + return + disk.charted = True + + for attr in disk.attrs: + chart_id = CHARTED_ATTRS.get(attr.name) + + if not chart_id or chart_id not in self.charts: + continue + + chart = self.charts[chart_id] + dim = [ + '{0}_{1}'.format(disk.name, attr.name), + disk.name, + CHARTS[chart_id]['algo'], + ] + + if dim[0] in self.charts[chart_id].dimensions: + chart.hide_dimension(dim[0], reverse=True) + else: + chart.add_dimension(dim) + + def remove_disk_from_charts(self, disk): + if len(self.charts) == 0 or not disk.charted: + return + + for attr in disk.attrs: + chart_id = CHARTED_ATTRS.get(attr.name) + + if not chart_id or chart_id not in self.charts: + continue - if not disk.is_active(): - disk.status = False - if not disk.status: - for chart in self.charts: - dimension_id = '_'.join([disk.name, chart.id[8:]]) - chart.hide_dimension(dimension_id=dimension_id) - - self.disks = [disk for disk in self.disks if disk.status] - - def scan(self, only_new=None): - new_disks = list() - for f in os.listdir(self.log_path): - full_path = os.path.join(self.log_path, f) - - if DiskLogFile.is_valid(full_path, self.exclude): - disk = Disk(full_path, self.age) - - active = disk.is_active() - if active is None: - continue - if active: - if not only_new: - new_disks.append(disk) - else: - if disk not in self.disks: - new_disks.append(disk) - else: - if not only_new: - self.debug("'{disk}' not updated in the last {age} minutes, " - "skipping it.".format(disk=disk.name, age=self.age)) - return new_disks + # TODO: can't delete dimension + self.charts[chart_id].hide_dimension('{0}_{1}'.format(disk.name, attr.name)) diff --git a/collectors/python.d.plugin/smartd_log/smartd_log.conf b/collectors/python.d.plugin/smartd_log/smartd_log.conf index 3fab3f1c..ab7f45b0 100644 --- a/collectors/python.d.plugin/smartd_log/smartd_log.conf +++ b/collectors/python.d.plugin/smartd_log/smartd_log.conf @@ -63,28 +63,7 @@ # # Additionally to the above, smartd_log also supports the following: # -# log_path: '/path/to/smartdlogs' # path to smartd log files. Default is /var/log/smartd -# raw_values: yes # enable/disable raw values charts. Enabled by default. -# smart_attributes: '1 2 3 4 44' # smart attributes charts. Default are ['1', '4', '5', '7', '9', '12', '193', '194', '197', '198', '200']. -# exclude_disks: 'PATTERN1 PATTERN2' # space separated patterns. If the pattern is in the drive name, the module will not collect data for it. -# -# ---------------------------------------------------------------------- -# Additional information -# Plugin reads smartd log files (-A option). -# You need to add (man smartd) to /etc/default/smartmontools '-i 600 -A /var/log/smartd/' to pass additional options to smartd on startup -# Then restart smartd service and check /path/log/smartdlogs -# ls /var/log/smartd/ -# CDC_WD10EZEX_00BN5A0-WD_WCC3F7FLVZS9.ata.csv WDC_WD10EZEX_00BN5A0-WD_WCC3F7FLVZS9.ata.csv ZDC_WD10EZEX_00BN5A0-WD_WCC3F7FLVZS9.ata.csv -# -# Smartd APPEND logs at every run. Its NOT RECOMMENDED to set '-i' option below 60 sec. -# STRONGLY RECOMMENDED to create smartd conf file for logrotate -# -# RAW vs NORMALIZED values -# "Normalized value", commonly referred to as just "value". This is a most universal measurement, on the scale from 0 (bad) to some maximum (good) value. -# Maximum values are typically 100, 200 or 253. Rule of thumb is: high values are good, low values are bad. -# -# "Raw value" - the value of the attribute as it is tracked by the device, before any normalization takes place. -# Some raw numbers provide valuable insight when properly interpreted. These cases will be discussed later on. -# Raw values are typically listed in hexadecimal numbers. The raw value has different structure for different vendors and is often not meaningful as a decimal number. +# log_path: '/path/to/smartd_logs' # path to smartd log files. Default is /var/log/smartd +# exclude_disks: 'PATTERN1 PATTERN2' # space separated patterns. If the pattern is in the drive name, the module will not collect data for it. # # ---------------------------------------------------------------------- diff --git a/collectors/python.d.plugin/springboot/README.md b/collectors/python.d.plugin/springboot/README.md index 008436a4..a1817cc2 100644 --- a/collectors/python.d.plugin/springboot/README.md +++ b/collectors/python.d.plugin/springboot/README.md @@ -1,40 +1,10 @@ # springboot This module will monitor one or more Java Spring-boot applications depending on configuration. - -It produces following charts: - -1. **Response Codes** in requests/s - * 1xx - * 2xx - * 3xx - * 4xx - * 5xx - * others - -2. **Threads** - * daemon - * total - -3. **GC Time** in milliseconds and **GC Operations** in operations/s - * Copy - * MarkSweep - * ... - -4. **Heap Mmeory Usage** in KB - * used - * committed - -### configuration - -Please see the [Monitoring Java Spring Boot Applications](https://github.com/netdata/netdata/wiki/Monitoring-Java-Spring-Boot-Applications) page for detailed info about module configuration. - ---- - -# Monitoring Java Spring Boot Applications - Netdata can be used to monitor running Java [Spring Boot](https://spring.io/) applications that expose their metrics with the use of the **Spring Boot Actuator** included in Spring Boot library. +## Configuration + The Spring Boot Actuator exposes these metrics over HTTP and is very easy to use: * add `org.springframework.boot:spring-boot-starter-actuator` to your application dependencies * set `endpoints.metrics.sensitive=false` in your `application.properties` @@ -93,7 +63,30 @@ public class HeapPoolMetrics implements PublicMetrics { Please refer [Spring Boot Actuator: Production-ready features](https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready.html) and [81. Actuator - Part IX. ‘How-to’ guides](https://docs.spring.io/spring-boot/docs/current/reference/html/howto-actuator.html) for more information. -## Using netdata springboot module +## Charts + +1. **Response Codes** in requests/s + * 1xx + * 2xx + * 3xx + * 4xx + * 5xx + * others + +2. **Threads** + * daemon + * total + +3. **GC Time** in milliseconds and **GC Operations** in operations/s + * Copy + * MarkSweep + * ... + +4. **Heap Mmeory Usage** in KB + * used + * committed + +## Usage The springboot module is enabled by default. It looks up `http://localhost:8080/metrics` and `http://127.0.0.1:8080/metrics` to detect Spring Boot application by default. You can change it by editing `/etc/netdata/python.d/springboot.conf` (to edit it on your system run `/etc/netdata/edit-config python.d/springboot.conf`). @@ -126,4 +119,4 @@ You can disable the default charts by set `defaults.<chart-id>: false`. The dimension name of extras charts should replace `.` to `_`. -Please check [springboot.conf](springboot.conf) for more examples.
\ No newline at end of file +Please check [springboot.conf](springboot.conf) for more examples. diff --git a/collectors/python.d.plugin/tor/Makefile.inc b/collectors/python.d.plugin/tor/Makefile.inc new file mode 100644 index 00000000..5a45f9b7 --- /dev/null +++ b/collectors/python.d.plugin/tor/Makefile.inc @@ -0,0 +1,13 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_python_DATA += tor/tor.chart.py +dist_pythonconfig_DATA += tor/tor.conf + +# do not install these files, but include them in the distribution +dist_noinst_DATA += tor/README.md tor/Makefile.inc + diff --git a/collectors/python.d.plugin/tor/README.md b/collectors/python.d.plugin/tor/README.md new file mode 100644 index 00000000..4a883373 --- /dev/null +++ b/collectors/python.d.plugin/tor/README.md @@ -0,0 +1,46 @@ +# tor + +Module connects to tor control port to collect traffic statistics. + +**Requirements:** +* `tor` program +* `stem` python package + +It produces only one chart: + +1. **Traffic** + * read + * write + +### configuration + +Needs only `control_port` + +Here is an example for local server: + +```yaml +update_every : 1 +priority : 60000 + +local_tcp: + name: 'local' + control_port: 9051 + +local_socket: + name: 'local' + control_port: '/var/run/tor/control' +``` + +### prerequisite + +Add to `/etc/tor/torrc`: + +``` +ControlPort 9051 +``` + +For more options please read the manual. + +Without configuration, module attempts to connect to `127.0.0.1:9051`. + +--- diff --git a/collectors/python.d.plugin/tor/tor.chart.py b/collectors/python.d.plugin/tor/tor.chart.py new file mode 100644 index 00000000..b77632bd --- /dev/null +++ b/collectors/python.d.plugin/tor/tor.chart.py @@ -0,0 +1,108 @@ +# -*- coding: utf-8 -*- +# Description: adaptec_raid netdata python.d module +# Author: Federico Ceratto <federico.ceratto@gmail.com> +# Author: Ilya Mashchenko (l2isbad) +# SPDX-License-Identifier: GPL-3.0-or-later + + +from bases.FrameworkServices.SimpleService import SimpleService + +try: + import stem + import stem.connection + import stem.control + STEM_AVAILABLE = True +except ImportError: + STEM_AVAILABLE = False + + +DEF_PORT = 'default' + +ORDER = [ + 'traffic', +] + +CHARTS = { + 'traffic': { + 'options': [None, 'Tor Traffic', 'KB/s', 'traffic', 'tor.traffic', 'area'], + 'lines': [ + ['read', 'read', 'incremental', 1, 1024], + ['write', 'write', 'incremental', 1, -1024], + ] + } +} + + +class Service(SimpleService): + """Provide netdata service for Tor""" + def __init__(self, configuration=None, name=None): + super(Service, self).__init__(configuration=configuration, name=name) + self.order = ORDER + self.definitions = CHARTS + + self.port = self.configuration.get('control_port', DEF_PORT) + self.password = self.configuration.get('password') + + self.use_socket = isinstance(self.port, str) and self.port != DEF_PORT and not self.port.isdigit() + self.conn = None + self.alive = False + + def check(self): + if not STEM_AVAILABLE: + self.error('the stem library is missing') + return False + + return self.connect() + + def get_data(self): + if not self.alive and not self.reconnect(): + return None + + data = dict() + + try: + data['read'] = self.conn.get_info('traffic/read') + data['write'] = self.conn.get_info('traffic/written') + except stem.ControllerError as error: + self.debug(error) + self.alive = False + + return data or None + + def authenticate(self): + try: + self.conn.authenticate(password=self.password) + except stem.connection.AuthenticationFailure as error: + self.error('authentication error: {0}'.format(error)) + return False + return True + + def connect_via_port(self): + try: + self.conn = stem.control.Controller.from_port(port=self.port) + except (stem.SocketError, ValueError) as error: + self.error(error) + + def connect_via_socket(self): + try: + self.conn = stem.control.Controller.from_socket_file(path=self.port) + except (stem.SocketError, ValueError) as error: + self.error(error) + + def connect(self): + if self.conn: + self.conn.close() + self.conn = None + + if self.use_socket: + self.connect_via_socket() + else: + self.connect_via_port() + + if self.conn and self.authenticate(): + self.alive = True + + return self.alive + + def reconnect(self): + return self.connect() diff --git a/collectors/python.d.plugin/tor/tor.conf b/collectors/python.d.plugin/tor/tor.conf new file mode 100644 index 00000000..8245414f --- /dev/null +++ b/collectors/python.d.plugin/tor/tor.conf @@ -0,0 +1,79 @@ +# netdata python.d.plugin configuration for tor +# +# This file is in YaML format. Generally the format is: +# +# name: value +# +# There are 2 sections: +# - global variables +# - one or more JOBS +# +# JOBS allow you to collect values from multiple sources. +# Each source will have its own set of charts. +# +# JOB parameters have to be indented (using spaces only, example below). + +# ---------------------------------------------------------------------- +# Global Variables +# These variables set the defaults for all JOBs, however each JOB +# may define its own, overriding the defaults. + +# update_every sets the default data collection frequency. +# If unset, the python.d.plugin default is used. +# update_every: 1 + +# priority controls the order of charts at the netdata dashboard. +# Lower numbers move the charts towards the top of the page. +# If unset, the default for python.d.plugin is used. +# priority: 60000 + +# retries sets the number of retries to be made in case of failures. +# If unset, the default for python.d.plugin is used. +# Attempts to restore the service are made once every update_every +# and only if the module has collected values in the past. +# retries: 60 + +# autodetection_retry sets the job re-check interval in seconds. +# The job is not deleted if check fails. +# Attempts to start the job are made once every autodetection_retry. +# This feature is disabled by default. +# autodetection_retry: 0 + +# ---------------------------------------------------------------------- +# JOBS (data collection sources) +# +# The default JOBS share the same *name*. JOBS with the same name +# are mutually exclusive. Only one of them will be allowed running at +# any time. This allows autodetection to try several alternatives and +# pick the one that works. +# +# Any number of jobs is supported. +# +# All python.d.plugin JOBS (for all its modules) support a set of +# predefined parameters. These are: +# +# job_name: +# name: myname # the JOB's name as it will appear at the +# # dashboard (by default is the job_name) +# # JOBs sharing a name are mutually exclusive +# update_every: 1 # the JOB's data collection frequency +# priority: 60000 # the JOB's order on the dashboard +# retries: 10 # the JOB's number of restoration attempts +# autodetection_retry: 0 # the JOB's re-check interval in seconds +# +# Additionally to the above, tor plugin also supports the following: +# +# control_port: 'port' # tor control port +# password: 'password' # tor control password +# +# ---------------------------------------------------------------------- +# AUTO-DETECTION JOBS +# only one of them will run (they have the same name) +# +# local_tcp: +# name: 'local' +# control_port: 9051 +# +# local_socket: +# name: 'local' +# control_port: '/var/run/tor/control' diff --git a/collectors/python.d.plugin/web_log/README.md b/collectors/python.d.plugin/web_log/README.md index 6e8ea1dd..e25a03fb 100644 --- a/collectors/python.d.plugin/web_log/README.md +++ b/collectors/python.d.plugin/web_log/README.md @@ -1,17 +1,86 @@ # web_log -Tails the apache/nginx/lighttpd/gunicorn log files to collect real-time web-server statistics. +## Motivation -It produces following charts: +Web server log files exist for more than 20 years. All web servers of all kinds, from all vendors, [since the time NCSA httpd was powering the web](https://en.wikipedia.org/wiki/NCSA_HTTPd), produce log files, saving in real-time all accesses to web sites and APIs. -1. **Response by type** requests/s +Yet, after the appearance of google analytics and similar services, and the recent rise of APM (Application Performance Monitoring) with sophisticated time-series databases that collect and analyze metrics at the application level, all these web server log files are mostly just filling our disks, rotated every night without any use whatsoever. + +netdata turns this "useless" log file, into a powerful performance and health monitoring tool, capable of detecting, **in real-time**, most common web server problems, such as: + +- too many redirects (i.e. **oops!** *this should not redirect clients to itself*) +- too many bad requests (i.e. **oops!** *a few files were not uploaded*) +- too many internal server errors (i.e. **oops!** *this release crashes too much*) +- unreasonably too many requests (i.e. **oops!** *we are under attack*) +- unreasonably few requests (i.e. **oops!** *call the network guys*) +- unreasonably slow responses (i.e. **oops!** *the database is slow again*) +- too few successful responses (i.e. **oops!** *help us God!*) + +## Usage + +If netdata is installed on a system running a web server, it will detect it and it will automatically present a series of charts, with information obtained from the web server API, like these (*these do not come from the web server log file*): + +![image](https://cloud.githubusercontent.com/assets/2662304/22900686/e283f636-f237-11e6-93d2-cbdf63de150c.png) +*[**netdata**](https://my-netdata.io/) charts based on metrics collected by querying the `nginx` API (i.e. `/stab_status`).* + +> [**netdata**](https://my-netdata.io/) supports `apache`, `nginx`, `lighttpd` and `tomcat`. To obtain real-time information from a web server API, the web server needs to expose it. For directions on configuring your web server, check the config files for each web server. There is a directory with a config file for each web server under [`/etc/netdata/python.d/`](../). + +## Configuration + +[**netdata**](https://my-netdata.io/) has a powerful `web_log` plugin, capable of incrementally parsing any number of web server log files. This plugin is automatically started with [**netdata**](https://my-netdata.io/) and comes, pre-configured, for finding web server log files on popular distributions. Its configuration is at [`/etc/netdata/python.d/web_log.conf`](web_log.conf), like this: + + +```yaml +nginx_log: + name : 'nginx_log' + path : '/var/log/nginx/access.log' + +apache_log: + name : 'apache_log' + path : '/var/log/apache/other_vhosts_access.log' + categories: + cacti : 'cacti.*' + observium : 'observium' +``` + +Theodule has preconfigured jobs for nginx, apache and gunicorn on various distros. +You can add one such section, for each of your web server log files. + +> **Important**<br/>Keep in mind [**netdata**](https://my-netdata.io/) runs as user `netdata`. So, make sure user `netdata` has access to the logs directory and can read the log file. + +## Charts + +Once you have all log files configured and [**netdata**](https://my-netdata.io/) restarted, **for each log file** you will get a section at the [**netdata**](https://my-netdata.io/) dashboard, with the following charts. + +### responses by status + +In this chart we tried to provide a meaningful status for all responses. So: + +- `success` counts all the valid responses (i.e. `1xx` informational, `2xx` successful and `304` not modified). +- `error` are `5xx` internal server errors. These are very bad, they mean your web site or API is facing difficulties. +- `redirect` are `3xx` responses, except `304`. All `3xx` are redirects, but `304` means "not modified" - it tells the browsers the content they already have is still valid and can be used as-is. So, we decided to account it as a successful response. +- `bad` are bad requests that cannot be served. +- `other` as all the other, non-standard, types of responses. + +![image](https://cloud.githubusercontent.com/assets/2662304/22902194/ea0affc6-f23c-11e6-85f1-a4951dd4bb40.png) + +### Responses by type + +Then, we group all responses by code family, without interpreting their meaning. +**Response by type** requests/s * success (1xx, 2xx, 304) * error (5xx) * redirect (3xx except 304) * bad (4xx) * other (all other responses) -2. **Response by code family** requests/s +![image](https://cloud.githubusercontent.com/assets/2662304/22901883/dea7d33a-f23b-11e6-960d-00a913b58936.png) + +### Responses by code family + +Here we show all the response codes in detail. + +**Response by code family** requests/s * 1xx (informational) * 2xx (successful) * 3xx (redirect) @@ -19,46 +88,114 @@ It produces following charts: * 5xx (internal server errors) * other (non-standart responses) * unmatched (the lines in the log file that are not matched) + + +![image](https://cloud.githubusercontent.com/assets/2662304/22901965/1a5d84ba-f23c-11e6-9d38-3deebcc8b879.png) + +>**Important**<br/>If your application is using hundreds of non-standard response codes, your browser may become slow while viewing this chart, so we have added a configuration [option to disable this chart](https://github.com/netdata/netdata/blob/419cd0a237275e5eeef3f92dcded84e735ee6c58/conf.d/python.d/web_log.conf#L63). + +### Detailed Response Codes -3. **Detailed Response Codes** requests/s (number of responses for each response code family individually) +Number of responses for each response code family individually (requests/s) -4. **Bandwidth** KB/s +### bandwidth + +This is a nice view of the traffic the web server is receiving and is sending. + +What is important to know for this chart, is that the bandwidth used for each request and response is accounted at the time the log is written. Since [**netdata**](https://my-netdata.io/) refreshes this chart every single second, you may have unrealistic spikes is the size of the requests or responses is too big. The reason is simple: a response may have needed 1 minute to be completed, but all the bandwidth used during that minute for the specific response will be accounted at the second the log line is written. + +As the legend on the chart suggests, you can use FireQoS to setup QoS on the web server ports and IPs to accurately measure the bandwidth the web server is using. Actually, [there may be a few more reasons to install QoS on your servers](../../tc.plugin/#tcplugin)... + +**Bandwidth** KB/s * received (bandwidth of requests) * send (bandwidth of responses) + +![image](https://cloud.githubusercontent.com/assets/2662304/22902266/245141d6-f23d-11e6-90f9-98729733e0da.png) + +> **Important**<br/>Most web servers do not log the request size by default.<br/>So, [unless you have configured your web server to log the size of requests](https://github.com/netdata/netdata/blob/419cd0a237275e5eeef3f92dcded84e735ee6c58/conf.d/python.d/web_log.conf#L76-L89), the `received` dimension will be always zero. -5. **Timings** ms (request processing time) +### timings + +[**netdata**](https://my-netdata.io/) will also render the `minimum`, `average` and `maximum` time the web server needed to respond to requests. + +Keep in mind most web servers timings start at the reception of the full request, until the dispatch of the last byte of the response. So, they include network latencies of responses, but they do not include network latencies of requests. + +**Timings** ms (request processing time) * min (bandwidth of requests) * max (bandwidth of responses) * average (bandwidth of responses) + +![image](https://cloud.githubusercontent.com/assets/2662304/22902283/369e3f92-f23d-11e6-9359-53e5d4ecb18e.png) -6. **Request per url** requests/s (configured by user) +> **Important**<br/>Most web servers do not log timing information by default.<br/>So, [unless you have configured your web server to also log timings](https://github.com/netdata/netdata/blob/419cd0a237275e5eeef3f92dcded84e735ee6c58/conf.d/python.d/web_log.conf#L76-L89), this chart will not exist. -7. **Http Methods** requests/s (requests per http method) +### URL patterns -8. **Http Versions** requests/s (requests per http version) +This is a very interesting chart. It is configured entirely by you. -9. **IP protocols** requests/s (requests per ip protocol version) +[**netdata**](https://my-netdata.io/) can map the URLs found in the log file into categories. You can define these categories, by providing names and regular expressions in `web_log.conf`. -10. **Current Poll Unique Client IPs** unique ips/s (unique client IPs per data collection iteration) +So, this configuration: -11. **All Time Unique Client IPs** unique ips/s (unique client IPs since the last restart of netdata) +```yaml +nginx_netdata: # name the charts + path: '/var/log/nginx/access.log' # web server log file + categories: + badges : '^/api/v1/badge\.svg' + charts : '^/api/v1/(data|chart|charts)' + registry : '^/api/v1/registry' + alarms : '^/api/v1/alarm' + allmetrics : '^/api/v1/allmetrics' + api_other : '^/api/' + netdata_conf: '^/netdata.conf' + api_old : '^/(data|datasource|graph|list|all\.json)' +``` +Produces the following chart. The `categories` section is matched in the order given. So, pay attention to the order you give your patterns. -### configuration +![image](https://cloud.githubusercontent.com/assets/2662304/22902302/4d25bf06-f23d-11e6-844d-18c0876bdc3d.png) -```yaml -nginx_log: - name : 'nginx_log' - path : '/var/log/nginx/access.log' +### HTTP versions -apache_log: - name : 'apache_log' - path : '/var/log/apache/other_vhosts_access.log' - categories: - cacti : 'cacti.*' - observium : 'observium' -``` +This chart breaks down requests by HTTP version used. + +![image](https://cloud.githubusercontent.com/assets/2662304/22902323/5ee376d4-f23d-11e6-8457-157d3f438843.png) + +### IP versions + +This one provides requests per IP version used by the clients (`IPv4`, `IPv6`). + +![image](https://cloud.githubusercontent.com/assets/2662304/22902370/7091a770-f23d-11e6-8cd2-74e9a67b1397.png) + +### Unique clients + +The last charts are about the unique IPs accessing your web server. + +**Current Poll Unique Client IPs** unique ips/s. This one counts the unique IPs for each data collection iteration (i.e. **unique clients per second**). + +![image](https://cloud.githubusercontent.com/assets/2662304/22902384/835aa168-f23d-11e6-914f-cfc3f06eaff8.png) + +**All Time Unique Client IPs** unique ips/s. Counts the unique IPs, since the last [**netdata**](https://my-netdata.io/) restart. + +![image](https://cloud.githubusercontent.com/assets/2662304/22902407/92dd27e6-f23d-11e6-900d-eede7bc08e64.png) + +>**Important**<br/>To provide this information `web_log` plugin keeps in memory all the IPs seen by the web server. Although this does not require so much memory, if you have a web server with several million unique client IPs, we suggest to [disable this chart](https://github.com/netdata/netdata/blob/419cd0a237275e5eeef3f92dcded84e735ee6c58/conf.d/python.d/web_log.conf#L64). + + +## Alarms + +The magic of [**netdata**](https://my-netdata.io/) is that all metrics are collected per second, and all metrics can be used or correlated to provide real-time alarms. Out of the box, [**netdata**](https://my-netdata.io/) automatically attaches the [following alarms](../../../health/health.d/web_log.conf) to all `web_log` charts (i.e. to all log files configured, individually): + +alarm|description|minimum<br/>requests|warning|critical +:-------|-------|:------:|:-----:|:------: +`1m_redirects`|The ratio of HTTP redirects (3xx except 304) over all the requests, during the last minute.<br/> <br/>*Detects if the site or the web API is suffering from too many or circular redirects.*<br/> <br/>(i.e. **oops!** *this should not redirect clients to itself*)|120/min|> 20%|> 30% +`1m_bad_requests`|The ratio of HTTP bad requests (4xx) over all the requests, during the last minute.<br/> <br/>*Detects if the site or the web API is receiving too many bad requests, including `404`, not found.*<br/> <br/>(i.e. **oops!** *a few files were not uploaded*)|120/min|> 30%|> 50% +`1m_internal_errors`|The ratio of HTTP internal server errors (5xx), over all the requests, during the last minute.<br/> <br/>*Detects if the site is facing difficulties to serve requests.*<br/> <br/>(i.e. **oops!** *this release crashes too much*)|120/min|> 2%|> 5% +`5m_requests_ratio`|The percentage of successful web requests of the last 5 minutes, compared with the previous 5 minutes.<br/> <br/>*Detects if the site or the web API is suddenly getting too many or too few requests.*<br/> <br/>(i.e. too many = **oops!** *we are under attack*)<br/>(i.e. too few = **oops!** *call the network guys*)|120/5min|> double or < half|> 4x or < 1/4x +`web_slow`|The average time to respond to requests, over the last 1 minute, compared to the average of last 10 minutes.<br/> <br/>*Detects if the site or the web API is suddenly a lot slower.*<br/> <br/>(i.e. **oops!** *the database is slow again*)|120/min|> 2x|> 4x +`1m_successful`|The ratio of successful HTTP responses (1xx, 2xx, 304) over all the requests, during the last minute.<br/> <br/>*Detects if the site or the web API is performing within limits.*<br/> <br/>(i.e. **oops!** *help us God!*)|120/min|< 85%|< 75% + +The column `minimum requests` state the minimum number of requests required for the alarm to be evaluated. We found that when the site is receiving requests above this rate, these alarms are pretty accurate (i.e. no false-positives). -Module has preconfigured jobs for nginx, apache and gunicorn on various distros. +[**netdata**](https://my-netdata.io/) alarms are user configurable. Sample config files can be found under directory `health/health.d` of the netdata github repository. So, even [`web_log` alarms can be adapted to your needs](../../../health/health.d/web_log.conf). ---- diff --git a/collectors/tc.plugin/README.md b/collectors/tc.plugin/README.md index 6670c491..a8b151de 100644 --- a/collectors/tc.plugin/README.md +++ b/collectors/tc.plugin/README.md @@ -18,75 +18,69 @@ dynamically creates. ## Motivation -One category of metrics missing in Linux monitoring, is bandwidth consumption for each open -socket (inbound and outbound traffic). So, you cannot tell how much bandwidth your web server, -your database server, your backup, your ssh sessions, etc are using. +One category of metrics missing in Linux monitoring, is bandwidth consumption for each open socket (inbound and outbound traffic). So, you cannot tell how much bandwidth your web server, your database server, your backup, your ssh sessions, etc are using. -To solve this problem, the most *adventurous* Linux monitoring tools install kernel modules to -capture all traffic, analyze it and provide reports per application. A lot of work, CPU intensive -and with a great degree of risk (due to the kernel modules involved which might affect the -stability of the whole system). Not to mention that such solutions are probably better suited -for a core linux router in your network. +To solve this problem, the most *adventurous* Linux monitoring tools install kernel modules to capture all traffic, analyze it and provide reports per application. A lot of work, CPU intensive and with a great degree of risk (due to the kernel modules involved which might affect the stability of the whole system). Not to mention that such solutions are probably better suited for a core linux router in your network. -Others use NFACCT, the netfilter accounting module which is already part of the Linux firewall. -However, this would require configuring a firewall on every system you want to measure bandwidth. +Others use NFACCT, the netfilter accounting module which is already part of the Linux firewall. However, this would require configuring a firewall on every system you want to measure bandwidth (just FYI, I do install a firewall on every server - and I strongly advise you to do so too - but configuring accounting on all servers seems overkill when you don't really need it for billing purposes). -QoS monitoring attempts to solve this in a much cleaner way. +**There is however a much simpler approach**. -## Introduction to QoS +## QoS -One of the features the Linux kernel has, but it is rarely used, is its ability to -**apply QoS on traffic**. Even most interesting is that it can apply QoS to **both inbound and -outbound traffic**. +One of the features the Linux kernel has, but it is rarely used, is its ability to **apply QoS on traffic**. Even most interesting is that it can apply QoS to **both inbound and outbound traffic**. QoS is about 2 features: 1. **Classify traffic** - Classification is the process of organizing traffic in groups, called **classes**. - Classification can evaluate every aspect of network packets, like source and destination ports, - source and destination IPs, netfilter marks, etc. + Classification is the process of organizing traffic in groups, called **classes**. Classification can evaluate every aspect of network packets, like source and destination ports, source and destination IPs, netfilter marks, etc. - When you classify traffic, you just assign a label to it. For example **I call `web server` - traffic, the traffic from my server's tcp/80, tcp/443 and to my server's tcp/80, tcp/443, - while I call `web surfing` all other tcp/80 and tcp/443 traffic**. You can use any combinations - you like. There is no limit. + When you classify traffic, you just assign a label to it. Of course classes have some properties themselves (like queuing mechanisms), but let's say it is that simple: **a label**. For example **I call `web server` traffic, the traffic from my server's tcp/80, tcp/443 and to my server's tcp/80, tcp/443, while I call `web surfing` all other tcp/80 and tcp/443 traffic**. You can use any combinations you like. There is no limit. 2. **Apply traffic shaping rules to these classes** - Traffic shaping is used to control how network interface bandwidth should be shared among the - classes. Of course we are not interested for this feature to just monitor the traffic. - Classification will be enough for monitoring everything. + Traffic shaping is used to control how network interface bandwidth should be shared among the classes. Normally, you need to do this, when there is not enough bandwidth to satisfy all the demand, or when you want to control the supply of bandwidth to certain services. Of course classification is sufficient for monitoring traffic, but traffic shaping is also quite important, as we will explain in the next section. -The key reasons of applying QoS on all servers (even cloud ones) are: +## Why you want QoS - - **ensure administrative tasks (like ssh, dns, etc) will always have a small but guaranteed - bandwidth.** QoS can guarantee that services like ssh, dns, ntp, etc will always have a small - supply of bandwidth. So, no matter what happens, you will be able to ssh to your server and - DNS will always work. +1. **Monitoring the bandwidth used by services** - - **ensure other administrative tasks will not monopolize all the available bandwidth.** - Services like backups, file copies, database dumps, etc can easily monopolize all the - available bandwidth. It is common for example a nightly backup, or a huge file transfer - to negatively influence the end-user experience. QoS can fix that. + netdata provides wonderful real-time charts, like this one (wait to see the orange `rsync` part): - - **ensure each end-user connection will get a fair cut of the available bandwidth.** - Several QoS queuing disciplines in Linux do this automatically, without any configuration from you. - The result is that new sockets are favored over older ones, so that users will get a snappier - experience, while others are transferring large amounts of traffic. - - - **protect the servers from DDoS attacks.** - When your system is under a DDoS attack, it will get a lot more bandwidth compared to the one it - can handle and probably your applications will crash. Setting a limit on the inbound traffic using - QoS, will protect your servers (throttle the requests) and depending on the size of the attack may - allow your legitimate users to access the server, while the attack is taking place. + ![qos3](https://cloud.githubusercontent.com/assets/2662304/14474189/713ede84-0104-11e6-8c9c-8dca5c2abd63.gif) +2. **Ensure sensitive administrative tasks will not starve for bandwidth** -Once **traffic classification** is applied, netdata can visualize the bandwidth consumption per -class in real-time (no configuration is needed for netdata - it will figure it out). + Have you tried to ssh to a server when the network is congested? If you have, you already know it does not work very well. QoS can guarantee that services like ssh, dns, ntp, etc will always have a small supply of bandwidth. So, no matter what happens, you will be able to ssh to your server and DNS will always work. -QoS, is extremely light. You will configure it once, and this is it. It will not bother you again -and it will not use any noticeable CPU resources, especially on application and database servers. +3. **Ensure administrative tasks will not monopolize all the bandwidth** + + Services like backups, file copies, database dumps, etc can easily monopolize all the available bandwidth. It is common for example a nightly backup, or a huge file transfer to negatively influence the end-user experience. QoS can fix that. + +4. **Ensure each end-user connection will get a fair cut of the available bandwidth.** + + Several QoS queuing disciplines in Linux do this automatically, without any configuration from you. The result is that new sockets are favored over older ones, so that users will get a snappier experience, while others are transferring large amounts of traffic. + +5. **Protect the servers from DDoS attacks.** + + When your system is under a DDoS attack, it will get a lot more bandwidth compared to the one it can handle and probably your applications will crash. Setting a limit on the inbound traffic using QoS, will protect your servers (throttle the requests) and depending on the size of the attack may allow your legitimate users to access the server, while the attack is taking place. + + Using QoS together with a [SYNPROXY](../proc.plugin/README.md#linux-anti-ddos) will provide a great degree of protection against most DDoS attacks. Actually when I wrote that article, a few folks tried to DDoS the netdata demo site to see in real-time the SYNPROXY operation. They did not do it right, but anyway a great deal of requests reached the netdata server. What saved netdata was QoS. The netdata demo server has QoS installed, so the requests were throttled and the server did not even reach the point of resource starvation. Read about it [here](../proc.plugin/README.md#linux-anti-ddos). + +On top of all these, QoS is extremely light. You will configure it once, and this is it. It will not bother you again and it will not use any noticeable CPU resources, especially on application and database servers. + + - ensure administrative tasks (like ssh, dns, etc) will always have a small but guaranteed bandwidth. So, no matter what happens, I will be able to ssh to my server and DNS will work. + + - ensure other administrative tasks will not monopolize all the available bandwidth. So, my nightly backup will not hurt my users, a developer that is copying files over the net will not get all the available bandwidth, etc. + + - ensure each end-user connection will get a fair cut of the available bandwidth. + +Once **traffic classification** is applied, we can use **[netdata](https://github.com/netdata/netdata)** to visualize the bandwidth consumption per class in real-time (no configuration is needed for netdata - it will figure it out). + +QoS, is extremely light. You will configure it once, and this is it. It will not bother you again and it will not use any noticeable CPU resources, especially on application and database servers. + +--- ## QoS in Linux? Have you lost your mind? @@ -94,28 +88,35 @@ Yes I know... but no, I have not! Of course, `tc` is probably **the most undocumented, complicated and unfriendly** command in Linux. -For example, for matching a simple port range in `tc`, e.g. all the high ports, from 1025 to 65535 -inclusive, you have to match these: +For example, do you know that for matching a simple port range in `tc`, e.g. all the high ports, from 1025 to 65535 inclusive, you have to match these: ``` -1025/0xffff 1026/0xfffe 1028/0xfffc 1032/0xfff8 1040/0xfff0 -1056/0xffe0 1088/0xffc0 1152/0xff80 1280/0xff00 1536/0xfe00 -2048/0xf800 4096/0xf000 8192/0xe000 16384/0xc000 32768/0x8000 +1025/0xffff +1026/0xfffe +1028/0xfffc +1032/0xfff8 +1040/0xfff0 +1056/0xffe0 +1088/0xffc0 +1152/0xff80 +1280/0xff00 +1536/0xfe00 +2048/0xf800 +4096/0xf000 +8192/0xe000 +16384/0xc000 +32768/0x8000 ``` I know what you are thinking right now! **And I agree!** -This is why I wrote **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**, a tool to -simplify QoS management in Linux. +This is why I wrote **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**, a tool to simplify QoS management in Linux. -The **[FireHOL](https://firehol.org/)** package already distributes **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**. -Check the **[FireQOS tutorial](https://firehol.org/tutorial/fireqos-new-user/)** -to learn how to write your own QoS configuration. +The **[FireHOL](https://firehol.org/)** package already distributes **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**. Check the **[FireQOS tutorial](https://firehol.org/tutorial/fireqos-new-user/)** to learn how to write your own QoS configuration. -With **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**, it is **really simple for everyone -to use QoS in Linux**. Just install the package `firehol`. It should already be available for your -distribution. If not, check the **[FireHOL Installation Guide](https://firehol.org/installing/)**. -After that, you will have the `fireqos` command. +With **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**, it is **really simple for everyone to use QoS in Linux**. Just install the package `firehol`. It should already be available for your distribution. If not, check the **[FireHOL Installation Guide](https://firehol.org/installing/)**. After that, you will have the `fireqos` command which uses a configuration like the following: + +## QoS Configuration This is the file `/etc/firehol/fireqos.conf` we use at the netdata demo site: @@ -157,14 +158,9 @@ This is the file `/etc/firehol/fireqos.conf` we use at the netdata demo site: match input src 10.2.3.5 ``` -Nothing more is needed. You just run `fireqos start` to apply this configuration, restart netdata -and you have real-time visualization of the bandwidth consumption of your applications. FireQOS is -not a daemon. It will just convert the configuration to `tc` commands. It will run them and it will -exit. +Nothing more is needed. You just run `fireqos start` to apply this configuration, restart netdata and you have real-time visualization of the bandwidth consumption of your applications. FireQOS is not a daemon. It will just convert the configuration to `tc` commands. It will run them and it will exit. -**IMPORTANT**: If you copy this configuration to apply it to your system, please adapt the -speeds - experiment in non-production environments to learn the tool, before applying it on -your servers. +**IMPORTANT**: If you copy this configuration to apply it to your system, please adapt the speeds - experiment in non-production environments to learn the tool, before applying it on your servers. And this is what you are going to get: @@ -174,10 +170,11 @@ And this is what you are going to get: ## More examples: -This is QoS from a linux router. Check these features: +This is QoS from my home linux router. Check these features: 1. It is real-time (per second updates) 2. QoS really works in Linux - check that the `background` traffic is squeezed when `surfing` needs it. ![test2](https://cloud.githubusercontent.com/assets/2662304/14093004/68966020-f553-11e5-98fe-ffee2086fafd.gif) + diff --git a/collectors/tc.plugin/tc-qos-helper.sh b/collectors/tc.plugin/tc-qos-helper.sh index b49d1f50..a1a2b914 100644 --- a/collectors/tc.plugin/tc-qos-helper.sh +++ b/collectors/tc.plugin/tc-qos-helper.sh @@ -100,7 +100,7 @@ if [ ! -d "${fireqos_run_dir}" ] warning "Although FireQoS is installed on this system as '${fireqos}', I cannot find/read its installation configuration at '${fireqos_exec_dir}/install.config'." fi else - warning "FireQoS is not installed on this system. Use FireQoS to apply traffic QoS and expose the class names to netdata. Check https://github.com/netdata/netdata/wiki/You-should-install-QoS-on-all-your-servers" + warning "FireQoS is not installed on this system. Use FireQoS to apply traffic QoS and expose the class names to netdata. Check https://github.com/netdata/netdata/tree/master/collectors/tc.plugin#tcplugin" fi fi diff --git a/collectors/tc.plugin/tc-qos-helper.sh.in b/collectors/tc.plugin/tc-qos-helper.sh.in index 6f6b0a59..a15eab89 100755 --- a/collectors/tc.plugin/tc-qos-helper.sh.in +++ b/collectors/tc.plugin/tc-qos-helper.sh.in @@ -100,7 +100,7 @@ if [ ! -d "${fireqos_run_dir}" ] warning "Although FireQoS is installed on this system as '${fireqos}', I cannot find/read its installation configuration at '${fireqos_exec_dir}/install.config'." fi else - warning "FireQoS is not installed on this system. Use FireQoS to apply traffic QoS and expose the class names to netdata. Check https://github.com/netdata/netdata/wiki/You-should-install-QoS-on-all-your-servers" + warning "FireQoS is not installed on this system. Use FireQoS to apply traffic QoS and expose the class names to netdata. Check https://github.com/netdata/netdata/tree/master/collectors/tc.plugin#tcplugin" fi fi |