diff options
Diffstat (limited to '')
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/Makefile.inc | 13 | ||||
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/README.md | 157 | ||||
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/metadata.yaml | 166 | ||||
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py | 651 | ||||
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf | 68 |
5 files changed, 1055 insertions, 0 deletions
diff --git a/collectors/python.d.plugin/nvidia_smi/Makefile.inc b/collectors/python.d.plugin/nvidia_smi/Makefile.inc new file mode 100644 index 00000000..52fb25a6 --- /dev/null +++ b/collectors/python.d.plugin/nvidia_smi/Makefile.inc @@ -0,0 +1,13 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_python_DATA += nvidia_smi/nvidia_smi.chart.py +dist_pythonconfig_DATA += nvidia_smi/nvidia_smi.conf + +# do not install these files, but include them in the distribution +dist_noinst_DATA += nvidia_smi/README.md nvidia_smi/Makefile.inc + diff --git a/collectors/python.d.plugin/nvidia_smi/README.md b/collectors/python.d.plugin/nvidia_smi/README.md new file mode 100644 index 00000000..7d45289a --- /dev/null +++ b/collectors/python.d.plugin/nvidia_smi/README.md @@ -0,0 +1,157 @@ +<!-- +title: "Nvidia GPU monitoring with Netdata" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/collectors/python.d.plugin/nvidia_smi/README.md" +sidebar_label: "nvidia_smi-python.d.plugin" +learn_status: "Published" +learn_topic_type: "References" +learn_rel_path: "Integrations/Monitor/Devices" +--> + +# Nvidia GPU collector + +Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool. + +## Requirements and Notes + +- You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface). +- You must enable this plugin, as its disabled by default due to minor performance issues: + ```bash + cd /etc/netdata # Replace this path with your Netdata config directory, if different + sudo ./edit-config python.d.conf + ``` + Remove the '#' before nvidia_smi so it reads: `nvidia_smi: yes`. + +- On some systems when the GPU is idle the `nvidia-smi` tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue. +- Currently the `nvidia-smi` tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: <https://github.com/netdata/netdata/pull/4357> +- Contributions are welcome. +- Make sure `netdata` user can execute `/usr/bin/nvidia-smi` or wherever your binary is. +- If `nvidia-smi` process [is not killed after netdata restart](https://github.com/netdata/netdata/issues/7143) you need to off `loop_mode`. +- `poll_seconds` is how often in seconds the tool is polled for as an integer. + +## Charts + +It produces the following charts: + +- PCI Express Bandwidth Utilization in `KiB/s` +- Fan Speed in `percentage` +- GPU Utilization in `percentage` +- Memory Bandwidth Utilization in `percentage` +- Encoder/Decoder Utilization in `percentage` +- Memory Usage in `MiB` +- Temperature in `celsius` +- Clock Frequencies in `MHz` +- Power Utilization in `Watts` +- Memory Used by Each Process in `MiB` +- Memory Used by Each User in `MiB` +- Number of User on GPU in `num` + +## Configuration + +Edit the `python.d/nvidia_smi.conf` configuration file using `edit-config` from the Netdata [config +directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md), which is typically at `/etc/netdata`. + +```bash +cd /etc/netdata # Replace this path with your Netdata config directory, if different +sudo ./edit-config python.d/nvidia_smi.conf +``` + +Sample: + +```yaml +loop_mode : yes +poll_seconds : 1 +exclude_zero_memory_users : yes +``` + + +### Troubleshooting + +To troubleshoot issues with the `nvidia_smi` module, run the `python.d.plugin` with the debug option enabled. The +output will give you the output of the data collection job or error messages on why the collector isn't working. + +First, navigate to your plugins directory, usually they are located under `/usr/libexec/netdata/plugins.d/`. If that's +not the case on your system, open `netdata.conf` and look for the setting `plugins directory`. Once you're in the +plugin's directory, switch to the `netdata` user. + +```bash +cd /usr/libexec/netdata/plugins.d/ +sudo su -s /bin/bash netdata +``` + +Now you can manually run the `nvidia_smi` module in debug mode: + +```bash +./python.d.plugin nvidia_smi debug trace +``` + +## Docker + +GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable. + +Sample `docker-compose.yml` +```yaml +version: '3' +services: + netdata: + image: netdata/netdata + container_name: netdata + hostname: example.com # set to fqdn of host + ports: + - 19999:19999 + restart: unless-stopped + cap_add: + - SYS_PTRACE + security_opt: + - apparmor:unconfined + environment: + - NETDATA_EXTRA_APK_PACKAGES=gcompat + volumes: + - netdataconfig:/etc/netdata + - netdatalib:/var/lib/netdata + - netdatacache:/var/cache/netdata + - /etc/passwd:/host/etc/passwd:ro + - /etc/group:/host/etc/group:ro + - /proc:/host/proc:ro + - /sys:/host/sys:ro + - /etc/os-release:/host/etc/os-release:ro + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [gpu] + +volumes: + netdataconfig: + netdatalib: + netdatacache: +``` + +Sample `docker run` +```yaml +docker run -d --name=netdata \ + -p 19999:19999 \ + -e NETDATA_EXTRA_APK_PACKAGES=gcompat \ + -v netdataconfig:/etc/netdata \ + -v netdatalib:/var/lib/netdata \ + -v netdatacache:/var/cache/netdata \ + -v /etc/passwd:/host/etc/passwd:ro \ + -v /etc/group:/host/etc/group:ro \ + -v /proc:/host/proc:ro \ + -v /sys:/host/sys:ro \ + -v /etc/os-release:/host/etc/os-release:ro \ + --restart unless-stopped \ + --cap-add SYS_PTRACE \ + --security-opt apparmor=unconfined \ + --gpus all \ + netdata/netdata +``` + +### Docker Troubleshooting +To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`. +```bash +docker exec -it netdata bash +cd /etc/netdata +./edit-config python.d.conf +``` diff --git a/collectors/python.d.plugin/nvidia_smi/metadata.yaml b/collectors/python.d.plugin/nvidia_smi/metadata.yaml new file mode 100644 index 00000000..9bf1e6ca --- /dev/null +++ b/collectors/python.d.plugin/nvidia_smi/metadata.yaml @@ -0,0 +1,166 @@ +# This collector will not appear in documentation, as the go version is preferred, +# https://github.com/netdata/go.d.plugin/blob/master/modules/nvidia_smi/README.md +# +# meta: +# plugin_name: python.d.plugin +# module_name: nvidia_smi +# monitored_instance: +# name: python.d nvidia_smi +# link: '' +# categories: [] +# icon_filename: '' +# related_resources: +# integrations: +# list: [] +# info_provided_to_referring_integrations: +# description: '' +# keywords: [] +# most_popular: false +# overview: +# data_collection: +# metrics_description: '' +# method_description: '' +# supported_platforms: +# include: [] +# exclude: [] +# multi_instance: true +# additional_permissions: +# description: '' +# default_behavior: +# auto_detection: +# description: '' +# limits: +# description: '' +# performance_impact: +# description: '' +# setup: +# prerequisites: +# list: [] +# configuration: +# file: +# name: '' +# description: '' +# options: +# description: '' +# folding: +# title: '' +# enabled: true +# list: [] +# examples: +# folding: +# enabled: true +# title: '' +# list: [] +# troubleshooting: +# problems: +# list: [] +# alerts: [] +# metrics: +# folding: +# title: Metrics +# enabled: false +# description: "" +# availability: [] +# scopes: +# - name: GPU +# description: "" +# labels: [] +# metrics: +# - name: nvidia_smi.pci_bandwidth +# description: PCI Express Bandwidth Utilization +# unit: "KiB/s" +# chart_type: area +# dimensions: +# - name: rx +# - name: tx +# - name: nvidia_smi.pci_bandwidth_percent +# description: PCI Express Bandwidth Percent +# unit: "percentage" +# chart_type: area +# dimensions: +# - name: rx_percent +# - name: tx_percent +# - name: nvidia_smi.fan_speed +# description: Fan Speed +# unit: "percentage" +# chart_type: line +# dimensions: +# - name: speed +# - name: nvidia_smi.gpu_utilization +# description: GPU Utilization +# unit: "percentage" +# chart_type: line +# dimensions: +# - name: utilization +# - name: nvidia_smi.mem_utilization +# description: Memory Bandwidth Utilization +# unit: "percentage" +# chart_type: line +# dimensions: +# - name: utilization +# - name: nvidia_smi.encoder_utilization +# description: Encoder/Decoder Utilization +# unit: "percentage" +# chart_type: line +# dimensions: +# - name: encoder +# - name: decoder +# - name: nvidia_smi.memory_allocated +# description: Memory Usage +# unit: "MiB" +# chart_type: stacked +# dimensions: +# - name: free +# - name: used +# - name: nvidia_smi.bar1_memory_usage +# description: Bar1 Memory Usage +# unit: "MiB" +# chart_type: stacked +# dimensions: +# - name: free +# - name: used +# - name: nvidia_smi.temperature +# description: Temperature +# unit: "celsius" +# chart_type: line +# dimensions: +# - name: temp +# - name: nvidia_smi.clocks +# description: Clock Frequencies +# unit: "MHz" +# chart_type: line +# dimensions: +# - name: graphics +# - name: video +# - name: sm +# - name: mem +# - name: nvidia_smi.power +# description: Power Utilization +# unit: "Watts" +# chart_type: line +# dimensions: +# - name: power +# - name: nvidia_smi.power_state +# description: Power State +# unit: "state" +# chart_type: line +# dimensions: +# - name: a dimension per {power_state} +# - name: nvidia_smi.processes_mem +# description: Memory Used by Each Process +# unit: "MiB" +# chart_type: stacked +# dimensions: +# - name: a dimension per process +# - name: nvidia_smi.user_mem +# description: Memory Used by Each User +# unit: "MiB" +# chart_type: stacked +# dimensions: +# - name: a dimension per user +# - name: nvidia_smi.user_num +# description: Number of User on GPU +# unit: "num" +# chart_type: line +# dimensions: +# - name: users diff --git a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py new file mode 100644 index 00000000..556a6143 --- /dev/null +++ b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py @@ -0,0 +1,651 @@ +# -*- coding: utf-8 -*- +# Description: nvidia-smi netdata python.d module +# Original Author: Steven Noonan (tycho) +# Author: Ilya Mashchenko (ilyam8) +# User Memory Stat Author: Guido Scatena (scatenag) + +import os +import pwd +import subprocess +import threading +import xml.etree.ElementTree as et + +from bases.FrameworkServices.SimpleService import SimpleService +from bases.collection import find_binary + +disabled_by_default = True + +NVIDIA_SMI = 'nvidia-smi' + +NOT_AVAILABLE = 'N/A' + +EMPTY_ROW = '' +EMPTY_ROW_LIMIT = 500 +POLLER_BREAK_ROW = '</nvidia_smi_log>' + +PCI_BANDWIDTH = 'pci_bandwidth' +PCI_BANDWIDTH_PERCENT = 'pci_bandwidth_percent' +FAN_SPEED = 'fan_speed' +GPU_UTIL = 'gpu_utilization' +MEM_UTIL = 'mem_utilization' +ENCODER_UTIL = 'encoder_utilization' +MEM_USAGE = 'mem_usage' +BAR_USAGE = 'bar1_mem_usage' +TEMPERATURE = 'temperature' +CLOCKS = 'clocks' +POWER = 'power' +POWER_STATE = 'power_state' +PROCESSES_MEM = 'processes_mem' +USER_MEM = 'user_mem' +USER_NUM = 'user_num' + +ORDER = [ + PCI_BANDWIDTH, + PCI_BANDWIDTH_PERCENT, + FAN_SPEED, + GPU_UTIL, + MEM_UTIL, + ENCODER_UTIL, + MEM_USAGE, + BAR_USAGE, + TEMPERATURE, + CLOCKS, + POWER, + POWER_STATE, + PROCESSES_MEM, + USER_MEM, + USER_NUM, +] + +# https://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__gpupstate.html +POWER_STATES = ['P' + str(i) for i in range(0, 16)] + +# PCI Transfer data rate in gigabits per second (Gb/s) per generation +PCI_SPEED = { + "1": 2.5, + "2": 5, + "3": 8, + "4": 16, + "5": 32 +} +# PCI encoding per generation +PCI_ENCODING = { + "1": 2 / 10, + "2": 2 / 10, + "3": 2 / 130, + "4": 2 / 130, + "5": 2 / 130 +} + + +def gpu_charts(gpu): + fam = gpu.full_name() + + charts = { + PCI_BANDWIDTH: { + 'options': [None, 'PCI Express Bandwidth Utilization', 'KiB/s', fam, 'nvidia_smi.pci_bandwidth', 'area'], + 'lines': [ + ['rx_util', 'rx', 'absolute', 1, 1], + ['tx_util', 'tx', 'absolute', 1, -1], + ] + }, + PCI_BANDWIDTH_PERCENT: { + 'options': [None, 'PCI Express Bandwidth Percent', 'percentage', fam, 'nvidia_smi.pci_bandwidth_percent', + 'area'], + 'lines': [ + ['rx_util_percent', 'rx_percent'], + ['tx_util_percent', 'tx_percent'], + ] + }, + FAN_SPEED: { + 'options': [None, 'Fan Speed', 'percentage', fam, 'nvidia_smi.fan_speed', 'line'], + 'lines': [ + ['fan_speed', 'speed'], + ] + }, + GPU_UTIL: { + 'options': [None, 'GPU Utilization', 'percentage', fam, 'nvidia_smi.gpu_utilization', 'line'], + 'lines': [ + ['gpu_util', 'utilization'], + ] + }, + MEM_UTIL: { + 'options': [None, 'Memory Bandwidth Utilization', 'percentage', fam, 'nvidia_smi.mem_utilization', 'line'], + 'lines': [ + ['memory_util', 'utilization'], + ] + }, + ENCODER_UTIL: { + 'options': [None, 'Encoder/Decoder Utilization', 'percentage', fam, 'nvidia_smi.encoder_utilization', + 'line'], + 'lines': [ + ['encoder_util', 'encoder'], + ['decoder_util', 'decoder'], + ] + }, + MEM_USAGE: { + 'options': [None, 'Memory Usage', 'MiB', fam, 'nvidia_smi.memory_allocated', 'stacked'], + 'lines': [ + ['fb_memory_free', 'free'], + ['fb_memory_used', 'used'], + ] + }, + BAR_USAGE: { + 'options': [None, 'Bar1 Memory Usage', 'MiB', fam, 'nvidia_smi.bar1_memory_usage', 'stacked'], + 'lines': [ + ['bar1_memory_free', 'free'], + ['bar1_memory_used', 'used'], + ] + }, + TEMPERATURE: { + 'options': [None, 'Temperature', 'celsius', fam, 'nvidia_smi.temperature', 'line'], + 'lines': [ + ['gpu_temp', 'temp'], + ] + }, + CLOCKS: { + 'options': [None, 'Clock Frequencies', 'MHz', fam, 'nvidia_smi.clocks', 'line'], + 'lines': [ + ['graphics_clock', 'graphics'], + ['video_clock', 'video'], + ['sm_clock', 'sm'], + ['mem_clock', 'mem'], + ] + }, + POWER: { + 'options': [None, 'Power Utilization', 'Watts', fam, 'nvidia_smi.power', 'line'], + 'lines': [ + ['power_draw', 'power', 'absolute', 1, 100], + ] + }, + POWER_STATE: { + 'options': [None, 'Power State', 'state', fam, 'nvidia_smi.power_state', 'line'], + 'lines': [['power_state_' + v.lower(), v, 'absolute'] for v in POWER_STATES] + }, + PROCESSES_MEM: { + 'options': [None, 'Memory Used by Each Process', 'MiB', fam, 'nvidia_smi.processes_mem', 'stacked'], + 'lines': [] + }, + USER_MEM: { + 'options': [None, 'Memory Used by Each User', 'MiB', fam, 'nvidia_smi.user_mem', 'stacked'], + 'lines': [] + }, + USER_NUM: { + 'options': [None, 'Number of User on GPU', 'num', fam, 'nvidia_smi.user_num', 'line'], + 'lines': [ + ['user_num', 'users'], + ] + }, + } + + idx = gpu.num + + order = ['gpu{0}_{1}'.format(idx, v) for v in ORDER] + charts = dict(('gpu{0}_{1}'.format(idx, k), v) for k, v in charts.items()) + + for chart in charts.values(): + for line in chart['lines']: + line[0] = 'gpu{0}_{1}'.format(idx, line[0]) + + return order, charts + + +class NvidiaSMI: + def __init__(self): + self.command = find_binary(NVIDIA_SMI) + self.active_proc = None + + def run_once(self): + proc = subprocess.Popen([self.command, '-x', '-q'], stdout=subprocess.PIPE) + stdout, _ = proc.communicate() + return stdout + + def run_loop(self, interval): + if self.active_proc: + self.kill() + proc = subprocess.Popen([self.command, '-x', '-q', '-l', str(interval)], stdout=subprocess.PIPE) + self.active_proc = proc + return proc.stdout + + def kill(self): + if self.active_proc: + self.active_proc.kill() + self.active_proc = None + + +class NvidiaSMIPoller(threading.Thread): + def __init__(self, poll_interval): + threading.Thread.__init__(self) + self.daemon = True + + self.smi = NvidiaSMI() + self.interval = poll_interval + + self.lock = threading.RLock() + self.last_data = str() + self.exit = False + self.empty_rows = 0 + self.rows = list() + + def has_smi(self): + return bool(self.smi.command) + + def run_once(self): + return self.smi.run_once() + + def run(self): + out = self.smi.run_loop(self.interval) + + for row in out: + if self.exit or self.empty_rows > EMPTY_ROW_LIMIT: + break + self.process_row(row) + self.smi.kill() + + def process_row(self, row): + row = row.decode() + self.empty_rows += (row == EMPTY_ROW) + self.rows.append(row) + + if POLLER_BREAK_ROW in row: + self.lock.acquire() + self.last_data = '\n'.join(self.rows) + self.lock.release() + + self.rows = list() + self.empty_rows = 0 + + def is_started(self): + return self.ident is not None + + def shutdown(self): + self.exit = True + + def data(self): + self.lock.acquire() + data = self.last_data + self.lock.release() + return data + + +def handle_attr_error(method): + def on_call(*args, **kwargs): + try: + return method(*args, **kwargs) + except AttributeError: + return None + + return on_call + + +def handle_value_error(method): + def on_call(*args, **kwargs): + try: + return method(*args, **kwargs) + except ValueError: + return None + + return on_call + + +HOST_PREFIX = os.getenv('NETDATA_HOST_PREFIX') +ETC_PASSWD_PATH = '/etc/passwd' +PROC_PATH = '/proc' + +IS_INSIDE_DOCKER = False + +if HOST_PREFIX: + ETC_PASSWD_PATH = os.path.join(HOST_PREFIX, ETC_PASSWD_PATH[1:]) + PROC_PATH = os.path.join(HOST_PREFIX, PROC_PATH[1:]) + IS_INSIDE_DOCKER = True + + +def read_passwd_file(): + data = dict() + with open(ETC_PASSWD_PATH, 'r') as f: + for line in f: + line = line.strip() + if line.startswith("#"): + continue + fields = line.split(":") + # name, passwd, uid, gid, comment, home_dir, shell + if len(fields) != 7: + continue + # uid, guid + fields[2], fields[3] = int(fields[2]), int(fields[3]) + data[fields[2]] = fields + return data + + +def read_passwd_file_safe(): + try: + if IS_INSIDE_DOCKER: + return read_passwd_file() + return dict((k[2], k) for k in pwd.getpwall()) + except (OSError, IOError): + return dict() + + +def get_username_by_pid_safe(pid, passwd_file): + path = os.path.join(PROC_PATH, pid) + try: + uid = os.stat(path).st_uid + except (OSError, IOError): + return '' + try: + if IS_INSIDE_DOCKER: + return passwd_file[uid][0] + return pwd.getpwuid(uid)[0] + except KeyError: + return str(uid) + + +class GPU: + def __init__(self, num, root, exclude_zero_memory_users=False): + self.num = num + self.root = root + self.exclude_zero_memory_users = exclude_zero_memory_users + + def id(self): + return self.root.get('id') + + def name(self): + return self.root.find('product_name').text + + def full_name(self): + return 'gpu{0} {1}'.format(self.num, self.name()) + + @handle_attr_error + def pci_link_gen(self): + return self.root.find('pci').find('pci_gpu_link_info').find('pcie_gen').find('max_link_gen').text + + @handle_attr_error + def pci_link_width(self): + info = self.root.find('pci').find('pci_gpu_link_info') + return info.find('link_widths').find('max_link_width').text.split('x')[0] + + def pci_bw_max(self): + link_gen = self.pci_link_gen() + link_width = int(self.pci_link_width()) + if link_gen not in PCI_SPEED or link_gen not in PCI_ENCODING or not link_width: + return None + # Maximum PCIe Bandwidth = SPEED * WIDTH * (1 - ENCODING) - 1Gb/s. + # see details https://enterprise-support.nvidia.com/s/article/understanding-pcie-configuration-for-maximum-performance + # return max bandwidth in kilobytes per second (kB/s) + return (PCI_SPEED[link_gen] * link_width * (1 - PCI_ENCODING[link_gen]) - 1) * 1000 * 1000 / 8 + + @handle_attr_error + def rx_util(self): + return self.root.find('pci').find('rx_util').text.split()[0] + + @handle_attr_error + def tx_util(self): + return self.root.find('pci').find('tx_util').text.split()[0] + + @handle_attr_error + def fan_speed(self): + return self.root.find('fan_speed').text.split()[0] + + @handle_attr_error + def gpu_util(self): + return self.root.find('utilization').find('gpu_util').text.split()[0] + + @handle_attr_error + def memory_util(self): + return self.root.find('utilization').find('memory_util').text.split()[0] + + @handle_attr_error + def encoder_util(self): + return self.root.find('utilization').find('encoder_util').text.split()[0] + + @handle_attr_error + def decoder_util(self): + return self.root.find('utilization').find('decoder_util').text.split()[0] + + @handle_attr_error + def fb_memory_used(self): + return self.root.find('fb_memory_usage').find('used').text.split()[0] + + @handle_attr_error + def fb_memory_free(self): + return self.root.find('fb_memory_usage').find('free').text.split()[0] + + @handle_attr_error + def bar1_memory_used(self): + return self.root.find('bar1_memory_usage').find('used').text.split()[0] + + @handle_attr_error + def bar1_memory_free(self): + return self.root.find('bar1_memory_usage').find('free').text.split()[0] + + @handle_attr_error + def temperature(self): + return self.root.find('temperature').find('gpu_temp').text.split()[0] + + @handle_attr_error + def graphics_clock(self): + return self.root.find('clocks').find('graphics_clock').text.split()[0] + + @handle_attr_error + def video_clock(self): + return self.root.find('clocks').find('video_clock').text.split()[0] + + @handle_attr_error + def sm_clock(self): + return self.root.find('clocks').find('sm_clock').text.split()[0] + + @handle_attr_error + def mem_clock(self): + return self.root.find('clocks').find('mem_clock').text.split()[0] + + @handle_attr_error + def power_readings(self): + elem = self.root.find('power_readings') + return elem if elem else self.root.find('gpu_power_readings') + + @handle_attr_error + def power_state(self): + return str(self.power_readings().find('power_state').text.split()[0]) + + @handle_value_error + @handle_attr_error + def power_draw(self): + return float(self.power_readings().find('power_draw').text.split()[0]) * 100 + + @handle_attr_error + def processes(self): + processes_info = self.root.find('processes').findall('process_info') + if not processes_info: + return list() + + passwd_file = read_passwd_file_safe() + processes = list() + + for info in processes_info: + pid = info.find('pid').text + processes.append({ + 'pid': int(pid), + 'process_name': info.find('process_name').text, + 'used_memory': int(info.find('used_memory').text.split()[0]), + 'username': get_username_by_pid_safe(pid, passwd_file), + }) + return processes + + def data(self): + data = { + 'rx_util': self.rx_util(), + 'tx_util': self.tx_util(), + 'fan_speed': self.fan_speed(), + 'gpu_util': self.gpu_util(), + 'memory_util': self.memory_util(), + 'encoder_util': self.encoder_util(), + 'decoder_util': self.decoder_util(), + 'fb_memory_used': self.fb_memory_used(), + 'fb_memory_free': self.fb_memory_free(), + 'bar1_memory_used': self.bar1_memory_used(), + 'bar1_memory_free': self.bar1_memory_free(), + 'gpu_temp': self.temperature(), + 'graphics_clock': self.graphics_clock(), + 'video_clock': self.video_clock(), + 'sm_clock': self.sm_clock(), + 'mem_clock': self.mem_clock(), + 'power_draw': self.power_draw(), + } + + if self.rx_util() != NOT_AVAILABLE and self.tx_util() != NOT_AVAILABLE: + pci_bw_max = self.pci_bw_max() + if not pci_bw_max: + data['rx_util_percent'] = 0 + data['tx_util_percent'] = 0 + else: + data['rx_util_percent'] = str(int(int(self.rx_util()) * 100 / self.pci_bw_max())) + data['tx_util_percent'] = str(int(int(self.tx_util()) * 100 / self.pci_bw_max())) + + for v in POWER_STATES: + data['power_state_' + v.lower()] = 0 + p_state = self.power_state() + if p_state: + data['power_state_' + p_state.lower()] = 1 + + processes = self.processes() or [] + users = set() + for p in processes: + data['process_mem_{0}'.format(p['pid'])] = p['used_memory'] + if p['username']: + if self.exclude_zero_memory_users and p['used_memory'] == 0: + continue + users.add(p['username']) + key = 'user_mem_{0}'.format(p['username']) + if key in data: + data[key] += p['used_memory'] + else: + data[key] = p['used_memory'] + data['user_num'] = len(users) + + return dict(('gpu{0}_{1}'.format(self.num, k), v) for k, v in data.items()) + + +class Service(SimpleService): + def __init__(self, configuration=None, name=None): + super(Service, self).__init__(configuration=configuration, name=name) + self.order = list() + self.definitions = dict() + self.loop_mode = configuration.get('loop_mode', True) + poll = int(configuration.get('poll_seconds', self.get_update_every())) + self.exclude_zero_memory_users = configuration.get('exclude_zero_memory_users', False) + self.poller = NvidiaSMIPoller(poll) + + def get_data_loop_mode(self): + if not self.poller.is_started(): + self.poller.start() + + if not self.poller.is_alive(): + self.debug('poller is off') + return None + + return self.poller.data() + + def get_data_normal_mode(self): + return self.poller.run_once() + + def get_data(self): + if self.loop_mode: + last_data = self.get_data_loop_mode() + else: + last_data = self.get_data_normal_mode() + + if not last_data: + return None + + parsed = self.parse_xml(last_data) + if parsed is None: + return None + + data = dict() + for idx, root in enumerate(parsed.findall('gpu')): + gpu = GPU(idx, root, self.exclude_zero_memory_users) + gpu_data = gpu.data() + # self.debug(gpu_data) + gpu_data = dict((k, v) for k, v in gpu_data.items() if is_gpu_data_value_valid(v)) + data.update(gpu_data) + self.update_processes_mem_chart(gpu) + self.update_processes_user_mem_chart(gpu) + + return data or None + + def update_processes_mem_chart(self, gpu): + ps = gpu.processes() + if not ps: + return + chart = self.charts['gpu{0}_{1}'.format(gpu.num, PROCESSES_MEM)] + active_dim_ids = [] + for p in ps: + dim_id = 'gpu{0}_process_mem_{1}'.format(gpu.num, p['pid']) + active_dim_ids.append(dim_id) + if dim_id not in chart: + chart.add_dimension([dim_id, '{0} {1}'.format(p['pid'], p['process_name'])]) + for dim in chart: + if dim.id not in active_dim_ids: + chart.del_dimension(dim.id, hide=False) + + def update_processes_user_mem_chart(self, gpu): + ps = gpu.processes() + if not ps: + return + chart = self.charts['gpu{0}_{1}'.format(gpu.num, USER_MEM)] + active_dim_ids = [] + for p in ps: + if not p.get('username'): + continue + dim_id = 'gpu{0}_user_mem_{1}'.format(gpu.num, p['username']) + active_dim_ids.append(dim_id) + if dim_id not in chart: + chart.add_dimension([dim_id, '{0}'.format(p['username'])]) + + for dim in chart: + if dim.id not in active_dim_ids: + chart.del_dimension(dim.id, hide=False) + + def check(self): + if not self.poller.has_smi(): + self.error("couldn't find '{0}' binary".format(NVIDIA_SMI)) + return False + + raw_data = self.poller.run_once() + if not raw_data: + self.error("failed to invoke '{0}' binary".format(NVIDIA_SMI)) + return False + + parsed = self.parse_xml(raw_data) + if parsed is None: + return False + + gpus = parsed.findall('gpu') + if not gpus: + return False + + self.create_charts(gpus) + + return True + + def parse_xml(self, data): + try: + return et.fromstring(data) + except et.ParseError as error: + self.error('xml parse failed: "{0}", error: {1}'.format(data, error)) + + return None + + def create_charts(self, gpus): + for idx, root in enumerate(gpus): + order, charts = gpu_charts(GPU(idx, root)) + self.order.extend(order) + self.definitions.update(charts) + + +def is_gpu_data_value_valid(value): + try: + int(value) + except (TypeError, ValueError): + return False + return True diff --git a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf new file mode 100644 index 00000000..3d2a30d4 --- /dev/null +++ b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf @@ -0,0 +1,68 @@ +# netdata python.d.plugin configuration for nvidia_smi +# +# This file is in YaML format. Generally the format is: +# +# name: value +# +# There are 2 sections: +# - global variables +# - one or more JOBS +# +# JOBS allow you to collect values from multiple sources. +# Each source will have its own set of charts. +# +# JOB parameters have to be indented (using spaces only, example below). + +# ---------------------------------------------------------------------- +# Global Variables +# These variables set the defaults for all JOBs, however each JOB +# may define its own, overriding the defaults. + +# update_every sets the default data collection frequency. +# If unset, the python.d.plugin default is used. +# update_every: 1 + +# priority controls the order of charts at the netdata dashboard. +# Lower numbers move the charts towards the top of the page. +# If unset, the default for python.d.plugin is used. +# priority: 60000 + +# penalty indicates whether to apply penalty to update_every in case of failures. +# Penalty will increase every 5 failed updates in a row. Maximum penalty is 10 minutes. +# penalty: yes + +# autodetection_retry sets the job re-check interval in seconds. +# The job is not deleted if check fails. +# Attempts to start the job are made once every autodetection_retry. +# This feature is disabled by default. +# autodetection_retry: 0 + +# ---------------------------------------------------------------------- +# JOBS (data collection sources) +# +# The default JOBS share the same *name*. JOBS with the same name +# are mutually exclusive. Only one of them will be allowed running at +# any time. This allows autodetection to try several alternatives and +# pick the one that works. +# +# Any number of jobs is supported. +# +# All python.d.plugin JOBS (for all its modules) support a set of +# predefined parameters. These are: +# +# job_name: +# name: myname # the JOB's name as it will appear at the +# # dashboard (by default is the job_name) +# # JOBs sharing a name are mutually exclusive +# update_every: 1 # the JOB's data collection frequency +# priority: 60000 # the JOB's order on the dashboard +# penalty: yes # the JOB's penalty +# autodetection_retry: 0 # the JOB's re-check interval in seconds +# +# Additionally to the above, example also supports the following: +# +# loop_mode: yes/no # default is yes. If set to yes `nvidia-smi` is executed in a separate thread using `-l` option. +# poll_seconds: SECONDS # default is 1. Sets the frequency of seconds the nvidia-smi tool is polled in loop mode. +# exclude_zero_memory_users: yes/no # default is no. Whether to collect users metrics with 0Mb memory allocation. +# +# ---------------------------------------------------------------------- |