diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-07-24 09:54:23 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-07-24 09:54:44 +0000 |
commit | 836b47cb7e99a977c5a23b059ca1d0b5065d310e (patch) | |
tree | 1604da8f482d02effa033c94a84be42bc0c848c3 /collectors/python.d.plugin/nvidia_smi | |
parent | Releasing debian version 1.44.3-2. (diff) | |
download | netdata-836b47cb7e99a977c5a23b059ca1d0b5065d310e.tar.xz netdata-836b47cb7e99a977c5a23b059ca1d0b5065d310e.zip |
Merging upstream version 1.46.3.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'collectors/python.d.plugin/nvidia_smi')
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/Makefile.inc | 13 | ||||
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/README.md | 157 | ||||
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/metadata.yaml | 166 | ||||
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py | 651 | ||||
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf | 68 |
5 files changed, 0 insertions, 1055 deletions
diff --git a/collectors/python.d.plugin/nvidia_smi/Makefile.inc b/collectors/python.d.plugin/nvidia_smi/Makefile.inc deleted file mode 100644 index 52fb25a68..000000000 --- a/collectors/python.d.plugin/nvidia_smi/Makefile.inc +++ /dev/null @@ -1,13 +0,0 @@ -# SPDX-License-Identifier: GPL-3.0-or-later - -# THIS IS NOT A COMPLETE Makefile -# IT IS INCLUDED BY ITS PARENT'S Makefile.am -# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT - -# install these files -dist_python_DATA += nvidia_smi/nvidia_smi.chart.py -dist_pythonconfig_DATA += nvidia_smi/nvidia_smi.conf - -# do not install these files, but include them in the distribution -dist_noinst_DATA += nvidia_smi/README.md nvidia_smi/Makefile.inc - diff --git a/collectors/python.d.plugin/nvidia_smi/README.md b/collectors/python.d.plugin/nvidia_smi/README.md deleted file mode 100644 index 7d45289a4..000000000 --- a/collectors/python.d.plugin/nvidia_smi/README.md +++ /dev/null @@ -1,157 +0,0 @@ -<!-- -title: "Nvidia GPU monitoring with Netdata" -custom_edit_url: "https://github.com/netdata/netdata/edit/master/collectors/python.d.plugin/nvidia_smi/README.md" -sidebar_label: "nvidia_smi-python.d.plugin" -learn_status: "Published" -learn_topic_type: "References" -learn_rel_path: "Integrations/Monitor/Devices" ---> - -# Nvidia GPU collector - -Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool. - -## Requirements and Notes - -- You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface). -- You must enable this plugin, as its disabled by default due to minor performance issues: - ```bash - cd /etc/netdata # Replace this path with your Netdata config directory, if different - sudo ./edit-config python.d.conf - ``` - Remove the '#' before nvidia_smi so it reads: `nvidia_smi: yes`. - -- On some systems when the GPU is idle the `nvidia-smi` tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue. -- Currently the `nvidia-smi` tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: <https://github.com/netdata/netdata/pull/4357> -- Contributions are welcome. -- Make sure `netdata` user can execute `/usr/bin/nvidia-smi` or wherever your binary is. -- If `nvidia-smi` process [is not killed after netdata restart](https://github.com/netdata/netdata/issues/7143) you need to off `loop_mode`. -- `poll_seconds` is how often in seconds the tool is polled for as an integer. - -## Charts - -It produces the following charts: - -- PCI Express Bandwidth Utilization in `KiB/s` -- Fan Speed in `percentage` -- GPU Utilization in `percentage` -- Memory Bandwidth Utilization in `percentage` -- Encoder/Decoder Utilization in `percentage` -- Memory Usage in `MiB` -- Temperature in `celsius` -- Clock Frequencies in `MHz` -- Power Utilization in `Watts` -- Memory Used by Each Process in `MiB` -- Memory Used by Each User in `MiB` -- Number of User on GPU in `num` - -## Configuration - -Edit the `python.d/nvidia_smi.conf` configuration file using `edit-config` from the Netdata [config -directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md), which is typically at `/etc/netdata`. - -```bash -cd /etc/netdata # Replace this path with your Netdata config directory, if different -sudo ./edit-config python.d/nvidia_smi.conf -``` - -Sample: - -```yaml -loop_mode : yes -poll_seconds : 1 -exclude_zero_memory_users : yes -``` - - -### Troubleshooting - -To troubleshoot issues with the `nvidia_smi` module, run the `python.d.plugin` with the debug option enabled. The -output will give you the output of the data collection job or error messages on why the collector isn't working. - -First, navigate to your plugins directory, usually they are located under `/usr/libexec/netdata/plugins.d/`. If that's -not the case on your system, open `netdata.conf` and look for the setting `plugins directory`. Once you're in the -plugin's directory, switch to the `netdata` user. - -```bash -cd /usr/libexec/netdata/plugins.d/ -sudo su -s /bin/bash netdata -``` - -Now you can manually run the `nvidia_smi` module in debug mode: - -```bash -./python.d.plugin nvidia_smi debug trace -``` - -## Docker - -GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable. - -Sample `docker-compose.yml` -```yaml -version: '3' -services: - netdata: - image: netdata/netdata - container_name: netdata - hostname: example.com # set to fqdn of host - ports: - - 19999:19999 - restart: unless-stopped - cap_add: - - SYS_PTRACE - security_opt: - - apparmor:unconfined - environment: - - NETDATA_EXTRA_APK_PACKAGES=gcompat - volumes: - - netdataconfig:/etc/netdata - - netdatalib:/var/lib/netdata - - netdatacache:/var/cache/netdata - - /etc/passwd:/host/etc/passwd:ro - - /etc/group:/host/etc/group:ro - - /proc:/host/proc:ro - - /sys:/host/sys:ro - - /etc/os-release:/host/etc/os-release:ro - deploy: - resources: - reservations: - devices: - - driver: nvidia - count: all - capabilities: [gpu] - -volumes: - netdataconfig: - netdatalib: - netdatacache: -``` - -Sample `docker run` -```yaml -docker run -d --name=netdata \ - -p 19999:19999 \ - -e NETDATA_EXTRA_APK_PACKAGES=gcompat \ - -v netdataconfig:/etc/netdata \ - -v netdatalib:/var/lib/netdata \ - -v netdatacache:/var/cache/netdata \ - -v /etc/passwd:/host/etc/passwd:ro \ - -v /etc/group:/host/etc/group:ro \ - -v /proc:/host/proc:ro \ - -v /sys:/host/sys:ro \ - -v /etc/os-release:/host/etc/os-release:ro \ - --restart unless-stopped \ - --cap-add SYS_PTRACE \ - --security-opt apparmor=unconfined \ - --gpus all \ - netdata/netdata -``` - -### Docker Troubleshooting -To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`. -```bash -docker exec -it netdata bash -cd /etc/netdata -./edit-config python.d.conf -``` diff --git a/collectors/python.d.plugin/nvidia_smi/metadata.yaml b/collectors/python.d.plugin/nvidia_smi/metadata.yaml deleted file mode 100644 index 9bf1e6ca7..000000000 --- a/collectors/python.d.plugin/nvidia_smi/metadata.yaml +++ /dev/null @@ -1,166 +0,0 @@ -# This collector will not appear in documentation, as the go version is preferred, -# https://github.com/netdata/go.d.plugin/blob/master/modules/nvidia_smi/README.md -# -# meta: -# plugin_name: python.d.plugin -# module_name: nvidia_smi -# monitored_instance: -# name: python.d nvidia_smi -# link: '' -# categories: [] -# icon_filename: '' -# related_resources: -# integrations: -# list: [] -# info_provided_to_referring_integrations: -# description: '' -# keywords: [] -# most_popular: false -# overview: -# data_collection: -# metrics_description: '' -# method_description: '' -# supported_platforms: -# include: [] -# exclude: [] -# multi_instance: true -# additional_permissions: -# description: '' -# default_behavior: -# auto_detection: -# description: '' -# limits: -# description: '' -# performance_impact: -# description: '' -# setup: -# prerequisites: -# list: [] -# configuration: -# file: -# name: '' -# description: '' -# options: -# description: '' -# folding: -# title: '' -# enabled: true -# list: [] -# examples: -# folding: -# enabled: true -# title: '' -# list: [] -# troubleshooting: -# problems: -# list: [] -# alerts: [] -# metrics: -# folding: -# title: Metrics -# enabled: false -# description: "" -# availability: [] -# scopes: -# - name: GPU -# description: "" -# labels: [] -# metrics: -# - name: nvidia_smi.pci_bandwidth -# description: PCI Express Bandwidth Utilization -# unit: "KiB/s" -# chart_type: area -# dimensions: -# - name: rx -# - name: tx -# - name: nvidia_smi.pci_bandwidth_percent -# description: PCI Express Bandwidth Percent -# unit: "percentage" -# chart_type: area -# dimensions: -# - name: rx_percent -# - name: tx_percent -# - name: nvidia_smi.fan_speed -# description: Fan Speed -# unit: "percentage" -# chart_type: line -# dimensions: -# - name: speed -# - name: nvidia_smi.gpu_utilization -# description: GPU Utilization -# unit: "percentage" -# chart_type: line -# dimensions: -# - name: utilization -# - name: nvidia_smi.mem_utilization -# description: Memory Bandwidth Utilization -# unit: "percentage" -# chart_type: line -# dimensions: -# - name: utilization -# - name: nvidia_smi.encoder_utilization -# description: Encoder/Decoder Utilization -# unit: "percentage" -# chart_type: line -# dimensions: -# - name: encoder -# - name: decoder -# - name: nvidia_smi.memory_allocated -# description: Memory Usage -# unit: "MiB" -# chart_type: stacked -# dimensions: -# - name: free -# - name: used -# - name: nvidia_smi.bar1_memory_usage -# description: Bar1 Memory Usage -# unit: "MiB" -# chart_type: stacked -# dimensions: -# - name: free -# - name: used -# - name: nvidia_smi.temperature -# description: Temperature -# unit: "celsius" -# chart_type: line -# dimensions: -# - name: temp -# - name: nvidia_smi.clocks -# description: Clock Frequencies -# unit: "MHz" -# chart_type: line -# dimensions: -# - name: graphics -# - name: video -# - name: sm -# - name: mem -# - name: nvidia_smi.power -# description: Power Utilization -# unit: "Watts" -# chart_type: line -# dimensions: -# - name: power -# - name: nvidia_smi.power_state -# description: Power State -# unit: "state" -# chart_type: line -# dimensions: -# - name: a dimension per {power_state} -# - name: nvidia_smi.processes_mem -# description: Memory Used by Each Process -# unit: "MiB" -# chart_type: stacked -# dimensions: -# - name: a dimension per process -# - name: nvidia_smi.user_mem -# description: Memory Used by Each User -# unit: "MiB" -# chart_type: stacked -# dimensions: -# - name: a dimension per user -# - name: nvidia_smi.user_num -# description: Number of User on GPU -# unit: "num" -# chart_type: line -# dimensions: -# - name: users diff --git a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py deleted file mode 100644 index 556a61435..000000000 --- a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py +++ /dev/null @@ -1,651 +0,0 @@ -# -*- coding: utf-8 -*- -# Description: nvidia-smi netdata python.d module -# Original Author: Steven Noonan (tycho) -# Author: Ilya Mashchenko (ilyam8) -# User Memory Stat Author: Guido Scatena (scatenag) - -import os -import pwd -import subprocess -import threading -import xml.etree.ElementTree as et - -from bases.FrameworkServices.SimpleService import SimpleService -from bases.collection import find_binary - -disabled_by_default = True - -NVIDIA_SMI = 'nvidia-smi' - -NOT_AVAILABLE = 'N/A' - -EMPTY_ROW = '' -EMPTY_ROW_LIMIT = 500 -POLLER_BREAK_ROW = '</nvidia_smi_log>' - -PCI_BANDWIDTH = 'pci_bandwidth' -PCI_BANDWIDTH_PERCENT = 'pci_bandwidth_percent' -FAN_SPEED = 'fan_speed' -GPU_UTIL = 'gpu_utilization' -MEM_UTIL = 'mem_utilization' -ENCODER_UTIL = 'encoder_utilization' -MEM_USAGE = 'mem_usage' -BAR_USAGE = 'bar1_mem_usage' -TEMPERATURE = 'temperature' -CLOCKS = 'clocks' -POWER = 'power' -POWER_STATE = 'power_state' -PROCESSES_MEM = 'processes_mem' -USER_MEM = 'user_mem' -USER_NUM = 'user_num' - -ORDER = [ - PCI_BANDWIDTH, - PCI_BANDWIDTH_PERCENT, - FAN_SPEED, - GPU_UTIL, - MEM_UTIL, - ENCODER_UTIL, - MEM_USAGE, - BAR_USAGE, - TEMPERATURE, - CLOCKS, - POWER, - POWER_STATE, - PROCESSES_MEM, - USER_MEM, - USER_NUM, -] - -# https://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__gpupstate.html -POWER_STATES = ['P' + str(i) for i in range(0, 16)] - -# PCI Transfer data rate in gigabits per second (Gb/s) per generation -PCI_SPEED = { - "1": 2.5, - "2": 5, - "3": 8, - "4": 16, - "5": 32 -} -# PCI encoding per generation -PCI_ENCODING = { - "1": 2 / 10, - "2": 2 / 10, - "3": 2 / 130, - "4": 2 / 130, - "5": 2 / 130 -} - - -def gpu_charts(gpu): - fam = gpu.full_name() - - charts = { - PCI_BANDWIDTH: { - 'options': [None, 'PCI Express Bandwidth Utilization', 'KiB/s', fam, 'nvidia_smi.pci_bandwidth', 'area'], - 'lines': [ - ['rx_util', 'rx', 'absolute', 1, 1], - ['tx_util', 'tx', 'absolute', 1, -1], - ] - }, - PCI_BANDWIDTH_PERCENT: { - 'options': [None, 'PCI Express Bandwidth Percent', 'percentage', fam, 'nvidia_smi.pci_bandwidth_percent', - 'area'], - 'lines': [ - ['rx_util_percent', 'rx_percent'], - ['tx_util_percent', 'tx_percent'], - ] - }, - FAN_SPEED: { - 'options': [None, 'Fan Speed', 'percentage', fam, 'nvidia_smi.fan_speed', 'line'], - 'lines': [ - ['fan_speed', 'speed'], - ] - }, - GPU_UTIL: { - 'options': [None, 'GPU Utilization', 'percentage', fam, 'nvidia_smi.gpu_utilization', 'line'], - 'lines': [ - ['gpu_util', 'utilization'], - ] - }, - MEM_UTIL: { - 'options': [None, 'Memory Bandwidth Utilization', 'percentage', fam, 'nvidia_smi.mem_utilization', 'line'], - 'lines': [ - ['memory_util', 'utilization'], - ] - }, - ENCODER_UTIL: { - 'options': [None, 'Encoder/Decoder Utilization', 'percentage', fam, 'nvidia_smi.encoder_utilization', - 'line'], - 'lines': [ - ['encoder_util', 'encoder'], - ['decoder_util', 'decoder'], - ] - }, - MEM_USAGE: { - 'options': [None, 'Memory Usage', 'MiB', fam, 'nvidia_smi.memory_allocated', 'stacked'], - 'lines': [ - ['fb_memory_free', 'free'], - ['fb_memory_used', 'used'], - ] - }, - BAR_USAGE: { - 'options': [None, 'Bar1 Memory Usage', 'MiB', fam, 'nvidia_smi.bar1_memory_usage', 'stacked'], - 'lines': [ - ['bar1_memory_free', 'free'], - ['bar1_memory_used', 'used'], - ] - }, - TEMPERATURE: { - 'options': [None, 'Temperature', 'celsius', fam, 'nvidia_smi.temperature', 'line'], - 'lines': [ - ['gpu_temp', 'temp'], - ] - }, - CLOCKS: { - 'options': [None, 'Clock Frequencies', 'MHz', fam, 'nvidia_smi.clocks', 'line'], - 'lines': [ - ['graphics_clock', 'graphics'], - ['video_clock', 'video'], - ['sm_clock', 'sm'], - ['mem_clock', 'mem'], - ] - }, - POWER: { - 'options': [None, 'Power Utilization', 'Watts', fam, 'nvidia_smi.power', 'line'], - 'lines': [ - ['power_draw', 'power', 'absolute', 1, 100], - ] - }, - POWER_STATE: { - 'options': [None, 'Power State', 'state', fam, 'nvidia_smi.power_state', 'line'], - 'lines': [['power_state_' + v.lower(), v, 'absolute'] for v in POWER_STATES] - }, - PROCESSES_MEM: { - 'options': [None, 'Memory Used by Each Process', 'MiB', fam, 'nvidia_smi.processes_mem', 'stacked'], - 'lines': [] - }, - USER_MEM: { - 'options': [None, 'Memory Used by Each User', 'MiB', fam, 'nvidia_smi.user_mem', 'stacked'], - 'lines': [] - }, - USER_NUM: { - 'options': [None, 'Number of User on GPU', 'num', fam, 'nvidia_smi.user_num', 'line'], - 'lines': [ - ['user_num', 'users'], - ] - }, - } - - idx = gpu.num - - order = ['gpu{0}_{1}'.format(idx, v) for v in ORDER] - charts = dict(('gpu{0}_{1}'.format(idx, k), v) for k, v in charts.items()) - - for chart in charts.values(): - for line in chart['lines']: - line[0] = 'gpu{0}_{1}'.format(idx, line[0]) - - return order, charts - - -class NvidiaSMI: - def __init__(self): - self.command = find_binary(NVIDIA_SMI) - self.active_proc = None - - def run_once(self): - proc = subprocess.Popen([self.command, '-x', '-q'], stdout=subprocess.PIPE) - stdout, _ = proc.communicate() - return stdout - - def run_loop(self, interval): - if self.active_proc: - self.kill() - proc = subprocess.Popen([self.command, '-x', '-q', '-l', str(interval)], stdout=subprocess.PIPE) - self.active_proc = proc - return proc.stdout - - def kill(self): - if self.active_proc: - self.active_proc.kill() - self.active_proc = None - - -class NvidiaSMIPoller(threading.Thread): - def __init__(self, poll_interval): - threading.Thread.__init__(self) - self.daemon = True - - self.smi = NvidiaSMI() - self.interval = poll_interval - - self.lock = threading.RLock() - self.last_data = str() - self.exit = False - self.empty_rows = 0 - self.rows = list() - - def has_smi(self): - return bool(self.smi.command) - - def run_once(self): - return self.smi.run_once() - - def run(self): - out = self.smi.run_loop(self.interval) - - for row in out: - if self.exit or self.empty_rows > EMPTY_ROW_LIMIT: - break - self.process_row(row) - self.smi.kill() - - def process_row(self, row): - row = row.decode() - self.empty_rows += (row == EMPTY_ROW) - self.rows.append(row) - - if POLLER_BREAK_ROW in row: - self.lock.acquire() - self.last_data = '\n'.join(self.rows) - self.lock.release() - - self.rows = list() - self.empty_rows = 0 - - def is_started(self): - return self.ident is not None - - def shutdown(self): - self.exit = True - - def data(self): - self.lock.acquire() - data = self.last_data - self.lock.release() - return data - - -def handle_attr_error(method): - def on_call(*args, **kwargs): - try: - return method(*args, **kwargs) - except AttributeError: - return None - - return on_call - - -def handle_value_error(method): - def on_call(*args, **kwargs): - try: - return method(*args, **kwargs) - except ValueError: - return None - - return on_call - - -HOST_PREFIX = os.getenv('NETDATA_HOST_PREFIX') -ETC_PASSWD_PATH = '/etc/passwd' -PROC_PATH = '/proc' - -IS_INSIDE_DOCKER = False - -if HOST_PREFIX: - ETC_PASSWD_PATH = os.path.join(HOST_PREFIX, ETC_PASSWD_PATH[1:]) - PROC_PATH = os.path.join(HOST_PREFIX, PROC_PATH[1:]) - IS_INSIDE_DOCKER = True - - -def read_passwd_file(): - data = dict() - with open(ETC_PASSWD_PATH, 'r') as f: - for line in f: - line = line.strip() - if line.startswith("#"): - continue - fields = line.split(":") - # name, passwd, uid, gid, comment, home_dir, shell - if len(fields) != 7: - continue - # uid, guid - fields[2], fields[3] = int(fields[2]), int(fields[3]) - data[fields[2]] = fields - return data - - -def read_passwd_file_safe(): - try: - if IS_INSIDE_DOCKER: - return read_passwd_file() - return dict((k[2], k) for k in pwd.getpwall()) - except (OSError, IOError): - return dict() - - -def get_username_by_pid_safe(pid, passwd_file): - path = os.path.join(PROC_PATH, pid) - try: - uid = os.stat(path).st_uid - except (OSError, IOError): - return '' - try: - if IS_INSIDE_DOCKER: - return passwd_file[uid][0] - return pwd.getpwuid(uid)[0] - except KeyError: - return str(uid) - - -class GPU: - def __init__(self, num, root, exclude_zero_memory_users=False): - self.num = num - self.root = root - self.exclude_zero_memory_users = exclude_zero_memory_users - - def id(self): - return self.root.get('id') - - def name(self): - return self.root.find('product_name').text - - def full_name(self): - return 'gpu{0} {1}'.format(self.num, self.name()) - - @handle_attr_error - def pci_link_gen(self): - return self.root.find('pci').find('pci_gpu_link_info').find('pcie_gen').find('max_link_gen').text - - @handle_attr_error - def pci_link_width(self): - info = self.root.find('pci').find('pci_gpu_link_info') - return info.find('link_widths').find('max_link_width').text.split('x')[0] - - def pci_bw_max(self): - link_gen = self.pci_link_gen() - link_width = int(self.pci_link_width()) - if link_gen not in PCI_SPEED or link_gen not in PCI_ENCODING or not link_width: - return None - # Maximum PCIe Bandwidth = SPEED * WIDTH * (1 - ENCODING) - 1Gb/s. - # see details https://enterprise-support.nvidia.com/s/article/understanding-pcie-configuration-for-maximum-performance - # return max bandwidth in kilobytes per second (kB/s) - return (PCI_SPEED[link_gen] * link_width * (1 - PCI_ENCODING[link_gen]) - 1) * 1000 * 1000 / 8 - - @handle_attr_error - def rx_util(self): - return self.root.find('pci').find('rx_util').text.split()[0] - - @handle_attr_error - def tx_util(self): - return self.root.find('pci').find('tx_util').text.split()[0] - - @handle_attr_error - def fan_speed(self): - return self.root.find('fan_speed').text.split()[0] - - @handle_attr_error - def gpu_util(self): - return self.root.find('utilization').find('gpu_util').text.split()[0] - - @handle_attr_error - def memory_util(self): - return self.root.find('utilization').find('memory_util').text.split()[0] - - @handle_attr_error - def encoder_util(self): - return self.root.find('utilization').find('encoder_util').text.split()[0] - - @handle_attr_error - def decoder_util(self): - return self.root.find('utilization').find('decoder_util').text.split()[0] - - @handle_attr_error - def fb_memory_used(self): - return self.root.find('fb_memory_usage').find('used').text.split()[0] - - @handle_attr_error - def fb_memory_free(self): - return self.root.find('fb_memory_usage').find('free').text.split()[0] - - @handle_attr_error - def bar1_memory_used(self): - return self.root.find('bar1_memory_usage').find('used').text.split()[0] - - @handle_attr_error - def bar1_memory_free(self): - return self.root.find('bar1_memory_usage').find('free').text.split()[0] - - @handle_attr_error - def temperature(self): - return self.root.find('temperature').find('gpu_temp').text.split()[0] - - @handle_attr_error - def graphics_clock(self): - return self.root.find('clocks').find('graphics_clock').text.split()[0] - - @handle_attr_error - def video_clock(self): - return self.root.find('clocks').find('video_clock').text.split()[0] - - @handle_attr_error - def sm_clock(self): - return self.root.find('clocks').find('sm_clock').text.split()[0] - - @handle_attr_error - def mem_clock(self): - return self.root.find('clocks').find('mem_clock').text.split()[0] - - @handle_attr_error - def power_readings(self): - elem = self.root.find('power_readings') - return elem if elem else self.root.find('gpu_power_readings') - - @handle_attr_error - def power_state(self): - return str(self.power_readings().find('power_state').text.split()[0]) - - @handle_value_error - @handle_attr_error - def power_draw(self): - return float(self.power_readings().find('power_draw').text.split()[0]) * 100 - - @handle_attr_error - def processes(self): - processes_info = self.root.find('processes').findall('process_info') - if not processes_info: - return list() - - passwd_file = read_passwd_file_safe() - processes = list() - - for info in processes_info: - pid = info.find('pid').text - processes.append({ - 'pid': int(pid), - 'process_name': info.find('process_name').text, - 'used_memory': int(info.find('used_memory').text.split()[0]), - 'username': get_username_by_pid_safe(pid, passwd_file), - }) - return processes - - def data(self): - data = { - 'rx_util': self.rx_util(), - 'tx_util': self.tx_util(), - 'fan_speed': self.fan_speed(), - 'gpu_util': self.gpu_util(), - 'memory_util': self.memory_util(), - 'encoder_util': self.encoder_util(), - 'decoder_util': self.decoder_util(), - 'fb_memory_used': self.fb_memory_used(), - 'fb_memory_free': self.fb_memory_free(), - 'bar1_memory_used': self.bar1_memory_used(), - 'bar1_memory_free': self.bar1_memory_free(), - 'gpu_temp': self.temperature(), - 'graphics_clock': self.graphics_clock(), - 'video_clock': self.video_clock(), - 'sm_clock': self.sm_clock(), - 'mem_clock': self.mem_clock(), - 'power_draw': self.power_draw(), - } - - if self.rx_util() != NOT_AVAILABLE and self.tx_util() != NOT_AVAILABLE: - pci_bw_max = self.pci_bw_max() - if not pci_bw_max: - data['rx_util_percent'] = 0 - data['tx_util_percent'] = 0 - else: - data['rx_util_percent'] = str(int(int(self.rx_util()) * 100 / self.pci_bw_max())) - data['tx_util_percent'] = str(int(int(self.tx_util()) * 100 / self.pci_bw_max())) - - for v in POWER_STATES: - data['power_state_' + v.lower()] = 0 - p_state = self.power_state() - if p_state: - data['power_state_' + p_state.lower()] = 1 - - processes = self.processes() or [] - users = set() - for p in processes: - data['process_mem_{0}'.format(p['pid'])] = p['used_memory'] - if p['username']: - if self.exclude_zero_memory_users and p['used_memory'] == 0: - continue - users.add(p['username']) - key = 'user_mem_{0}'.format(p['username']) - if key in data: - data[key] += p['used_memory'] - else: - data[key] = p['used_memory'] - data['user_num'] = len(users) - - return dict(('gpu{0}_{1}'.format(self.num, k), v) for k, v in data.items()) - - -class Service(SimpleService): - def __init__(self, configuration=None, name=None): - super(Service, self).__init__(configuration=configuration, name=name) - self.order = list() - self.definitions = dict() - self.loop_mode = configuration.get('loop_mode', True) - poll = int(configuration.get('poll_seconds', self.get_update_every())) - self.exclude_zero_memory_users = configuration.get('exclude_zero_memory_users', False) - self.poller = NvidiaSMIPoller(poll) - - def get_data_loop_mode(self): - if not self.poller.is_started(): - self.poller.start() - - if not self.poller.is_alive(): - self.debug('poller is off') - return None - - return self.poller.data() - - def get_data_normal_mode(self): - return self.poller.run_once() - - def get_data(self): - if self.loop_mode: - last_data = self.get_data_loop_mode() - else: - last_data = self.get_data_normal_mode() - - if not last_data: - return None - - parsed = self.parse_xml(last_data) - if parsed is None: - return None - - data = dict() - for idx, root in enumerate(parsed.findall('gpu')): - gpu = GPU(idx, root, self.exclude_zero_memory_users) - gpu_data = gpu.data() - # self.debug(gpu_data) - gpu_data = dict((k, v) for k, v in gpu_data.items() if is_gpu_data_value_valid(v)) - data.update(gpu_data) - self.update_processes_mem_chart(gpu) - self.update_processes_user_mem_chart(gpu) - - return data or None - - def update_processes_mem_chart(self, gpu): - ps = gpu.processes() - if not ps: - return - chart = self.charts['gpu{0}_{1}'.format(gpu.num, PROCESSES_MEM)] - active_dim_ids = [] - for p in ps: - dim_id = 'gpu{0}_process_mem_{1}'.format(gpu.num, p['pid']) - active_dim_ids.append(dim_id) - if dim_id not in chart: - chart.add_dimension([dim_id, '{0} {1}'.format(p['pid'], p['process_name'])]) - for dim in chart: - if dim.id not in active_dim_ids: - chart.del_dimension(dim.id, hide=False) - - def update_processes_user_mem_chart(self, gpu): - ps = gpu.processes() - if not ps: - return - chart = self.charts['gpu{0}_{1}'.format(gpu.num, USER_MEM)] - active_dim_ids = [] - for p in ps: - if not p.get('username'): - continue - dim_id = 'gpu{0}_user_mem_{1}'.format(gpu.num, p['username']) - active_dim_ids.append(dim_id) - if dim_id not in chart: - chart.add_dimension([dim_id, '{0}'.format(p['username'])]) - - for dim in chart: - if dim.id not in active_dim_ids: - chart.del_dimension(dim.id, hide=False) - - def check(self): - if not self.poller.has_smi(): - self.error("couldn't find '{0}' binary".format(NVIDIA_SMI)) - return False - - raw_data = self.poller.run_once() - if not raw_data: - self.error("failed to invoke '{0}' binary".format(NVIDIA_SMI)) - return False - - parsed = self.parse_xml(raw_data) - if parsed is None: - return False - - gpus = parsed.findall('gpu') - if not gpus: - return False - - self.create_charts(gpus) - - return True - - def parse_xml(self, data): - try: - return et.fromstring(data) - except et.ParseError as error: - self.error('xml parse failed: "{0}", error: {1}'.format(data, error)) - - return None - - def create_charts(self, gpus): - for idx, root in enumerate(gpus): - order, charts = gpu_charts(GPU(idx, root)) - self.order.extend(order) - self.definitions.update(charts) - - -def is_gpu_data_value_valid(value): - try: - int(value) - except (TypeError, ValueError): - return False - return True diff --git a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf deleted file mode 100644 index 3d2a30d41..000000000 --- a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf +++ /dev/null @@ -1,68 +0,0 @@ -# netdata python.d.plugin configuration for nvidia_smi -# -# This file is in YaML format. Generally the format is: -# -# name: value -# -# There are 2 sections: -# - global variables -# - one or more JOBS -# -# JOBS allow you to collect values from multiple sources. -# Each source will have its own set of charts. -# -# JOB parameters have to be indented (using spaces only, example below). - -# ---------------------------------------------------------------------- -# Global Variables -# These variables set the defaults for all JOBs, however each JOB -# may define its own, overriding the defaults. - -# update_every sets the default data collection frequency. -# If unset, the python.d.plugin default is used. -# update_every: 1 - -# priority controls the order of charts at the netdata dashboard. -# Lower numbers move the charts towards the top of the page. -# If unset, the default for python.d.plugin is used. -# priority: 60000 - -# penalty indicates whether to apply penalty to update_every in case of failures. -# Penalty will increase every 5 failed updates in a row. Maximum penalty is 10 minutes. -# penalty: yes - -# autodetection_retry sets the job re-check interval in seconds. -# The job is not deleted if check fails. -# Attempts to start the job are made once every autodetection_retry. -# This feature is disabled by default. -# autodetection_retry: 0 - -# ---------------------------------------------------------------------- -# JOBS (data collection sources) -# -# The default JOBS share the same *name*. JOBS with the same name -# are mutually exclusive. Only one of them will be allowed running at -# any time. This allows autodetection to try several alternatives and -# pick the one that works. -# -# Any number of jobs is supported. -# -# All python.d.plugin JOBS (for all its modules) support a set of -# predefined parameters. These are: -# -# job_name: -# name: myname # the JOB's name as it will appear at the -# # dashboard (by default is the job_name) -# # JOBs sharing a name are mutually exclusive -# update_every: 1 # the JOB's data collection frequency -# priority: 60000 # the JOB's order on the dashboard -# penalty: yes # the JOB's penalty -# autodetection_retry: 0 # the JOB's re-check interval in seconds -# -# Additionally to the above, example also supports the following: -# -# loop_mode: yes/no # default is yes. If set to yes `nvidia-smi` is executed in a separate thread using `-l` option. -# poll_seconds: SECONDS # default is 1. Sets the frequency of seconds the nvidia-smi tool is polled in loop mode. -# exclude_zero_memory_users: yes/no # default is no. Whether to collect users metrics with 0Mb memory allocation. -# -# ---------------------------------------------------------------------- |