summaryrefslogtreecommitdiffstats
path: root/collectors/python.d.plugin/nvidia_smi
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-07-24 09:54:23 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-07-24 09:54:44 +0000
commit836b47cb7e99a977c5a23b059ca1d0b5065d310e (patch)
tree1604da8f482d02effa033c94a84be42bc0c848c3 /collectors/python.d.plugin/nvidia_smi
parentReleasing debian version 1.44.3-2. (diff)
downloadnetdata-836b47cb7e99a977c5a23b059ca1d0b5065d310e.tar.xz
netdata-836b47cb7e99a977c5a23b059ca1d0b5065d310e.zip
Merging upstream version 1.46.3.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'collectors/python.d.plugin/nvidia_smi')
-rw-r--r--collectors/python.d.plugin/nvidia_smi/Makefile.inc13
-rw-r--r--collectors/python.d.plugin/nvidia_smi/README.md157
-rw-r--r--collectors/python.d.plugin/nvidia_smi/metadata.yaml166
-rw-r--r--collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py651
-rw-r--r--collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf68
5 files changed, 0 insertions, 1055 deletions
diff --git a/collectors/python.d.plugin/nvidia_smi/Makefile.inc b/collectors/python.d.plugin/nvidia_smi/Makefile.inc
deleted file mode 100644
index 52fb25a68..000000000
--- a/collectors/python.d.plugin/nvidia_smi/Makefile.inc
+++ /dev/null
@@ -1,13 +0,0 @@
-# SPDX-License-Identifier: GPL-3.0-or-later
-
-# THIS IS NOT A COMPLETE Makefile
-# IT IS INCLUDED BY ITS PARENT'S Makefile.am
-# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT
-
-# install these files
-dist_python_DATA += nvidia_smi/nvidia_smi.chart.py
-dist_pythonconfig_DATA += nvidia_smi/nvidia_smi.conf
-
-# do not install these files, but include them in the distribution
-dist_noinst_DATA += nvidia_smi/README.md nvidia_smi/Makefile.inc
-
diff --git a/collectors/python.d.plugin/nvidia_smi/README.md b/collectors/python.d.plugin/nvidia_smi/README.md
deleted file mode 100644
index 7d45289a4..000000000
--- a/collectors/python.d.plugin/nvidia_smi/README.md
+++ /dev/null
@@ -1,157 +0,0 @@
-<!--
-title: "Nvidia GPU monitoring with Netdata"
-custom_edit_url: "https://github.com/netdata/netdata/edit/master/collectors/python.d.plugin/nvidia_smi/README.md"
-sidebar_label: "nvidia_smi-python.d.plugin"
-learn_status: "Published"
-learn_topic_type: "References"
-learn_rel_path: "Integrations/Monitor/Devices"
--->
-
-# Nvidia GPU collector
-
-Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool.
-
-## Requirements and Notes
-
-- You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface).
-- You must enable this plugin, as its disabled by default due to minor performance issues:
- ```bash
- cd /etc/netdata # Replace this path with your Netdata config directory, if different
- sudo ./edit-config python.d.conf
- ```
- Remove the '#' before nvidia_smi so it reads: `nvidia_smi: yes`.
-
-- On some systems when the GPU is idle the `nvidia-smi` tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue.
-- Currently the `nvidia-smi` tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: <https://github.com/netdata/netdata/pull/4357>
-- Contributions are welcome.
-- Make sure `netdata` user can execute `/usr/bin/nvidia-smi` or wherever your binary is.
-- If `nvidia-smi` process [is not killed after netdata restart](https://github.com/netdata/netdata/issues/7143) you need to off `loop_mode`.
-- `poll_seconds` is how often in seconds the tool is polled for as an integer.
-
-## Charts
-
-It produces the following charts:
-
-- PCI Express Bandwidth Utilization in `KiB/s`
-- Fan Speed in `percentage`
-- GPU Utilization in `percentage`
-- Memory Bandwidth Utilization in `percentage`
-- Encoder/Decoder Utilization in `percentage`
-- Memory Usage in `MiB`
-- Temperature in `celsius`
-- Clock Frequencies in `MHz`
-- Power Utilization in `Watts`
-- Memory Used by Each Process in `MiB`
-- Memory Used by Each User in `MiB`
-- Number of User on GPU in `num`
-
-## Configuration
-
-Edit the `python.d/nvidia_smi.conf` configuration file using `edit-config` from the Netdata [config
-directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md), which is typically at `/etc/netdata`.
-
-```bash
-cd /etc/netdata # Replace this path with your Netdata config directory, if different
-sudo ./edit-config python.d/nvidia_smi.conf
-```
-
-Sample:
-
-```yaml
-loop_mode : yes
-poll_seconds : 1
-exclude_zero_memory_users : yes
-```
-
-
-### Troubleshooting
-
-To troubleshoot issues with the `nvidia_smi` module, run the `python.d.plugin` with the debug option enabled. The
-output will give you the output of the data collection job or error messages on why the collector isn't working.
-
-First, navigate to your plugins directory, usually they are located under `/usr/libexec/netdata/plugins.d/`. If that's
-not the case on your system, open `netdata.conf` and look for the setting `plugins directory`. Once you're in the
-plugin's directory, switch to the `netdata` user.
-
-```bash
-cd /usr/libexec/netdata/plugins.d/
-sudo su -s /bin/bash netdata
-```
-
-Now you can manually run the `nvidia_smi` module in debug mode:
-
-```bash
-./python.d.plugin nvidia_smi debug trace
-```
-
-## Docker
-
-GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable.
-
-Sample `docker-compose.yml`
-```yaml
-version: '3'
-services:
- netdata:
- image: netdata/netdata
- container_name: netdata
- hostname: example.com # set to fqdn of host
- ports:
- - 19999:19999
- restart: unless-stopped
- cap_add:
- - SYS_PTRACE
- security_opt:
- - apparmor:unconfined
- environment:
- - NETDATA_EXTRA_APK_PACKAGES=gcompat
- volumes:
- - netdataconfig:/etc/netdata
- - netdatalib:/var/lib/netdata
- - netdatacache:/var/cache/netdata
- - /etc/passwd:/host/etc/passwd:ro
- - /etc/group:/host/etc/group:ro
- - /proc:/host/proc:ro
- - /sys:/host/sys:ro
- - /etc/os-release:/host/etc/os-release:ro
- deploy:
- resources:
- reservations:
- devices:
- - driver: nvidia
- count: all
- capabilities: [gpu]
-
-volumes:
- netdataconfig:
- netdatalib:
- netdatacache:
-```
-
-Sample `docker run`
-```yaml
-docker run -d --name=netdata \
- -p 19999:19999 \
- -e NETDATA_EXTRA_APK_PACKAGES=gcompat \
- -v netdataconfig:/etc/netdata \
- -v netdatalib:/var/lib/netdata \
- -v netdatacache:/var/cache/netdata \
- -v /etc/passwd:/host/etc/passwd:ro \
- -v /etc/group:/host/etc/group:ro \
- -v /proc:/host/proc:ro \
- -v /sys:/host/sys:ro \
- -v /etc/os-release:/host/etc/os-release:ro \
- --restart unless-stopped \
- --cap-add SYS_PTRACE \
- --security-opt apparmor=unconfined \
- --gpus all \
- netdata/netdata
-```
-
-### Docker Troubleshooting
-To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`.
-```bash
-docker exec -it netdata bash
-cd /etc/netdata
-./edit-config python.d.conf
-```
diff --git a/collectors/python.d.plugin/nvidia_smi/metadata.yaml b/collectors/python.d.plugin/nvidia_smi/metadata.yaml
deleted file mode 100644
index 9bf1e6ca7..000000000
--- a/collectors/python.d.plugin/nvidia_smi/metadata.yaml
+++ /dev/null
@@ -1,166 +0,0 @@
-# This collector will not appear in documentation, as the go version is preferred,
-# https://github.com/netdata/go.d.plugin/blob/master/modules/nvidia_smi/README.md
-#
-# meta:
-# plugin_name: python.d.plugin
-# module_name: nvidia_smi
-# monitored_instance:
-# name: python.d nvidia_smi
-# link: ''
-# categories: []
-# icon_filename: ''
-# related_resources:
-# integrations:
-# list: []
-# info_provided_to_referring_integrations:
-# description: ''
-# keywords: []
-# most_popular: false
-# overview:
-# data_collection:
-# metrics_description: ''
-# method_description: ''
-# supported_platforms:
-# include: []
-# exclude: []
-# multi_instance: true
-# additional_permissions:
-# description: ''
-# default_behavior:
-# auto_detection:
-# description: ''
-# limits:
-# description: ''
-# performance_impact:
-# description: ''
-# setup:
-# prerequisites:
-# list: []
-# configuration:
-# file:
-# name: ''
-# description: ''
-# options:
-# description: ''
-# folding:
-# title: ''
-# enabled: true
-# list: []
-# examples:
-# folding:
-# enabled: true
-# title: ''
-# list: []
-# troubleshooting:
-# problems:
-# list: []
-# alerts: []
-# metrics:
-# folding:
-# title: Metrics
-# enabled: false
-# description: ""
-# availability: []
-# scopes:
-# - name: GPU
-# description: ""
-# labels: []
-# metrics:
-# - name: nvidia_smi.pci_bandwidth
-# description: PCI Express Bandwidth Utilization
-# unit: "KiB/s"
-# chart_type: area
-# dimensions:
-# - name: rx
-# - name: tx
-# - name: nvidia_smi.pci_bandwidth_percent
-# description: PCI Express Bandwidth Percent
-# unit: "percentage"
-# chart_type: area
-# dimensions:
-# - name: rx_percent
-# - name: tx_percent
-# - name: nvidia_smi.fan_speed
-# description: Fan Speed
-# unit: "percentage"
-# chart_type: line
-# dimensions:
-# - name: speed
-# - name: nvidia_smi.gpu_utilization
-# description: GPU Utilization
-# unit: "percentage"
-# chart_type: line
-# dimensions:
-# - name: utilization
-# - name: nvidia_smi.mem_utilization
-# description: Memory Bandwidth Utilization
-# unit: "percentage"
-# chart_type: line
-# dimensions:
-# - name: utilization
-# - name: nvidia_smi.encoder_utilization
-# description: Encoder/Decoder Utilization
-# unit: "percentage"
-# chart_type: line
-# dimensions:
-# - name: encoder
-# - name: decoder
-# - name: nvidia_smi.memory_allocated
-# description: Memory Usage
-# unit: "MiB"
-# chart_type: stacked
-# dimensions:
-# - name: free
-# - name: used
-# - name: nvidia_smi.bar1_memory_usage
-# description: Bar1 Memory Usage
-# unit: "MiB"
-# chart_type: stacked
-# dimensions:
-# - name: free
-# - name: used
-# - name: nvidia_smi.temperature
-# description: Temperature
-# unit: "celsius"
-# chart_type: line
-# dimensions:
-# - name: temp
-# - name: nvidia_smi.clocks
-# description: Clock Frequencies
-# unit: "MHz"
-# chart_type: line
-# dimensions:
-# - name: graphics
-# - name: video
-# - name: sm
-# - name: mem
-# - name: nvidia_smi.power
-# description: Power Utilization
-# unit: "Watts"
-# chart_type: line
-# dimensions:
-# - name: power
-# - name: nvidia_smi.power_state
-# description: Power State
-# unit: "state"
-# chart_type: line
-# dimensions:
-# - name: a dimension per {power_state}
-# - name: nvidia_smi.processes_mem
-# description: Memory Used by Each Process
-# unit: "MiB"
-# chart_type: stacked
-# dimensions:
-# - name: a dimension per process
-# - name: nvidia_smi.user_mem
-# description: Memory Used by Each User
-# unit: "MiB"
-# chart_type: stacked
-# dimensions:
-# - name: a dimension per user
-# - name: nvidia_smi.user_num
-# description: Number of User on GPU
-# unit: "num"
-# chart_type: line
-# dimensions:
-# - name: users
diff --git a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py
deleted file mode 100644
index 556a61435..000000000
--- a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py
+++ /dev/null
@@ -1,651 +0,0 @@
-# -*- coding: utf-8 -*-
-# Description: nvidia-smi netdata python.d module
-# Original Author: Steven Noonan (tycho)
-# Author: Ilya Mashchenko (ilyam8)
-# User Memory Stat Author: Guido Scatena (scatenag)
-
-import os
-import pwd
-import subprocess
-import threading
-import xml.etree.ElementTree as et
-
-from bases.FrameworkServices.SimpleService import SimpleService
-from bases.collection import find_binary
-
-disabled_by_default = True
-
-NVIDIA_SMI = 'nvidia-smi'
-
-NOT_AVAILABLE = 'N/A'
-
-EMPTY_ROW = ''
-EMPTY_ROW_LIMIT = 500
-POLLER_BREAK_ROW = '</nvidia_smi_log>'
-
-PCI_BANDWIDTH = 'pci_bandwidth'
-PCI_BANDWIDTH_PERCENT = 'pci_bandwidth_percent'
-FAN_SPEED = 'fan_speed'
-GPU_UTIL = 'gpu_utilization'
-MEM_UTIL = 'mem_utilization'
-ENCODER_UTIL = 'encoder_utilization'
-MEM_USAGE = 'mem_usage'
-BAR_USAGE = 'bar1_mem_usage'
-TEMPERATURE = 'temperature'
-CLOCKS = 'clocks'
-POWER = 'power'
-POWER_STATE = 'power_state'
-PROCESSES_MEM = 'processes_mem'
-USER_MEM = 'user_mem'
-USER_NUM = 'user_num'
-
-ORDER = [
- PCI_BANDWIDTH,
- PCI_BANDWIDTH_PERCENT,
- FAN_SPEED,
- GPU_UTIL,
- MEM_UTIL,
- ENCODER_UTIL,
- MEM_USAGE,
- BAR_USAGE,
- TEMPERATURE,
- CLOCKS,
- POWER,
- POWER_STATE,
- PROCESSES_MEM,
- USER_MEM,
- USER_NUM,
-]
-
-# https://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__gpupstate.html
-POWER_STATES = ['P' + str(i) for i in range(0, 16)]
-
-# PCI Transfer data rate in gigabits per second (Gb/s) per generation
-PCI_SPEED = {
- "1": 2.5,
- "2": 5,
- "3": 8,
- "4": 16,
- "5": 32
-}
-# PCI encoding per generation
-PCI_ENCODING = {
- "1": 2 / 10,
- "2": 2 / 10,
- "3": 2 / 130,
- "4": 2 / 130,
- "5": 2 / 130
-}
-
-
-def gpu_charts(gpu):
- fam = gpu.full_name()
-
- charts = {
- PCI_BANDWIDTH: {
- 'options': [None, 'PCI Express Bandwidth Utilization', 'KiB/s', fam, 'nvidia_smi.pci_bandwidth', 'area'],
- 'lines': [
- ['rx_util', 'rx', 'absolute', 1, 1],
- ['tx_util', 'tx', 'absolute', 1, -1],
- ]
- },
- PCI_BANDWIDTH_PERCENT: {
- 'options': [None, 'PCI Express Bandwidth Percent', 'percentage', fam, 'nvidia_smi.pci_bandwidth_percent',
- 'area'],
- 'lines': [
- ['rx_util_percent', 'rx_percent'],
- ['tx_util_percent', 'tx_percent'],
- ]
- },
- FAN_SPEED: {
- 'options': [None, 'Fan Speed', 'percentage', fam, 'nvidia_smi.fan_speed', 'line'],
- 'lines': [
- ['fan_speed', 'speed'],
- ]
- },
- GPU_UTIL: {
- 'options': [None, 'GPU Utilization', 'percentage', fam, 'nvidia_smi.gpu_utilization', 'line'],
- 'lines': [
- ['gpu_util', 'utilization'],
- ]
- },
- MEM_UTIL: {
- 'options': [None, 'Memory Bandwidth Utilization', 'percentage', fam, 'nvidia_smi.mem_utilization', 'line'],
- 'lines': [
- ['memory_util', 'utilization'],
- ]
- },
- ENCODER_UTIL: {
- 'options': [None, 'Encoder/Decoder Utilization', 'percentage', fam, 'nvidia_smi.encoder_utilization',
- 'line'],
- 'lines': [
- ['encoder_util', 'encoder'],
- ['decoder_util', 'decoder'],
- ]
- },
- MEM_USAGE: {
- 'options': [None, 'Memory Usage', 'MiB', fam, 'nvidia_smi.memory_allocated', 'stacked'],
- 'lines': [
- ['fb_memory_free', 'free'],
- ['fb_memory_used', 'used'],
- ]
- },
- BAR_USAGE: {
- 'options': [None, 'Bar1 Memory Usage', 'MiB', fam, 'nvidia_smi.bar1_memory_usage', 'stacked'],
- 'lines': [
- ['bar1_memory_free', 'free'],
- ['bar1_memory_used', 'used'],
- ]
- },
- TEMPERATURE: {
- 'options': [None, 'Temperature', 'celsius', fam, 'nvidia_smi.temperature', 'line'],
- 'lines': [
- ['gpu_temp', 'temp'],
- ]
- },
- CLOCKS: {
- 'options': [None, 'Clock Frequencies', 'MHz', fam, 'nvidia_smi.clocks', 'line'],
- 'lines': [
- ['graphics_clock', 'graphics'],
- ['video_clock', 'video'],
- ['sm_clock', 'sm'],
- ['mem_clock', 'mem'],
- ]
- },
- POWER: {
- 'options': [None, 'Power Utilization', 'Watts', fam, 'nvidia_smi.power', 'line'],
- 'lines': [
- ['power_draw', 'power', 'absolute', 1, 100],
- ]
- },
- POWER_STATE: {
- 'options': [None, 'Power State', 'state', fam, 'nvidia_smi.power_state', 'line'],
- 'lines': [['power_state_' + v.lower(), v, 'absolute'] for v in POWER_STATES]
- },
- PROCESSES_MEM: {
- 'options': [None, 'Memory Used by Each Process', 'MiB', fam, 'nvidia_smi.processes_mem', 'stacked'],
- 'lines': []
- },
- USER_MEM: {
- 'options': [None, 'Memory Used by Each User', 'MiB', fam, 'nvidia_smi.user_mem', 'stacked'],
- 'lines': []
- },
- USER_NUM: {
- 'options': [None, 'Number of User on GPU', 'num', fam, 'nvidia_smi.user_num', 'line'],
- 'lines': [
- ['user_num', 'users'],
- ]
- },
- }
-
- idx = gpu.num
-
- order = ['gpu{0}_{1}'.format(idx, v) for v in ORDER]
- charts = dict(('gpu{0}_{1}'.format(idx, k), v) for k, v in charts.items())
-
- for chart in charts.values():
- for line in chart['lines']:
- line[0] = 'gpu{0}_{1}'.format(idx, line[0])
-
- return order, charts
-
-
-class NvidiaSMI:
- def __init__(self):
- self.command = find_binary(NVIDIA_SMI)
- self.active_proc = None
-
- def run_once(self):
- proc = subprocess.Popen([self.command, '-x', '-q'], stdout=subprocess.PIPE)
- stdout, _ = proc.communicate()
- return stdout
-
- def run_loop(self, interval):
- if self.active_proc:
- self.kill()
- proc = subprocess.Popen([self.command, '-x', '-q', '-l', str(interval)], stdout=subprocess.PIPE)
- self.active_proc = proc
- return proc.stdout
-
- def kill(self):
- if self.active_proc:
- self.active_proc.kill()
- self.active_proc = None
-
-
-class NvidiaSMIPoller(threading.Thread):
- def __init__(self, poll_interval):
- threading.Thread.__init__(self)
- self.daemon = True
-
- self.smi = NvidiaSMI()
- self.interval = poll_interval
-
- self.lock = threading.RLock()
- self.last_data = str()
- self.exit = False
- self.empty_rows = 0
- self.rows = list()
-
- def has_smi(self):
- return bool(self.smi.command)
-
- def run_once(self):
- return self.smi.run_once()
-
- def run(self):
- out = self.smi.run_loop(self.interval)
-
- for row in out:
- if self.exit or self.empty_rows > EMPTY_ROW_LIMIT:
- break
- self.process_row(row)
- self.smi.kill()
-
- def process_row(self, row):
- row = row.decode()
- self.empty_rows += (row == EMPTY_ROW)
- self.rows.append(row)
-
- if POLLER_BREAK_ROW in row:
- self.lock.acquire()
- self.last_data = '\n'.join(self.rows)
- self.lock.release()
-
- self.rows = list()
- self.empty_rows = 0
-
- def is_started(self):
- return self.ident is not None
-
- def shutdown(self):
- self.exit = True
-
- def data(self):
- self.lock.acquire()
- data = self.last_data
- self.lock.release()
- return data
-
-
-def handle_attr_error(method):
- def on_call(*args, **kwargs):
- try:
- return method(*args, **kwargs)
- except AttributeError:
- return None
-
- return on_call
-
-
-def handle_value_error(method):
- def on_call(*args, **kwargs):
- try:
- return method(*args, **kwargs)
- except ValueError:
- return None
-
- return on_call
-
-
-HOST_PREFIX = os.getenv('NETDATA_HOST_PREFIX')
-ETC_PASSWD_PATH = '/etc/passwd'
-PROC_PATH = '/proc'
-
-IS_INSIDE_DOCKER = False
-
-if HOST_PREFIX:
- ETC_PASSWD_PATH = os.path.join(HOST_PREFIX, ETC_PASSWD_PATH[1:])
- PROC_PATH = os.path.join(HOST_PREFIX, PROC_PATH[1:])
- IS_INSIDE_DOCKER = True
-
-
-def read_passwd_file():
- data = dict()
- with open(ETC_PASSWD_PATH, 'r') as f:
- for line in f:
- line = line.strip()
- if line.startswith("#"):
- continue
- fields = line.split(":")
- # name, passwd, uid, gid, comment, home_dir, shell
- if len(fields) != 7:
- continue
- # uid, guid
- fields[2], fields[3] = int(fields[2]), int(fields[3])
- data[fields[2]] = fields
- return data
-
-
-def read_passwd_file_safe():
- try:
- if IS_INSIDE_DOCKER:
- return read_passwd_file()
- return dict((k[2], k) for k in pwd.getpwall())
- except (OSError, IOError):
- return dict()
-
-
-def get_username_by_pid_safe(pid, passwd_file):
- path = os.path.join(PROC_PATH, pid)
- try:
- uid = os.stat(path).st_uid
- except (OSError, IOError):
- return ''
- try:
- if IS_INSIDE_DOCKER:
- return passwd_file[uid][0]
- return pwd.getpwuid(uid)[0]
- except KeyError:
- return str(uid)
-
-
-class GPU:
- def __init__(self, num, root, exclude_zero_memory_users=False):
- self.num = num
- self.root = root
- self.exclude_zero_memory_users = exclude_zero_memory_users
-
- def id(self):
- return self.root.get('id')
-
- def name(self):
- return self.root.find('product_name').text
-
- def full_name(self):
- return 'gpu{0} {1}'.format(self.num, self.name())
-
- @handle_attr_error
- def pci_link_gen(self):
- return self.root.find('pci').find('pci_gpu_link_info').find('pcie_gen').find('max_link_gen').text
-
- @handle_attr_error
- def pci_link_width(self):
- info = self.root.find('pci').find('pci_gpu_link_info')
- return info.find('link_widths').find('max_link_width').text.split('x')[0]
-
- def pci_bw_max(self):
- link_gen = self.pci_link_gen()
- link_width = int(self.pci_link_width())
- if link_gen not in PCI_SPEED or link_gen not in PCI_ENCODING or not link_width:
- return None
- # Maximum PCIe Bandwidth = SPEED * WIDTH * (1 - ENCODING) - 1Gb/s.
- # see details https://enterprise-support.nvidia.com/s/article/understanding-pcie-configuration-for-maximum-performance
- # return max bandwidth in kilobytes per second (kB/s)
- return (PCI_SPEED[link_gen] * link_width * (1 - PCI_ENCODING[link_gen]) - 1) * 1000 * 1000 / 8
-
- @handle_attr_error
- def rx_util(self):
- return self.root.find('pci').find('rx_util').text.split()[0]
-
- @handle_attr_error
- def tx_util(self):
- return self.root.find('pci').find('tx_util').text.split()[0]
-
- @handle_attr_error
- def fan_speed(self):
- return self.root.find('fan_speed').text.split()[0]
-
- @handle_attr_error
- def gpu_util(self):
- return self.root.find('utilization').find('gpu_util').text.split()[0]
-
- @handle_attr_error
- def memory_util(self):
- return self.root.find('utilization').find('memory_util').text.split()[0]
-
- @handle_attr_error
- def encoder_util(self):
- return self.root.find('utilization').find('encoder_util').text.split()[0]
-
- @handle_attr_error
- def decoder_util(self):
- return self.root.find('utilization').find('decoder_util').text.split()[0]
-
- @handle_attr_error
- def fb_memory_used(self):
- return self.root.find('fb_memory_usage').find('used').text.split()[0]
-
- @handle_attr_error
- def fb_memory_free(self):
- return self.root.find('fb_memory_usage').find('free').text.split()[0]
-
- @handle_attr_error
- def bar1_memory_used(self):
- return self.root.find('bar1_memory_usage').find('used').text.split()[0]
-
- @handle_attr_error
- def bar1_memory_free(self):
- return self.root.find('bar1_memory_usage').find('free').text.split()[0]
-
- @handle_attr_error
- def temperature(self):
- return self.root.find('temperature').find('gpu_temp').text.split()[0]
-
- @handle_attr_error
- def graphics_clock(self):
- return self.root.find('clocks').find('graphics_clock').text.split()[0]
-
- @handle_attr_error
- def video_clock(self):
- return self.root.find('clocks').find('video_clock').text.split()[0]
-
- @handle_attr_error
- def sm_clock(self):
- return self.root.find('clocks').find('sm_clock').text.split()[0]
-
- @handle_attr_error
- def mem_clock(self):
- return self.root.find('clocks').find('mem_clock').text.split()[0]
-
- @handle_attr_error
- def power_readings(self):
- elem = self.root.find('power_readings')
- return elem if elem else self.root.find('gpu_power_readings')
-
- @handle_attr_error
- def power_state(self):
- return str(self.power_readings().find('power_state').text.split()[0])
-
- @handle_value_error
- @handle_attr_error
- def power_draw(self):
- return float(self.power_readings().find('power_draw').text.split()[0]) * 100
-
- @handle_attr_error
- def processes(self):
- processes_info = self.root.find('processes').findall('process_info')
- if not processes_info:
- return list()
-
- passwd_file = read_passwd_file_safe()
- processes = list()
-
- for info in processes_info:
- pid = info.find('pid').text
- processes.append({
- 'pid': int(pid),
- 'process_name': info.find('process_name').text,
- 'used_memory': int(info.find('used_memory').text.split()[0]),
- 'username': get_username_by_pid_safe(pid, passwd_file),
- })
- return processes
-
- def data(self):
- data = {
- 'rx_util': self.rx_util(),
- 'tx_util': self.tx_util(),
- 'fan_speed': self.fan_speed(),
- 'gpu_util': self.gpu_util(),
- 'memory_util': self.memory_util(),
- 'encoder_util': self.encoder_util(),
- 'decoder_util': self.decoder_util(),
- 'fb_memory_used': self.fb_memory_used(),
- 'fb_memory_free': self.fb_memory_free(),
- 'bar1_memory_used': self.bar1_memory_used(),
- 'bar1_memory_free': self.bar1_memory_free(),
- 'gpu_temp': self.temperature(),
- 'graphics_clock': self.graphics_clock(),
- 'video_clock': self.video_clock(),
- 'sm_clock': self.sm_clock(),
- 'mem_clock': self.mem_clock(),
- 'power_draw': self.power_draw(),
- }
-
- if self.rx_util() != NOT_AVAILABLE and self.tx_util() != NOT_AVAILABLE:
- pci_bw_max = self.pci_bw_max()
- if not pci_bw_max:
- data['rx_util_percent'] = 0
- data['tx_util_percent'] = 0
- else:
- data['rx_util_percent'] = str(int(int(self.rx_util()) * 100 / self.pci_bw_max()))
- data['tx_util_percent'] = str(int(int(self.tx_util()) * 100 / self.pci_bw_max()))
-
- for v in POWER_STATES:
- data['power_state_' + v.lower()] = 0
- p_state = self.power_state()
- if p_state:
- data['power_state_' + p_state.lower()] = 1
-
- processes = self.processes() or []
- users = set()
- for p in processes:
- data['process_mem_{0}'.format(p['pid'])] = p['used_memory']
- if p['username']:
- if self.exclude_zero_memory_users and p['used_memory'] == 0:
- continue
- users.add(p['username'])
- key = 'user_mem_{0}'.format(p['username'])
- if key in data:
- data[key] += p['used_memory']
- else:
- data[key] = p['used_memory']
- data['user_num'] = len(users)
-
- return dict(('gpu{0}_{1}'.format(self.num, k), v) for k, v in data.items())
-
-
-class Service(SimpleService):
- def __init__(self, configuration=None, name=None):
- super(Service, self).__init__(configuration=configuration, name=name)
- self.order = list()
- self.definitions = dict()
- self.loop_mode = configuration.get('loop_mode', True)
- poll = int(configuration.get('poll_seconds', self.get_update_every()))
- self.exclude_zero_memory_users = configuration.get('exclude_zero_memory_users', False)
- self.poller = NvidiaSMIPoller(poll)
-
- def get_data_loop_mode(self):
- if not self.poller.is_started():
- self.poller.start()
-
- if not self.poller.is_alive():
- self.debug('poller is off')
- return None
-
- return self.poller.data()
-
- def get_data_normal_mode(self):
- return self.poller.run_once()
-
- def get_data(self):
- if self.loop_mode:
- last_data = self.get_data_loop_mode()
- else:
- last_data = self.get_data_normal_mode()
-
- if not last_data:
- return None
-
- parsed = self.parse_xml(last_data)
- if parsed is None:
- return None
-
- data = dict()
- for idx, root in enumerate(parsed.findall('gpu')):
- gpu = GPU(idx, root, self.exclude_zero_memory_users)
- gpu_data = gpu.data()
- # self.debug(gpu_data)
- gpu_data = dict((k, v) for k, v in gpu_data.items() if is_gpu_data_value_valid(v))
- data.update(gpu_data)
- self.update_processes_mem_chart(gpu)
- self.update_processes_user_mem_chart(gpu)
-
- return data or None
-
- def update_processes_mem_chart(self, gpu):
- ps = gpu.processes()
- if not ps:
- return
- chart = self.charts['gpu{0}_{1}'.format(gpu.num, PROCESSES_MEM)]
- active_dim_ids = []
- for p in ps:
- dim_id = 'gpu{0}_process_mem_{1}'.format(gpu.num, p['pid'])
- active_dim_ids.append(dim_id)
- if dim_id not in chart:
- chart.add_dimension([dim_id, '{0} {1}'.format(p['pid'], p['process_name'])])
- for dim in chart:
- if dim.id not in active_dim_ids:
- chart.del_dimension(dim.id, hide=False)
-
- def update_processes_user_mem_chart(self, gpu):
- ps = gpu.processes()
- if not ps:
- return
- chart = self.charts['gpu{0}_{1}'.format(gpu.num, USER_MEM)]
- active_dim_ids = []
- for p in ps:
- if not p.get('username'):
- continue
- dim_id = 'gpu{0}_user_mem_{1}'.format(gpu.num, p['username'])
- active_dim_ids.append(dim_id)
- if dim_id not in chart:
- chart.add_dimension([dim_id, '{0}'.format(p['username'])])
-
- for dim in chart:
- if dim.id not in active_dim_ids:
- chart.del_dimension(dim.id, hide=False)
-
- def check(self):
- if not self.poller.has_smi():
- self.error("couldn't find '{0}' binary".format(NVIDIA_SMI))
- return False
-
- raw_data = self.poller.run_once()
- if not raw_data:
- self.error("failed to invoke '{0}' binary".format(NVIDIA_SMI))
- return False
-
- parsed = self.parse_xml(raw_data)
- if parsed is None:
- return False
-
- gpus = parsed.findall('gpu')
- if not gpus:
- return False
-
- self.create_charts(gpus)
-
- return True
-
- def parse_xml(self, data):
- try:
- return et.fromstring(data)
- except et.ParseError as error:
- self.error('xml parse failed: "{0}", error: {1}'.format(data, error))
-
- return None
-
- def create_charts(self, gpus):
- for idx, root in enumerate(gpus):
- order, charts = gpu_charts(GPU(idx, root))
- self.order.extend(order)
- self.definitions.update(charts)
-
-
-def is_gpu_data_value_valid(value):
- try:
- int(value)
- except (TypeError, ValueError):
- return False
- return True
diff --git a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf b/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf
deleted file mode 100644
index 3d2a30d41..000000000
--- a/collectors/python.d.plugin/nvidia_smi/nvidia_smi.conf
+++ /dev/null
@@ -1,68 +0,0 @@
-# netdata python.d.plugin configuration for nvidia_smi
-#
-# This file is in YaML format. Generally the format is:
-#
-# name: value
-#
-# There are 2 sections:
-# - global variables
-# - one or more JOBS
-#
-# JOBS allow you to collect values from multiple sources.
-# Each source will have its own set of charts.
-#
-# JOB parameters have to be indented (using spaces only, example below).
-
-# ----------------------------------------------------------------------
-# Global Variables
-# These variables set the defaults for all JOBs, however each JOB
-# may define its own, overriding the defaults.
-
-# update_every sets the default data collection frequency.
-# If unset, the python.d.plugin default is used.
-# update_every: 1
-
-# priority controls the order of charts at the netdata dashboard.
-# Lower numbers move the charts towards the top of the page.
-# If unset, the default for python.d.plugin is used.
-# priority: 60000
-
-# penalty indicates whether to apply penalty to update_every in case of failures.
-# Penalty will increase every 5 failed updates in a row. Maximum penalty is 10 minutes.
-# penalty: yes
-
-# autodetection_retry sets the job re-check interval in seconds.
-# The job is not deleted if check fails.
-# Attempts to start the job are made once every autodetection_retry.
-# This feature is disabled by default.
-# autodetection_retry: 0
-
-# ----------------------------------------------------------------------
-# JOBS (data collection sources)
-#
-# The default JOBS share the same *name*. JOBS with the same name
-# are mutually exclusive. Only one of them will be allowed running at
-# any time. This allows autodetection to try several alternatives and
-# pick the one that works.
-#
-# Any number of jobs is supported.
-#
-# All python.d.plugin JOBS (for all its modules) support a set of
-# predefined parameters. These are:
-#
-# job_name:
-# name: myname # the JOB's name as it will appear at the
-# # dashboard (by default is the job_name)
-# # JOBs sharing a name are mutually exclusive
-# update_every: 1 # the JOB's data collection frequency
-# priority: 60000 # the JOB's order on the dashboard
-# penalty: yes # the JOB's penalty
-# autodetection_retry: 0 # the JOB's re-check interval in seconds
-#
-# Additionally to the above, example also supports the following:
-#
-# loop_mode: yes/no # default is yes. If set to yes `nvidia-smi` is executed in a separate thread using `-l` option.
-# poll_seconds: SECONDS # default is 1. Sets the frequency of seconds the nvidia-smi tool is polled in loop mode.
-# exclude_zero_memory_users: yes/no # default is no. Whether to collect users metrics with 0Mb memory allocation.
-#
-# ----------------------------------------------------------------------