summaryrefslogtreecommitdiffstats
path: root/collectors/python.d.plugin/nvidia_smi
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2023-05-08 16:27:04 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2023-05-08 16:27:04 +0000
commita836a244a3d2bdd4da1ee2641e3e957850668cea (patch)
treecb87c75b3677fab7144f868435243f864048a1e6 /collectors/python.d.plugin/nvidia_smi
parentAdding upstream version 1.38.1. (diff)
downloadnetdata-a836a244a3d2bdd4da1ee2641e3e957850668cea.tar.xz
netdata-a836a244a3d2bdd4da1ee2641e3e957850668cea.zip
Adding upstream version 1.39.0.upstream/1.39.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'collectors/python.d.plugin/nvidia_smi')
-rw-r--r--collectors/python.d.plugin/nvidia_smi/README.md98
-rw-r--r--collectors/python.d.plugin/nvidia_smi/metrics.csv16
2 files changed, 109 insertions, 5 deletions
diff --git a/collectors/python.d.plugin/nvidia_smi/README.md b/collectors/python.d.plugin/nvidia_smi/README.md
index ce5473c26..7d45289a4 100644
--- a/collectors/python.d.plugin/nvidia_smi/README.md
+++ b/collectors/python.d.plugin/nvidia_smi/README.md
@@ -4,16 +4,13 @@ custom_edit_url: "https://github.com/netdata/netdata/edit/master/collectors/pyth
sidebar_label: "nvidia_smi-python.d.plugin"
learn_status: "Published"
learn_topic_type: "References"
-learn_rel_path: "References/Collectors references/Devices"
+learn_rel_path: "Integrations/Monitor/Devices"
-->
-# Nvidia GPU monitoring with Netdata
+# Nvidia GPU collector
Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool.
-> **Warning**: this collector does not work when the Netdata Agent is [running in a container](https://github.com/netdata/netdata/blob/master/packaging/docker/README.md).
-
-
## Requirements and Notes
- You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface).
@@ -67,3 +64,94 @@ exclude_zero_memory_users : yes
```
+### Troubleshooting
+
+To troubleshoot issues with the `nvidia_smi` module, run the `python.d.plugin` with the debug option enabled. The
+output will give you the output of the data collection job or error messages on why the collector isn't working.
+
+First, navigate to your plugins directory, usually they are located under `/usr/libexec/netdata/plugins.d/`. If that's
+not the case on your system, open `netdata.conf` and look for the setting `plugins directory`. Once you're in the
+plugin's directory, switch to the `netdata` user.
+
+```bash
+cd /usr/libexec/netdata/plugins.d/
+sudo su -s /bin/bash netdata
+```
+
+Now you can manually run the `nvidia_smi` module in debug mode:
+
+```bash
+./python.d.plugin nvidia_smi debug trace
+```
+
+## Docker
+
+GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable.
+
+Sample `docker-compose.yml`
+```yaml
+version: '3'
+services:
+ netdata:
+ image: netdata/netdata
+ container_name: netdata
+ hostname: example.com # set to fqdn of host
+ ports:
+ - 19999:19999
+ restart: unless-stopped
+ cap_add:
+ - SYS_PTRACE
+ security_opt:
+ - apparmor:unconfined
+ environment:
+ - NETDATA_EXTRA_APK_PACKAGES=gcompat
+ volumes:
+ - netdataconfig:/etc/netdata
+ - netdatalib:/var/lib/netdata
+ - netdatacache:/var/cache/netdata
+ - /etc/passwd:/host/etc/passwd:ro
+ - /etc/group:/host/etc/group:ro
+ - /proc:/host/proc:ro
+ - /sys:/host/sys:ro
+ - /etc/os-release:/host/etc/os-release:ro
+ deploy:
+ resources:
+ reservations:
+ devices:
+ - driver: nvidia
+ count: all
+ capabilities: [gpu]
+
+volumes:
+ netdataconfig:
+ netdatalib:
+ netdatacache:
+```
+
+Sample `docker run`
+```yaml
+docker run -d --name=netdata \
+ -p 19999:19999 \
+ -e NETDATA_EXTRA_APK_PACKAGES=gcompat \
+ -v netdataconfig:/etc/netdata \
+ -v netdatalib:/var/lib/netdata \
+ -v netdatacache:/var/cache/netdata \
+ -v /etc/passwd:/host/etc/passwd:ro \
+ -v /etc/group:/host/etc/group:ro \
+ -v /proc:/host/proc:ro \
+ -v /sys:/host/sys:ro \
+ -v /etc/os-release:/host/etc/os-release:ro \
+ --restart unless-stopped \
+ --cap-add SYS_PTRACE \
+ --security-opt apparmor=unconfined \
+ --gpus all \
+ netdata/netdata
+```
+
+### Docker Troubleshooting
+To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`.
+```bash
+docker exec -it netdata bash
+cd /etc/netdata
+./edit-config python.d.conf
+```
diff --git a/collectors/python.d.plugin/nvidia_smi/metrics.csv b/collectors/python.d.plugin/nvidia_smi/metrics.csv
new file mode 100644
index 000000000..683ea5650
--- /dev/null
+++ b/collectors/python.d.plugin/nvidia_smi/metrics.csv
@@ -0,0 +1,16 @@
+metric,scope,dimensions,unit,description,chart_type,labels,plugin,module
+nvidia_smi.pci_bandwidth,GPU,"rx, tx",KiB/s,PCI Express Bandwidth Utilization,area,,python.d.plugin,nvidia_smi
+nvidia_smi.pci_bandwidth_percent,GPU,"rx_percent, tx_percent",percentage,PCI Express Bandwidth Percent,area,,python.d.plugin,nvidia_smi
+nvidia_smi.fan_speed,GPU,speed,percentage,Fan Speed,line,,python.d.plugin,nvidia_smi
+nvidia_smi.gpu_utilization,GPU,utilization,percentage,GPU Utilization,line,,python.d.plugin,nvidia_smi
+nvidia_smi.mem_utilization,GPU,utilization,percentage,Memory Bandwidth Utilization,line,,python.d.plugin,nvidia_smi
+nvidia_smi.encoder_utilization,GPU,"encoder, decoder",percentage,Encoder/Decoder Utilization,line,,python.d.plugin,nvidia_smi
+nvidia_smi.memory_allocated,GPU,"free, used",MiB,Memory Usage,stacked,,python.d.plugin,nvidia_smi
+nvidia_smi.bar1_memory_usage,GPU,"free, used",MiB,Bar1 Memory Usage,stacked,,python.d.plugin,nvidia_smi
+nvidia_smi.temperature,GPU,temp,celsius,Temperature,line,,python.d.plugin,nvidia_smi
+nvidia_smi.clocks,GPU,"graphics, video, sm, mem",MHz,Clock Frequencies,line,,python.d.plugin,nvidia_smi
+nvidia_smi.power,GPU,power,Watts,Power Utilization,line,,python.d.plugin,nvidia_smi
+nvidia_smi.power_state,GPU,a dimension per {power_state},state,Power State,line,,python.d.plugin,nvidia_smi
+nvidia_smi.processes_mem,GPU,a dimension per process,MiB,Memory Used by Each Process,stacked,,python.d.plugin,nvidia_smi
+nvidia_smi.user_mem,GPU,a dimension per user,MiB,Memory Used by Each User,stacked,,python.d.plugin,nvidia_smi
+nvidia_smi.user_num,GPU,users,num,Number of User on GPU,line,,python.d.plugin,nvidia_smi