diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2023-05-08 16:27:04 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2023-05-08 16:27:04 +0000 |
commit | a836a244a3d2bdd4da1ee2641e3e957850668cea (patch) | |
tree | cb87c75b3677fab7144f868435243f864048a1e6 /collectors/python.d.plugin/nvidia_smi | |
parent | Adding upstream version 1.38.1. (diff) | |
download | netdata-a836a244a3d2bdd4da1ee2641e3e957850668cea.tar.xz netdata-a836a244a3d2bdd4da1ee2641e3e957850668cea.zip |
Adding upstream version 1.39.0.upstream/1.39.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'collectors/python.d.plugin/nvidia_smi')
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/README.md | 98 | ||||
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/metrics.csv | 16 |
2 files changed, 109 insertions, 5 deletions
diff --git a/collectors/python.d.plugin/nvidia_smi/README.md b/collectors/python.d.plugin/nvidia_smi/README.md index ce5473c26..7d45289a4 100644 --- a/collectors/python.d.plugin/nvidia_smi/README.md +++ b/collectors/python.d.plugin/nvidia_smi/README.md @@ -4,16 +4,13 @@ custom_edit_url: "https://github.com/netdata/netdata/edit/master/collectors/pyth sidebar_label: "nvidia_smi-python.d.plugin" learn_status: "Published" learn_topic_type: "References" -learn_rel_path: "References/Collectors references/Devices" +learn_rel_path: "Integrations/Monitor/Devices" --> -# Nvidia GPU monitoring with Netdata +# Nvidia GPU collector Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool. -> **Warning**: this collector does not work when the Netdata Agent is [running in a container](https://github.com/netdata/netdata/blob/master/packaging/docker/README.md). - - ## Requirements and Notes - You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface). @@ -67,3 +64,94 @@ exclude_zero_memory_users : yes ``` +### Troubleshooting + +To troubleshoot issues with the `nvidia_smi` module, run the `python.d.plugin` with the debug option enabled. The +output will give you the output of the data collection job or error messages on why the collector isn't working. + +First, navigate to your plugins directory, usually they are located under `/usr/libexec/netdata/plugins.d/`. If that's +not the case on your system, open `netdata.conf` and look for the setting `plugins directory`. Once you're in the +plugin's directory, switch to the `netdata` user. + +```bash +cd /usr/libexec/netdata/plugins.d/ +sudo su -s /bin/bash netdata +``` + +Now you can manually run the `nvidia_smi` module in debug mode: + +```bash +./python.d.plugin nvidia_smi debug trace +``` + +## Docker + +GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable. + +Sample `docker-compose.yml` +```yaml +version: '3' +services: + netdata: + image: netdata/netdata + container_name: netdata + hostname: example.com # set to fqdn of host + ports: + - 19999:19999 + restart: unless-stopped + cap_add: + - SYS_PTRACE + security_opt: + - apparmor:unconfined + environment: + - NETDATA_EXTRA_APK_PACKAGES=gcompat + volumes: + - netdataconfig:/etc/netdata + - netdatalib:/var/lib/netdata + - netdatacache:/var/cache/netdata + - /etc/passwd:/host/etc/passwd:ro + - /etc/group:/host/etc/group:ro + - /proc:/host/proc:ro + - /sys:/host/sys:ro + - /etc/os-release:/host/etc/os-release:ro + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [gpu] + +volumes: + netdataconfig: + netdatalib: + netdatacache: +``` + +Sample `docker run` +```yaml +docker run -d --name=netdata \ + -p 19999:19999 \ + -e NETDATA_EXTRA_APK_PACKAGES=gcompat \ + -v netdataconfig:/etc/netdata \ + -v netdatalib:/var/lib/netdata \ + -v netdatacache:/var/cache/netdata \ + -v /etc/passwd:/host/etc/passwd:ro \ + -v /etc/group:/host/etc/group:ro \ + -v /proc:/host/proc:ro \ + -v /sys:/host/sys:ro \ + -v /etc/os-release:/host/etc/os-release:ro \ + --restart unless-stopped \ + --cap-add SYS_PTRACE \ + --security-opt apparmor=unconfined \ + --gpus all \ + netdata/netdata +``` + +### Docker Troubleshooting +To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`. +```bash +docker exec -it netdata bash +cd /etc/netdata +./edit-config python.d.conf +``` diff --git a/collectors/python.d.plugin/nvidia_smi/metrics.csv b/collectors/python.d.plugin/nvidia_smi/metrics.csv new file mode 100644 index 000000000..683ea5650 --- /dev/null +++ b/collectors/python.d.plugin/nvidia_smi/metrics.csv @@ -0,0 +1,16 @@ +metric,scope,dimensions,unit,description,chart_type,labels,plugin,module +nvidia_smi.pci_bandwidth,GPU,"rx, tx",KiB/s,PCI Express Bandwidth Utilization,area,,python.d.plugin,nvidia_smi +nvidia_smi.pci_bandwidth_percent,GPU,"rx_percent, tx_percent",percentage,PCI Express Bandwidth Percent,area,,python.d.plugin,nvidia_smi +nvidia_smi.fan_speed,GPU,speed,percentage,Fan Speed,line,,python.d.plugin,nvidia_smi +nvidia_smi.gpu_utilization,GPU,utilization,percentage,GPU Utilization,line,,python.d.plugin,nvidia_smi +nvidia_smi.mem_utilization,GPU,utilization,percentage,Memory Bandwidth Utilization,line,,python.d.plugin,nvidia_smi +nvidia_smi.encoder_utilization,GPU,"encoder, decoder",percentage,Encoder/Decoder Utilization,line,,python.d.plugin,nvidia_smi +nvidia_smi.memory_allocated,GPU,"free, used",MiB,Memory Usage,stacked,,python.d.plugin,nvidia_smi +nvidia_smi.bar1_memory_usage,GPU,"free, used",MiB,Bar1 Memory Usage,stacked,,python.d.plugin,nvidia_smi +nvidia_smi.temperature,GPU,temp,celsius,Temperature,line,,python.d.plugin,nvidia_smi +nvidia_smi.clocks,GPU,"graphics, video, sm, mem",MHz,Clock Frequencies,line,,python.d.plugin,nvidia_smi +nvidia_smi.power,GPU,power,Watts,Power Utilization,line,,python.d.plugin,nvidia_smi +nvidia_smi.power_state,GPU,a dimension per {power_state},state,Power State,line,,python.d.plugin,nvidia_smi +nvidia_smi.processes_mem,GPU,a dimension per process,MiB,Memory Used by Each Process,stacked,,python.d.plugin,nvidia_smi +nvidia_smi.user_mem,GPU,a dimension per user,MiB,Memory Used by Each User,stacked,,python.d.plugin,nvidia_smi +nvidia_smi.user_num,GPU,users,num,Number of User on GPU,line,,python.d.plugin,nvidia_smi |