diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2023-05-08 16:27:04 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2023-05-08 16:27:04 +0000 |
commit | a836a244a3d2bdd4da1ee2641e3e957850668cea (patch) | |
tree | cb87c75b3677fab7144f868435243f864048a1e6 /collectors/python.d.plugin/nvidia_smi/README.md | |
parent | Adding upstream version 1.38.1. (diff) | |
download | netdata-a836a244a3d2bdd4da1ee2641e3e957850668cea.tar.xz netdata-a836a244a3d2bdd4da1ee2641e3e957850668cea.zip |
Adding upstream version 1.39.0.upstream/1.39.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'collectors/python.d.plugin/nvidia_smi/README.md')
-rw-r--r-- | collectors/python.d.plugin/nvidia_smi/README.md | 98 |
1 files changed, 93 insertions, 5 deletions
diff --git a/collectors/python.d.plugin/nvidia_smi/README.md b/collectors/python.d.plugin/nvidia_smi/README.md index ce5473c26..7d45289a4 100644 --- a/collectors/python.d.plugin/nvidia_smi/README.md +++ b/collectors/python.d.plugin/nvidia_smi/README.md @@ -4,16 +4,13 @@ custom_edit_url: "https://github.com/netdata/netdata/edit/master/collectors/pyth sidebar_label: "nvidia_smi-python.d.plugin" learn_status: "Published" learn_topic_type: "References" -learn_rel_path: "References/Collectors references/Devices" +learn_rel_path: "Integrations/Monitor/Devices" --> -# Nvidia GPU monitoring with Netdata +# Nvidia GPU collector Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool. -> **Warning**: this collector does not work when the Netdata Agent is [running in a container](https://github.com/netdata/netdata/blob/master/packaging/docker/README.md). - - ## Requirements and Notes - You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface). @@ -67,3 +64,94 @@ exclude_zero_memory_users : yes ``` +### Troubleshooting + +To troubleshoot issues with the `nvidia_smi` module, run the `python.d.plugin` with the debug option enabled. The +output will give you the output of the data collection job or error messages on why the collector isn't working. + +First, navigate to your plugins directory, usually they are located under `/usr/libexec/netdata/plugins.d/`. If that's +not the case on your system, open `netdata.conf` and look for the setting `plugins directory`. Once you're in the +plugin's directory, switch to the `netdata` user. + +```bash +cd /usr/libexec/netdata/plugins.d/ +sudo su -s /bin/bash netdata +``` + +Now you can manually run the `nvidia_smi` module in debug mode: + +```bash +./python.d.plugin nvidia_smi debug trace +``` + +## Docker + +GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable. + +Sample `docker-compose.yml` +```yaml +version: '3' +services: + netdata: + image: netdata/netdata + container_name: netdata + hostname: example.com # set to fqdn of host + ports: + - 19999:19999 + restart: unless-stopped + cap_add: + - SYS_PTRACE + security_opt: + - apparmor:unconfined + environment: + - NETDATA_EXTRA_APK_PACKAGES=gcompat + volumes: + - netdataconfig:/etc/netdata + - netdatalib:/var/lib/netdata + - netdatacache:/var/cache/netdata + - /etc/passwd:/host/etc/passwd:ro + - /etc/group:/host/etc/group:ro + - /proc:/host/proc:ro + - /sys:/host/sys:ro + - /etc/os-release:/host/etc/os-release:ro + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [gpu] + +volumes: + netdataconfig: + netdatalib: + netdatacache: +``` + +Sample `docker run` +```yaml +docker run -d --name=netdata \ + -p 19999:19999 \ + -e NETDATA_EXTRA_APK_PACKAGES=gcompat \ + -v netdataconfig:/etc/netdata \ + -v netdatalib:/var/lib/netdata \ + -v netdatacache:/var/cache/netdata \ + -v /etc/passwd:/host/etc/passwd:ro \ + -v /etc/group:/host/etc/group:ro \ + -v /proc:/host/proc:ro \ + -v /sys:/host/sys:ro \ + -v /etc/os-release:/host/etc/os-release:ro \ + --restart unless-stopped \ + --cap-add SYS_PTRACE \ + --security-opt apparmor=unconfined \ + --gpus all \ + netdata/netdata +``` + +### Docker Troubleshooting +To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`. +```bash +docker exec -it netdata bash +cd /etc/netdata +./edit-config python.d.conf +``` |