From be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Fri, 19 Apr 2024 04:57:58 +0200 Subject: Adding upstream version 1.44.3. Signed-off-by: Daniel Baumann --- collectors/python.d.plugin/nvidia_smi/README.md | 157 ++++++++++++++++++++++++ 1 file changed, 157 insertions(+) create mode 100644 collectors/python.d.plugin/nvidia_smi/README.md (limited to 'collectors/python.d.plugin/nvidia_smi/README.md') diff --git a/collectors/python.d.plugin/nvidia_smi/README.md b/collectors/python.d.plugin/nvidia_smi/README.md new file mode 100644 index 00000000..7d45289a --- /dev/null +++ b/collectors/python.d.plugin/nvidia_smi/README.md @@ -0,0 +1,157 @@ + + +# Nvidia GPU collector + +Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool. + +## Requirements and Notes + +- You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface). +- You must enable this plugin, as its disabled by default due to minor performance issues: + ```bash + cd /etc/netdata # Replace this path with your Netdata config directory, if different + sudo ./edit-config python.d.conf + ``` + Remove the '#' before nvidia_smi so it reads: `nvidia_smi: yes`. + +- On some systems when the GPU is idle the `nvidia-smi` tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue. +- Currently the `nvidia-smi` tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: +- Contributions are welcome. +- Make sure `netdata` user can execute `/usr/bin/nvidia-smi` or wherever your binary is. +- If `nvidia-smi` process [is not killed after netdata restart](https://github.com/netdata/netdata/issues/7143) you need to off `loop_mode`. +- `poll_seconds` is how often in seconds the tool is polled for as an integer. + +## Charts + +It produces the following charts: + +- PCI Express Bandwidth Utilization in `KiB/s` +- Fan Speed in `percentage` +- GPU Utilization in `percentage` +- Memory Bandwidth Utilization in `percentage` +- Encoder/Decoder Utilization in `percentage` +- Memory Usage in `MiB` +- Temperature in `celsius` +- Clock Frequencies in `MHz` +- Power Utilization in `Watts` +- Memory Used by Each Process in `MiB` +- Memory Used by Each User in `MiB` +- Number of User on GPU in `num` + +## Configuration + +Edit the `python.d/nvidia_smi.conf` configuration file using `edit-config` from the Netdata [config +directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md), which is typically at `/etc/netdata`. + +```bash +cd /etc/netdata # Replace this path with your Netdata config directory, if different +sudo ./edit-config python.d/nvidia_smi.conf +``` + +Sample: + +```yaml +loop_mode : yes +poll_seconds : 1 +exclude_zero_memory_users : yes +``` + + +### Troubleshooting + +To troubleshoot issues with the `nvidia_smi` module, run the `python.d.plugin` with the debug option enabled. The +output will give you the output of the data collection job or error messages on why the collector isn't working. + +First, navigate to your plugins directory, usually they are located under `/usr/libexec/netdata/plugins.d/`. If that's +not the case on your system, open `netdata.conf` and look for the setting `plugins directory`. Once you're in the +plugin's directory, switch to the `netdata` user. + +```bash +cd /usr/libexec/netdata/plugins.d/ +sudo su -s /bin/bash netdata +``` + +Now you can manually run the `nvidia_smi` module in debug mode: + +```bash +./python.d.plugin nvidia_smi debug trace +``` + +## Docker + +GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable. + +Sample `docker-compose.yml` +```yaml +version: '3' +services: + netdata: + image: netdata/netdata + container_name: netdata + hostname: example.com # set to fqdn of host + ports: + - 19999:19999 + restart: unless-stopped + cap_add: + - SYS_PTRACE + security_opt: + - apparmor:unconfined + environment: + - NETDATA_EXTRA_APK_PACKAGES=gcompat + volumes: + - netdataconfig:/etc/netdata + - netdatalib:/var/lib/netdata + - netdatacache:/var/cache/netdata + - /etc/passwd:/host/etc/passwd:ro + - /etc/group:/host/etc/group:ro + - /proc:/host/proc:ro + - /sys:/host/sys:ro + - /etc/os-release:/host/etc/os-release:ro + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [gpu] + +volumes: + netdataconfig: + netdatalib: + netdatacache: +``` + +Sample `docker run` +```yaml +docker run -d --name=netdata \ + -p 19999:19999 \ + -e NETDATA_EXTRA_APK_PACKAGES=gcompat \ + -v netdataconfig:/etc/netdata \ + -v netdatalib:/var/lib/netdata \ + -v netdatacache:/var/cache/netdata \ + -v /etc/passwd:/host/etc/passwd:ro \ + -v /etc/group:/host/etc/group:ro \ + -v /proc:/host/proc:ro \ + -v /sys:/host/sys:ro \ + -v /etc/os-release:/host/etc/os-release:ro \ + --restart unless-stopped \ + --cap-add SYS_PTRACE \ + --security-opt apparmor=unconfined \ + --gpus all \ + netdata/netdata +``` + +### Docker Troubleshooting +To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`. +```bash +docker exec -it netdata bash +cd /etc/netdata +./edit-config python.d.conf +``` -- cgit v1.2.3