summaryrefslogtreecommitdiffstats
path: root/collectors/python.d.plugin/nvidia_smi/README.md
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-05 12:08:03 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-05 12:08:18 +0000
commit5da14042f70711ea5cf66e034699730335462f66 (patch)
tree0f6354ccac934ed87a2d555f45be4c831cf92f4a /collectors/python.d.plugin/nvidia_smi/README.md
parentReleasing debian version 1.44.3-2. (diff)
downloadnetdata-5da14042f70711ea5cf66e034699730335462f66.tar.xz
netdata-5da14042f70711ea5cf66e034699730335462f66.zip
Merging upstream version 1.45.3+dfsg.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'collectors/python.d.plugin/nvidia_smi/README.md')
-rw-r--r--collectors/python.d.plugin/nvidia_smi/README.md157
1 files changed, 0 insertions, 157 deletions
diff --git a/collectors/python.d.plugin/nvidia_smi/README.md b/collectors/python.d.plugin/nvidia_smi/README.md
deleted file mode 100644
index 7d45289a4..000000000
--- a/collectors/python.d.plugin/nvidia_smi/README.md
+++ /dev/null
@@ -1,157 +0,0 @@
-<!--
-title: "Nvidia GPU monitoring with Netdata"
-custom_edit_url: "https://github.com/netdata/netdata/edit/master/collectors/python.d.plugin/nvidia_smi/README.md"
-sidebar_label: "nvidia_smi-python.d.plugin"
-learn_status: "Published"
-learn_topic_type: "References"
-learn_rel_path: "Integrations/Monitor/Devices"
--->
-
-# Nvidia GPU collector
-
-Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool.
-
-## Requirements and Notes
-
-- You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface).
-- You must enable this plugin, as its disabled by default due to minor performance issues:
- ```bash
- cd /etc/netdata # Replace this path with your Netdata config directory, if different
- sudo ./edit-config python.d.conf
- ```
- Remove the '#' before nvidia_smi so it reads: `nvidia_smi: yes`.
-
-- On some systems when the GPU is idle the `nvidia-smi` tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue.
-- Currently the `nvidia-smi` tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: <https://github.com/netdata/netdata/pull/4357>
-- Contributions are welcome.
-- Make sure `netdata` user can execute `/usr/bin/nvidia-smi` or wherever your binary is.
-- If `nvidia-smi` process [is not killed after netdata restart](https://github.com/netdata/netdata/issues/7143) you need to off `loop_mode`.
-- `poll_seconds` is how often in seconds the tool is polled for as an integer.
-
-## Charts
-
-It produces the following charts:
-
-- PCI Express Bandwidth Utilization in `KiB/s`
-- Fan Speed in `percentage`
-- GPU Utilization in `percentage`
-- Memory Bandwidth Utilization in `percentage`
-- Encoder/Decoder Utilization in `percentage`
-- Memory Usage in `MiB`
-- Temperature in `celsius`
-- Clock Frequencies in `MHz`
-- Power Utilization in `Watts`
-- Memory Used by Each Process in `MiB`
-- Memory Used by Each User in `MiB`
-- Number of User on GPU in `num`
-
-## Configuration
-
-Edit the `python.d/nvidia_smi.conf` configuration file using `edit-config` from the Netdata [config
-directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md), which is typically at `/etc/netdata`.
-
-```bash
-cd /etc/netdata # Replace this path with your Netdata config directory, if different
-sudo ./edit-config python.d/nvidia_smi.conf
-```
-
-Sample:
-
-```yaml
-loop_mode : yes
-poll_seconds : 1
-exclude_zero_memory_users : yes
-```
-
-
-### Troubleshooting
-
-To troubleshoot issues with the `nvidia_smi` module, run the `python.d.plugin` with the debug option enabled. The
-output will give you the output of the data collection job or error messages on why the collector isn't working.
-
-First, navigate to your plugins directory, usually they are located under `/usr/libexec/netdata/plugins.d/`. If that's
-not the case on your system, open `netdata.conf` and look for the setting `plugins directory`. Once you're in the
-plugin's directory, switch to the `netdata` user.
-
-```bash
-cd /usr/libexec/netdata/plugins.d/
-sudo su -s /bin/bash netdata
-```
-
-Now you can manually run the `nvidia_smi` module in debug mode:
-
-```bash
-./python.d.plugin nvidia_smi debug trace
-```
-
-## Docker
-
-GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable.
-
-Sample `docker-compose.yml`
-```yaml
-version: '3'
-services:
- netdata:
- image: netdata/netdata
- container_name: netdata
- hostname: example.com # set to fqdn of host
- ports:
- - 19999:19999
- restart: unless-stopped
- cap_add:
- - SYS_PTRACE
- security_opt:
- - apparmor:unconfined
- environment:
- - NETDATA_EXTRA_APK_PACKAGES=gcompat
- volumes:
- - netdataconfig:/etc/netdata
- - netdatalib:/var/lib/netdata
- - netdatacache:/var/cache/netdata
- - /etc/passwd:/host/etc/passwd:ro
- - /etc/group:/host/etc/group:ro
- - /proc:/host/proc:ro
- - /sys:/host/sys:ro
- - /etc/os-release:/host/etc/os-release:ro
- deploy:
- resources:
- reservations:
- devices:
- - driver: nvidia
- count: all
- capabilities: [gpu]
-
-volumes:
- netdataconfig:
- netdatalib:
- netdatacache:
-```
-
-Sample `docker run`
-```yaml
-docker run -d --name=netdata \
- -p 19999:19999 \
- -e NETDATA_EXTRA_APK_PACKAGES=gcompat \
- -v netdataconfig:/etc/netdata \
- -v netdatalib:/var/lib/netdata \
- -v netdatacache:/var/cache/netdata \
- -v /etc/passwd:/host/etc/passwd:ro \
- -v /etc/group:/host/etc/group:ro \
- -v /proc:/host/proc:ro \
- -v /sys:/host/sys:ro \
- -v /etc/os-release:/host/etc/os-release:ro \
- --restart unless-stopped \
- --cap-add SYS_PTRACE \
- --security-opt apparmor=unconfined \
- --gpus all \
- netdata/netdata
-```
-
-### Docker Troubleshooting
-To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`.
-```bash
-docker exec -it netdata bash
-cd /etc/netdata
-./edit-config python.d.conf
-```