From be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97 Mon Sep 17 00:00:00 2001
From: Daniel Baumann <daniel.baumann@progress-linux.org>
Date: Fri, 19 Apr 2024 04:57:58 +0200
Subject: Adding upstream version 1.44.3.

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
---
 collectors/python.d.plugin/nvidia_smi/README.md | 157 ++++++++++++++++++++++++
 1 file changed, 157 insertions(+)
 create mode 100644 collectors/python.d.plugin/nvidia_smi/README.md

(limited to 'collectors/python.d.plugin/nvidia_smi/README.md')

diff --git a/collectors/python.d.plugin/nvidia_smi/README.md b/collectors/python.d.plugin/nvidia_smi/README.md
new file mode 100644
index 00000000..7d45289a
--- /dev/null
+++ b/collectors/python.d.plugin/nvidia_smi/README.md
@@ -0,0 +1,157 @@
+<!--
+title: "Nvidia GPU monitoring with Netdata"
+custom_edit_url: "https://github.com/netdata/netdata/edit/master/collectors/python.d.plugin/nvidia_smi/README.md"
+sidebar_label: "nvidia_smi-python.d.plugin"
+learn_status: "Published"
+learn_topic_type: "References"
+learn_rel_path: "Integrations/Monitor/Devices"
+-->
+
+# Nvidia GPU collector
+
+Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool.
+
+## Requirements and Notes
+
+-   You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface).
+-   You must enable this plugin, as its disabled by default due to minor performance issues:
+    ```bash
+    cd /etc/netdata   # Replace this path with your Netdata config directory, if different
+    sudo ./edit-config python.d.conf
+    ```
+    Remove the '#' before nvidia_smi so it reads: `nvidia_smi: yes`.
+
+-   On some systems when the GPU is idle the `nvidia-smi` tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue.
+-   Currently the `nvidia-smi` tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: <https://github.com/netdata/netdata/pull/4357>
+-   Contributions are welcome.
+-   Make sure `netdata` user can execute `/usr/bin/nvidia-smi` or wherever your binary is.
+-   If `nvidia-smi` process [is not killed after netdata restart](https://github.com/netdata/netdata/issues/7143) you need to off `loop_mode`.
+-   `poll_seconds` is how often in seconds the tool is polled for as an integer.
+
+## Charts
+
+It produces the following charts:
+
+-   PCI Express Bandwidth Utilization in `KiB/s`
+-   Fan Speed in `percentage`
+-   GPU Utilization in `percentage`
+-   Memory Bandwidth Utilization in `percentage`
+-   Encoder/Decoder Utilization in `percentage`
+-   Memory Usage in `MiB`
+-   Temperature in `celsius`
+-   Clock Frequencies in `MHz`
+-   Power Utilization in `Watts`
+-   Memory Used by Each Process in `MiB`
+-   Memory Used by Each User in `MiB`
+-   Number of User on GPU in `num`
+
+## Configuration
+
+Edit the `python.d/nvidia_smi.conf` configuration file using `edit-config` from the Netdata [config
+directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md), which is typically at `/etc/netdata`.
+
+```bash
+cd /etc/netdata   # Replace this path with your Netdata config directory, if different
+sudo ./edit-config python.d/nvidia_smi.conf
+```
+
+Sample:
+
+```yaml
+loop_mode    : yes
+poll_seconds : 1
+exclude_zero_memory_users : yes
+```
+
+
+### Troubleshooting
+
+To troubleshoot issues with the `nvidia_smi` module, run the `python.d.plugin` with the debug option enabled. The 
+output will give you the output of the data collection job or error messages on why the collector isn't working.
+
+First, navigate to your plugins directory, usually they are located under `/usr/libexec/netdata/plugins.d/`. If that's 
+not the case on your system, open `netdata.conf` and look for the setting `plugins directory`. Once you're in the 
+plugin's directory, switch to the `netdata` user.
+
+```bash
+cd /usr/libexec/netdata/plugins.d/
+sudo su -s /bin/bash netdata
+```
+
+Now you can manually run the `nvidia_smi` module in debug mode:
+
+```bash
+./python.d.plugin nvidia_smi debug trace
+```
+
+## Docker
+
+GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable.
+
+Sample `docker-compose.yml`
+```yaml
+version: '3'
+services:
+  netdata:
+    image: netdata/netdata
+    container_name: netdata
+    hostname: example.com # set to fqdn of host
+    ports:
+      - 19999:19999
+    restart: unless-stopped
+    cap_add:
+      - SYS_PTRACE
+    security_opt:
+      - apparmor:unconfined
+    environment:
+      - NETDATA_EXTRA_APK_PACKAGES=gcompat
+    volumes:
+      - netdataconfig:/etc/netdata
+      - netdatalib:/var/lib/netdata
+      - netdatacache:/var/cache/netdata
+      - /etc/passwd:/host/etc/passwd:ro
+      - /etc/group:/host/etc/group:ro
+      - /proc:/host/proc:ro
+      - /sys:/host/sys:ro
+      - /etc/os-release:/host/etc/os-release:ro
+    deploy:
+      resources:
+        reservations:
+          devices:
+          - driver: nvidia
+            count: all
+            capabilities: [gpu]
+
+volumes:
+  netdataconfig:
+  netdatalib:
+  netdatacache:
+```
+
+Sample `docker run`
+```yaml
+docker run -d --name=netdata \
+  -p 19999:19999 \
+  -e NETDATA_EXTRA_APK_PACKAGES=gcompat \
+  -v netdataconfig:/etc/netdata \
+  -v netdatalib:/var/lib/netdata \
+  -v netdatacache:/var/cache/netdata \
+  -v /etc/passwd:/host/etc/passwd:ro \
+  -v /etc/group:/host/etc/group:ro \
+  -v /proc:/host/proc:ro \
+  -v /sys:/host/sys:ro \
+  -v /etc/os-release:/host/etc/os-release:ro \
+  --restart unless-stopped \
+  --cap-add SYS_PTRACE \
+  --security-opt apparmor=unconfined \
+  --gpus all \
+  netdata/netdata
+```
+
+### Docker Troubleshooting
+To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`.
+```bash
+docker exec -it netdata bash
+cd /etc/netdata
+./edit-config python.d.conf
+```
-- 
cgit v1.2.3