summaryrefslogtreecommitdiffstats
path: root/src/go/collectors/go.d.plugin/modules/nvidia_smi/integrations/nvidia_gpu.md
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-08-26 08:15:20 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-08-26 08:15:20 +0000
commit87d772a7d708fec12f48cd8adc0dedff6e1025da (patch)
tree1fee344c64cc3f43074a01981e21126c8482a522 /src/go/collectors/go.d.plugin/modules/nvidia_smi/integrations/nvidia_gpu.md
parentAdding upstream version 1.46.3. (diff)
downloadnetdata-upstream.tar.xz
netdata-upstream.zip
Adding upstream version 1.47.0.upstream/1.47.0upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r--src/go/plugin/go.d/modules/nvidia_smi/integrations/nvidia_gpu.md (renamed from src/go/collectors/go.d.plugin/modules/nvidia_smi/integrations/nvidia_gpu.md)107
1 files changed, 61 insertions, 46 deletions
diff --git a/src/go/collectors/go.d.plugin/modules/nvidia_smi/integrations/nvidia_gpu.md b/src/go/plugin/go.d/modules/nvidia_smi/integrations/nvidia_gpu.md
index 28016cfbd..620c09639 100644
--- a/src/go/collectors/go.d.plugin/modules/nvidia_smi/integrations/nvidia_gpu.md
+++ b/src/go/plugin/go.d/modules/nvidia_smi/integrations/nvidia_gpu.md
@@ -1,6 +1,6 @@
<!--startmeta
-custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/go/collectors/go.d.plugin/modules/nvidia_smi/README.md"
-meta_yaml: "https://github.com/netdata/netdata/edit/master/src/go/collectors/go.d.plugin/modules/nvidia_smi/metadata.yaml"
+custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/go/plugin/go.d/modules/nvidia_smi/README.md"
+meta_yaml: "https://github.com/netdata/netdata/edit/master/src/go/plugin/go.d/modules/nvidia_smi/metadata.yaml"
sidebar_label: "Nvidia GPU"
learn_status: "Published"
learn_rel_path: "Collecting Metrics/Hardware Devices and Sensors"
@@ -24,8 +24,6 @@ Module: nvidia_smi
This collector monitors GPUs performance metrics using
the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) CLI tool.
-> **Warning**: under development, [loop mode](https://github.com/netdata/netdata/issues/14522) not implemented yet.
-
@@ -70,24 +68,24 @@ Labels:
Metrics:
-| Metric | Dimensions | Unit | XML | CSV |
-|:------|:----------|:----|:---:|:---:|
-| nvidia_smi.gpu_pcie_bandwidth_usage | rx, tx | B/s | • | |
-| nvidia_smi.gpu_pcie_bandwidth_utilization | rx, tx | % | • | |
-| nvidia_smi.gpu_fan_speed_perc | fan_speed | % | • | • |
-| nvidia_smi.gpu_utilization | gpu | % | • | • |
-| nvidia_smi.gpu_memory_utilization | memory | % | • | • |
-| nvidia_smi.gpu_decoder_utilization | decoder | % | • | |
-| nvidia_smi.gpu_encoder_utilization | encoder | % | • | |
-| nvidia_smi.gpu_frame_buffer_memory_usage | free, used, reserved | B | • | • |
-| nvidia_smi.gpu_bar1_memory_usage | free, used | B | • | |
-| nvidia_smi.gpu_temperature | temperature | Celsius | • | • |
-| nvidia_smi.gpu_voltage | voltage | V | • | |
-| nvidia_smi.gpu_clock_freq | graphics, video, sm, mem | MHz | • | • |
-| nvidia_smi.gpu_power_draw | power_draw | Watts | • | • |
-| nvidia_smi.gpu_performance_state | P0-P15 | state | • | • |
-| nvidia_smi.gpu_mig_mode_current_status | enabled, disabled | status | • | |
-| nvidia_smi.gpu_mig_devices_count | mig | devices | • | |
+| Metric | Dimensions | Unit |
+|:------|:----------|:----|
+| nvidia_smi.gpu_pcie_bandwidth_usage | rx, tx | B/s |
+| nvidia_smi.gpu_pcie_bandwidth_utilization | rx, tx | % |
+| nvidia_smi.gpu_fan_speed_perc | fan_speed | % |
+| nvidia_smi.gpu_utilization | gpu | % |
+| nvidia_smi.gpu_memory_utilization | memory | % |
+| nvidia_smi.gpu_decoder_utilization | decoder | % |
+| nvidia_smi.gpu_encoder_utilization | encoder | % |
+| nvidia_smi.gpu_frame_buffer_memory_usage | free, used, reserved | B |
+| nvidia_smi.gpu_bar1_memory_usage | free, used | B |
+| nvidia_smi.gpu_temperature | temperature | Celsius |
+| nvidia_smi.gpu_voltage | voltage | V |
+| nvidia_smi.gpu_clock_freq | graphics, video, sm, mem | MHz |
+| nvidia_smi.gpu_power_draw | power_draw | Watts |
+| nvidia_smi.gpu_performance_state | P0-P15 | state |
+| nvidia_smi.gpu_mig_mode_current_status | enabled, disabled | status |
+| nvidia_smi.gpu_mig_devices_count | mig | devices |
### Per mig
@@ -103,10 +101,10 @@ Labels:
Metrics:
-| Metric | Dimensions | Unit | XML | CSV |
-|:------|:----------|:----|:---:|:---:|
-| nvidia_smi.gpu_mig_frame_buffer_memory_usage | free, used, reserved | B | • | |
-| nvidia_smi.gpu_mig_bar1_memory_usage | free, used | B | • | |
+| Metric | Dimensions | Unit |
+|:------|:----------|:----|
+| nvidia_smi.gpu_mig_frame_buffer_memory_usage | free, used, reserved | B |
+| nvidia_smi.gpu_mig_bar1_memory_usage | free, used | B |
@@ -119,11 +117,7 @@ There are no alerts configured by default for this integration.
### Prerequisites
-#### Enable in go.d.conf.
-
-This collector is disabled by default. You need to explicitly enable it in the `go.d.conf` file.
-
-
+No action required.
### Configuration
@@ -152,26 +146,12 @@ The following options can be defined globally: update_every, autodetection_retry
| autodetection_retry | Recheck interval in seconds. Zero means no recheck will be scheduled. | 0 | no |
| binary_path | Path to nvidia_smi binary. The default is "nvidia_smi" and the executable is looked for in the directories specified in the PATH environment variable. | nvidia_smi | no |
| timeout | nvidia_smi binary execution timeout. | 2 | no |
-| use_csv_format | Used format when requesting GPU information. XML is used if set to 'no'. | no | no |
+| loop_mode | When enabled, `nvidia-smi` is executed continuously in a separate thread using the `-l` option. | yes | no |
</details>
#### Examples
-##### CSV format
-
-Use CSV format when requesting GPU information.
-
-<details open><summary>Config</summary>
-
-```yaml
-jobs:
- - name: nvidia_smi
- use_csv_format: yes
-
-```
-</details>
-
##### Custom binary path
The executable is not in the directories specified in the PATH environment variable.
@@ -192,6 +172,8 @@ jobs:
### Debug Mode
+**Important**: Debug mode is not supported for data collection jobs created via the UI using the Dyncfg feature.
+
To troubleshoot issues with the `nvidia_smi` collector, run the `go.d.plugin` with the debug option enabled. The output
should give you clues as to why the collector isn't working.
@@ -214,4 +196,37 @@ should give you clues as to why the collector isn't working.
./go.d.plugin -d -m nvidia_smi
```
+### Getting Logs
+
+If you're encountering problems with the `nvidia_smi` collector, follow these steps to retrieve logs and identify potential issues:
+
+- **Run the command** specific to your system (systemd, non-systemd, or Docker container).
+- **Examine the output** for any warnings or error messages that might indicate issues. These messages should provide clues about the root cause of the problem.
+
+#### System with systemd
+
+Use the following command to view logs generated since the last Netdata service restart:
+
+```bash
+journalctl _SYSTEMD_INVOCATION_ID="$(systemctl show --value --property=InvocationID netdata)" --namespace=netdata --grep nvidia_smi
+```
+
+#### System without systemd
+
+Locate the collector log file, typically at `/var/log/netdata/collector.log`, and use `grep` to filter for collector's name:
+
+```bash
+grep nvidia_smi /var/log/netdata/collector.log
+```
+
+**Note**: This method shows logs from all restarts. Focus on the **latest entries** for troubleshooting current issues.
+
+#### Docker Container
+
+If your Netdata runs in a Docker container named "netdata" (replace if different), use this command:
+
+```bash
+docker logs netdata 2>&1 | grep nvidia_smi
+```
+