diff options
Diffstat (limited to '')
-rw-r--r-- | src/go/collectors/go.d.plugin/modules/nvidia_smi/integrations/nvidia_gpu.md | 217 |
1 files changed, 217 insertions, 0 deletions
diff --git a/src/go/collectors/go.d.plugin/modules/nvidia_smi/integrations/nvidia_gpu.md b/src/go/collectors/go.d.plugin/modules/nvidia_smi/integrations/nvidia_gpu.md new file mode 100644 index 000000000..2622c9fdb --- /dev/null +++ b/src/go/collectors/go.d.plugin/modules/nvidia_smi/integrations/nvidia_gpu.md @@ -0,0 +1,217 @@ +<!--startmeta +custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/go/collectors/go.d.plugin/modules/nvidia_smi/README.md" +meta_yaml: "https://github.com/netdata/netdata/edit/master/src/go/collectors/go.d.plugin/modules/nvidia_smi/metadata.yaml" +sidebar_label: "Nvidia GPU" +learn_status: "Published" +learn_rel_path: "Collecting Metrics/Hardware Devices and Sensors" +most_popular: False +message: "DO NOT EDIT THIS FILE DIRECTLY, IT IS GENERATED BY THE COLLECTOR'S metadata.yaml FILE" +endmeta--> + +# Nvidia GPU + + +<img src="https://netdata.cloud/img/nvidia.svg" width="150"/> + + +Plugin: go.d.plugin +Module: nvidia_smi + +<img src="https://img.shields.io/badge/maintained%20by-Netdata-%2300ab44" /> + +## Overview + +This collector monitors GPUs performance metrics using +the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) CLI tool. + +> **Warning**: under development, [loop mode](https://github.com/netdata/netdata/issues/14522) not implemented yet. + + + + +This collector is supported on all platforms. + +This collector supports collecting metrics from multiple instances of this integration, including remote instances. + + +### Default Behavior + +#### Auto-Detection + +This integration doesn't support auto-detection. + +#### Limits + +The default configuration for this integration does not impose any limits on data collection. + +#### Performance Impact + +The default configuration for this integration is not expected to impose a significant performance impact on the system. + + +## Metrics + +Metrics grouped by *scope*. + +The scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels. + + + +### Per gpu + +These metrics refer to the GPU. + +Labels: + +| Label | Description | +|:-----------|:----------------| +| uuid | GPU id (e.g. 00000000:00:04.0) | +| product_name | GPU product name (e.g. NVIDIA A100-SXM4-40GB) | + +Metrics: + +| Metric | Dimensions | Unit | XML | CSV | +|:------|:----------|:----|:---:|:---:| +| nvidia_smi.gpu_pcie_bandwidth_usage | rx, tx | B/s | • | | +| nvidia_smi.gpu_pcie_bandwidth_utilization | rx, tx | % | • | | +| nvidia_smi.gpu_fan_speed_perc | fan_speed | % | • | • | +| nvidia_smi.gpu_utilization | gpu | % | • | • | +| nvidia_smi.gpu_memory_utilization | memory | % | • | • | +| nvidia_smi.gpu_decoder_utilization | decoder | % | • | | +| nvidia_smi.gpu_encoder_utilization | encoder | % | • | | +| nvidia_smi.gpu_frame_buffer_memory_usage | free, used, reserved | B | • | • | +| nvidia_smi.gpu_bar1_memory_usage | free, used | B | • | | +| nvidia_smi.gpu_temperature | temperature | Celsius | • | • | +| nvidia_smi.gpu_voltage | voltage | V | • | | +| nvidia_smi.gpu_clock_freq | graphics, video, sm, mem | MHz | • | • | +| nvidia_smi.gpu_power_draw | power_draw | Watts | • | • | +| nvidia_smi.gpu_performance_state | P0-P15 | state | • | • | +| nvidia_smi.gpu_mig_mode_current_status | enabled, disabled | status | • | | +| nvidia_smi.gpu_mig_devices_count | mig | devices | • | | + +### Per mig + +These metrics refer to the Multi-Instance GPU (MIG). + +Labels: + +| Label | Description | +|:-----------|:----------------| +| uuid | GPU id (e.g. 00000000:00:04.0) | +| product_name | GPU product name (e.g. NVIDIA A100-SXM4-40GB) | +| gpu_instance_id | GPU instance id (e.g. 1) | + +Metrics: + +| Metric | Dimensions | Unit | XML | CSV | +|:------|:----------|:----|:---:|:---:| +| nvidia_smi.gpu_mig_frame_buffer_memory_usage | free, used, reserved | B | • | | +| nvidia_smi.gpu_mig_bar1_memory_usage | free, used | B | • | | + + + +## Alerts + +There are no alerts configured by default for this integration. + + +## Setup + +### Prerequisites + +#### Enable in go.d.conf. + +This collector is disabled by default. You need to explicitly enable it in the `go.d.conf` file. + + + +### Configuration + +#### File + +The configuration file name for this integration is `go.d/nvidia_smi.conf`. + + +You can edit the configuration file using the `edit-config` script from the +Netdata [config directory](https://github.com/netdata/netdata/blob/master/docs/netdata-agent/configuration.md#the-netdata-config-directory). + +```bash +cd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata +sudo ./edit-config go.d/nvidia_smi.conf +``` +#### Options + +The following options can be defined globally: update_every, autodetection_retry. + + +<details><summary>Config options</summary> + +| Name | Description | Default | Required | +|:----|:-----------|:-------|:--------:| +| update_every | Data collection frequency. | 10 | no | +| autodetection_retry | Recheck interval in seconds. Zero means no recheck will be scheduled. | 0 | no | +| binary_path | Path to nvidia_smi binary. The default is "nvidia_smi" and the executable is looked for in the directories specified in the PATH environment variable. | nvidia_smi | no | +| timeout | nvidia_smi binary execution timeout. | 2 | no | +| use_csv_format | Used format when requesting GPU information. XML is used if set to 'no'. | yes | no | + +</details> + +#### Examples + +##### XML format + +Use XML format when requesting GPU information. + +<details><summary>Config</summary> + +```yaml +jobs: + - name: nvidia_smi + use_csv_format: no + +``` +</details> + +##### Custom binary path + +The executable is not in the directories specified in the PATH environment variable. + +<details><summary>Config</summary> + +```yaml +jobs: + - name: nvidia_smi + binary_path: /usr/local/sbin/nvidia_smi + +``` +</details> + + + +## Troubleshooting + +### Debug Mode + +To troubleshoot issues with the `nvidia_smi` collector, run the `go.d.plugin` with the debug option enabled. The output +should give you clues as to why the collector isn't working. + +- Navigate to the `plugins.d` directory, usually at `/usr/libexec/netdata/plugins.d/`. If that's not the case on + your system, open `netdata.conf` and look for the `plugins` setting under `[directories]`. + + ```bash + cd /usr/libexec/netdata/plugins.d/ + ``` + +- Switch to the `netdata` user. + + ```bash + sudo -u netdata -s + ``` + +- Run the `go.d.plugin` to debug the collector: + + ```bash + ./go.d.plugin -d -m nvidia_smi + ``` + + |