Techinal Notes: Monitoring Advanced GPU Metrics Using DCGM Exporter

10.10.2024 — AI, MLOps — 2 min read

Overview

When working with GPU-intensive applications like machine learning, it's critical to monitor GPU metrics to ensure performance and resource optimization. NVIDIA’s dcgm-exporter offers a robust solution for GPU monitoring. However, its default configuration provides only basic metrics, such as GPU utilization and memory usage, which are insufficient for monitoring advanced features like Tensor Core activity or CUDA Core activity.

In this guide, I'll walk you through configuring dcgm-exporter to enable advanced profiling metrics, including Tensor Core activity, using Docker Compose and Prometheus.

Problem with Default Metrics

By default, dcgm-exporter uses a file (e.g., default-counters.csv) that includes basic metrics like:

Clocks: SM_CLOCK, MEM_CLOCK
Temperature: GPU_TEMP, MEMORY_TEMP
Power Usage: POWER_USAGE, TOTAL_ENERGY_CONSUMPTION
Utilization: GPU_UTIL, MEM_COPY_UTIL, ENC_UTIL, DEC_UTIL

These metrics are sufficient for basic GPU monitoring but lack deeper insights into Tensor Core and CUDA Core activity. Profiling metrics like Tensor Core cycles or SM (CUDA Core) active percentage are not included.

Solution: Enable Profiling Metrics in DCGM Exporter

To unlock advanced metrics, such as PIPE_TENSOR_ACTIVE and SM_ACTIVE, you need to:

Enable profiling collectors in dcgm-exporter.
Provide a custom metrics configuration file that includes these advanced metrics.

Steps to Monitor Tensor Core and CUDA Core Activity

Step 1: Download and Modify the Metrics File

Download the example dcp-metrics-included.csv file from the official NVIDIA GitHub repository:

1wget https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/etc/dcp-metrics-included.csv

Modify the file to include only the metrics you need or to fix formatting issues:

Add a trailing comma to comments (e.g., # comment,,).
Remove unsupported labels, such as DCGM_FI_DRIVER_VERSION.

For Tensor Core and CUDA Core activity, ensure these metrics are included:

1# Tensor Core and CUDA Core Activity,,
2DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
3DCGM_FI_PROF_SM_ACTIVE, gauge, Ratio of time the Streaming Multiprocessor is active.
4DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the memory interface is active sending/receiving data.

Save the file to a local path, e.g., /home/user/dcp-metrics-included.csv.

Step 2: Update Docker Compose Configuration

Here’s the final docker-compose.yml for dcgm-exporter with profiling enabled:

1version: '3.8'
2
3services:
4  nvidia-dcgm-exporter:
5    image: nvidia/dcgm-exporter:latest
6    deploy:
7      resources:
8        reservations:
9          devices:
10            - capabilities: [gpu]
11    environment:
12      - NVIDIA_VISIBLE_DEVICES=all
13      - DCGM_EXPORTER_COLLECTORS=/workspace/dcp-metrics-included.csv
14    ports:
15      - "9400:9400"  # Prometheus scraping port
16    cap_add:
17      - SYS_ADMIN  # Required for profiling metrics
18    volumes:
19      - /home/user/dcp-metrics-included.csv:/workspace/dcp-metrics-included.csv
20    networks:
21      - monitoring
22
23networks:
24  monitoring:
25    driver: bridge

Explanation:

DCGM_EXPORTER_COLLECTORS: Points to the custom metrics file.
SYS_ADMIN capability: Required for profiling features.
Volume mounting: Ensures the container can access the custom metrics file.

Step 3: Verify the Metrics

Start the container:

1docker-compose up -d nvidia-dcgm-exporter

Access the metrics endpoint:
```
1curl http://<your-server-ip>:9400/metrics
```
Look for profiling metrics like:
- DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: Tensor Core activity.
- DCGM_FI_PROF_SM_ACTIVE: CUDA Core activity.
- DCGM_FI_PROF_DRAM_ACTIVE: Memory interface activity.

Example Output

Here’s a sample output for advanced GPU metrics:

1# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active.
2# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
3DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-abc123",device="nvidia0"} 0.012345
4
5# HELP DCGM_FI_PROF_SM_ACTIVE Ratio of time the Streaming Multiprocessor is active.
6# TYPE DCGM_FI_PROF_SM_ACTIVE gauge
7DCGM_FI_PROF_SM_ACTIVE{gpu="0",UUID="GPU-abc123",device="nvidia0"} 0.234567
8
9# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the memory interface is active.
10# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
11DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-abc123",device="nvidia0"} 0.045678

Step 4: Visualize in Grafana

To monitor the metrics:

Add Prometheus as a Data Source in Grafana.

Create a new dashboard and add panels with PromQL queries:

Tensor Core Activity:

1DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

CUDA Core Activity:

1DCGM_FI_PROF_SM_ACTIVE

Memory Interface Activity:

1DCGM_FI_PROF_DRAM_ACTIVE

Lessons Learned

CSV Formatting: Ensure comments end with commas, and unsupported fields are removed.
Profiling Permissions: SYS_ADMIN capability is mandatory for advanced metrics.
Customization: Use a custom metrics file for fine-grained control over exported data.

Conclusion

With these steps, you can monitor advanced GPU metrics, including Tensor Core and CUDA Core activity. This setup provides a comprehensive view of GPU performance, essential for optimizing machine learning workloads and other GPU-intensive applications