When working with GPU-intensive applications like machine learning, it's critical to monitor GPU metrics to ensure performance and resource optimization. NVIDIA’s dcgm-exporter
offers a robust solution for GPU monitoring. However, its default configuration provides only basic metrics, such as GPU utilization and memory usage, which are insufficient for monitoring advanced features like Tensor Core activity or CUDA Core activity.
In this guide, I'll walk you through configuring dcgm-exporter
to enable advanced profiling metrics, including Tensor Core activity, using Docker Compose and Prometheus.
By default, dcgm-exporter
uses a file (e.g., default-counters.csv
) that includes basic metrics like:
SM_CLOCK
, MEM_CLOCK
GPU_TEMP
, MEMORY_TEMP
POWER_USAGE
, TOTAL_ENERGY_CONSUMPTION
GPU_UTIL
, MEM_COPY_UTIL
, ENC_UTIL
, DEC_UTIL
These metrics are sufficient for basic GPU monitoring but lack deeper insights into Tensor Core and CUDA Core activity. Profiling metrics like Tensor Core cycles or SM (CUDA Core) active percentage are not included.
To unlock advanced metrics, such as PIPE_TENSOR_ACTIVE
and SM_ACTIVE
, you need to:
dcgm-exporter
.Download the example dcp-metrics-included.csv
file from the official NVIDIA GitHub repository:
1wget https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/etc/dcp-metrics-included.csv
Modify the file to include only the metrics you need or to fix formatting issues:
# comment,,
).DCGM_FI_DRIVER_VERSION
.For Tensor Core and CUDA Core activity, ensure these metrics are included:
1# Tensor Core and CUDA Core Activity,,2DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.3DCGM_FI_PROF_SM_ACTIVE, gauge, Ratio of time the Streaming Multiprocessor is active.4DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the memory interface is active sending/receiving data.
Save the file to a local path, e.g., /home/user/dcp-metrics-included.csv
.
Here’s the final docker-compose.yml
for dcgm-exporter
with profiling enabled:
1version: '3.8'2
3services:4 nvidia-dcgm-exporter:5 image: nvidia/dcgm-exporter:latest6 deploy:7 resources:8 reservations:9 devices:10 - capabilities: [gpu]11 environment:12 - NVIDIA_VISIBLE_DEVICES=all13 - DCGM_EXPORTER_COLLECTORS=/workspace/dcp-metrics-included.csv14 ports:15 - "9400:9400" # Prometheus scraping port16 cap_add:17 - SYS_ADMIN # Required for profiling metrics18 volumes:19 - /home/user/dcp-metrics-included.csv:/workspace/dcp-metrics-included.csv20 networks:21 - monitoring22
23networks:24 monitoring:25 driver: bridge
DCGM_EXPORTER_COLLECTORS
: Points to the custom metrics file.SYS_ADMIN
capability: Required for profiling features.Start the container:
1docker-compose up -d nvidia-dcgm-exporter
Access the metrics endpoint:
1curl http://<your-server-ip>:9400/metrics
Look for profiling metrics like:
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
: Tensor Core activity.DCGM_FI_PROF_SM_ACTIVE
: CUDA Core activity.DCGM_FI_PROF_DRAM_ACTIVE
: Memory interface activity.Here’s a sample output for advanced GPU metrics:
1# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active.2# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge3DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-abc123",device="nvidia0"} 0.0123454
5# HELP DCGM_FI_PROF_SM_ACTIVE Ratio of time the Streaming Multiprocessor is active.6# TYPE DCGM_FI_PROF_SM_ACTIVE gauge7DCGM_FI_PROF_SM_ACTIVE{gpu="0",UUID="GPU-abc123",device="nvidia0"} 0.2345678
9# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the memory interface is active.10# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge11DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-abc123",device="nvidia0"} 0.045678
To monitor the metrics:
1DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
1DCGM_FI_PROF_SM_ACTIVE
1DCGM_FI_PROF_DRAM_ACTIVE
SYS_ADMIN
capability is mandatory for advanced metrics.With these steps, you can monitor advanced GPU metrics, including Tensor Core and CUDA Core activity. This setup provides a comprehensive view of GPU performance, essential for optimizing machine learning workloads and other GPU-intensive applications