When working with GPU-intensive applications like machine learning, it's critical to monitor GPU metrics to ensure performance and resource optimization. NVIDIA’s dcgm-exporter offers a robust solution for GPU monitoring. However, its default configuration provides only basic metrics, such as GPU utilization and memory usage, which are insufficient for monitoring advanced features like Tensor Core activity or CUDA Core activity.
In this guide, I'll walk you through configuring dcgm-exporter to enable advanced profiling metrics, including Tensor Core activity, using Docker Compose and Prometheus.
By default, dcgm-exporter uses a file (e.g., default-counters.csv) that includes basic metrics like:
SM_CLOCK, MEM_CLOCKGPU_TEMP, MEMORY_TEMPPOWER_USAGE, TOTAL_ENERGY_CONSUMPTIONGPU_UTIL, MEM_COPY_UTIL, ENC_UTIL, DEC_UTILThese metrics are sufficient for basic GPU monitoring but lack deeper insights into Tensor Core and CUDA Core activity. Profiling metrics like Tensor Core cycles or SM (CUDA Core) active percentage are not included.
To unlock advanced metrics, such as PIPE_TENSOR_ACTIVE and SM_ACTIVE, you need to:
dcgm-exporter.Download the example dcp-metrics-included.csv file from the official NVIDIA GitHub repository:
1wget https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/etc/dcp-metrics-included.csvModify the file to include only the metrics you need or to fix formatting issues:
# comment,,).DCGM_FI_DRIVER_VERSION.For Tensor Core and CUDA Core activity, ensure these metrics are included:
1# Tensor Core and CUDA Core Activity,,2DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.3DCGM_FI_PROF_SM_ACTIVE, gauge, Ratio of time the Streaming Multiprocessor is active.4DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the memory interface is active sending/receiving data.Save the file to a local path, e.g., /home/user/dcp-metrics-included.csv.
Here’s the final docker-compose.yml for dcgm-exporter with profiling enabled:
1version: '3.8'2
3services:4 nvidia-dcgm-exporter:5 image: nvidia/dcgm-exporter:latest6 deploy:7 resources:8 reservations:9 devices:10 - capabilities: [gpu]11 environment:12 - NVIDIA_VISIBLE_DEVICES=all13 - DCGM_EXPORTER_COLLECTORS=/workspace/dcp-metrics-included.csv14 ports:15 - "9400:9400" # Prometheus scraping port16 cap_add:17 - SYS_ADMIN # Required for profiling metrics18 volumes:19 - /home/user/dcp-metrics-included.csv:/workspace/dcp-metrics-included.csv20 networks:21 - monitoring22
23networks:24 monitoring:25 driver: bridgeDCGM_EXPORTER_COLLECTORS: Points to the custom metrics file.SYS_ADMIN capability: Required for profiling features.Start the container:
1docker-compose up -d nvidia-dcgm-exporterAccess the metrics endpoint:
1curl http://<your-server-ip>:9400/metricsLook for profiling metrics like:
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: Tensor Core activity.DCGM_FI_PROF_SM_ACTIVE: CUDA Core activity.DCGM_FI_PROF_DRAM_ACTIVE: Memory interface activity.Here’s a sample output for advanced GPU metrics:
1# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active.2# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge3DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-abc123",device="nvidia0"} 0.0123454
5# HELP DCGM_FI_PROF_SM_ACTIVE Ratio of time the Streaming Multiprocessor is active.6# TYPE DCGM_FI_PROF_SM_ACTIVE gauge7DCGM_FI_PROF_SM_ACTIVE{gpu="0",UUID="GPU-abc123",device="nvidia0"} 0.2345678
9# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the memory interface is active.10# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge11DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-abc123",device="nvidia0"} 0.045678To monitor the metrics:
1DCGM_FI_PROF_PIPE_TENSOR_ACTIVE1DCGM_FI_PROF_SM_ACTIVE1DCGM_FI_PROF_DRAM_ACTIVESYS_ADMIN capability is mandatory for advanced metrics.With these steps, you can monitor advanced GPU metrics, including Tensor Core and CUDA Core activity. This setup provides a comprehensive view of GPU performance, essential for optimizing machine learning workloads and other GPU-intensive applications