Recently, we deployed a microservice system for a Machine Learning application to an on-premises server. During the early stages of deployment, we quickly realized that monitoring was crucial to track system stability and performance metrics, especially for real-time applications. After researching various tools and solutions, we decided to deploy a Prometheus + Grafana stack to collect metrics and visualize the system's health.
Here are my detailed notes from the deployment process. These notes are quite specific and serve as a technical reference for anyone looking to implement a similar monitoring setup.
We chose this combination for the following reasons:
The entire stack is orchestrated using Docker Compose. Below is the docker-compose.yml
file we used:
1version: '3.8'2
3services:4 cadvisor:5 image: google/cadvisor:latest6 ports:7 - "8088:8080"8 volumes:9 - /:/rootfs:ro10 - /var/run:/var/run:rw11 - /sys:/sys:ro12 - /var/lib/docker/:/var/lib/docker:ro13 networks:14 - monitoring15
16 prometheus:17 image: prom/prometheus:latest18 volumes:19 - ./prometheus.yml:/etc/prometheus/prometheus.yml20 command:21 - '--config.file=/etc/prometheus/prometheus.yml'22 ports:23 - "9090:9090"24 networks:25 - monitoring26
27 grafana:28 image: grafana/grafana:latest29 ports:30 - "3000:3000"31 volumes:32 - grafana-storage:/var/lib/grafana33 networks:34 - monitoring35
36 nvidia-dcgm-exporter:37 image: nvidia/dcgm-exporter:latest38 deploy:39 resources:40 reservations:41 devices:42 - capabilities: [gpu]43 environment:44 - NVIDIA_VISIBLE_DEVICES=all45 ports:46 - "9400:9400"47 networks:48 - monitoring49
50networks:51 monitoring:52 driver: bridge53
54volumes:55 grafana-storage:
Prometheus is configured to scrape metrics from cAdvisor and NVIDIA DCGM exporter. Here is the prometheus.yml
file:
1global:2 scrape_interval: 5s3
4scrape_configs:5 - job_name: 'cadvisor'6 static_configs:7 - targets: ['cadvisor:8080']8
9 - job_name: 'nvidia-gpu'10 static_configs:11 - targets: ['nvidia-dcgm-exporter:9400']
Once Prometheus is set up, Grafana can query its data to visualize metrics. Below are some of the key panels we created:
Docker Container Memory Usage (MB):
sum(container_memory_usage_bytes{image!=""}) by (name) / 1024 / 1024
GPU Utilization:
DCGM_FI_DEV_GPU_UTIL
{{modelName}} - GPU {{gpu}}
GPU Memory Usage (MB):
DCGM_FI_DEV_FB_USED / 1024
Docker Container CPU Usage (Cores):
sum(rate(container_cpu_usage_seconds_total{image!=""}[5m])) by (name)
Prepare Files: Save the docker-compose.yml
and prometheus.yml
files in the same directory.
Start the Services: Run the following command to start the monitoring stack:
1docker-compose up -d
Access Web Interfaces:
http://<server-ip>:9090
to view raw metrics.http://<server-ip>:3000
to create and view dashboards.Import Dashboards: Use the JSON model of your dashboard in Grafana to visualize metrics.