Add prometheus export to report process-level GPU utilization and memory used size 

Currently [alnair-profiler](https://github.com/CentaurusInfra/alnair/tree/main/alnair-profiler) use [Nvidia DCGM-exporter](https://github.com/NVIDIA/dcgm-exporter) to collect and view GPU metrics.
Problem: DCGM's resolution at GPU level (per card).
Considering Alnair is able to run multiple jobs on one GPU, so process-level utilization is important to monitor each job's resource utilization.
Plan: take advantage of nvml library's per process gpu utilization and memory usage function [python example](https://github.com/Fizzbb/GPUNotes/tree/main/nvml-python), and add [custom collector and exporter](https://prometheus.io/docs/instrumenting/writing_exporters/) to achieve this. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prometheus export to report process-level GPU utilization and memory used size #131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add prometheus export to report process-level GPU utilization and memory used size #131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions