Skip to content

Add prometheus export to report process-level GPU utilization and memory used size  #131

@Fizzbb

Description

@Fizzbb

Currently alnair-profiler use Nvidia DCGM-exporter to collect and view GPU metrics.
Problem: DCGM's resolution at GPU level (per card).
Considering Alnair is able to run multiple jobs on one GPU, so process-level utilization is important to monitor each job's resource utilization.
Plan: take advantage of nvml library's per process gpu utilization and memory usage function python example, and add custom collector and exporter to achieve this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions