Currently alnair-profiler use Nvidia DCGM-exporter to collect and view GPU metrics.
Problem: DCGM's resolution at GPU level (per card).
Considering Alnair is able to run multiple jobs on one GPU, so process-level utilization is important to monitor each job's resource utilization.
Plan: take advantage of nvml library's per process gpu utilization and memory usage function python example, and add custom collector and exporter to achieve this.