how to trace the data flow in gpu-direct calls? how to verify nvme drive data flow, bandwidth and throughput, latency?