dcgm
FakeAI: Rapid Development and Testing for AI Infrastructure
GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring
GPU telemetry with workload attribution. One OTLP agent per node ties hardware metrics (NVIDIA, AMD, Intel Gaudi) to the K8s pod or Slurm job burning the GPU — so you know who's paying for that idle H100.