Skip to content

Releases: leptonai/gpud

gpud-v0.5.1-rc-0

10 Jul 04:05
35c8acb
Compare
Choose a tag to compare

What's Changed

  • feat(nvidia/infiniband): remove "ibstatus" use for health checker by @gyuho in #946
  • feat(components/*): log component errors as warn level by @gyuho in #950
  • fix(nvidia/nccl, peermem): add missing cancel calls by @gyuho in #949
  • feat(nvidia/infiniband): track /sys/class/infiniband/[device], optional data collection by @gyuho in #943
  • fix(reboot): select only "reboot" events in "os" bucket by @gyuho in #951
  • feat(nfs): check mount fs type, degraded if not nfs by @gyuho in #952
  • fix(nvidia/xid, nvidia/sxid, docker): fix err.Error() panic by @gyuho in #953
  • feat(nvidia/gpu-counts): add gpu counts component by @gyuho in #944
  • feat(nvidia/infiniband): support custom infiniband class root dir by @gyuho in #954
  • chore(package-controller): remove redundant log lines by @cardyok in #955
  • fix(cmd/status): do not parse login timestamp if empty by @gyuho in #956
  • fix(nvml/processes): handle "no such file or directory" error when listing gpu processes by @gyuho in #960
  • feat(machine-info): add private ip fallback for aws using imds by @gyuho in #962
  • tests(components): use assert for unit tests by @gyuho in #961
  • feat(server): retain only 24 hours of metrics/events data (retain xid/sxid/reboot as exceptions) by @gyuho in #927
  • fix(nfs-checker): add missing checker.Clean, simplify the logic to only check the self-machine by @gyuho in #963
  • feat(nvidia/infiniband): more tests with class dir, separate evaluate_threshold file by @gyuho in #965
  • feat(nvidia): new device interface to get PCI Bus ID by @gyuho in #966
  • feat(components/xid): append GPU uuid in reason by @eahydra in #970
  • feat(nvidia/xid): xid 110 should suggest reboot first by @gyuho in #974
  • tests(nfs): make nfs error retry test more robust by @gyuho in #958
  • fix(pkg/providers): handle no public IP case in AWS by @gyuho in #975
  • feat(session): extend initializing grace period to 5 min and make persistence mode more frequent by @cardyok in #978
  • fix(session): make last reboot time global by @cardyok in #979
  • fix(nvidia/xid): remove stale test case by @gyuho in #977
  • feat(memory): make oom events human readable using cadvisor code by @gyuho in #980
  • feat(containerd): extend dangling pod count check in containerd by @cardyok in #981
  • fix(containerd): recreate context for kubelet list calls by @gyuho in #982
  • feat(infiniband): track historical ib port states to detect drop/flap by @gyuho in #971

Full Changelog: v0.5.0...v0.5.1-rc-0

gpud-v0.5.0

26 Jun 16:40
Compare
Choose a tag to compare

What's Changed

  • feat(metrics, server): implement scraper, store, syncer for Prometheus metrics using sqlite as an experimental feature by @gyuho in #577
  • nit(fd): remove unused prom gatherer from component by @gyuho in #583
  • feat(edge/derp): add Helsinki region for network checks by @gyuho in #580
  • nit(cpu): remove unused prom gatherer from component by @gyuho in #582
  • feat(host, pci): move reboot utils, simplify by @gyuho in #586
  • feat(metrics): label Prometheus metrics with component name by @gyuho in #579
  • feat(fuse): simplify metrics + poller by @gyuho in #585
  • feat(os): simplify poller, remove os component dependency from sxid/xid by @gyuho in #590
  • feat(network-latency): simplify metrics + poller by @gyuho in #592
  • feat(systemd): remove (in favor of individual component checks) by @gyuho in #587
  • feat(disk): simplify metrics + poller (remove old metrics method for component) by @gyuho in #584
  • feat(pci): refactor to clean up poller by @gyuho in #570
  • test(metrics/syncer): wait longer for syncer start/stop ut for slow CI by @gyuho in #578
  • feat(release): deprecate ubuntu20.04, remove default release file without ubuntu version by @gyuho in #596
  • feat(xid,sxid): use kmsg instead of dmesg by @eahydra in #602
  • fix(disk): set missing State.Error field by @gyuho in #598
  • test(metrics/syncer): make start/stop tests less flaky for slow CI by @gyuho in #599
  • nits(infiniband): remove id package, make code ordering consistent by @gyuho in #600
  • feat: use kmsg instead of dmesg by @eahydra in #603
  • chore: remove all unnecessary dmesg information by @eahydra in #605
  • feat(kubelet): check port open, list pods with failure threshold by @gyuho in #589
  • feat(gpud-manager/systemd): auto-add --log-file flag with default file path by @gyuho in #606
  • feat(cpu): simplify metrics, simplify states output and checks by @gyuho in #597
  • feat(nvidia/clock-speed): simplify with its own poller (no shared poller) by @gyuho in #607
  • feat(fd): simplify metrics with prometheus by @gyuho in #608
  • feat(memory): simplify metrics using Prometheus-based ones by @gyuho in #611
  • feat(nvidia/remapped-rows): simplify metrics by @gyuho in #612
  • feat(nvidia/hw-slowdown): simplify metrics, remove shared poller by @gyuho in #609
  • feat(nvidia/xid): mark Xid 143 as fatal (gpu init failure) by @gyuho in #614
  • feat(nvidia/temperature, power, util, ecc, gpm, gsp-firmware-mode, processes, persistence-mode, nvlink): simplify metrics, remove shared poller by @gyuho in #610
  • feat(nvidia/info): simplify, remove shared poller dependency by @gyuho in #613
  • feat(nvidia): remove shared poller package by @gyuho in #615
  • feat(pkg/login): clean up, increase unit test coverage by @gyuho in #616
  • feat(cpu, fd): clean up init, add unit tests by @gyuho in #617
  • tests(metrics/syncer): remove flaky tests by @gyuho in #619
  • feat(server): deprecate per-component metrics backend for global prometheus registry, support component selection in metrics store by @gyuho in #618
  • feat(metrics): define/use its own prometheus registry by @gyuho in #623
  • feat(fabric-manager): simplify to remove dbus library dependency by @gyuho in #622
  • feat(components/*): increase unit test coverage, make healthy/reason check consistent by @gyuho in #624
  • README: add test coverage codecov.io badge by @gyuho in #626
  • feat(components): add unit tests, set component method, move common pkg to api/v1 by @gyuho in #628
  • feat(server): remove local web UI by @gyuho in #625
  • tests(memory/bpf): add more unit tests by @gyuho in #632
  • go.mod: upgrade various dependencies (including fsnotify, gopsutil, etc) for bug fixes by @gyuho in #630
  • test(containerd/pod): add more unit tests by @gyuho in #631
  • fix(nvidia): use correct temperature threshold in variable name by @gyuho in #636
  • feat(api/v1): add "LoginRequest" struct by @gyuho in #633
  • feat(api/v1): rename events, states, infos to GPUdComponentEvents, GPUdComponentStates, GPUdComponentInfos by @gyuho in #634
  • feat(login): rename Gossip to SendGossip by @gyuho in #638
  • fix(disk): skip partition check if it does not exist anymore by @gyuho in #637
  • feat(components): rename state to health state by @gyuho in #639
  • nit(metrics, components): rename type.go to types.go by @gyuho in #642
  • feat(os, nvidia, gpud scan): add "HealthCheckResult" interface to simplify scan per component by @gyuho in #640
  • feat(components): remove deprecated healthy boolean fields in favor of health state by @gyuho in #645
  • feat(cpu): support one-off check health state (for "gpud scan") by @gyuho in #646
  • feat(memory): add check health states helper for "gpud scan" operation by @gyuho in #648
  • feat(network): add one-off check for network latency by @gyuho in #647
  • feat(gpud): remove "diagnose" command for now, move to "scan" command by @gyuho in #649
  • feat(session): add cookiejar and do health check before starting session by @cardyok in #643
  • feat(process/scheduler): add exclusive scheduler, support "bootstrap" session message by @gyuho in #635
  • feat(components): define registry interface for components init simplification by @gyuho in #650
  • feat(gossip, login): separate gossip pkg, implement new login/gossip operations by @gyuho in #641
  • fix(gpud run): create metadata table if not exists by @gyuho in #666
  • Improve connect test coverage by @Raajheer1 in #667
  • feat(components/*): refactor to add "Check" method for one-time operations, simplify Component interfaces by @gyuho in #651
  • feat(pkg, nvidia): remove unused nvidia query func, flags for scan command by @gyuho in #669
  • feat(login): populate resource spec cpu/memory + GPU counts nvidia.com/gpu by @gyuho in #668
  • feat(nvidia/bad-envs): rename Data to checkResult, add more unit tests by @gyuho in #671
  • test(client/v1): add more healthz client tests by @gyuho in #673
  • feat(reboot): reboot after 10s by default by @eahydra in #670
  • feat(gpud): clean up --endpoint flag parsing, persist in systemd service file by @gyuho in #681
  • feat(*): populate health state Component field, rename Data to checkResult by @gyuho in #674
  • feat(nvidia-query/nvml): remove unused shared get function by @gyuho in #679
  • Add testing for pkg/server by @Raajheer1 in #680
  • go.mod: upgrade sqlite3 to 1.14.28 for CVE-2025-29087 by @gyuho in #677
  • go.mod: bump up prometheus procfs to 0.16.1 by @gyuho in #676
  • test(e2e): rename import aliases, context variables by @gyuho in #675
  • feat(components): add more register methods, add Deregisterable interface by @gyuho in #672
  • feat(custom-plugins): implement server + client by @gyuho in #627
  • fix(nvidia/info, fabric manager): check nvml library exists by @gyuho in #682
  • feat(nvidia): clean up nvml instance creation, add driver/cuda version methods by @gyuho in #683
  • fix(server): pre-process endpoint URL for new server by @gyuho in #684
  • feat(nvml): clean up info components with new met...
Read more

gpud-v0.4.9

21 Apr 15:52
9316cef
Compare
Choose a tag to compare

GPUd release notes (2025-04-21T15:51:59Z)

Welcome to this new release!

What's Changed

  • fix(nvidia-query): require non-empty GPU device name from NVML to mark GPUs are installed by @gyuho in #689

Full Changelog: v0.4.8...v0.4.9

gpud-v0.4.8

27 Mar 07:33
deb28bb
Compare
Choose a tag to compare

GPUd release notes (2025-03-27T07:56:21Z)

Welcome to this new release!

What's Changed

  • fix(library): sort reason slice to make /states more deterministic by @gyuho in #567
  • go.mod: bump up go version to 1.24.1 by @gyuho in #569
  • feat(nvidia/bad-envs): simplify, remove poller imports by @gyuho in #573
  • feat(fd, cpu, memory, remapped-rows, info, containerd, kubelet, docker, tailscale): return data fetch error by @gyuho in #572

Full Changelog: v0.4.7...v0.4.8

gpud-v0.4.7

25 Mar 10:08
4ed36e8
Compare
Choose a tag to compare
gpud-v0.4.7 Pre-release
Pre-release

GPUd release notes (2025-03-25T10:09:44Z)

Welcome to this new release!

What's Changed

  • go.mod: bump up go version to 1.23.7 by @gyuho in #556
  • feat(goreleaser): remove deprecated fields by @gyuho in #557
  • fix(package-controller): force wait cmd finish to retrieve result by @cardyok in #560
  • fix: panic on special environments by @eahydra in #561
  • fix(xid/sxid): update current state with reboot event included by @cardyok in #564
  • fix(latency/derp): close ts client once used by @gyuho in #566
  • fix(docker): close docker client once used by @gyuho in #565

Full Changelog: v0.4.6...v0.4.7

gpud-v0.4.6

22 Mar 04:31
c1eaabe
Compare
Choose a tag to compare
gpud-v0.4.6 Pre-release
Pre-release

GPUd release notes (2025-03-22T04:31:07Z)

Welcome to this new release!

What's Changed

  • feat(components/file): remove unused component (in favor of upcoming external plugin) by @gyuho in #550
  • feat(component/library): simplify + increase unit test coverage by @gyuho in #551
  • feat(kernel-module): simplify + increase test coverage by @gyuho in #552
  • fix(metrics/state): fix select query on empty secondaryName by @gyuho in #553
  • fix(server): do not prematurely shutdown nvml instance for row remapping component by @gyuho in #555

Full Changelog: v0.4.5...v0.4.6

gpud-v0.4.5

19 Mar 05:08
0f32e00
Compare
Choose a tag to compare

GPUd release notes (2025-03-19T05:09:18Z)

Welcome to this new release!

What's Changed

  • test(dmesg): add missing close func call, reduce flakiness in dmesg unit tests by @gyuho in #457
  • fix(e2e/mock/nvml): add missing get proc util func by @gyuho in #459
  • dep(go mod): bump up prom client go v.1.21.0 by @gyuho in #458
  • tests(pkg/nvidia/nvml): unit test for util/temp files, fix flaky mock tests by @gyuho in #460
  • tests(pkg/process): increase unit test coverage by @gyuho in #465
  • tests(nvidia/nvml): increase test coverage of clock events/speed by @gyuho in #464
  • tests(nvidia-query/nvml): more unit tests using nvml mock by @gyuho in #461
  • feat(dmesg): log dropped events due to channel full by @gyuho in #462
  • feat(pkg/dmesg): simplify log line processor create funcs by @gyuho in #469
  • feat(power-supply): remove not useful battery checks by @gyuho in #468
  • feat(nvidia/hw-slowdown): display data_sources in alert message by @gyuho in #463
  • tests(pkg/log): add more unit tests by @gyuho in #474
  • feat(nvidia-query/nvidia-smi): improve attacahed GPUs mismatch error message by @gyuho in #475
  • feat(pkg/login): unit testable code base, add unit tests by @gyuho in #470
  • feat(pkg/latency): "Closest" helper for "join" region detection by @gyuho in #471
  • feta(pkg/update): split into smaller files, add unit tests by @gyuho in #473
  • nits(pkg/dmesg): remove unused by @gyuho in #477
  • feat(netutils/latency): move pkg, add ut from tailscale code base by @gyuho in #476
  • feat(nvidia/infiniband): watch mellanox power events via dmesg/kernel by @gyuho in #467
  • fix(session/serve): fix response error JSON encoding by @gyuho in #480
  • tests(nvidia-query/nvml): add more unit tests for nvlink calls by @gyuho in #481
  • tests(nvidia/nvml): add more unit tests by @gyuho in #483
  • nit(session): do not shadow component name var in metrics get within for-loops by @gyuho in #485
  • feat(nvidia/fabric-manager): simplify log watcher, remove fabric manager from shared poller by @gyuho in #479
  • nit(session): simplify "delete" handling by @gyuho in #484
  • feat(query/log): remove unused tail scan by @gyuho in #486
  • fix(infiniband): simplify port check message, handle "Polling" state by @gyuho in #482
  • feat(query/log): remove, and simplify "scan" dmesg func for one-off operations, remove "gpud logs" by @gyuho in #487
  • feat(nvidia-query): remove some redundant nvidia-smi dependencies by @gyuho in #489
  • fix(xid/sxid): remove critical field by @cardyok in #490
  • fix(e2e): add missing shutdown, nvidia-smi input by @gyuho in #492
  • feat(install): support arm install by @cardyok in #495
  • feat(kmsg): initial commit (experiment to replace dmesg watcher) by @gyuho in #491
  • feat(components): watch with kmsg (experimental) by @gyuho in #496
  • feat(containerd): simplify /states with goroutines by @gyuho in #493
  • nit(common/action): remove unused public methods by @gyuho in #497
  • nit(pkg/randutil): remove by @gyuho in #499
  • tests(pkg/host): add more uts by @gyuho in #498
  • fix(kmsg): fix uptime-based kmsg timestamp parsing by @gyuho in #500
  • feat(nvidia): only keep "nvidia-smi --query" by @gyuho in #501
  • feat(components): remove unused states helper functions by @gyuho in #503
  • nits(nvidia/row-remapping): log row remapping states, document memory capability by @gyuho in #502
  • tests(gpud-metrics): add more unit tests by @gyuho in #504
  • nits(eventstore): improve event store to avoid direct reliance on database by @eahydra in #507
  • feat(components): simplify kubelet/containerd/docker states by @gyuho in #505
  • feat(docker, containerd, kubelet, tailscale): clean up systemd service checks by @gyuho in #513
  • feat(nvidia): remove unused "error" component, smi usage for temp/power checks, remove smi parse by @gyuho in #514
  • feat(nvidia): remove nvidia-smi dependency in all places (use NVML instead) by @gyuho in #515
  • fix(e2e): add missing get cuda version nvml call mock by @gyuho in #516
  • feat(cpu, memory, fd): simplify data poll by @gyuho in #517
  • fix(nvidia/remappinged-rows): clean up nvml mentions from message, remove unused SMI usage from nvidia ecc checks by @gyuho in #518
  • feat(nvidia/remappined-rows): only suggest hw inspection if not reboot by @gyuho in #519
  • feat(cpu): remoev unused physical cores field by @gyuho in #520
  • fix(sxid): fix sxid event message by @cardyok in #521
  • fix(xid/sxid): update state error base on xid/sxid name by @cardyok in #522
  • build(deps): bump golang.org/x/net from 0.34.0 to 0.36.0 by @dependabot in #523
  • fix(xid/sxid): add rand millisecond to event timestamp to avoid conflict by @cardyok in #525
  • feat(nvidia): simulate "get remapped rows" (optional) by @gyuho in #524
  • fix(nvml): fix hw slowdown events db writes, support simulate hw slowdown flag by @gyuho in #526
  • nit(remapped-rows): simplify reason message by @gyuho in #529
  • fix(xid + sxid): truncate repair actions to pick hw inspection after reboot by @gyuho in #530
  • feat(os, peermem): ignore lsmod, system virt cmd timeouts (as non-actionable) by @gyuho in #531
  • feat(metrics): remove unused EMA sql track/query by @gyuho in #533
  • chore(xid+sxid+scan): polish message by @eahydra in #535
  • feat(containerd): reduce/simplify CRI list calls by @gyuho in #534
  • chore(xid+sxid): polish reason and error message by @eahydra in #537
  • chore(xid+sxid): polish event message by @eahydra in #539
  • feat(remapped-rows): handle "system not ready" nvml error, refactor remapped-rows component by @gyuho in #540
  • feat(kmsg): add default deduper by second-scale by @gyuho in #541
  • feat(components/info): simplify /states with struct by @gyuho in #542
  • feat(containerd/pod): improve list pod sandbox error message in /states by @gyuho in #543
  • fix(nvidia): handle nvml.ERROR_NOT_FOUND in utilization query by @gyuho in #544
  • fix(nvidia/processes): handle nil output/data for /states response by @gyuho in #547
  • feat(containerd): retry cri unix grpc connection until ready by @gyuho in #545
  • feat(eventstore): handle invalid sql conversion error better by @gyuho in #546
  • feat(components): add missing nil pointer checks for getStates by @gyuho in #548

New Contributors

Full Changelog: v0.4.4...v0.4.5

gpud-v0.4.4

24 Feb 12:35
218e1f6
Compare
Choose a tag to compare

GPUd release notes (2025-02-24T12:36:24Z)

Welcome to this new release!

What's Changed

  • feat(dmesg/watcher): set default stream buffer limit to 16KB by @gyuho in #397
  • fix(os): log which command fails when context timeout in get by @gyuho in #398
  • project(*): move non-component code to /pkg by @gyuho in #368
  • feat(dmesg): dedup logs by seconds if same content, bump up log channel buffer by @gyuho in #399
  • nits(components/memory): set missing message field in events, update "New" by @gyuho in #402
  • feat(nvidia/nccl): use new dmesg poller (simplified) by @gyuho in #401
  • nits(components): close dmesg watcher without sync.Once by @gyuho in #404
  • feat(dmesg): log line processor for dmesg by @gyuho in #406
  • fix(infiniband): do not call "ibstat" on empty thresholds by @gyuho in #414
  • fix(library): locate "libnvidia-ml.so.1" for later nvidia drivers (>=565.57.01) for library checks, error if not found by @gyuho in #415
  • fix(infiniband): set /states unhealthy when thresholds set but ibstat not found by @gyuho in #420
  • feat(network/edge/derpmap): add Ashburn (Virginia), Nuremberg (Germany) by @gyuho in #421
  • test(pkg/dmesg): make "TestDedupLogLines" deterministic in slow CI by @gyuho in #422
  • fix(gpud-state): do not rollback if apiversion update committed successfully by @gyuho in #426
  • test(dmesg): make peer mem logs watch tests less flaky by @gyuho in #417
  • feat(systemd): do not return error on uptime check failures by @gyuho in #427
  • test(pkg/dmesg): make dedup log line tests more deterministic with waits by @gyuho in #425
  • go module: upgrade nvlib, gopsutil, add missing go-cache by @gyuho in #429
  • feat(dmesg): set default dmesg watcher in log line processor if not specified by @gyuho in #428
  • feat(memory): use dmesg log line processor (shared code) by @gyuho in #418
  • feat(cpu): use new dmesg poller (simplified) by @gyuho in #405
  • feat(nvidia/nccl): use new dmesg log line processor (simplified) by @gyuho in #419
  • test(dmesg): output more details for flaky tests by @gyuho in #430
  • feat(components/fd): use new dmesg poller (simplified) by @gyuho in #400
  • feat(nvidia/peermem): use new dmesg poller (simplified) by @gyuho in #403
  • feat(gpud update): trim space when fetching version from "_latest.txt" by @gyuho in #424
  • feat(accelerator/nvidia): move nvidia-query dmesg helper for xid/sxid by @gyuho in #433
  • feat(dmesg, log): support match func in log scanner by @gyuho in #436
  • feat(dmesg): remove component, use match func diagnose/scan by @gyuho in #432
  • feat(nvidia): only use NVML for GPU memory usage tracking by @gyuho in #416
  • feat(nvidia): remove nvidia-smi fallback for row remapping issues by @gyuho in #407
  • feat(disk/lsblk): more debug info when "lsblk --version" parse fails by @gyuho in #443
  • cmd(gpud): support "run --log-file" with built-in log rotation by @gyuho in #437
  • feat(xid/sxid): optimize state reason and error by @cardyok in #438
  • charts/gpud: remove (that does not follow the "best security practice" -- we will work on newer version) by @gyuho in #444
  • feat(reboot): introduce initializing state for critical components on reboot by @cardyok in #439
  • feat(kubelet): rename component name by @cardyok in #445
  • feat(session): do parallel component collection by @cardyok in #441
  • fix(ibstat): add healthy state by @cardyok in #446
  • feat(install): support install specific version by @cardyok in #447
  • feat(infiniband): use events store to track historical ibstat status by @gyuho in #448
  • feat(nvidia-query/server): remove redundant channel select, log more by @gyuho in #449
  • feat(nvidia-query): use "ERROR_ARGUMENT_VERSION_MISMATCH" to decide whether GPM is supported or not by @gyuho in #450
  • tests(error/xid): add more nvrm dmesg regex unit tests for Xid 119 by @gyuho in #452
  • feat(pkg/process): support custom bash script file name and directory, change default tmp file name pattern by @gyuho in #451
  • fix(fd): take file_max into limit setting by @cardyok in #453
  • fix(fd): make degraded state healthy false by @cardyok in #454
  • fix(fd): fix ut for degraded condition by @cardyok in #455

Full Changelog: v0.4.3...v0.4.4

gpud-v0.4.3

12 Feb 08:40
57fa27b
Compare
Choose a tag to compare

GPUd release notes (2025-02-12T08:40:40Z)

Welcome to this new release!

What's Changed

  • test(db/events): make retention purge unit tests less flaky by @gyuho in #389
  • test(nvidia/query/nvml): add clock speed, errors unit tests using mock/nvml by @gyuho in #360
  • test(errdefs): use more helpers, increase test coverage by @gyuho in #358
  • tests(internal/session): add more unit tests with smaller functions by @gyuho in #359
  • debug(infiniband): output raw ibstat when issue found by @gyuho in #392
  • fix(process): do not exit read before reading all buffer, support larger initial buffer size for scanner by @gyuho in #393
  • fix(xid/sxid): rely on last reboot first by @cardyok in #373
  • test(rootkeys): add unit tests by @gyuho in #357
  • test(client/v1): increase unit test coverage by @gyuho in #353
  • feat(ib, disk): use combined output for ibstat, lsblk by @gyuho in #395

Full Changelog: v0.4.2...v0.4.3

gpud-v0.4.2

10 Feb 18:34
e0b05fa
Compare
Choose a tag to compare

GPUd release notes (2025-02-10T18:35:30Z)

Welcome to this new release!

What's Changed

  • fix(xid/sxid): only consider 3 day events and do not rely on purge by @cardyok in #390

Full Changelog: v0.4.1...v0.4.2