Releases: leptonai/gpud
Releases · leptonai/gpud
gpud-v0.5.1-rc-0
What's Changed
- feat(nvidia/infiniband): remove "ibstatus" use for health checker by @gyuho in #946
- feat(components/*): log component errors as warn level by @gyuho in #950
- fix(nvidia/nccl, peermem): add missing cancel calls by @gyuho in #949
- feat(nvidia/infiniband): track
/sys/class/infiniband/[device]
, optional data collection by @gyuho in #943 - fix(reboot): select only "reboot" events in "os" bucket by @gyuho in #951
- feat(nfs): check mount fs type, degraded if not nfs by @gyuho in #952
- fix(nvidia/xid, nvidia/sxid, docker): fix err.Error() panic by @gyuho in #953
- feat(nvidia/gpu-counts): add gpu counts component by @gyuho in #944
- feat(nvidia/infiniband): support custom infiniband class root dir by @gyuho in #954
- chore(package-controller): remove redundant log lines by @cardyok in #955
- fix(cmd/status): do not parse login timestamp if empty by @gyuho in #956
- fix(nvml/processes): handle "no such file or directory" error when listing gpu processes by @gyuho in #960
- feat(machine-info): add private ip fallback for aws using imds by @gyuho in #962
- tests(components): use assert for unit tests by @gyuho in #961
- feat(server): retain only 24 hours of metrics/events data (retain xid/sxid/reboot as exceptions) by @gyuho in #927
- fix(nfs-checker): add missing checker.Clean, simplify the logic to only check the self-machine by @gyuho in #963
- feat(nvidia/infiniband): more tests with class dir, separate evaluate_threshold file by @gyuho in #965
- feat(nvidia): new device interface to get PCI Bus ID by @gyuho in #966
- feat(components/xid): append GPU uuid in reason by @eahydra in #970
- feat(nvidia/xid): xid 110 should suggest reboot first by @gyuho in #974
- tests(nfs): make nfs error retry test more robust by @gyuho in #958
- fix(pkg/providers): handle no public IP case in AWS by @gyuho in #975
- feat(session): extend initializing grace period to 5 min and make persistence mode more frequent by @cardyok in #978
- fix(session): make last reboot time global by @cardyok in #979
- fix(nvidia/xid): remove stale test case by @gyuho in #977
- feat(memory): make oom events human readable using cadvisor code by @gyuho in #980
- feat(containerd): extend dangling pod count check in containerd by @cardyok in #981
- fix(containerd): recreate context for kubelet list calls by @gyuho in #982
- feat(infiniband): track historical ib port states to detect drop/flap by @gyuho in #971
Full Changelog: v0.5.0...v0.5.1-rc-0
gpud-v0.5.0
What's Changed
- feat(metrics, server): implement scraper, store, syncer for Prometheus metrics using sqlite as an experimental feature by @gyuho in #577
- nit(fd): remove unused prom gatherer from component by @gyuho in #583
- feat(edge/derp): add Helsinki region for network checks by @gyuho in #580
- nit(cpu): remove unused prom gatherer from component by @gyuho in #582
- feat(host, pci): move reboot utils, simplify by @gyuho in #586
- feat(metrics): label Prometheus metrics with component name by @gyuho in #579
- feat(fuse): simplify metrics + poller by @gyuho in #585
- feat(os): simplify poller, remove os component dependency from sxid/xid by @gyuho in #590
- feat(network-latency): simplify metrics + poller by @gyuho in #592
- feat(systemd): remove (in favor of individual component checks) by @gyuho in #587
- feat(disk): simplify metrics + poller (remove old metrics method for component) by @gyuho in #584
- feat(pci): refactor to clean up poller by @gyuho in #570
- test(metrics/syncer): wait longer for syncer start/stop ut for slow CI by @gyuho in #578
- feat(release): deprecate ubuntu20.04, remove default release file without ubuntu version by @gyuho in #596
- feat(xid,sxid): use kmsg instead of dmesg by @eahydra in #602
- fix(disk): set missing State.Error field by @gyuho in #598
- test(metrics/syncer): make start/stop tests less flaky for slow CI by @gyuho in #599
- nits(infiniband): remove id package, make code ordering consistent by @gyuho in #600
- feat: use kmsg instead of dmesg by @eahydra in #603
- chore: remove all unnecessary dmesg information by @eahydra in #605
- feat(kubelet): check port open, list pods with failure threshold by @gyuho in #589
- feat(gpud-manager/systemd): auto-add --log-file flag with default file path by @gyuho in #606
- feat(cpu): simplify metrics, simplify states output and checks by @gyuho in #597
- feat(nvidia/clock-speed): simplify with its own poller (no shared poller) by @gyuho in #607
- feat(fd): simplify metrics with prometheus by @gyuho in #608
- feat(memory): simplify metrics using Prometheus-based ones by @gyuho in #611
- feat(nvidia/remapped-rows): simplify metrics by @gyuho in #612
- feat(nvidia/hw-slowdown): simplify metrics, remove shared poller by @gyuho in #609
- feat(nvidia/xid): mark Xid 143 as fatal (gpu init failure) by @gyuho in #614
- feat(nvidia/temperature, power, util, ecc, gpm, gsp-firmware-mode, processes, persistence-mode, nvlink): simplify metrics, remove shared poller by @gyuho in #610
- feat(nvidia/info): simplify, remove shared poller dependency by @gyuho in #613
- feat(nvidia): remove shared poller package by @gyuho in #615
- feat(pkg/login): clean up, increase unit test coverage by @gyuho in #616
- feat(cpu, fd): clean up init, add unit tests by @gyuho in #617
- tests(metrics/syncer): remove flaky tests by @gyuho in #619
- feat(server): deprecate per-component metrics backend for global prometheus registry, support component selection in metrics store by @gyuho in #618
- feat(metrics): define/use its own prometheus registry by @gyuho in #623
- feat(fabric-manager): simplify to remove dbus library dependency by @gyuho in #622
- feat(components/*): increase unit test coverage, make healthy/reason check consistent by @gyuho in #624
- README: add test coverage codecov.io badge by @gyuho in #626
- feat(components): add unit tests, set component method, move common pkg to api/v1 by @gyuho in #628
- feat(server): remove local web UI by @gyuho in #625
- tests(memory/bpf): add more unit tests by @gyuho in #632
- go.mod: upgrade various dependencies (including fsnotify, gopsutil, etc) for bug fixes by @gyuho in #630
- test(containerd/pod): add more unit tests by @gyuho in #631
- fix(nvidia): use correct temperature threshold in variable name by @gyuho in #636
- feat(api/v1): add "LoginRequest" struct by @gyuho in #633
- feat(api/v1): rename events, states, infos to GPUdComponentEvents, GPUdComponentStates, GPUdComponentInfos by @gyuho in #634
- feat(login): rename Gossip to SendGossip by @gyuho in #638
- fix(disk): skip partition check if it does not exist anymore by @gyuho in #637
- feat(components): rename state to health state by @gyuho in #639
- nit(metrics, components): rename type.go to types.go by @gyuho in #642
- feat(os, nvidia, gpud scan): add "HealthCheckResult" interface to simplify scan per component by @gyuho in #640
- feat(components): remove deprecated healthy boolean fields in favor of health state by @gyuho in #645
- feat(cpu): support one-off check health state (for "gpud scan") by @gyuho in #646
- feat(memory): add check health states helper for "gpud scan" operation by @gyuho in #648
- feat(network): add one-off check for network latency by @gyuho in #647
- feat(gpud): remove "diagnose" command for now, move to "scan" command by @gyuho in #649
- feat(session): add cookiejar and do health check before starting session by @cardyok in #643
- feat(process/scheduler): add exclusive scheduler, support "bootstrap" session message by @gyuho in #635
- feat(components): define registry interface for components init simplification by @gyuho in #650
- feat(gossip, login): separate gossip pkg, implement new login/gossip operations by @gyuho in #641
- fix(gpud run): create metadata table if not exists by @gyuho in #666
- Improve connect test coverage by @Raajheer1 in #667
- feat(components/*): refactor to add "Check" method for one-time operations, simplify Component interfaces by @gyuho in #651
- feat(pkg, nvidia): remove unused nvidia query func, flags for scan command by @gyuho in #669
- feat(login): populate resource spec cpu/memory + GPU counts nvidia.com/gpu by @gyuho in #668
- feat(nvidia/bad-envs): rename Data to checkResult, add more unit tests by @gyuho in #671
- test(client/v1): add more healthz client tests by @gyuho in #673
- feat(reboot): reboot after 10s by default by @eahydra in #670
- feat(gpud): clean up --endpoint flag parsing, persist in systemd service file by @gyuho in #681
- feat(*): populate health state Component field, rename Data to checkResult by @gyuho in #674
- feat(nvidia-query/nvml): remove unused shared get function by @gyuho in #679
- Add testing for pkg/server by @Raajheer1 in #680
- go.mod: upgrade sqlite3 to 1.14.28 for CVE-2025-29087 by @gyuho in #677
- go.mod: bump up prometheus procfs to 0.16.1 by @gyuho in #676
- test(e2e): rename import aliases, context variables by @gyuho in #675
- feat(components): add more register methods, add Deregisterable interface by @gyuho in #672
- feat(custom-plugins): implement server + client by @gyuho in #627
- fix(nvidia/info, fabric manager): check nvml library exists by @gyuho in #682
- feat(nvidia): clean up nvml instance creation, add driver/cuda version methods by @gyuho in #683
- fix(server): pre-process endpoint URL for new server by @gyuho in #684
- feat(nvml): clean up info components with new met...
gpud-v0.4.9
GPUd release notes (2025-04-21T15:51:59Z)
Welcome to this new release!
What's Changed
- fix(nvidia-query): require non-empty GPU device name from NVML to mark GPUs are installed by @gyuho in #689
Full Changelog: v0.4.8...v0.4.9
gpud-v0.4.8
GPUd release notes (2025-03-27T07:56:21Z)
Welcome to this new release!
What's Changed
- fix(library): sort reason slice to make /states more deterministic by @gyuho in #567
- go.mod: bump up go version to 1.24.1 by @gyuho in #569
- feat(nvidia/bad-envs): simplify, remove poller imports by @gyuho in #573
- feat(fd, cpu, memory, remapped-rows, info, containerd, kubelet, docker, tailscale): return data fetch error by @gyuho in #572
Full Changelog: v0.4.7...v0.4.8
gpud-v0.4.7
GPUd release notes (2025-03-25T10:09:44Z)
Welcome to this new release!
What's Changed
- go.mod: bump up go version to 1.23.7 by @gyuho in #556
- feat(goreleaser): remove deprecated fields by @gyuho in #557
- fix(package-controller): force wait cmd finish to retrieve result by @cardyok in #560
- fix: panic on special environments by @eahydra in #561
- fix(xid/sxid): update current state with reboot event included by @cardyok in #564
- fix(latency/derp): close ts client once used by @gyuho in #566
- fix(docker): close docker client once used by @gyuho in #565
Full Changelog: v0.4.6...v0.4.7
gpud-v0.4.6
GPUd release notes (2025-03-22T04:31:07Z)
Welcome to this new release!
What's Changed
- feat(components/file): remove unused component (in favor of upcoming external plugin) by @gyuho in #550
- feat(component/library): simplify + increase unit test coverage by @gyuho in #551
- feat(kernel-module): simplify + increase test coverage by @gyuho in #552
- fix(metrics/state): fix select query on empty secondaryName by @gyuho in #553
- fix(server): do not prematurely shutdown nvml instance for row remapping component by @gyuho in #555
Full Changelog: v0.4.5...v0.4.6
gpud-v0.4.5
GPUd release notes (2025-03-19T05:09:18Z)
Welcome to this new release!
What's Changed
- test(dmesg): add missing close func call, reduce flakiness in dmesg unit tests by @gyuho in #457
- fix(e2e/mock/nvml): add missing get proc util func by @gyuho in #459
- dep(go mod): bump up prom client go v.1.21.0 by @gyuho in #458
- tests(pkg/nvidia/nvml): unit test for util/temp files, fix flaky mock tests by @gyuho in #460
- tests(pkg/process): increase unit test coverage by @gyuho in #465
- tests(nvidia/nvml): increase test coverage of clock events/speed by @gyuho in #464
- tests(nvidia-query/nvml): more unit tests using nvml mock by @gyuho in #461
- feat(dmesg): log dropped events due to channel full by @gyuho in #462
- feat(pkg/dmesg): simplify log line processor create funcs by @gyuho in #469
- feat(power-supply): remove not useful battery checks by @gyuho in #468
- feat(nvidia/hw-slowdown): display data_sources in alert message by @gyuho in #463
- tests(pkg/log): add more unit tests by @gyuho in #474
- feat(nvidia-query/nvidia-smi): improve attacahed GPUs mismatch error message by @gyuho in #475
- feat(pkg/login): unit testable code base, add unit tests by @gyuho in #470
- feat(pkg/latency): "Closest" helper for "join" region detection by @gyuho in #471
- feta(pkg/update): split into smaller files, add unit tests by @gyuho in #473
- nits(pkg/dmesg): remove unused by @gyuho in #477
- feat(netutils/latency): move pkg, add ut from tailscale code base by @gyuho in #476
- feat(nvidia/infiniband): watch mellanox power events via dmesg/kernel by @gyuho in #467
- fix(session/serve): fix response error JSON encoding by @gyuho in #480
- tests(nvidia-query/nvml): add more unit tests for nvlink calls by @gyuho in #481
- tests(nvidia/nvml): add more unit tests by @gyuho in #483
- nit(session): do not shadow component name var in metrics get within for-loops by @gyuho in #485
- feat(nvidia/fabric-manager): simplify log watcher, remove fabric manager from shared poller by @gyuho in #479
- nit(session): simplify "delete" handling by @gyuho in #484
- feat(query/log): remove unused tail scan by @gyuho in #486
- fix(infiniband): simplify port check message, handle "Polling" state by @gyuho in #482
- feat(query/log): remove, and simplify "scan" dmesg func for one-off operations, remove "gpud logs" by @gyuho in #487
- feat(nvidia-query): remove some redundant nvidia-smi dependencies by @gyuho in #489
- fix(xid/sxid): remove critical field by @cardyok in #490
- fix(e2e): add missing shutdown, nvidia-smi input by @gyuho in #492
- feat(install): support arm install by @cardyok in #495
- feat(kmsg): initial commit (experiment to replace dmesg watcher) by @gyuho in #491
- feat(components): watch with kmsg (experimental) by @gyuho in #496
- feat(containerd): simplify /states with goroutines by @gyuho in #493
- nit(common/action): remove unused public methods by @gyuho in #497
- nit(pkg/randutil): remove by @gyuho in #499
- tests(pkg/host): add more uts by @gyuho in #498
- fix(kmsg): fix uptime-based kmsg timestamp parsing by @gyuho in #500
- feat(nvidia): only keep "nvidia-smi --query" by @gyuho in #501
- feat(components): remove unused states helper functions by @gyuho in #503
- nits(nvidia/row-remapping): log row remapping states, document memory capability by @gyuho in #502
- tests(gpud-metrics): add more unit tests by @gyuho in #504
- nits(eventstore): improve event store to avoid direct reliance on database by @eahydra in #507
- feat(components): simplify kubelet/containerd/docker states by @gyuho in #505
- feat(docker, containerd, kubelet, tailscale): clean up systemd service checks by @gyuho in #513
- feat(nvidia): remove unused "error" component, smi usage for temp/power checks, remove smi parse by @gyuho in #514
- feat(nvidia): remove nvidia-smi dependency in all places (use NVML instead) by @gyuho in #515
- fix(e2e): add missing get cuda version nvml call mock by @gyuho in #516
- feat(cpu, memory, fd): simplify data poll by @gyuho in #517
- fix(nvidia/remappinged-rows): clean up nvml mentions from message, remove unused SMI usage from nvidia ecc checks by @gyuho in #518
- feat(nvidia/remappined-rows): only suggest hw inspection if not reboot by @gyuho in #519
- feat(cpu): remoev unused physical cores field by @gyuho in #520
- fix(sxid): fix sxid event message by @cardyok in #521
- fix(xid/sxid): update state error base on xid/sxid name by @cardyok in #522
- build(deps): bump golang.org/x/net from 0.34.0 to 0.36.0 by @dependabot in #523
- fix(xid/sxid): add rand millisecond to event timestamp to avoid conflict by @cardyok in #525
- feat(nvidia): simulate "get remapped rows" (optional) by @gyuho in #524
- fix(nvml): fix hw slowdown events db writes, support simulate hw slowdown flag by @gyuho in #526
- nit(remapped-rows): simplify reason message by @gyuho in #529
- fix(xid + sxid): truncate repair actions to pick hw inspection after reboot by @gyuho in #530
- feat(os, peermem): ignore lsmod, system virt cmd timeouts (as non-actionable) by @gyuho in #531
- feat(metrics): remove unused EMA sql track/query by @gyuho in #533
- chore(xid+sxid+scan): polish message by @eahydra in #535
- feat(containerd): reduce/simplify CRI list calls by @gyuho in #534
- chore(xid+sxid): polish reason and error message by @eahydra in #537
- chore(xid+sxid): polish event message by @eahydra in #539
- feat(remapped-rows): handle "system not ready" nvml error, refactor remapped-rows component by @gyuho in #540
- feat(kmsg): add default deduper by second-scale by @gyuho in #541
- feat(components/info): simplify /states with struct by @gyuho in #542
- feat(containerd/pod): improve list pod sandbox error message in /states by @gyuho in #543
- fix(nvidia): handle nvml.ERROR_NOT_FOUND in utilization query by @gyuho in #544
- fix(nvidia/processes): handle nil output/data for /states response by @gyuho in #547
- feat(containerd): retry cri unix grpc connection until ready by @gyuho in #545
- feat(eventstore): handle invalid sql conversion error better by @gyuho in #546
- feat(components): add missing nil pointer checks for getStates by @gyuho in #548
New Contributors
Full Changelog: v0.4.4...v0.4.5
gpud-v0.4.4
GPUd release notes (2025-02-24T12:36:24Z)
Welcome to this new release!
What's Changed
- feat(dmesg/watcher): set default stream buffer limit to 16KB by @gyuho in #397
- fix(os): log which command fails when context timeout in get by @gyuho in #398
- project(*): move non-component code to /pkg by @gyuho in #368
- feat(dmesg): dedup logs by seconds if same content, bump up log channel buffer by @gyuho in #399
- nits(components/memory): set missing message field in events, update "New" by @gyuho in #402
- feat(nvidia/nccl): use new dmesg poller (simplified) by @gyuho in #401
- nits(components): close dmesg watcher without sync.Once by @gyuho in #404
- feat(dmesg): log line processor for dmesg by @gyuho in #406
- fix(infiniband): do not call "ibstat" on empty thresholds by @gyuho in #414
- fix(library): locate "libnvidia-ml.so.1" for later nvidia drivers (>=565.57.01) for library checks, error if not found by @gyuho in #415
- fix(infiniband): set /states unhealthy when thresholds set but ibstat not found by @gyuho in #420
- feat(network/edge/derpmap): add Ashburn (Virginia), Nuremberg (Germany) by @gyuho in #421
- test(pkg/dmesg): make "TestDedupLogLines" deterministic in slow CI by @gyuho in #422
- fix(gpud-state): do not rollback if apiversion update committed successfully by @gyuho in #426
- test(dmesg): make peer mem logs watch tests less flaky by @gyuho in #417
- feat(systemd): do not return error on uptime check failures by @gyuho in #427
- test(pkg/dmesg): make dedup log line tests more deterministic with waits by @gyuho in #425
- go module: upgrade nvlib, gopsutil, add missing go-cache by @gyuho in #429
- feat(dmesg): set default dmesg watcher in log line processor if not specified by @gyuho in #428
- feat(memory): use dmesg log line processor (shared code) by @gyuho in #418
- feat(cpu): use new dmesg poller (simplified) by @gyuho in #405
- feat(nvidia/nccl): use new dmesg log line processor (simplified) by @gyuho in #419
- test(dmesg): output more details for flaky tests by @gyuho in #430
- feat(components/fd): use new dmesg poller (simplified) by @gyuho in #400
- feat(nvidia/peermem): use new dmesg poller (simplified) by @gyuho in #403
- feat(gpud update): trim space when fetching version from "_latest.txt" by @gyuho in #424
- feat(accelerator/nvidia): move nvidia-query dmesg helper for xid/sxid by @gyuho in #433
- feat(dmesg, log): support match func in log scanner by @gyuho in #436
- feat(dmesg): remove component, use match func diagnose/scan by @gyuho in #432
- feat(nvidia): only use NVML for GPU memory usage tracking by @gyuho in #416
- feat(nvidia): remove nvidia-smi fallback for row remapping issues by @gyuho in #407
- feat(disk/lsblk): more debug info when "lsblk --version" parse fails by @gyuho in #443
- cmd(gpud): support "run --log-file" with built-in log rotation by @gyuho in #437
- feat(xid/sxid): optimize state reason and error by @cardyok in #438
- charts/gpud: remove (that does not follow the "best security practice" -- we will work on newer version) by @gyuho in #444
- feat(reboot): introduce initializing state for critical components on reboot by @cardyok in #439
- feat(kubelet): rename component name by @cardyok in #445
- feat(session): do parallel component collection by @cardyok in #441
- fix(ibstat): add healthy state by @cardyok in #446
- feat(install): support install specific version by @cardyok in #447
- feat(infiniband): use events store to track historical ibstat status by @gyuho in #448
- feat(nvidia-query/server): remove redundant channel select, log more by @gyuho in #449
- feat(nvidia-query): use "ERROR_ARGUMENT_VERSION_MISMATCH" to decide whether GPM is supported or not by @gyuho in #450
- tests(error/xid): add more nvrm dmesg regex unit tests for Xid 119 by @gyuho in #452
- feat(pkg/process): support custom bash script file name and directory, change default tmp file name pattern by @gyuho in #451
- fix(fd): take file_max into limit setting by @cardyok in #453
- fix(fd): make degraded state healthy false by @cardyok in #454
- fix(fd): fix ut for degraded condition by @cardyok in #455
Full Changelog: v0.4.3...v0.4.4
gpud-v0.4.3
GPUd release notes (2025-02-12T08:40:40Z)
Welcome to this new release!
What's Changed
- test(db/events): make retention purge unit tests less flaky by @gyuho in #389
- test(nvidia/query/nvml): add clock speed, errors unit tests using mock/nvml by @gyuho in #360
- test(errdefs): use more helpers, increase test coverage by @gyuho in #358
- tests(internal/session): add more unit tests with smaller functions by @gyuho in #359
- debug(infiniband): output raw ibstat when issue found by @gyuho in #392
- fix(process): do not exit read before reading all buffer, support larger initial buffer size for scanner by @gyuho in #393
- fix(xid/sxid): rely on last reboot first by @cardyok in #373
- test(rootkeys): add unit tests by @gyuho in #357
- test(client/v1): increase unit test coverage by @gyuho in #353
- feat(ib, disk): use combined output for ibstat, lsblk by @gyuho in #395
Full Changelog: v0.4.2...v0.4.3
gpud-v0.4.2
GPUd release notes (2025-02-10T18:35:30Z)
Welcome to this new release!
What's Changed
Full Changelog: v0.4.1...v0.4.2