Skip to content

Upgrade silent no-op: talm reports success when Talos auto-rolled back the install #175

@lexfrei

Description

@lexfrei

Problem

talm upgrade exit 0 + reported post check passed does not prove the node is actually running the target Talos version. A failed Talos boot triggers A/B auto-rollback, but talm reports success because the apply RPC was acked.

Reproducible on the dev17 OCI cluster, confirmed end-to-end:

  1. Node running ghcr.io/cozystack/cozystack/talos:v1.12.6 (carries 9 bundled extensions: drbd, zfs, amdgpu, i915, intel/amd-ucode, intel-ice-firmware, bnx2, qlogic-firmware).
  2. talm upgrade --image ghcr.io/siderolabs/installer:v1.13.0 -f nodes/node0.yaml returns exit 0 with post check passed.
  3. Talos pulls the new installer, writes A partition, reboots.
  4. New A partition lacks the extensions present in B; the machine config references state that needs them (zfs-backed /var/lib/local).
  5. Boot fails its readiness check, Talos auto-rolls back to B.
  6. talm get version after the apply returns v1.12.6 — the operator's "upgrade" silently no-op'd.

On a production fleet this scales badly: dozens-to-hundreds of "successful" upgrades can leave every node on the old version, undetected for days.

Proposed fix — Phase 2C post-upgrade version verify

Layer on top of the apply-safety gates introduced in #172 / #173:

  1. Before ApplyConfiguration --image, capture the target version from the image tag (or query installer image labels — org.opencontainers.image.version).

  2. After the apply RPC returns success and the post-boot reconcile window elapses (the same preflightCOSIReadTimeout-style cap Phase 2B uses), read runtime.Version COSI on each target node.

  3. If running.Version != target.Version, return a hint-bearing blocker:

    post-upgrade: requested upgrade to v1.13.0 but running version is v1.12.6
    — Talos auto-rolled back. Check installer compatibility (extensions,
    vendor, architecture).
    hint: cross-vendor upgrade (cozystack → siderolabs) drops bundled
    extensions; use the cozystack-built image at the target version.
    

Optional pre-upgrade extension compatibility check

Before applying, walk current ExtensionStatus COSI and compare against the new installer's manifest (if exposed via talosctl image inspect or similar). Block the upgrade pre-flight with the list of missing extensions, citing each. Skippable with --skip-extension-check for migrations where the operator knows extensions are intentionally dropped.

Scope

  • New helper in pkg/applycheck/ for extension-set comparison (unit-testable).
  • New preflight hook verifyPostUpgradeVersion in pkg/commands/preflight_apply_safety.go.
  • Wire into the upgrade flow, not just apply — the upgrade command has its own RPC path (talosctl_wrapper.go upgrade handler).
  • Default off (like --skip-post-apply-verify) until tested at scale; documented opt-in via --verify-post-upgrade=true.

Out of scope

  • General OS-level diff (kernel modules outside Talos extension framework).
  • Detecting partial / silent failures of the new install BEFORE boot (Talos itself is the only source of truth there).

Test plan

Real-Talos validation against the dev17 cluster:

  • Bad upgrade: cozystack-1.12 → siderolabs-1.13. Expected: blocker with version-mismatch message.
  • Good upgrade: cozystack-1.12 → cozystack-1.13 (once cozystack publishes 1.13). Expected: silent pass.
  • Same-version no-op upgrade: cozystack-1.12.6 → cozystack-1.12.6. Expected: silent pass (no version change but matches).

Mock-side: unit tests for the version comparator (expected vs running, image tag parse), and for the extension-compat helper.

Surfaced during

Real-Talos exercise of #173. Adversarial QA against dev17 found the silent-no-op as part of the cross-version upgrade scenario in the manual test plan section K.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/applyIssues or PRs related to talm apply (preflight, COSI validation, ApplyConfiguration flow)kind/featureCategorizes issue or PR as related to a new feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions