Problem
talm upgrade exit 0 + reported post check passed does not prove the node is actually running the target Talos version. A failed Talos boot triggers A/B auto-rollback, but talm reports success because the apply RPC was acked.
Reproducible on the dev17 OCI cluster, confirmed end-to-end:
- Node running
ghcr.io/cozystack/cozystack/talos:v1.12.6 (carries 9 bundled extensions: drbd, zfs, amdgpu, i915, intel/amd-ucode, intel-ice-firmware, bnx2, qlogic-firmware).
talm upgrade --image ghcr.io/siderolabs/installer:v1.13.0 -f nodes/node0.yaml returns exit 0 with post check passed.
- Talos pulls the new installer, writes A partition, reboots.
- New A partition lacks the extensions present in B; the machine config references state that needs them (
zfs-backed /var/lib/local).
- Boot fails its readiness check, Talos auto-rolls back to B.
talm get version after the apply returns v1.12.6 — the operator's "upgrade" silently no-op'd.
On a production fleet this scales badly: dozens-to-hundreds of "successful" upgrades can leave every node on the old version, undetected for days.
Proposed fix — Phase 2C post-upgrade version verify
Layer on top of the apply-safety gates introduced in #172 / #173:
-
Before ApplyConfiguration --image, capture the target version from the image tag (or query installer image labels — org.opencontainers.image.version).
-
After the apply RPC returns success and the post-boot reconcile window elapses (the same preflightCOSIReadTimeout-style cap Phase 2B uses), read runtime.Version COSI on each target node.
-
If running.Version != target.Version, return a hint-bearing blocker:
post-upgrade: requested upgrade to v1.13.0 but running version is v1.12.6
— Talos auto-rolled back. Check installer compatibility (extensions,
vendor, architecture).
hint: cross-vendor upgrade (cozystack → siderolabs) drops bundled
extensions; use the cozystack-built image at the target version.
Optional pre-upgrade extension compatibility check
Before applying, walk current ExtensionStatus COSI and compare against the new installer's manifest (if exposed via talosctl image inspect or similar). Block the upgrade pre-flight with the list of missing extensions, citing each. Skippable with --skip-extension-check for migrations where the operator knows extensions are intentionally dropped.
Scope
- New helper in
pkg/applycheck/ for extension-set comparison (unit-testable).
- New preflight hook
verifyPostUpgradeVersion in pkg/commands/preflight_apply_safety.go.
- Wire into the upgrade flow, not just apply — the upgrade command has its own RPC path (
talosctl_wrapper.go upgrade handler).
- Default off (like
--skip-post-apply-verify) until tested at scale; documented opt-in via --verify-post-upgrade=true.
Out of scope
- General OS-level diff (kernel modules outside Talos extension framework).
- Detecting partial / silent failures of the new install BEFORE boot (Talos itself is the only source of truth there).
Test plan
Real-Talos validation against the dev17 cluster:
- Bad upgrade: cozystack-1.12 → siderolabs-1.13. Expected: blocker with version-mismatch message.
- Good upgrade: cozystack-1.12 → cozystack-1.13 (once cozystack publishes 1.13). Expected: silent pass.
- Same-version no-op upgrade: cozystack-1.12.6 → cozystack-1.12.6. Expected: silent pass (no version change but matches).
Mock-side: unit tests for the version comparator (expected vs running, image tag parse), and for the extension-compat helper.
Surfaced during
Real-Talos exercise of #173. Adversarial QA against dev17 found the silent-no-op as part of the cross-version upgrade scenario in the manual test plan section K.
Problem
talm upgradeexit 0 + reportedpost check passeddoes not prove the node is actually running the target Talos version. A failed Talos boot triggers A/B auto-rollback, but talm reports success because the apply RPC was acked.Reproducible on the dev17 OCI cluster, confirmed end-to-end:
ghcr.io/cozystack/cozystack/talos:v1.12.6(carries 9 bundled extensions: drbd, zfs, amdgpu, i915, intel/amd-ucode, intel-ice-firmware, bnx2, qlogic-firmware).talm upgrade --image ghcr.io/siderolabs/installer:v1.13.0 -f nodes/node0.yamlreturns exit 0 withpost check passed.zfs-backed/var/lib/local).talm get versionafter the apply returnsv1.12.6— the operator's "upgrade" silently no-op'd.On a production fleet this scales badly: dozens-to-hundreds of "successful" upgrades can leave every node on the old version, undetected for days.
Proposed fix — Phase 2C post-upgrade version verify
Layer on top of the apply-safety gates introduced in #172 / #173:
Before
ApplyConfiguration --image, capture the target version from the image tag (or query installer image labels —org.opencontainers.image.version).After the apply RPC returns success and the post-boot reconcile window elapses (the same
preflightCOSIReadTimeout-style cap Phase 2B uses), readruntime.VersionCOSI on each target node.If
running.Version != target.Version, return a hint-bearing blocker:Optional pre-upgrade extension compatibility check
Before applying, walk current
ExtensionStatusCOSI and compare against the new installer's manifest (if exposed viatalosctl image inspector similar). Block the upgrade pre-flight with the list of missing extensions, citing each. Skippable with--skip-extension-checkfor migrations where the operator knows extensions are intentionally dropped.Scope
pkg/applycheck/for extension-set comparison (unit-testable).verifyPostUpgradeVersioninpkg/commands/preflight_apply_safety.go.talosctl_wrapper.goupgrade handler).--skip-post-apply-verify) until tested at scale; documented opt-in via--verify-post-upgrade=true.Out of scope
Test plan
Real-Talos validation against the dev17 cluster:
Mock-side: unit tests for the version comparator (
expected vs running,image tag parse), and for the extension-compat helper.Surfaced during
Real-Talos exercise of #173. Adversarial QA against dev17 found the silent-no-op as part of the cross-version upgrade scenario in the manual test plan section K.