feat(apply): apply-time safety gates — declared-resource existence + drift preview/verify#173
feat(apply): apply-time safety gates — declared-resource existence + drift preview/verify#173lexfrei wants to merge 1 commit into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the applycheck package to implement safety gates for Talos MachineConfig applications, including host-resource reference validation and structural diffing. These checks are integrated into the apply command with new flags to control resource validation, drift previews, and post-apply verification. A review comment identifies an improvement opportunity in error handling within cosiMachineConfigReader to prevent transient errors from being incorrectly reported as maintenance connection issues.
| if err != nil { | ||
| // PermissionDenied on the insecure path is the documented | ||
| // Sensitive-resource degrade signal. Distinguishing transient | ||
| // failures from auth-side denial would require parsing the | ||
| // gRPC status; for the pre-merge cut we treat any read | ||
| // failure as "could not read" and let the caller decide | ||
| // whether to surface that as a blocker (Phase 2B never does; | ||
| // Phase 2A never does either). | ||
| return nil, false, nil | ||
| } |
There was a problem hiding this comment.
The current implementation swallows the error from safe.StateGet. This can lead to misleading output in previewDrift and verifyAppliedState, where a transient network or API error is reported as a maintenance connection issue. It's better to return the error so the caller can provide accurate feedback.
| if err != nil { | |
| // PermissionDenied on the insecure path is the documented | |
| // Sensitive-resource degrade signal. Distinguishing transient | |
| // failures from auth-side denial would require parsing the | |
| // gRPC status; for the pre-merge cut we treat any read | |
| // failure as "could not read" and let the caller decide | |
| // whether to surface that as a blocker (Phase 2B never does; | |
| // Phase 2A never does either). | |
| return nil, false, nil | |
| } | |
| if err != nil { | |
| // PermissionDenied on the insecure path is the documented | |
| // Sensitive-resource degrade signal. Distinguishing transient | |
| // failures from auth-side denial would require parsing the | |
| // gRPC status; for the pre-merge cut we treat any read | |
| // failure as "could not read" and let the caller decide | |
| // whether to surface that as a blocker (Phase 2B never does; | |
| // Phase 2A never does either). | |
| return nil, false, err | |
| } |
Capture the matrix of cases the gates were validated against during PR #173 development: Phase 1 link / disk / selector inputs, Phase 2A diff classification, Phase 2B mode gating, real-Talos validation steps against the dev17 reference cluster. Documents the known limitations (upgrade has no gates, insecure path can't render with discovery, --init --update non-tty UX gap). Future-self reference for re-validation after touching applycheck or the preflight hooks. Refs: #172 Signed-off-by: Aleksei Sviridkin <f@lex.la>
Adds docs/manual-test-plan.md, a checklist for exercising talm against a real Talos cluster. Complementary to the narrow apply-safety-gates plan: it walks the whole CLI surface (init, template, apply, upgrade, rotate-ca, meta, reset, completion, read-only diagnostics) with expected outcomes and the failure modes to watch for. Each section captures the exact commands run during the dev17 adversarial sweep that produced the eight bugs PR #173 fixes, plus the destructive sanity-check block to run after reset / rotate-ca / mode=reboot to confirm cluster health. Extends naturally — new sections drop in alphabetically (K, L, ...). Also adds nit.md to .gitignore so the in-flight notes file doesn't get accidentally committed. Refs: #172 Signed-off-by: Aleksei Sviridkin <f@lex.la>
…kip + linksDisksReader three-valued + md reflow Blockers from branch-review pass 2 (#173): 1. Phase 2C ran WithClient even on --insecure upgrade. The maintenance / pre-auth connection cannot reach the auth-only COSI path; pre-fix the gate either silently surrendered with a 'version unreadable' line or, worse, connected to an unrelated node from the talosconfig context and verified its state. Extract shouldRunPostUpgradeVerify(insecure, staged, skip) pure predicate; capture --insecure flag BEFORE original RunE (alongside --image and --stage). Mirrors cosiMachineConfigReader's insecure-path branch in apply. 2. Phase 2C ran after --stage upgrades — guaranteed false positive blocker since the new partition isn't activated until the next reboot and runtime.Version still reports the OLD version. Same predicate now skips on --stage, mirroring shouldRunPostApplyVerify's STAGED case for apply. 3. docs/manual-test-plan.md K2-pre block claimed Phase 2C 'until it lands', contradicting the new K1-pre section right above and the README addition. Reworked K2-pre as the manual fallback for --skip-post-upgrade-verify / --insecure flows that the gate skips by design. K1-pre's expected output is synced with the actual emit (two-hypothesis hint, waiting line, skip list). 4. cosiLinksDisksReader had a two-valued (snapshot, ok) signature that conflated transient COSI errors with auth-disallowed. Mirror machineConfigReader's three-valued (snapshot, ok, err) shape: - err != nil -> transient (apid timeout, network blip); surface underlying cause, the operator's config isn't wrong. - !ok && err==nil -> auth-disallowed / resource unreachable by design; suggest --skip-resource-validation. - ok -> success. preflightValidateResources splits the two branches into distinct error messages (no more misleading 'config is wrong' on a 2s COSI timeout). New regression test TestPreflightValidateResources_TransientErr_WrapsUnderlying pins the contract. 5. Phase 2C silent no-op when both --nodes and talosconfig context's Nodes are empty. Now prints an explanatory line so the operator sees the gate they opted into didn't actually run. Recommendations addressed: - Phase 2C 90s wait now prints a 'waiting...' line up front so the operator's terminal isn't a mystery hang. - targetImage == '' path prints a skip notice instead of silent no-op. - transportNVMe const split into transportNVMe (COSI Disk.Transport value) + selectorTypeNVMe (v1alpha1 InstallDiskSelector.type enum). They're the same string today; documenting the contract separation prevents a silent future regression. - Phase 2C error message reworded to lead with both hypotheses ('auto-rolled back OR still booting'), matching the two- hypothesis hint constant. Documentation reflow (separate, no behaviour change): user pushed back on hardwrapped prose in human-rendered .md and PR/issue bodies. Reflowed docs/manual-test-plan.md, docs/apply-safety-gates-test-plan.md, nit.md, GitHub issue #178, and PR #173 body to one-line-per-paragraph per CLAUDE.md's prose rule. Tables, code blocks, and bullet structure preserved verbatim. Recommendation NOT addressed in this pass: the parentDir-shadowing-the-package-constant nit. Pre-existing across ~6 sites; folding it into this PR adds noise. Worth a follow-up issue (kind/cleanup, area/commands, good first issue). Refs: #172, #175 Signed-off-by: Aleksei Sviridkin <f@lex.la>
Adds four apply-time safety gates to talm (umbrella #172) plus the talm init --update non-tty UX fix from #174. The Phase 2C gate also closes #175 by detecting silent Talos auto-rollback after a failed cross-vendor upgrade. # Phase 1 — declared resource existence Pre-apply walker that scans the rendered MachineConfig for declared links and disks, then asserts each one resolves against the node's live COSI snapshot. Catches typos before they hit talosctl: - LinkConfig.name, DHCPv4Config.name, DHCPv6Config.name, EthernetConfig.name — references to existing links. - BondConfig.links[], BridgeConfig.links[], VLANConfig.parent, Layer2VIPConfig.link, HCloudVIPConfig.link, machine.network. interfaces[].interface (v1.11 nested form), routes[].linkName, addresses[].linkName. - machine.install.disk (literal path) and machine.install.disk Selector (size/model/serial/wwid/type/busPath filter). - UserVolumeConfig.provisioning.diskSelector (v1.12) and machine.disks[].device (v1.11). Selector semantics mirror Talos's v1alpha1_provider.go disk resolution: humanize.ParseBytes for size, virtual / readonly / CDROM exclusion, by-id / by-path stable symlinks accepted. The type-selector enum (nvme / sd / hdd / ssd) maps to (Transport, Rotational) per Talos; case handling is intentionally permissive (predicateType lowercases) — Phase 1 catches declared-resource mismatches, not Talos's input-format strictness. Bond / Bridge / VLAN / Wireguard / Dummy / LinkAlias config kinds are intentionally excluded from name-emission: their .name describes a NEW resource being created, not a reference to an existing one. The walker's multidocHandlers dispatch table distinguishes the two classes explicitly. Walker is YAML-only — no Talos machinery types in its surface — so contract tests in pkg/applycheck/refs_test.go run without standing up a live cluster. Real Talos YAML keys (BridgeConfig. links not ports, VLANConfig.parent not link) are pinned to test so a future Talos schema rename surfaces as a red test. Skip via --skip-resource-validation. Default on. Hint candidate lists are capped at 10 entries with a "... and N more" suffix so cozystack hosts with many dm/drbd/loop devices don't drown the actual finding line. # Phase 2A — pre-apply drift preview Reads the node's current MachineConfig via COSI and emits a +/-/~/= diff against the rendered config that's about to be applied. Informational — never blocks. Runs on every apply mode including --dry-run. Diff is doc-structural, keyed by (Kind, Name) for v1.12 multi-doc and synthetic MachineConfig{} for v1.11 root. Re-serialization order is irrelevant (TestDiff_KeyOrderIrrelevant). Slice values are surfaced as multiset add/remove buckets ("removed [127.0.0.1]" for certSANs duplicate-cleanup) rather than full-slice dumps; map / nested values render in YAML flow style ({k: v, k2: v2}) rather than Go's map[k:v k2:v2]. Empty nested maps emit as their own leaf so an absent / present transition surfaces visibly. Duplicate (Kind, Name) in a single doc stream is a config defect (Talos's behaviour on duplicates is unspecified) — parseDocs surfaces it as an error so the operator sees the real problem instead of a misleading drift line. Multi-node output is per-node-distinguished: the header includes "(node X.Y.Z.W)" when nodeID is non-empty so an operator scanning stderr from a multi-node apply can tell whose diff is whose. The auth-template path threads the node through cosiPreflightContext, which downshifts client.WithNodes (plural, single-element — needed for template lookup round-tripping) to client.WithNode (singular — needed because apid's COSI router rejects the plural metadata key); multi-element ctx is refused up front. On insecure / maintenance connection the MachineConfig resource is Sensitive and unreachable, so the gate degrades gracefully with a one-line "drift verification unavailable on maintenance connection" warning. Skip via --skip-drift-preview. Default on. # Phase 2B — post-apply state verification After ApplyConfiguration returns success, re-reads the on-node MachineConfig and structurally compares against the bytes that were sent. Divergence blocks with a per-document diff, catching silent doc drops (Talos parser ignored an unknown field) and controller reverts. Mode skips: - REBOOT / STAGED / TRY — verify cannot reach a stable ActiveID; pinned by shouldRunPostApplyVerify matrix. - AUTO — Talos's apply-server promotes AUTO to REBOOT internally when the change requires it, so the verify would race the reboot. Conservative skip; operators wanting verify on no-reboot AUTO applies pass --mode=no-reboot explicitly. - --dry-run — no apply happened. Reader err != nil is a hard blocker (the gate exists to catch silent rollbacks; swallowing a read error would defeat its purpose). ok=false err=nil on the auth path degrades to a maintenance-connection warning. Skip via --skip-post-apply-verify. Default OFF until the Talos- mutated-field allowlist lands (#172) — Talos mutates cert hashes and timestamps post-apply that surface as false-positive divergence today. # Phase 2C — post-upgrade version verify After talosctl upgrade returns success, waits 90s for the node to finish booting then reads runtime.Version via COSI and compares the running version's (Major, Minor) contract against the contract parsed from the target image tag. Cross-minor mismatch surfaces as a hint-bearing blocker with a two- hypothesis hint (silent A/B rollback after the new partition failed its boot readiness check, OR the node is slower than the reconcile window). Closes #175: cross-vendor upgrade silent no-op. The upgrade RPC acks success while Talos has actually rolled back to the previous partition (cross-vendor image, missing extensions, failed boot readiness check). The gate catches this by reading runtime.Version on the auth path post-boot. Target image is captured BEFORE talosctl's RunE — talosctl's upgrade flow can overwrite --image with the node's running install.image (no-op upgrade), which would mask the version mismatch the gate exists to catch. Mode skips: - --insecure — auth-only COSI path is unreachable; hard early-return in shouldRunPostUpgradeVerify. - --stage — new partition not yet activated until reboot; runtime.Version would always report the old version. Best-effort surrender on digest-pinned images and unparseable tags (the gate can't extract a version literal). Reader err is the rollback signal — a node that hung mid-boot looks like "connection refused" from COSI; surfaces as a hint-bearing blocker (TestVerifyPostUpgradeVersion_ReaderConnectionRefused_ NotSilent). versionReader signature is three-valued (string, bool, error) to distinguish real read failures from by-design unreachable cases. Multi-node verify is collect-then-block: every node's verdict is captured (errors.Join) rather than short-circuiting on the first failure. talosctl upgrade has already executed on every node by the time this loop runs, so stopping at the first failure would hide partial rollbacks. The 90s reconcile wait is deferred until after node resolution so an operator with an empty --nodes list doesn't sit through 90s of "waiting for the node to finish booting..." just to be told there was no target node. Skip via --skip-post-upgrade-verify. Default on. # init --update non-tty handling (#174) Closes #174. Previously, talm init --update under a non-tty stdin (CI, scripted) hung on an interactive overwrite prompt that EOF'd into a cryptic error. Now: - resolveOverwritePolicy(force, isTTY) is a pure predicate returning Ask / Force / NonInteractive. Non-tty without --force emits a hint-bearing error pointing the operator at --force. - stdinIsTTY uses term.IsTerminal instead of the naive ModeCharDevice check that falsely accepted /dev/null as a tty (and produced the original EOF wedge). - init --image runs validateImageRefShape before writing, rejecting malformed --image values (e.g. "::malformed", "no-slash:tag") up front instead of leaving a broken values.yaml that fails later inside configloader. # Implementation notes Engine package is unchanged. All gate logic lives in pkg/applycheck (walker, validator, differ, selector) and pkg/commands/preflight_apply_safety.go (Phase 1/2A/2B hooks + COSI readers) plus pkg/commands/preflight_upgrade_verify.go (Phase 2C). The three readers (linksDisksReader, machineConfigReader, versionReader) share the (snapshot, ok, err) three-valued contract for test-fake symmetry, though the policy on each output combination differs per gate. cosiLinksDisksReader gives each ListAll call its own context.WithTimeout budget — a shared budget would leak the first read's elapsed time into the second's deadline and produce false transient-timeout blockers on slow nodes. multisetDiff's equality key is fmt.Sprintf("%v", item) wrapped behind canonicalKey() — fmt sorts map keys for stable output (documented in the fmt package), and the wrapper isolates the contract for a future swap. Pinned by a 100-iteration stability test. # Documentation - README's "Apply-time safety gates" section describes all four gates with their skip-flag defaults. - docs/apply-safety-gates-test-plan.md — narrow QA reference for the gates: matrix of mode × skip-flag × insecure × --dry-run for every gate, real-Talos validation steps. - docs/manual-test-plan.md — broader end-to-end QA matrix covering the whole talm CLI surface. - golangci-lint timeout bumped to 10m in pr.yml to accommodate the additional package's analysis time. Test surface includes red+green pairs for every contract-level predicate (TDD per project rule) plus integration tests for the preflight hooks via function-type fakes. Walker, differ, validator, selector each have their own contract test file. TestNoWorkflowLeakageInRepoSource is extended to ban private operator identifiers from committed source. Closes #172 Closes #174 Closes #175 Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
69c4d5d to
558f680
Compare
Summary
Implements both phases of the apply-time safety gates umbrella (#172), with the post-/branch-review polish, the bug fixes that surfaced during real-Talos validation against a live OCI v1.12.6 cluster, and the
init --updatenon-tty UX fix that the validation surfaced as a blocker.Phase 1 — declared-resource existence (block-by-default,
--skip-resource-validationopt-out): walks the rendered MachineConfig, extracts host-resource references (link names from v1.11machine.network.interfaces[]+ v1.12 multi-docLinkConfig.name,BondConfig.links[],VLANConfig.parent,BridgeConfig.links[],Layer2VIPConfig.link,HCloudVIPConfig.link,DHCPv4Config.name,DHCPv6Config.name,EthernetConfig.name; install disk literal +diskSelector;UserVolumeConfig.provisioning.diskSelector), and verifies each againstDisks.block.talos.dev+LinkStatuses.net.talos.devsnapshots read from COSI. Findings cite the JSONPath source so operators can locate the offending value directly. Virtual-creator documents (BondConfig.name,VLANConfig.name,BridgeConfig.name,WireguardConfig.name,DummyLinkConfig.name,LinkAliasConfig.name) are intentionally skipped — their.nameis the new resource being created, not an existing-link reference.Phase 2 — drift gates:
--skip-drift-previewopt-out): reads on-nodeMachineConfigs.config.talos.devand diffs against what's about to be sent, rendering+/-/~/=per(kind, name)pair. Informational only — never blocks. Runs on--dry-runtoo — that's exactly the case operators want a preview for.--skip-post-apply-verifyopt-out, default off until the Talos-mutated-field allowlist lands — see Apply-time safety gates: validate declared resources, preview & verify drift #172): re-reads on-node MachineConfig afterApplyConfigurationsuccess and structurally compares against the bytes sent. Blocks on divergence. Auto-skipped on--mode=staged/--mode=try/--mode=reboot/--dry-run(reboot kills the COSI connection mid-verify; staged/try diverge from "what was sent is what is on-node now" by design).Both Phase 2 gates degrade gracefully on the insecure / maintenance connection:
MachineConfigismeta.Sensitive, the COSI reader returnsok=falseon that path deterministically (the maintenance-vs-auth path is statically known from--insecure), and the gate prints one explanatory line (drift verification unavailable on maintenance connection) without blocking.talm init --updatenon-tty handling (closes #174): the preset-template overwrite prompt used to wedge onEOFwhen stdin wasn't a tty, silently leaving the project on a stale preset. Now resolves to one of three policies:--forceaccepts all overwrites with a per-file log line; tty + no--forcekeeps the historical interactive prompt; non-tty + no--forceerrors with a hint pointing at--force. Surfaced during real-Talos validation when the apply-safety gates appeared to misbehave because the preset on the test cluster was stale.What changed
New package
pkg/applycheck:WalkRefs— multi-doc YAML walker for v1.11 + v1.12 forms. Tolerant of malformed shapes (no panic on strings-where-maps-expected, ints-where-strings-expected, null docs, missing keys). Emits refs only for existing links the apply references, not for virtual links the apply creates (BondConfig.name/VLANConfig.name/BridgeConfig.nameare correctly skipped — they describe new resources, not pre-existing ones).ValidateRefs+matchSelector— Phase 1 validation againstHostSnapshot. Disk selectors handle model/busPath globs, exact serial/wwid/modalias/UUID, type via the(Transport, Rotational)mapping that mirrorsv1alpha1_provider.go:1325-1351(nvme→Transport=="nvme",sd→Transport=="mmc",hdd→Rotational==true,ssd→Rotational==false), size viahumanize.ParseBytes(case-insensitive SI/IEC units, comparators>= / <= / > / < / ==). Read-only, CDROM, and kernel-virtual (bus_path=/virtual— dm/drbd/loop) devices are excluded from candidacy. Stable disk symlinks (/dev/disk/by-id/...,/dev/disk/by-path/...) accepted forRefKindDiskLiteral.Diff+FilterChanged+FormatChange— Phase 2 structural differ by(kind, name)with leaf-levelFieldChangefor updates.FieldChange.HasOld/HasNewdistinguish(absent)from literalnull.reflect.DeepEqualgates structural equality, so re-serializing either side with different key ordering doesn't produce false drift. Tolerates UTF-8 BOM, CRLF line endings, no trailing newline, and only-separator documents.New preflight hooks in
pkg/commands/preflight_apply_safety.go:preflightValidateResources,previewDrift,verifyAppliedState— mirror thepreflightCheckTalosVersionshape and use function-type readers (linksDisksReader,machineConfigReader) so unit tests don't need a live Talos client.cosiLinksDisksReader,cosiMachineConfigReader— production readers backed bysafe.StateListAll/safe.StateGet. The MachineConfig reader takes aninsecureflag and short-circuits on the maintenance path rather than swallowing every COSI error as "degrade"; transient errors propagate soverifyAppliedStatesurfaces them as blockers.Wired into both apply paths in
pkg/commands/apply.go:buildApplyClosure(auth / template-rendering path).applyOneFileDirectPatchMode(direct-patch / insecure path) per-node loop.runPreApplyGates/runPostApplyGatecentralize the skip-flag plumbing.shouldRunPostApplyVerifyis the testable mode-gating predicate (staged/try always skip; dry-run skips Phase 2B but not Phase 2A).Three new flags on
talm apply:--skip-resource-validation(default false)--skip-drift-preview(default false)--skip-post-apply-verify(default true — flip to=falseto enable until the Talos-mutated-field allowlist lands)talm init --updateUX:resolveOverwritePolicy(force, isTTY) -> {Ask, Force, NonInteractive}is a pure predicate; pin-tested independently of stdin.askUserOverwriteconsults the resolved policy:Forcebypasses the prompt with a one-line log;NonInteractivereturns a hint-bearing error;Askkeeps the historical interactive prompt.stdinIsTTYusesterm.IsTerminal(stdin.Fd())instead of the naiveModeCharDevicecheck (the latter falsely accepted/dev/nullas a tty and led to the EOF wedge).Tests
TDD red-first per phase. 26 commits, structured as red/green pairs for each gate. 5 new test files; ~310 cases run when subtests are counted across
pkg/applycheckandpkg/commands. Categories:Lint clean on host and
GOOS=windows.go vet ./...clean. Race detector clean.Real-Talos validation
Run against the OCI 3-node Talos v1.12.6 reference cluster. Surfaced seven false-positives unit tests didn't catch, plus the
init --updateUX gap:bus_path=/virtualfilter.--dry-runwas skipping Phase 2A drift preview entirely — fixed (preview is read-only, drift visibility is the whole point of dry-run).namefields (BondConfig.nameetc.) as required link refs, blocking every legitimate new-virtual-link apply — fixed./dev/disk/by-id/...literals (the recommended stable form) were rejected because the walker only matchedDevPath— fixed by also acceptingDisk.Symlinks.--mode=rebootkilled the COSI connection mid-verify, surfacing as apost-apply blockeron every successful reboot apply — fixed by skipping Phase 2B on--mode=rebootsame as staged/try.BridgeConfigwalker usedports, but the real Talos v1alpha1 tag islinks(bridge.go BridgeLinks) — fixed; pre-fix Bridge slave refs were silently never validated.VLANConfigwalker usedlinkfor the parent, but the real tag isparent(vlan.go ParentLinkConfig) — fixed; pre-fix VLAN parent refs were silently never validated.talm init --updateEOFed on the preset-template prompt in non-tty contexts, leaving the project on a stale preset that didn't ship the gate-relevant helpers — fixed by the policy-driven prompt +--forceoverload.All eight pinned by regression tests. Test plan reference saved at
docs/apply-safety-gates-test-plan.mdfor future re-validation.Closes
Closes #172. Closes #174.