Skip to content

feat(apply): apply-time safety gates — declared-resource existence + drift preview/verify#173

Draft
lexfrei wants to merge 1 commit into
mainfrom
feat/apply-safety-gates
Draft

feat(apply): apply-time safety gates — declared-resource existence + drift preview/verify#173
lexfrei wants to merge 1 commit into
mainfrom
feat/apply-safety-gates

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented May 11, 2026

Summary

Implements both phases of the apply-time safety gates umbrella (#172), with the post-/branch-review polish, the bug fixes that surfaced during real-Talos validation against a live OCI v1.12.6 cluster, and the init --update non-tty UX fix that the validation surfaced as a blocker.

Phase 1 — declared-resource existence (block-by-default, --skip-resource-validation opt-out): walks the rendered MachineConfig, extracts host-resource references (link names from v1.11 machine.network.interfaces[] + v1.12 multi-doc LinkConfig.name, BondConfig.links[], VLANConfig.parent, BridgeConfig.links[], Layer2VIPConfig.link, HCloudVIPConfig.link, DHCPv4Config.name, DHCPv6Config.name, EthernetConfig.name; install disk literal + diskSelector; UserVolumeConfig.provisioning.diskSelector), and verifies each against Disks.block.talos.dev + LinkStatuses.net.talos.dev snapshots read from COSI. Findings cite the JSONPath source so operators can locate the offending value directly. Virtual-creator documents (BondConfig.name, VLANConfig.name, BridgeConfig.name, WireguardConfig.name, DummyLinkConfig.name, LinkAliasConfig.name) are intentionally skipped — their .name is the new resource being created, not an existing-link reference.

Phase 2 — drift gates:

  • Phase 2A — pre-apply preview (--skip-drift-preview opt-out): reads on-node MachineConfigs.config.talos.dev and diffs against what's about to be sent, rendering +/-/~/= per (kind, name) pair. Informational only — never blocks. Runs on --dry-run too — that's exactly the case operators want a preview for.
  • Phase 2B — post-apply verify (--skip-post-apply-verify opt-out, default off until the Talos-mutated-field allowlist lands — see Apply-time safety gates: validate declared resources, preview & verify drift #172): re-reads on-node MachineConfig after ApplyConfiguration success and structurally compares against the bytes sent. Blocks on divergence. Auto-skipped on --mode=staged / --mode=try / --mode=reboot / --dry-run (reboot kills the COSI connection mid-verify; staged/try diverge from "what was sent is what is on-node now" by design).

Both Phase 2 gates degrade gracefully on the insecure / maintenance connection: MachineConfig is meta.Sensitive, the COSI reader returns ok=false on that path deterministically (the maintenance-vs-auth path is statically known from --insecure), and the gate prints one explanatory line (drift verification unavailable on maintenance connection) without blocking.

talm init --update non-tty handling (closes #174): the preset-template overwrite prompt used to wedge on EOF when stdin wasn't a tty, silently leaving the project on a stale preset. Now resolves to one of three policies: --force accepts all overwrites with a per-file log line; tty + no --force keeps the historical interactive prompt; non-tty + no --force errors with a hint pointing at --force. Surfaced during real-Talos validation when the apply-safety gates appeared to misbehave because the preset on the test cluster was stale.

What changed

New package pkg/applycheck:

  • WalkRefs — multi-doc YAML walker for v1.11 + v1.12 forms. Tolerant of malformed shapes (no panic on strings-where-maps-expected, ints-where-strings-expected, null docs, missing keys). Emits refs only for existing links the apply references, not for virtual links the apply creates (BondConfig.name / VLANConfig.name / BridgeConfig.name are correctly skipped — they describe new resources, not pre-existing ones).
  • ValidateRefs + matchSelector — Phase 1 validation against HostSnapshot. Disk selectors handle model/busPath globs, exact serial/wwid/modalias/UUID, type via the (Transport, Rotational) mapping that mirrors v1alpha1_provider.go:1325-1351 (nvmeTransport=="nvme", sdTransport=="mmc", hddRotational==true, ssdRotational==false), size via humanize.ParseBytes (case-insensitive SI/IEC units, comparators >= / <= / > / < / ==). Read-only, CDROM, and kernel-virtual (bus_path=/virtual — dm/drbd/loop) devices are excluded from candidacy. Stable disk symlinks (/dev/disk/by-id/..., /dev/disk/by-path/...) accepted for RefKindDiskLiteral.
  • Diff + FilterChanged + FormatChange — Phase 2 structural differ by (kind, name) with leaf-level FieldChange for updates. FieldChange.HasOld / HasNew distinguish (absent) from literal null. reflect.DeepEqual gates structural equality, so re-serializing either side with different key ordering doesn't produce false drift. Tolerates UTF-8 BOM, CRLF line endings, no trailing newline, and only-separator documents.

New preflight hooks in pkg/commands/preflight_apply_safety.go:

  • preflightValidateResources, previewDrift, verifyAppliedState — mirror the preflightCheckTalosVersion shape and use function-type readers (linksDisksReader, machineConfigReader) so unit tests don't need a live Talos client.
  • cosiLinksDisksReader, cosiMachineConfigReader — production readers backed by safe.StateListAll / safe.StateGet. The MachineConfig reader takes an insecure flag and short-circuits on the maintenance path rather than swallowing every COSI error as "degrade"; transient errors propagate so verifyAppliedState surfaces them as blockers.

Wired into both apply paths in pkg/commands/apply.go:

  • buildApplyClosure (auth / template-rendering path).
  • applyOneFileDirectPatchMode (direct-patch / insecure path) per-node loop.
  • runPreApplyGates / runPostApplyGate centralize the skip-flag plumbing.
  • shouldRunPostApplyVerify is the testable mode-gating predicate (staged/try always skip; dry-run skips Phase 2B but not Phase 2A).

Three new flags on talm apply:

  • --skip-resource-validation (default false)
  • --skip-drift-preview (default false)
  • --skip-post-apply-verify (default true — flip to =false to enable until the Talos-mutated-field allowlist lands)

talm init --update UX:

  • resolveOverwritePolicy(force, isTTY) -> {Ask, Force, NonInteractive} is a pure predicate; pin-tested independently of stdin.
  • askUserOverwrite consults the resolved policy: Force bypasses the prompt with a one-line log; NonInteractive returns a hint-bearing error; Ask keeps the historical interactive prompt.
  • stdinIsTTY uses term.IsTerminal(stdin.Fd()) instead of the naive ModeCharDevice check (the latter falsely accepted /dev/null as a tty and led to the EOF wedge).

Tests

TDD red-first per phase. 26 commits, structured as red/green pairs for each gate. 5 new test files; ~310 cases run when subtests are counted across pkg/applycheck and pkg/commands. Categories:

  • Walker contract (v1.11 nested + v1.12 multi-doc + adversarial malformed input + the virtual-link-name false-positive guard).
  • Selector evaluator (every InstallDiskSelector field shape; type-via-(Transport,Rotational); humanize size matrix; readonly/CDROM/virtual exclusions; by-id symlink acceptance).
  • Differ (add/remove/update/equal; identical inputs; FilterChanged; absent-vs-null FieldChange flags; invalid-YAML error path).
  • Preflight hooks (link missing, disk missing, reader failure, insecure-path degrade for previewDrift + verifyAppliedState, mode-gating matrix).
  • Overwrite-policy (the (force, tty) -> policy matrix; each policy's prompt/log/error behaviour against an injected stdin reader).

Lint clean on host and GOOS=windows. go vet ./... clean. Race detector clean.

Real-Talos validation

Run against the OCI 3-node Talos v1.12.6 reference cluster. Surfaced seven false-positives unit tests didn't catch, plus the init --update UX gap:

  1. Virtual disks (dm/drbd/loop) polluted Phase 1 hint output — fixed by bus_path=/virtual filter.
  2. --dry-run was skipping Phase 2A drift preview entirely — fixed (preview is read-only, drift visibility is the whole point of dry-run).
  3. Walker emitted virtual-link name fields (BondConfig.name etc.) as required link refs, blocking every legitimate new-virtual-link apply — fixed.
  4. /dev/disk/by-id/... literals (the recommended stable form) were rejected because the walker only matched DevPath — fixed by also accepting Disk.Symlinks.
  5. --mode=reboot killed the COSI connection mid-verify, surfacing as a post-apply blocker on every successful reboot apply — fixed by skipping Phase 2B on --mode=reboot same as staged/try.
  6. BridgeConfig walker used ports, but the real Talos v1alpha1 tag is links (bridge.go BridgeLinks) — fixed; pre-fix Bridge slave refs were silently never validated.
  7. VLANConfig walker used link for the parent, but the real tag is parent (vlan.go ParentLinkConfig) — fixed; pre-fix VLAN parent refs were silently never validated.
  8. talm init --update EOFed on the preset-template prompt in non-tty contexts, leaving the project on a stale preset that didn't ship the gate-relevant helpers — fixed by the policy-driven prompt + --force overload.

All eight pinned by regression tests. Test plan reference saved at docs/apply-safety-gates-test-plan.md for future re-validation.

Closes

Closes #172. Closes #174.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9ac7706a-3068-46ef-94e4-a680562f1e2b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/apply-safety-gates

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the applycheck package to implement safety gates for Talos MachineConfig applications, including host-resource reference validation and structural diffing. These checks are integrated into the apply command with new flags to control resource validation, drift previews, and post-apply verification. A review comment identifies an improvement opportunity in error handling within cosiMachineConfigReader to prevent transient errors from being incorrectly reported as maintenance connection issues.

Comment on lines +304 to +313
if err != nil {
// PermissionDenied on the insecure path is the documented
// Sensitive-resource degrade signal. Distinguishing transient
// failures from auth-side denial would require parsing the
// gRPC status; for the pre-merge cut we treat any read
// failure as "could not read" and let the caller decide
// whether to surface that as a blocker (Phase 2B never does;
// Phase 2A never does either).
return nil, false, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation swallows the error from safe.StateGet. This can lead to misleading output in previewDrift and verifyAppliedState, where a transient network or API error is reported as a maintenance connection issue. It's better to return the error so the caller can provide accurate feedback.

Suggested change
if err != nil {
// PermissionDenied on the insecure path is the documented
// Sensitive-resource degrade signal. Distinguishing transient
// failures from auth-side denial would require parsing the
// gRPC status; for the pre-merge cut we treat any read
// failure as "could not read" and let the caller decide
// whether to surface that as a blocker (Phase 2B never does;
// Phase 2A never does either).
return nil, false, nil
}
if err != nil {
// PermissionDenied on the insecure path is the documented
// Sensitive-resource degrade signal. Distinguishing transient
// failures from auth-side denial would require parsing the
// gRPC status; for the pre-merge cut we treat any read
// failure as "could not read" and let the caller decide
// whether to surface that as a blocker (Phase 2B never does;
// Phase 2A never does either).
return nil, false, err
}

lexfrei added a commit that referenced this pull request May 11, 2026
Capture the matrix of cases the gates were validated against during
PR #173 development: Phase 1 link / disk / selector inputs, Phase 2A
diff classification, Phase 2B mode gating, real-Talos validation
steps against the dev17 reference cluster. Documents the known
limitations (upgrade has no gates, insecure path can't render with
discovery, --init --update non-tty UX gap).

Future-self reference for re-validation after touching applycheck
or the preflight hooks.

Refs: #172
Signed-off-by: Aleksei Sviridkin <f@lex.la>
lexfrei added a commit that referenced this pull request May 11, 2026
Adds docs/manual-test-plan.md, a checklist for exercising talm
against a real Talos cluster. Complementary to the narrow
apply-safety-gates plan: it walks the whole CLI surface (init,
template, apply, upgrade, rotate-ca, meta, reset, completion,
read-only diagnostics) with expected outcomes and the failure
modes to watch for.

Each section captures the exact commands run during the dev17
adversarial sweep that produced the eight bugs PR #173 fixes,
plus the destructive sanity-check block to run after reset /
rotate-ca / mode=reboot to confirm cluster health.

Extends naturally — new sections drop in alphabetically (K, L, ...).

Also adds nit.md to .gitignore so the in-flight notes file
doesn't get accidentally committed.

Refs: #172
Signed-off-by: Aleksei Sviridkin <f@lex.la>
lexfrei added a commit that referenced this pull request May 12, 2026
…kip + linksDisksReader three-valued + md reflow

Blockers from branch-review pass 2 (#173):

1. Phase 2C ran WithClient even on --insecure upgrade. The
   maintenance / pre-auth connection cannot reach the auth-only
   COSI path; pre-fix the gate either silently surrendered with a
   'version unreadable' line or, worse, connected to an unrelated
   node from the talosconfig context and verified its state.
   Extract shouldRunPostUpgradeVerify(insecure, staged, skip)
   pure predicate; capture --insecure flag BEFORE original RunE
   (alongside --image and --stage). Mirrors
   cosiMachineConfigReader's insecure-path branch in apply.

2. Phase 2C ran after --stage upgrades — guaranteed false positive
   blocker since the new partition isn't activated until the next
   reboot and runtime.Version still reports the OLD version. Same
   predicate now skips on --stage, mirroring
   shouldRunPostApplyVerify's STAGED case for apply.

3. docs/manual-test-plan.md K2-pre block claimed Phase 2C 'until
   it lands', contradicting the new K1-pre section right above and
   the README addition. Reworked K2-pre as the manual fallback for
   --skip-post-upgrade-verify / --insecure flows that the gate
   skips by design. K1-pre's expected output is synced with the
   actual emit (two-hypothesis hint, waiting line, skip list).

4. cosiLinksDisksReader had a two-valued (snapshot, ok) signature
   that conflated transient COSI errors with auth-disallowed.
   Mirror machineConfigReader's three-valued
   (snapshot, ok, err) shape:

     - err != nil  -> transient (apid timeout, network blip);
       surface underlying cause, the operator's config isn't
       wrong.
     - !ok && err==nil -> auth-disallowed / resource unreachable
       by design; suggest --skip-resource-validation.
     - ok           -> success.

   preflightValidateResources splits the two branches into
   distinct error messages (no more misleading 'config is wrong'
   on a 2s COSI timeout). New regression test
   TestPreflightValidateResources_TransientErr_WrapsUnderlying
   pins the contract.

5. Phase 2C silent no-op when both --nodes and talosconfig
   context's Nodes are empty. Now prints an explanatory line so
   the operator sees the gate they opted into didn't actually
   run.

Recommendations addressed:

- Phase 2C 90s wait now prints a 'waiting...' line up front so
  the operator's terminal isn't a mystery hang.
- targetImage == '' path prints a skip notice instead of silent
  no-op.
- transportNVMe const split into transportNVMe (COSI Disk.Transport
  value) + selectorTypeNVMe (v1alpha1 InstallDiskSelector.type
  enum). They're the same string today; documenting the contract
  separation prevents a silent future regression.
- Phase 2C error message reworded to lead with both hypotheses
  ('auto-rolled back OR still booting'), matching the two-
  hypothesis hint constant.

Documentation reflow (separate, no behaviour change): user pushed
back on hardwrapped prose in human-rendered .md and PR/issue
bodies. Reflowed docs/manual-test-plan.md,
docs/apply-safety-gates-test-plan.md, nit.md, GitHub issue #178,
and PR #173 body to one-line-per-paragraph per CLAUDE.md's prose
rule. Tables, code blocks, and bullet structure preserved
verbatim.

Recommendation NOT addressed in this pass: the
parentDir-shadowing-the-package-constant nit. Pre-existing across
~6 sites; folding it into this PR adds noise. Worth a follow-up
issue (kind/cleanup, area/commands, good first issue).

Refs: #172, #175
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Adds four apply-time safety gates to talm (umbrella #172) plus the
talm init --update non-tty UX fix from #174. The Phase 2C gate also
closes #175 by detecting silent Talos auto-rollback after a failed
cross-vendor upgrade.

# Phase 1 — declared resource existence

Pre-apply walker that scans the rendered MachineConfig for declared
links and disks, then asserts each one resolves against the node's
live COSI snapshot. Catches typos before they hit talosctl:

  - LinkConfig.name, DHCPv4Config.name, DHCPv6Config.name,
    EthernetConfig.name — references to existing links.
  - BondConfig.links[], BridgeConfig.links[], VLANConfig.parent,
    Layer2VIPConfig.link, HCloudVIPConfig.link, machine.network.
    interfaces[].interface (v1.11 nested form), routes[].linkName,
    addresses[].linkName.
  - machine.install.disk (literal path) and machine.install.disk
    Selector (size/model/serial/wwid/type/busPath filter).
  - UserVolumeConfig.provisioning.diskSelector (v1.12) and
    machine.disks[].device (v1.11).

Selector semantics mirror Talos's v1alpha1_provider.go disk
resolution: humanize.ParseBytes for size, virtual / readonly /
CDROM exclusion, by-id / by-path stable symlinks accepted. The
type-selector enum (nvme / sd / hdd / ssd) maps to (Transport,
Rotational) per Talos; case handling is intentionally permissive
(predicateType lowercases) — Phase 1 catches declared-resource
mismatches, not Talos's input-format strictness.

Bond / Bridge / VLAN / Wireguard / Dummy / LinkAlias config kinds
are intentionally excluded from name-emission: their .name
describes a NEW resource being created, not a reference to an
existing one. The walker's multidocHandlers dispatch table
distinguishes the two classes explicitly.

Walker is YAML-only — no Talos machinery types in its surface —
so contract tests in pkg/applycheck/refs_test.go run without
standing up a live cluster. Real Talos YAML keys (BridgeConfig.
links not ports, VLANConfig.parent not link) are pinned to test
so a future Talos schema rename surfaces as a red test.

Skip via --skip-resource-validation. Default on.

Hint candidate lists are capped at 10 entries with a "... and N
more" suffix so cozystack hosts with many dm/drbd/loop devices
don't drown the actual finding line.

# Phase 2A — pre-apply drift preview

Reads the node's current MachineConfig via COSI and emits a
+/-/~/= diff against the rendered config that's about to be
applied. Informational — never blocks. Runs on every apply mode
including --dry-run.

Diff is doc-structural, keyed by (Kind, Name) for v1.12 multi-doc
and synthetic MachineConfig{} for v1.11 root. Re-serialization
order is irrelevant (TestDiff_KeyOrderIrrelevant). Slice values
are surfaced as multiset add/remove buckets ("removed [127.0.0.1]"
for certSANs duplicate-cleanup) rather than full-slice dumps;
map / nested values render in YAML flow style ({k: v, k2: v2})
rather than Go's map[k:v k2:v2]. Empty nested maps emit as their
own leaf so an absent / present transition surfaces visibly.

Duplicate (Kind, Name) in a single doc stream is a config defect
(Talos's behaviour on duplicates is unspecified) — parseDocs
surfaces it as an error so the operator sees the real problem
instead of a misleading drift line.

Multi-node output is per-node-distinguished: the header includes
"(node X.Y.Z.W)" when nodeID is non-empty so an operator
scanning stderr from a multi-node apply can tell whose diff is
whose. The auth-template path threads the node through
cosiPreflightContext, which downshifts client.WithNodes (plural,
single-element — needed for template lookup round-tripping) to
client.WithNode (singular — needed because apid's COSI router
rejects the plural metadata key); multi-element ctx is refused
up front.

On insecure / maintenance connection the MachineConfig resource
is Sensitive and unreachable, so the gate degrades gracefully
with a one-line "drift verification unavailable on maintenance
connection" warning.

Skip via --skip-drift-preview. Default on.

# Phase 2B — post-apply state verification

After ApplyConfiguration returns success, re-reads the on-node
MachineConfig and structurally compares against the bytes that
were sent. Divergence blocks with a per-document diff, catching
silent doc drops (Talos parser ignored an unknown field) and
controller reverts.

Mode skips:
  - REBOOT / STAGED / TRY — verify cannot reach a stable
    ActiveID; pinned by shouldRunPostApplyVerify matrix.
  - AUTO — Talos's apply-server promotes AUTO to REBOOT
    internally when the change requires it, so the verify would
    race the reboot. Conservative skip; operators wanting verify
    on no-reboot AUTO applies pass --mode=no-reboot explicitly.
  - --dry-run — no apply happened.

Reader err != nil is a hard blocker (the gate exists to catch
silent rollbacks; swallowing a read error would defeat its
purpose). ok=false err=nil on the auth path degrades to a
maintenance-connection warning.

Skip via --skip-post-apply-verify. Default OFF until the Talos-
mutated-field allowlist lands (#172) — Talos mutates cert hashes
and timestamps post-apply that surface as false-positive
divergence today.

# Phase 2C — post-upgrade version verify

After talosctl upgrade returns success, waits 90s for the node
to finish booting then reads runtime.Version via COSI and
compares the running version's (Major, Minor) contract against
the contract parsed from the target image tag. Cross-minor
mismatch surfaces as a hint-bearing blocker with a two-
hypothesis hint (silent A/B rollback after the new partition
failed its boot readiness check, OR the node is slower than the
reconcile window).

Closes #175: cross-vendor upgrade silent no-op. The upgrade RPC
acks success while Talos has actually rolled back to the
previous partition (cross-vendor image, missing extensions,
failed boot readiness check). The gate catches this by reading
runtime.Version on the auth path post-boot.

Target image is captured BEFORE talosctl's RunE — talosctl's
upgrade flow can overwrite --image with the node's running
install.image (no-op upgrade), which would mask the version
mismatch the gate exists to catch.

Mode skips:
  - --insecure — auth-only COSI path is unreachable; hard
    early-return in shouldRunPostUpgradeVerify.
  - --stage — new partition not yet activated until reboot;
    runtime.Version would always report the old version.

Best-effort surrender on digest-pinned images and unparseable
tags (the gate can't extract a version literal). Reader err is
the rollback signal — a node that hung mid-boot looks like
"connection refused" from COSI; surfaces as a hint-bearing
blocker (TestVerifyPostUpgradeVersion_ReaderConnectionRefused_
NotSilent). versionReader signature is three-valued (string,
bool, error) to distinguish real read failures from by-design
unreachable cases.

Multi-node verify is collect-then-block: every node's verdict
is captured (errors.Join) rather than short-circuiting on the
first failure. talosctl upgrade has already executed on every
node by the time this loop runs, so stopping at the first
failure would hide partial rollbacks.

The 90s reconcile wait is deferred until after node resolution
so an operator with an empty --nodes list doesn't sit through
90s of "waiting for the node to finish booting..." just to be
told there was no target node.

Skip via --skip-post-upgrade-verify. Default on.

# init --update non-tty handling (#174)

Closes #174. Previously, talm init --update under a non-tty
stdin (CI, scripted) hung on an interactive overwrite prompt
that EOF'd into a cryptic error. Now:

  - resolveOverwritePolicy(force, isTTY) is a pure predicate
    returning Ask / Force / NonInteractive. Non-tty without
    --force emits a hint-bearing error pointing the operator
    at --force.
  - stdinIsTTY uses term.IsTerminal instead of the naive
    ModeCharDevice check that falsely accepted /dev/null as a
    tty (and produced the original EOF wedge).
  - init --image runs validateImageRefShape before writing,
    rejecting malformed --image values (e.g. "::malformed",
    "no-slash:tag") up front instead of leaving a broken
    values.yaml that fails later inside configloader.

# Implementation notes

Engine package is unchanged. All gate logic lives in
pkg/applycheck (walker, validator, differ, selector) and
pkg/commands/preflight_apply_safety.go (Phase 1/2A/2B hooks +
COSI readers) plus pkg/commands/preflight_upgrade_verify.go
(Phase 2C).

The three readers (linksDisksReader, machineConfigReader,
versionReader) share the (snapshot, ok, err) three-valued
contract for test-fake symmetry, though the policy on each
output combination differs per gate.

cosiLinksDisksReader gives each ListAll call its own
context.WithTimeout budget — a shared budget would leak the
first read's elapsed time into the second's deadline and
produce false transient-timeout blockers on slow nodes.

multisetDiff's equality key is fmt.Sprintf("%v", item) wrapped
behind canonicalKey() — fmt sorts map keys for stable output
(documented in the fmt package), and the wrapper isolates the
contract for a future swap. Pinned by a 100-iteration
stability test.

# Documentation

  - README's "Apply-time safety gates" section describes all
    four gates with their skip-flag defaults.
  - docs/apply-safety-gates-test-plan.md — narrow QA reference
    for the gates: matrix of mode × skip-flag × insecure ×
    --dry-run for every gate, real-Talos validation steps.
  - docs/manual-test-plan.md — broader end-to-end QA matrix
    covering the whole talm CLI surface.
  - golangci-lint timeout bumped to 10m in pr.yml to
    accommodate the additional package's analysis time.

Test surface includes red+green pairs for every contract-level
predicate (TDD per project rule) plus integration tests for the
preflight hooks via function-type fakes. Walker, differ,
validator, selector each have their own contract test file.
TestNoWorkflowLeakageInRepoSource is extended to ban private
operator identifiers from committed source.

Closes #172
Closes #174
Closes #175

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@lexfrei lexfrei force-pushed the feat/apply-safety-gates branch from 69c4d5d to 558f680 Compare May 12, 2026 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant