|
| 1 | +# Apply-time safety gates: test plan |
| 2 | + |
| 3 | +A reference checklist for validating changes to the apply-time safety gates introduced in #172 / PR #173. Covers the contract tests that ship with the package plus the manual real-Talos validation steps that surface issues unit tests cannot. |
| 4 | + |
| 5 | +## Build under test |
| 6 | + |
| 7 | +```bash |
| 8 | +cd ~/git/github.com/cozystack/talm && go build -o /tmp/talm-safety ./ |
| 9 | +``` |
| 10 | + |
| 11 | +Run all matrix cells against the binary at `/tmp/talm-safety`. Use the dev17 reference cluster (`~/git/github.com/lexfrei/work/env/dev17/talm/`, OCI 3-node Talos v1.12.6) for live runs. |
| 12 | + |
| 13 | +## Phase 1 — declared-resource existence |
| 14 | + |
| 15 | +### Link references |
| 16 | + |
| 17 | +| Case | How to trigger | Expected | |
| 18 | +| --- | --- | --- | |
| 19 | +| Typoed `LinkConfig.name` | Add `LinkConfig{name: eth9999}` to a node body | `[blocker] declared link "eth9999" not found …` plus available-links hint | |
| 20 | +| Typoed bond slave | Add `BondConfig{name: bond0, links: [ghost0, ens5]}` | Blocker on `ghost0` only; `bond0` (new bond) NOT flagged | |
| 21 | +| Typoed VLAN parent | Add `VLANConfig{name: ens5.99, link: ghost0, vlanID: 99}` | Blocker on `link: ghost0`; `name: ens5.99` not flagged (new VLAN) | |
| 22 | +| Typoed bridge port | Add `BridgeConfig{name: br99, ports: [ghost0]}` | Blocker on `ghost0` only; `br99` not flagged | |
| 23 | +| Typoed Layer2VIP link | Set `vipLink: ghost0` in values | Blocker on `link: ghost0` | |
| 24 | +| Legacy v1.11 interface | `machine.network.interfaces[].interface: eth9999` | Blocker; same hint shape | |
| 25 | + |
| 26 | +### Disk references |
| 27 | + |
| 28 | +| Case | How to trigger | Expected | |
| 29 | +| --- | --- | --- | |
| 30 | +| Bad literal disk | Set `machine.install.disk: /dev/sdz` | Blocker, hint lists real disks (sda, sdb) — **must omit** virtual class (dm-*, drbd*, loop*) | |
| 31 | +| Bad model selector | `diskSelector: {model: Samsumg}` | Blocker "matches zero disks", hint lists candidate disks with size | |
| 32 | +| Impossible size | `diskSelector: {size: ">= 99TB"}` | Blocker "matches zero disks" | |
| 33 | +| Lowercase units | `diskSelector: {size: ">= 100gb"}` | Matches as if `>= 100GB` (humanize.ParseBytes case-insensitive) | |
| 34 | +| Mixed case + spaces | `diskSelector: {size: "<= 200000MiB"}` | Parsed correctly | |
| 35 | +| Multiple matches | `diskSelector: {type: ssd}` on host with several SSDs | Warning (not blocker) "matches multiple disks; install picks the first match" | |
| 36 | +| Type semantics | `type: nvme/sd/hdd/ssd` per-disk Transport+Rotational | Mirror Talos `v1alpha1_provider.go:1325-1351` mapping | |
| 37 | +| Readonly excluded | Selector + a readonly disk on host | Readonly disk not counted as match | |
| 38 | +| CDROM excluded | Selector + a CD drive on host | CD not counted as match | |
| 39 | +| Virtual excluded | Selector on cozystack host with many dm/drbd/loop | dm/drbd/loop not counted; hint omits them | |
| 40 | + |
| 41 | +### Opt-out |
| 42 | + |
| 43 | +| Case | Trigger | Expected | |
| 44 | +| --- | --- | --- | |
| 45 | +| `--skip-resource-validation` | Pass with bad selector + bad link | No Phase 1 output; render proceeds | |
| 46 | + |
| 47 | +## Phase 2A — pre-apply drift preview |
| 48 | + |
| 49 | +### Diff classification |
| 50 | + |
| 51 | +| Case | Trigger | Expected | |
| 52 | +| --- | --- | --- | |
| 53 | +| Identical desired/on-node | First-run apply after the same render | `0 addition, 0 removal, 0 update, N unchanged.` | |
| 54 | +| Removed doc | Apply config that drops a previously-emitted doc (e.g. dropping a LinkConfig that was on-node) | `- LinkConfig{name: …}` line | |
| 55 | +| Added doc | Apply config that adds a fresh doc | `+ LinkConfig{name: …}` line | |
| 56 | +| Updated leaf | Change one nested field (e.g. `clusterDomain`) | `~ MachineConfig` plus `cluster.network.dnsDomain: cozy.local -> cozy.example` | |
| 57 | +| Identical inputs include Equal entries | Verified via Diff API; OpEqual entries returned, FilterChanged drops them | — | |
| 58 | +| Distinguish absent vs null | YAML `extraField: null` added to one side | FieldChange.HasOld=false / HasNew=true; formatter renders `(absent) -> <nil>` | |
| 59 | +| Stable ordering | Re-run on same inputs | Identical output bytes | |
| 60 | + |
| 61 | +### Path / mode interactions |
| 62 | + |
| 63 | +| Case | Trigger | Expected | |
| 64 | +| --- | --- | --- | |
| 65 | +| Dry-run shows preview | `talm apply --dry-run -f node.yaml` | Phase 2A runs; this is the "show me what would change" contract | |
| 66 | +| Insecure path | `talm apply -i -f node.yaml` (where chart can render offline) | `talm: drift verification unavailable on maintenance connection`; no block | |
| 67 | +| `--skip-drift-preview` | Pass with any change | Preview suppressed entirely | |
| 68 | + |
| 69 | +## Phase 2B — post-apply state verification |
| 70 | + |
| 71 | +Default off until the Talos-mutated-field allowlist lands. Enable explicitly with `--skip-post-apply-verify=false`. |
| 72 | + |
| 73 | +| Case | Trigger | Expected | |
| 74 | +| --- | --- | --- | |
| 75 | +| Clean apply | Apply config matching on-node, `--skip-post-apply-verify=false` | Silent success (no output, no error) | |
| 76 | +| Mode=staged | `--mode=staged --skip-post-apply-verify=false` | Phase 2B skipped (staged store doesn't change ActiveID) | |
| 77 | +| Mode=try | `--mode=try --skip-post-apply-verify=false` | Phase 2B skipped (rollback timer races verify) | |
| 78 | +| Mode=no-reboot/reboot/auto | Real apply with verify enabled | Phase 2B runs | |
| 79 | +| Dry-run | `--dry-run --skip-post-apply-verify=false` | Phase 2B skipped (no real apply) | |
| 80 | +| Reader error | Simulated COSI hiccup on auth path | Hint-bearing blocker `post-apply: re-reading on-node MachineConfig`, exit non-zero (the gate is here to catch silent rollbacks — error is not swallowed) | |
| 81 | +| Insecure path | `talm apply -i --skip-post-apply-verify=false` | `drift verification unavailable on maintenance connection` line; no block | |
| 82 | + |
| 83 | +## Real-Talos validation (dev17) |
| 84 | + |
| 85 | +Before requesting human review, exercise the gates against a live Talos node. |
| 86 | + |
| 87 | +### Setup |
| 88 | + |
| 89 | +```bash |
| 90 | +cd ~/git/github.com/lexfrei/work/env/dev17/talm |
| 91 | +``` |
| 92 | + |
| 93 | +dev17 carries an OCI 3-node Talos v1.12.6 cluster (158.101.113.227 / 129.158.237.173 / 157.151.143.81). The vendored talm library in `charts/talm/` may need `talm init --update --preset cozystack` to pick up new helpers; preset templates (`templates/_helpers.tpl`) require interactive confirm and won't auto-update non-tty (known gap, see #174). |
| 94 | + |
| 95 | +### Sanity check |
| 96 | + |
| 97 | +```bash |
| 98 | +/tmp/talm-safety template -f nodes/node0.yaml > /tmp/rendered.yaml |
| 99 | +test -s /tmp/rendered.yaml || echo "render failed" |
| 100 | +``` |
| 101 | + |
| 102 | +### Phase 1 (auth path) |
| 103 | + |
| 104 | +```bash |
| 105 | +# Clean run — should silently pass: |
| 106 | +/tmp/talm-safety apply --dry-run -f nodes/node0.yaml |
| 107 | + |
| 108 | +# Inject a bad link ref (cp + edit a temp file inside the talm project): |
| 109 | +cp nodes/node0.yaml nodes/_test-bad.yaml |
| 110 | +echo -e "---\napiVersion: v1alpha1\nkind: LinkConfig\nname: eth9999" >> nodes/_test-bad.yaml |
| 111 | +/tmp/talm-safety apply --dry-run -f nodes/_test-bad.yaml # expect [blocker] |
| 112 | +rm nodes/_test-bad.yaml |
| 113 | +``` |
| 114 | + |
| 115 | +### Phase 2A (drift preview) |
| 116 | + |
| 117 | +```bash |
| 118 | +# Dry-run against a clean cluster — should report 0/0/0 unchanged: |
| 119 | +/tmp/talm-safety apply --dry-run -f nodes/node0.yaml |
| 120 | + |
| 121 | +# Force a leaf change via values.yaml (back up then revert): |
| 122 | +sed -i.bak 's/^clusterDomain: .*/clusterDomain: cozy.example/' values.yaml |
| 123 | +/tmp/talm-safety apply --dry-run -f nodes/node0.yaml | grep -E "^ [+\-~=]|^ " |
| 124 | +mv values.yaml.bak values.yaml |
| 125 | +``` |
| 126 | + |
| 127 | +### Phase 2B (real apply with verify enabled) |
| 128 | + |
| 129 | +```bash |
| 130 | +/tmp/talm-safety apply --mode=no-reboot --skip-post-apply-verify=false -f nodes/node0.yaml |
| 131 | +# Expected: drift preview + 'Applied configuration without a reboot' + silent post-apply verify |
| 132 | +``` |
| 133 | + |
| 134 | +### Multi-node + mix |
| 135 | + |
| 136 | +```bash |
| 137 | +/tmp/talm-safety apply --dry-run -f nodes/node0.yaml -f nodes/node1.yaml -f nodes/node2.yaml |
| 138 | +# Each node renders its own preview; per-node independence. |
| 139 | +``` |
| 140 | + |
| 141 | +### Insecure path |
| 142 | + |
| 143 | +`talm apply -i` exercises the maintenance connection. On dev17 the chart uses live discovery (`lookup "disks"`), which fails on insecure (no auth for COSI). The render errors before the gate runs — that's existing talm behaviour, not a regression. |
| 144 | + |
| 145 | +## Implementation health |
| 146 | + |
| 147 | +Run as part of every push: |
| 148 | + |
| 149 | +```bash |
| 150 | +go test ./... |
| 151 | +go test -race ./pkg/applycheck/... ./pkg/commands/... |
| 152 | +golangci-lint run ./... |
| 153 | +GOOS=windows golangci-lint run ./... |
| 154 | +go vet ./... |
| 155 | +``` |
| 156 | + |
| 157 | +## Known limitations / follow-ups |
| 158 | + |
| 159 | +- **`talm init --update` non-tty UX gap** (#174): preset-template overwrites require interactive confirmation, so a CI-driven refresh leaves the operator on a stale preset that doesn't surface new validation logic. Work around by copying preset templates from the repo manually or running update under a tty. |
| 160 | +- **Talos-mutated-field allowlist** (open in #172): Phase 2B reports cert hashes / timestamps as divergence today; the verify is off by default until an allowlist lands. |
| 161 | +- **`talm upgrade` has no preflight gates**: the upgrade flow wraps `talosctl upgrade` and doesn't route through `buildApplyClosure` / `applyOneFileDirectPatchMode`. Wiring would require either reproducing the gate calls in `upgrade_handler.go` or refactoring the apply flow. |
| 162 | +- **Phase 1/2 on `--insecure`**: the safety gates can't run before the chart renders, and the chart's `lookup` calls need an authenticated COSI connection. Insecure path = effectively no gates today. |
0 commit comments