Skip to content

Commit bafa4a8

Browse files
committed
docs(apply-safety): test plan reference for the apply-time gates
Capture the matrix of cases the gates were validated against during PR #173 development: Phase 1 link / disk / selector inputs, Phase 2A diff classification, Phase 2B mode gating, real-Talos validation steps against the dev17 reference cluster. Documents the known limitations (upgrade has no gates, insecure path can't render with discovery, --init --update non-tty UX gap). Future-self reference for re-validation after touching applycheck or the preflight hooks. Refs: #172 Signed-off-by: Aleksei Sviridkin <f@lex.la>
1 parent 8745dad commit bafa4a8

1 file changed

Lines changed: 162 additions & 0 deletions

File tree

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# Apply-time safety gates: test plan
2+
3+
A reference checklist for validating changes to the apply-time safety gates introduced in #172 / PR #173. Covers the contract tests that ship with the package plus the manual real-Talos validation steps that surface issues unit tests cannot.
4+
5+
## Build under test
6+
7+
```bash
8+
cd ~/git/github.com/cozystack/talm && go build -o /tmp/talm-safety ./
9+
```
10+
11+
Run all matrix cells against the binary at `/tmp/talm-safety`. Use the dev17 reference cluster (`~/git/github.com/lexfrei/work/env/dev17/talm/`, OCI 3-node Talos v1.12.6) for live runs.
12+
13+
## Phase 1 — declared-resource existence
14+
15+
### Link references
16+
17+
| Case | How to trigger | Expected |
18+
| --- | --- | --- |
19+
| Typoed `LinkConfig.name` | Add `LinkConfig{name: eth9999}` to a node body | `[blocker] declared link "eth9999" not found …` plus available-links hint |
20+
| Typoed bond slave | Add `BondConfig{name: bond0, links: [ghost0, ens5]}` | Blocker on `ghost0` only; `bond0` (new bond) NOT flagged |
21+
| Typoed VLAN parent | Add `VLANConfig{name: ens5.99, link: ghost0, vlanID: 99}` | Blocker on `link: ghost0`; `name: ens5.99` not flagged (new VLAN) |
22+
| Typoed bridge port | Add `BridgeConfig{name: br99, ports: [ghost0]}` | Blocker on `ghost0` only; `br99` not flagged |
23+
| Typoed Layer2VIP link | Set `vipLink: ghost0` in values | Blocker on `link: ghost0` |
24+
| Legacy v1.11 interface | `machine.network.interfaces[].interface: eth9999` | Blocker; same hint shape |
25+
26+
### Disk references
27+
28+
| Case | How to trigger | Expected |
29+
| --- | --- | --- |
30+
| Bad literal disk | Set `machine.install.disk: /dev/sdz` | Blocker, hint lists real disks (sda, sdb) — **must omit** virtual class (dm-*, drbd*, loop*) |
31+
| Bad model selector | `diskSelector: {model: Samsumg}` | Blocker "matches zero disks", hint lists candidate disks with size |
32+
| Impossible size | `diskSelector: {size: ">= 99TB"}` | Blocker "matches zero disks" |
33+
| Lowercase units | `diskSelector: {size: ">= 100gb"}` | Matches as if `>= 100GB` (humanize.ParseBytes case-insensitive) |
34+
| Mixed case + spaces | `diskSelector: {size: "<= 200000MiB"}` | Parsed correctly |
35+
| Multiple matches | `diskSelector: {type: ssd}` on host with several SSDs | Warning (not blocker) "matches multiple disks; install picks the first match" |
36+
| Type semantics | `type: nvme/sd/hdd/ssd` per-disk Transport+Rotational | Mirror Talos `v1alpha1_provider.go:1325-1351` mapping |
37+
| Readonly excluded | Selector + a readonly disk on host | Readonly disk not counted as match |
38+
| CDROM excluded | Selector + a CD drive on host | CD not counted as match |
39+
| Virtual excluded | Selector on cozystack host with many dm/drbd/loop | dm/drbd/loop not counted; hint omits them |
40+
41+
### Opt-out
42+
43+
| Case | Trigger | Expected |
44+
| --- | --- | --- |
45+
| `--skip-resource-validation` | Pass with bad selector + bad link | No Phase 1 output; render proceeds |
46+
47+
## Phase 2A — pre-apply drift preview
48+
49+
### Diff classification
50+
51+
| Case | Trigger | Expected |
52+
| --- | --- | --- |
53+
| Identical desired/on-node | First-run apply after the same render | `0 addition, 0 removal, 0 update, N unchanged.` |
54+
| Removed doc | Apply config that drops a previously-emitted doc (e.g. dropping a LinkConfig that was on-node) | `- LinkConfig{name: …}` line |
55+
| Added doc | Apply config that adds a fresh doc | `+ LinkConfig{name: …}` line |
56+
| Updated leaf | Change one nested field (e.g. `clusterDomain`) | `~ MachineConfig` plus `cluster.network.dnsDomain: cozy.local -> cozy.example` |
57+
| Identical inputs include Equal entries | Verified via Diff API; OpEqual entries returned, FilterChanged drops them ||
58+
| Distinguish absent vs null | YAML `extraField: null` added to one side | FieldChange.HasOld=false / HasNew=true; formatter renders `(absent) -> <nil>` |
59+
| Stable ordering | Re-run on same inputs | Identical output bytes |
60+
61+
### Path / mode interactions
62+
63+
| Case | Trigger | Expected |
64+
| --- | --- | --- |
65+
| Dry-run shows preview | `talm apply --dry-run -f node.yaml` | Phase 2A runs; this is the "show me what would change" contract |
66+
| Insecure path | `talm apply -i -f node.yaml` (where chart can render offline) | `talm: drift verification unavailable on maintenance connection`; no block |
67+
| `--skip-drift-preview` | Pass with any change | Preview suppressed entirely |
68+
69+
## Phase 2B — post-apply state verification
70+
71+
Default off until the Talos-mutated-field allowlist lands. Enable explicitly with `--skip-post-apply-verify=false`.
72+
73+
| Case | Trigger | Expected |
74+
| --- | --- | --- |
75+
| Clean apply | Apply config matching on-node, `--skip-post-apply-verify=false` | Silent success (no output, no error) |
76+
| Mode=staged | `--mode=staged --skip-post-apply-verify=false` | Phase 2B skipped (staged store doesn't change ActiveID) |
77+
| Mode=try | `--mode=try --skip-post-apply-verify=false` | Phase 2B skipped (rollback timer races verify) |
78+
| Mode=no-reboot/reboot/auto | Real apply with verify enabled | Phase 2B runs |
79+
| Dry-run | `--dry-run --skip-post-apply-verify=false` | Phase 2B skipped (no real apply) |
80+
| Reader error | Simulated COSI hiccup on auth path | Hint-bearing blocker `post-apply: re-reading on-node MachineConfig`, exit non-zero (the gate is here to catch silent rollbacks — error is not swallowed) |
81+
| Insecure path | `talm apply -i --skip-post-apply-verify=false` | `drift verification unavailable on maintenance connection` line; no block |
82+
83+
## Real-Talos validation (dev17)
84+
85+
Before requesting human review, exercise the gates against a live Talos node.
86+
87+
### Setup
88+
89+
```bash
90+
cd ~/git/github.com/lexfrei/work/env/dev17/talm
91+
```
92+
93+
dev17 carries an OCI 3-node Talos v1.12.6 cluster (158.101.113.227 / 129.158.237.173 / 157.151.143.81). The vendored talm library in `charts/talm/` may need `talm init --update --preset cozystack` to pick up new helpers; preset templates (`templates/_helpers.tpl`) require interactive confirm and won't auto-update non-tty (known gap, see #174).
94+
95+
### Sanity check
96+
97+
```bash
98+
/tmp/talm-safety template -f nodes/node0.yaml > /tmp/rendered.yaml
99+
test -s /tmp/rendered.yaml || echo "render failed"
100+
```
101+
102+
### Phase 1 (auth path)
103+
104+
```bash
105+
# Clean run — should silently pass:
106+
/tmp/talm-safety apply --dry-run -f nodes/node0.yaml
107+
108+
# Inject a bad link ref (cp + edit a temp file inside the talm project):
109+
cp nodes/node0.yaml nodes/_test-bad.yaml
110+
echo -e "---\napiVersion: v1alpha1\nkind: LinkConfig\nname: eth9999" >> nodes/_test-bad.yaml
111+
/tmp/talm-safety apply --dry-run -f nodes/_test-bad.yaml # expect [blocker]
112+
rm nodes/_test-bad.yaml
113+
```
114+
115+
### Phase 2A (drift preview)
116+
117+
```bash
118+
# Dry-run against a clean cluster — should report 0/0/0 unchanged:
119+
/tmp/talm-safety apply --dry-run -f nodes/node0.yaml
120+
121+
# Force a leaf change via values.yaml (back up then revert):
122+
sed -i.bak 's/^clusterDomain: .*/clusterDomain: cozy.example/' values.yaml
123+
/tmp/talm-safety apply --dry-run -f nodes/node0.yaml | grep -E "^ [+\-~=]|^ "
124+
mv values.yaml.bak values.yaml
125+
```
126+
127+
### Phase 2B (real apply with verify enabled)
128+
129+
```bash
130+
/tmp/talm-safety apply --mode=no-reboot --skip-post-apply-verify=false -f nodes/node0.yaml
131+
# Expected: drift preview + 'Applied configuration without a reboot' + silent post-apply verify
132+
```
133+
134+
### Multi-node + mix
135+
136+
```bash
137+
/tmp/talm-safety apply --dry-run -f nodes/node0.yaml -f nodes/node1.yaml -f nodes/node2.yaml
138+
# Each node renders its own preview; per-node independence.
139+
```
140+
141+
### Insecure path
142+
143+
`talm apply -i` exercises the maintenance connection. On dev17 the chart uses live discovery (`lookup "disks"`), which fails on insecure (no auth for COSI). The render errors before the gate runs — that's existing talm behaviour, not a regression.
144+
145+
## Implementation health
146+
147+
Run as part of every push:
148+
149+
```bash
150+
go test ./...
151+
go test -race ./pkg/applycheck/... ./pkg/commands/...
152+
golangci-lint run ./...
153+
GOOS=windows golangci-lint run ./...
154+
go vet ./...
155+
```
156+
157+
## Known limitations / follow-ups
158+
159+
- **`talm init --update` non-tty UX gap** (#174): preset-template overwrites require interactive confirmation, so a CI-driven refresh leaves the operator on a stale preset that doesn't surface new validation logic. Work around by copying preset templates from the repo manually or running update under a tty.
160+
- **Talos-mutated-field allowlist** (open in #172): Phase 2B reports cert hashes / timestamps as divergence today; the verify is off by default until an allowlist lands.
161+
- **`talm upgrade` has no preflight gates**: the upgrade flow wraps `talosctl upgrade` and doesn't route through `buildApplyClosure` / `applyOneFileDirectPatchMode`. Wiring would require either reproducing the gate calls in `upgrade_handler.go` or refactoring the apply flow.
162+
- **Phase 1/2 on `--insecure`**: the safety gates can't run before the chart renders, and the chart's `lookup` calls need an authenticated COSI connection. Insecure path = effectively no gates today.

0 commit comments

Comments
 (0)