Skip to content

[FIXED] Incomplete route pool on premature pong#7971

Merged
neilalexander merged 1 commit intomainfrom
maurice/route-pool-incomplete
Mar 19, 2026
Merged

[FIXED] Incomplete route pool on premature pong#7971
neilalexander merged 1 commit intomainfrom
maurice/route-pool-incomplete

Conversation

@MauriceVanVeen
Copy link
Copy Markdown
Member

If the first PONG was received before c.route.startNewRoute was initialized it would result in the route pool not being fully populated. Instead of relying on the very first pong, we can just rely on the first pong after initializing the field.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner March 18, 2026 20:15
@synadia-claude-reviewer
Copy link
Copy Markdown

synadia-claude-reviewer bot commented Mar 18, 2026

Claude finished @MauriceVanVeen's task in 1m 50s —— View job


PR Review: [FIXED] Incomplete route pool on premature pong

  • Gather context
  • Analyze changes in server/client.go
  • Analyze test changes in server/routes_test.go
  • Post review findings

Review

Correctness: The fix is correct and well-targeted.

The bug is clear: c.flags.setIfNotSet(firstPong) would return true on the very first PONG, which could arrive before processRouteInfo/addRoute had set c.route.startNewRoute. Once firstPong was consumed, subsequent PONGs (including the one sent right after startNewRoute is assigned at route.go:2413) would be ignored for route-pool chaining, leaving the pool incomplete.

The fix — replacing the flag check with a simple nil-guard on startNewRoute — is the minimal and correct solution. Since:

  1. startNewRoute is set exactly once per connection in addRoute (route.go:2408)
  2. A PING is sent immediately after (route.go:2413), inside the same lock
  3. startNewRoute is cleared upon consumption in processPong

...there's no risk of double-triggering, and the fix correctly handles the race.

Minor issues:

  1. Stale comments in route.go (not changed in this PR): The comments at route.go:90-92 and route.go:2403-2405 still say "first PONG", which is no longer accurate — it's now "the PONG in response to the PING sent after startNewRoute is initialized". These could be updated for clarity.

  2. Confusing error message in test (routes_test.go):

    if strings.HasPrefix(line, "PONG") {
        return errors.New("stalled on PONG")
    }

    The error message "stalled on PONG" is misleading — we're not stalled, we're just still polling. A message like "S2 responded to our PING; still waiting for S2's timer PING" would be clearer.

  3. done WaitGroup usage: The defer done.Wait() / defer done.Done() synchronization pattern (keeping handler goroutines alive until the main test finishes its second-connection check) is valid but non-obvious. A brief comment explaining why would help future readers.

Overall: The fix is sound. The test does exercise the race condition correctly. The stale comments in route.go are the only thing worth a follow-up, and they're pre-existing.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/route-pool-incomplete branch from 1aa2184 to e5678cb Compare March 19, 2026 06:51
@MauriceVanVeen
Copy link
Copy Markdown
Member Author

@claude, brief re-review please.

@synadia-claude-reviewer
Copy link
Copy Markdown

synadia-claude-reviewer bot commented Mar 19, 2026

Claude finished @MauriceVanVeen's task in 53s —— View job


Re-review

  • Read changed files
  • Post re-review

Overall: LGTM. The fix is correct and the previous stale-comment concern in route.go:90-92 has been addressed.

The one remaining stale comment is at route.go:2403-2405, which was not changed in this PR:

// In pool mode, if we did not yet reach the cap, try to connect a new connection,
// but do so only after receiving the first PONG to our PING, which will ensure
// that we have proper authentication.

This still says "first PONG to our PING" — now that the mechanism is a nil-guard on startNewRoute rather than a firstPong flag, "first" is slightly misleading (it's the PONG in response to the specific PING sent right after startNewRoute is set). Low priority, but worth a one-word tweak. Fix this →

Test (TestRoutePoolFirstPongBlocksChain): The test correctly exercises the race — it delivers a PONG before INFO, then sends INFO + a second PING/PONG, and verifies that startNewRoute is still consumed and triggers the next pool connection. The logic is sound.

Two small nits from the previous review remain open (not blocking):

  • The checkFor loop exit message ("expected PING, got %q") is the polling error — that's fine and accurate. But the loop is actually looking for a PING from S2, not waiting for S2's timer to fire; a comment clarifying this would help.
  • The keepAlive / ready WaitGroup interaction (goroutines stay alive via defer keepAlive.Wait() until keepAlive.Done() in the main goroutine) is non-obvious and a brief comment would help future readers.

Neither of these are blockers. The fix itself is minimal, correct, and well-tested.


Copy link
Copy Markdown
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neilalexander neilalexander merged commit 5b63063 into main Mar 19, 2026
50 checks passed
@neilalexander neilalexander deleted the maurice/route-pool-incomplete branch March 19, 2026 15:23
neilalexander added a commit that referenced this pull request Mar 24, 2026
Includes the following:

- #7781
- #7973
- #7970
- #7971

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants