Skip to content

Conversation

arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Sep 28, 2025

Related issue: b/415354418

Problem

On connection breakage, the pickfirst leaf balancer enters idle and returns an Idle picker that calls the balancer's ExitIdle method only the first time Pick is called. The following sequence of events will cause the balancer to get stuck in Idle state:

  1. Existing connection breaks, SubConn requests re-resolution and reports IDLE. In turn PF updates the ClientConn state to IDLE with an Idle picker.
  2. An RPC is made, triggering balancer.ExitIdle through the idle picker. The balancer attempts to re-connect the failed SubConn.
  3. The resolver produces a new endpoint list, removing the endpoint used by the existing SubConn. PF removes the existing SubConn. Since the balancer didn't update the ClientConn state to CONNECTING yet, pickfirst thinks that it's still in IDLE and doesn't start connecting to the new endpoints.
  4. New RPC requests trigger the idle picker, but it's a no-op since it only triggers the balancer's ExitIdle method once.

Fix

This change moves the ClientConn into Connecting immediately when the ExitIdle method is called. This ensures that the balancer continues to re-connect when a new endpoint list is produced by the resolver.

RELEASE NOTES:

  • balancer/pickfirst: Fix bug that can cause balancer to get stuck in IDLE state on backend address change.

@arjan-bal arjan-bal added this to the 1.76 Release milestone Sep 28, 2025
@arjan-bal arjan-bal added Type: Bug Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. labels Sep 28, 2025
Copy link

codecov bot commented Sep 28, 2025

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 82.07%. Comparing base (bb71072) to head (b90ab74).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8615      +/-   ##
==========================================
+ Coverage   82.00%   82.07%   +0.07%     
==========================================
  Files         415      415              
  Lines       40697    40704       +7     
==========================================
+ Hits        33372    33409      +37     
+ Misses       5937     5910      -27     
+ Partials     1388     1385       -3     
Files with missing lines Coverage Δ
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go 88.41% <87.50%> (+1.33%) ⬆️

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal arjan-bal changed the title pickfirstleaf: Avoid getting suck in idle on connection breakage pickfirstleaf: Avoid getting stuck in idle on connection breakage Sep 29, 2025
}

// Calling the idle picker should result in the SubConn being re-connected.
if err := cc.WaitForPickerWithErr(ctx, balancer.ErrNoSubConnAvailable); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should this not be a WaitFor call? We just want to call the current picker once at this point, right? We know it's updated because the state is Idle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we don't need to wait for the picker, we can use the current picker. Changed to fetching the current picker, calling its Pick method and verifying it returns the queueing error.

@arjan-bal arjan-bal changed the title pickfirstleaf: Avoid getting stuck in idle on connection breakage pickfirstleaf: Avoid getting stuck in IDLE on connection breakage Sep 29, 2025
// update with a new address is received. The test verifies that the balancer
// creates a new SubConn for the new address and that the ClientConn eventually
// becomes READY.
func (s) TestPickFirstLeaf_Reconnection(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have this as an e2e style test. We wouldn't be able to use the blocking context dialer testutil that we have, since we can only control the connection after we start connecting. So, by that time, the subchannel would have already moved to Connecting. I spent a couple of minutes on it, but couldn't think of a way to be able to do this in a e2e style. But if you have an idea that can work without much work, that would be great. But I wouldn't block this PR for that test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, by that time, the subchannel would have already moved to Connecting.

I thought of using a real channel, but couldn't figure out a simple way to delay/drop the Connecting update, so decided to use the fake channel.

One solution I thought of was to wrap the SubConn state listener in a parent LB to intercept and drop connectivity updates. That was making the test more complex than using a fake channel.

@easwars easwars assigned arjan-bal and unassigned easwars Sep 30, 2025
@arjan-bal arjan-bal merged commit 51b15ad into grpc:master Sep 30, 2025
15 checks passed
arjan-bal added a commit to arjan-bal/grpc-go that referenced this pull request Oct 1, 2025
…pc#8615)

Related issue: b/415354418

## Problem

On connection breakage, the pickfirst leaf balancer enters idle and
returns an `Idle picker` that calls the balancer's `ExitIdle` method
only the first time `Pick` is called. The following sequence of events
will cause the balancer to get stuck in `Idle` state:
1. Existing connection breaks, SubConn [requests re-resolution and
reports
IDLE](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/clientconn.go#L1388-L1393).
In turn PF updates the ClientConn state to IDLE with an `Idle picker`.
1. An RPC is made, triggering `balancer.ExitIdle` through the idle
picker. The balancer attempts to re-connect the failed SubConn.
1. The resolver produces a new endpoint list, removing the endpoint used
by the existing SubConn. PF removes the existing SubConn. Since the
balancer didn't update the ClientConn state to CONNECTING yet, pickfirst
thinks that it's still in IDLE and doesn't start connecting to the new
endpoints.
1. New RPC requests trigger the idle picker, but it's a no-op since it
only [triggers the balancer's ExitIdle method
once](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663).

## Fix

This change moves the ClientConn into Connecting immediately when the
`ExitIdle` method is called. This ensures that the balancer continues to
re-connect when a new endpoint list is produced by the resolver.

RELEASE NOTES:
* balancer/pickfirst: Fix bug that can cause balancer to get stuck in
`IDLE` state on connection failure.
arjan-bal added a commit to arjan-bal/grpc-go that referenced this pull request Oct 1, 2025
…pc#8615)

Related issue: b/415354418

## Problem

On connection breakage, the pickfirst leaf balancer enters idle and
returns an `Idle picker` that calls the balancer's `ExitIdle` method
only the first time `Pick` is called. The following sequence of events
will cause the balancer to get stuck in `Idle` state:
1. Existing connection breaks, SubConn [requests re-resolution and
reports
IDLE](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/clientconn.go#L1388-L1393).
In turn PF updates the ClientConn state to IDLE with an `Idle picker`.
1. An RPC is made, triggering `balancer.ExitIdle` through the idle
picker. The balancer attempts to re-connect the failed SubConn.
1. The resolver produces a new endpoint list, removing the endpoint used
by the existing SubConn. PF removes the existing SubConn. Since the
balancer didn't update the ClientConn state to CONNECTING yet, pickfirst
thinks that it's still in IDLE and doesn't start connecting to the new
endpoints.
1. New RPC requests trigger the idle picker, but it's a no-op since it
only [triggers the balancer's ExitIdle method
once](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663).

## Fix

This change moves the ClientConn into Connecting immediately when the
`ExitIdle` method is called. This ensures that the balancer continues to
re-connect when a new endpoint list is produced by the resolver.

RELEASE NOTES:
* balancer/pickfirst: Fix bug that can cause balancer to get stuck in
`IDLE` state on connection failure.
@arjan-bal arjan-bal deleted the pf-stuck-bug branch October 1, 2025 08:23
arjan-bal added a commit that referenced this pull request Oct 3, 2025
Original PRs: #8610, #8615

RELEASE NOTES:
* balancer/pick_first:
* Fix bug that can cause balancer to get stuck in `IDLE` state on
backend address change.
* When configured, shuffle addresses in resolver updates that lack
endpoints. Since gRPC automatically adds endpoints to resolver updates,
this bug should only affect implementers of custom LB policies that use
pick_first for delegation but don't forward the endpoints.
Pranjali-2501 pushed a commit to Pranjali-2501/grpc-go that referenced this pull request Oct 6, 2025
…pc#8615)

Related issue: b/415354418

## Problem

On connection breakage, the pickfirst leaf balancer enters idle and
returns an `Idle picker` that calls the balancer's `ExitIdle` method
only the first time `Pick` is called. The following sequence of events
will cause the balancer to get stuck in `Idle` state:
1. Existing connection breaks, SubConn [requests re-resolution and
reports
IDLE](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/clientconn.go#L1388-L1393).
In turn PF updates the ClientConn state to IDLE with an `Idle picker`.
1. An RPC is made, triggering `balancer.ExitIdle` through the idle
picker. The balancer attempts to re-connect the failed SubConn.
1. The resolver produces a new endpoint list, removing the endpoint used
by the existing SubConn. PF removes the existing SubConn. Since the
balancer didn't update the ClientConn state to CONNECTING yet, pickfirst
thinks that it's still in IDLE and doesn't start connecting to the new
endpoints.
1. New RPC requests trigger the idle picker, but it's a no-op since it
only [triggers the balancer's ExitIdle method
once](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663).

## Fix

This change moves the ClientConn into Connecting immediately when the
`ExitIdle` method is called. This ensures that the balancer continues to
re-connect when a new endpoint list is produced by the resolver.

RELEASE NOTES:
* balancer/pickfirst: Fix bug that can cause balancer to get stuck in
`IDLE` state on connection failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants