-
Notifications
You must be signed in to change notification settings - Fork 4.6k
pickfirstleaf: Avoid getting stuck in IDLE on connection breakage #8615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8615 +/- ##
==========================================
+ Coverage 82.00% 82.07% +0.07%
==========================================
Files 415 415
Lines 40697 40704 +7
==========================================
+ Hits 33372 33409 +37
+ Misses 5937 5910 -27
+ Partials 1388 1385 -3
🚀 New features to boost your workflow:
|
} | ||
|
||
// Calling the idle picker should result in the SubConn being re-connected. | ||
if err := cc.WaitForPickerWithErr(ctx, balancer.ErrNoSubConnAvailable); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should this not be a WaitFor
call? We just want to call the current picker once at this point, right? We know it's updated because the state is Idle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we don't need to wait for the picker, we can use the current picker. Changed to fetching the current picker, calling its Pick
method and verifying it returns the queueing error.
// update with a new address is received. The test verifies that the balancer | ||
// creates a new SubConn for the new address and that the ClientConn eventually | ||
// becomes READY. | ||
func (s) TestPickFirstLeaf_Reconnection(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have this as an e2e style test. We wouldn't be able to use the blocking context dialer testutil that we have, since we can only control the connection after we start connecting. So, by that time, the subchannel would have already moved to Connecting
. I spent a couple of minutes on it, but couldn't think of a way to be able to do this in a e2e style. But if you have an idea that can work without much work, that would be great. But I wouldn't block this PR for that test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, by that time, the subchannel would have already moved to Connecting.
I thought of using a real channel, but couldn't figure out a simple way to delay/drop the Connecting
update, so decided to use the fake channel.
One solution I thought of was to wrap the SubConn state listener in a parent LB to intercept and drop connectivity updates. That was making the test more complex than using a fake channel.
…pc#8615) Related issue: b/415354418 ## Problem On connection breakage, the pickfirst leaf balancer enters idle and returns an `Idle picker` that calls the balancer's `ExitIdle` method only the first time `Pick` is called. The following sequence of events will cause the balancer to get stuck in `Idle` state: 1. Existing connection breaks, SubConn [requests re-resolution and reports IDLE](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/clientconn.go#L1388-L1393). In turn PF updates the ClientConn state to IDLE with an `Idle picker`. 1. An RPC is made, triggering `balancer.ExitIdle` through the idle picker. The balancer attempts to re-connect the failed SubConn. 1. The resolver produces a new endpoint list, removing the endpoint used by the existing SubConn. PF removes the existing SubConn. Since the balancer didn't update the ClientConn state to CONNECTING yet, pickfirst thinks that it's still in IDLE and doesn't start connecting to the new endpoints. 1. New RPC requests trigger the idle picker, but it's a no-op since it only [triggers the balancer's ExitIdle method once](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663). ## Fix This change moves the ClientConn into Connecting immediately when the `ExitIdle` method is called. This ensures that the balancer continues to re-connect when a new endpoint list is produced by the resolver. RELEASE NOTES: * balancer/pickfirst: Fix bug that can cause balancer to get stuck in `IDLE` state on connection failure.
…pc#8615) Related issue: b/415354418 ## Problem On connection breakage, the pickfirst leaf balancer enters idle and returns an `Idle picker` that calls the balancer's `ExitIdle` method only the first time `Pick` is called. The following sequence of events will cause the balancer to get stuck in `Idle` state: 1. Existing connection breaks, SubConn [requests re-resolution and reports IDLE](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/clientconn.go#L1388-L1393). In turn PF updates the ClientConn state to IDLE with an `Idle picker`. 1. An RPC is made, triggering `balancer.ExitIdle` through the idle picker. The balancer attempts to re-connect the failed SubConn. 1. The resolver produces a new endpoint list, removing the endpoint used by the existing SubConn. PF removes the existing SubConn. Since the balancer didn't update the ClientConn state to CONNECTING yet, pickfirst thinks that it's still in IDLE and doesn't start connecting to the new endpoints. 1. New RPC requests trigger the idle picker, but it's a no-op since it only [triggers the balancer's ExitIdle method once](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663). ## Fix This change moves the ClientConn into Connecting immediately when the `ExitIdle` method is called. This ensures that the balancer continues to re-connect when a new endpoint list is produced by the resolver. RELEASE NOTES: * balancer/pickfirst: Fix bug that can cause balancer to get stuck in `IDLE` state on connection failure.
Original PRs: #8610, #8615 RELEASE NOTES: * balancer/pick_first: * Fix bug that can cause balancer to get stuck in `IDLE` state on backend address change. * When configured, shuffle addresses in resolver updates that lack endpoints. Since gRPC automatically adds endpoints to resolver updates, this bug should only affect implementers of custom LB policies that use pick_first for delegation but don't forward the endpoints.
…pc#8615) Related issue: b/415354418 ## Problem On connection breakage, the pickfirst leaf balancer enters idle and returns an `Idle picker` that calls the balancer's `ExitIdle` method only the first time `Pick` is called. The following sequence of events will cause the balancer to get stuck in `Idle` state: 1. Existing connection breaks, SubConn [requests re-resolution and reports IDLE](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/clientconn.go#L1388-L1393). In turn PF updates the ClientConn state to IDLE with an `Idle picker`. 1. An RPC is made, triggering `balancer.ExitIdle` through the idle picker. The balancer attempts to re-connect the failed SubConn. 1. The resolver produces a new endpoint list, removing the endpoint used by the existing SubConn. PF removes the existing SubConn. Since the balancer didn't update the ClientConn state to CONNECTING yet, pickfirst thinks that it's still in IDLE and doesn't start connecting to the new endpoints. 1. New RPC requests trigger the idle picker, but it's a no-op since it only [triggers the balancer's ExitIdle method once](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663). ## Fix This change moves the ClientConn into Connecting immediately when the `ExitIdle` method is called. This ensures that the balancer continues to re-connect when a new endpoint list is produced by the resolver. RELEASE NOTES: * balancer/pickfirst: Fix bug that can cause balancer to get stuck in `IDLE` state on connection failure.
Related issue: b/415354418
Problem
On connection breakage, the pickfirst leaf balancer enters idle and returns an
Idle picker
that calls the balancer'sExitIdle
method only the first timePick
is called. The following sequence of events will cause the balancer to get stuck inIdle
state:Idle picker
.balancer.ExitIdle
through the idle picker. The balancer attempts to re-connect the failed SubConn.Fix
This change moves the ClientConn into Connecting immediately when the
ExitIdle
method is called. This ensures that the balancer continues to re-connect when a new endpoint list is produced by the resolver.RELEASE NOTES:
IDLE
state on backend address change.