Fix a race that can occur while closing a socket that is connecting #41

charlievieth · 2019-07-24T18:13:09Z

Previously, this code used a non-nil socket as a placeholder in server.AcquireSocket. This could lead to a race if server.Close is called while a socket is connecting. This commit changes the code to use a nil socket placeholder instead. If the socket successfully connects the nil-placeholder is replaced with the valid socket and removes the nil-placeholder if the socket fails to connect.

This is a follow on to #40 and fixes the race identified here.

Previously, this code used a non-nil socket as a placeholder in server.AcquireSocket. This could lead to a race if server.Close is called while a socket is connecting. This commit changes the code to use a nil socket placeholder instead. If the socket successfully connects the nil-placeholder is replaced with the valid socket and removes the nil-placeholder if the socket fails to connect.

charlievieth · 2019-07-24T18:13:41Z

PTAL @scode

charlievieth · 2019-07-24T18:15:17Z

🔒 DO NOT MERGE - We cannot rely on the size of the liveSocket slice being static between calls. Fixing now.

Since the size of the liveSockets slice may change while we are connecting we cannot reliably replace the nil-placeholder we inserted. Instead we replace/remove the first nil-placeholder we find.

charlievieth · 2019-07-24T18:22:15Z

🔓 Fixed, with 0f7e0d4:

server: replace arbitrary nil-placeholders in AcquireSocket

Since the size of the liveSockets slice may change while we are
connecting we cannot reliably replace the nil-placeholder we inserted.
Instead we replace/remove the first nil-placeholder we find.

scode · 2019-07-24T19:10:38Z

The close case is the primary one I've been looking at; didn't see this being an issue (but only incidentally). The socket close/kill path calls into the server, but those calls have gates checking for the server being closed. Since you determined there's a race though, I want to look again and see if I can convince myself of it a well. Also want to double check all uses of live sockets.

I don't want to rush it and have meetings today; I may not get to it until tomorrow.

scode · 2019-07-24T20:07:06Z

Discussed a bit offline, and wanted to add some more color to the specific issue being fixed here. While #38 does fix a race condition between server close and AcquireSocket, it doesn't so do in all cases.

In general, all the calls into mongoServer methods that happen in the mongoSocket close path are appropriately gated on server status, such that calling mongoSocket.Close() for a socket whose server is already closed should be safe from the perspective of entering a closer server.

However, the remaining issue is that the socket itself is not safe for concurrent use until after it has finished being created. Socket structs are created unlocked, and added to liveSockets immediately. Any concurrent server close that happens will then potentially call Close() on an unlocked socket object.

That's the issue this PR is fixing. We could also potentially fix it by adding locking and checking for the connecting state in Close(). However, I think I prefer this approach since it much more obviously prevents a race by obviously and locally ensuring we never publish an object that hasn't finished initializing.

I've checked each use of liveSockets and I don't see any reason why a nil being present there would cause a problem, aside from the Close() path where this PR is adding a nil check.

scode · 2019-07-24T20:21:35Z

server.go

+				// connection is completed, is that an unconnected socket is not safe
+				// for concurrent use. Concurrent mongoServer.Close() calls may race with us
+				// and otherwise call Close() on an unlocked not-yet-connected socket.
+				for i, s := range server.liveSockets {


I feel slightly uncomfortable with the fact that this is O(n) during connection acquisition (i.e. O(n^2) to open n new connections).

However, the constant cost is very low compared to the actual connection establishment and the connection caps we use are relatively small. I'd rather us pay this small cost than go with a more complicated fix with a higher risk of introducing a bug. Just wanted to call it out as a decision.

I thought about that, but I think we'll have much bigger problems if the connection count becomes so large that this is unreasonably slow. I'm pretty sure it would need to be around two million before this takes anywhere near a millisecond (searching 1mil pointers took ~500μs on my machine, example test code below).

package main import ( "math/rand" "testing" "time" ) func init() { rand.Seed(time.Now().UnixNano()) } func FindNil(a []*uint64) int { for i, p := range a { if p == nil { return i } } return -1 } func BenchmarkPointer(b *testing.B) { const N = 1000000 a := make([]*uint64, N) for i := range a { u := uint64(i) a[i] = &u } n := rand.Intn(1000) // place near end a[len(a)-n-1] = nil b.ResetTimer() for i := 0; i < b.N; i++ { FindNil(a) } }

Agreed. Sorry, I didn't meant to make you do a test :)

No worries, I was actually curious about it how long it would take.

scode

+1 on my account (charlie feel free to review on account of the comments I pushed). Giving other team mates a chance to take a look before shipping.

gitsen · 2019-07-24T21:23:16Z

server.go

-			server.liveSockets = append(server.liveSockets, socket)
+			// hold a spot in the liveSockets slice to ensure connecting sockets are counted
+			// against the total connection cap.
+			server.liveSockets = append(server.liveSockets, nil)


I would prefer the use of an atomicCounter for the livesocket count for bookkeeping, but that might need more code changes than this isolated one.

I thought about that, but that requires additional coordination. Since the following example is flacky when n < limit - 1 and more than two processes are concurrently checking the limit:

const limit = 8 var p int64 n := atomic.LoadInt64(&p) if n < limit && atomic.CompareAndSwapInt64(&p, n, n+1) { // do the thing } else { // if another process incremented p, we would need a // bounded loop to do this properly and even that // isn't great since to be truly correct we would // also need a sync.Cond. }

Basically, if we used atomics we'd end up with a sub-par mutex or a racy solution.

Yeah, I don't think an atomic counter really helps us.

Normally I'd prefer we do separate explicit book keeping of sockets in connecting states rather than sticking nils into liveSockets, but given that the upstream project is dead and our limited expected usage and lack of good testing, I think the minimal surgical change is preferred.

gitsen · 2019-07-24T21:26:08Z

server.go

 			}
-			// hold our spot in the liveSockets slice
-			server.liveSockets = append(server.liveSockets, socket)
+			// hold a spot in the liveSockets slice to ensure connecting sockets are counted


Can you explain as to why is this important and a close on a socket no longer needed (even if over the pool limit will be detrimental)? This is the thundering herd problem I am assuming (where we run out of connections), so might be a good comment to leave here for future readers.

Pushing a comment explaining the pool limit on the mongoCluster socket, and adjusting terminology here to more obviously related it to the poolLimit parameter which can be traced back to mongoCluster.

In case it matters to anyone in the future, here's the relevant change that changed how pool limits work.

charlievieth · 2019-07-25T05:44:37Z

+1 on my account (charlie feel free to review on account of the comments I pushed). Giving other team mates a chance to take a look before shipping.

Thanks for adding comments and clearly explaining the issue here (both are super useful). I have nothing more to add so please feel free to merge whenever you feel this has been sufficiently reviewed.

gitsen · 2019-07-25T15:25:09Z

+1

gitsen

+1

charlievieth mentioned this pull request Jul 24, 2019

DS-4992 - Fix race in mongoServer.Close() #40

Merged

server: replace arbitrary nil-placeholders in AcquireSocket

0f7e0d4

Since the size of the liveSockets slice may change while we are connecting we cannot reliably replace the nil-placeholder we inserted. Instead we replace/remove the first nil-placeholder we find.

scode self-requested a review July 24, 2019 19:10

add some commentary

5d688a6

scode reviewed Jul 24, 2019

View reviewed changes

scode approved these changes Jul 24, 2019

View reviewed changes

gitsen reviewed Jul 24, 2019

View reviewed changes

clarify pool limits

b10bc72

scode merged commit 8de64b2 into lyft-v2-4 Jul 25, 2019

gitsen reviewed Jul 25, 2019

View reviewed changes

Fix a race that can occur while closing a socket that is connecting #41

Fix a race that can occur while closing a socket that is connecting #41

Uh oh!

Conversation

charlievieth commented Jul 24, 2019

Uh oh!

charlievieth commented Jul 24, 2019

Uh oh!

charlievieth commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charlievieth commented Jul 24, 2019

Uh oh!

scode commented Jul 24, 2019

Uh oh!

scode commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scode Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlievieth Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scode Jul 25, 2019

Choose a reason for hiding this comment

Uh oh!

charlievieth Jul 25, 2019

Choose a reason for hiding this comment

Uh oh!

scode left a comment

Choose a reason for hiding this comment

Uh oh!

gitsen Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

charlievieth Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlievieth Jul 25, 2019

Choose a reason for hiding this comment

Uh oh!

scode Jul 25, 2019

Choose a reason for hiding this comment

Uh oh!

gitsen Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

scode Jul 25, 2019

Choose a reason for hiding this comment

Uh oh!

charlievieth commented Jul 25, 2019

Uh oh!

gitsen commented Jul 25, 2019

Uh oh!

gitsen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

charlievieth commented Jul 24, 2019 •

edited

Loading

scode commented Jul 24, 2019 •

edited

Loading

scode Jul 24, 2019 •

edited

Loading

charlievieth Jul 24, 2019 •

edited

Loading

charlievieth Jul 24, 2019 •

edited

Loading