Skip to content

Conversation

@charlievieth
Copy link

Previously, this code used a non-nil socket as a placeholder in server.AcquireSocket. This could lead to a race if server.Close is called while a socket is connecting. This commit changes the code to use a nil socket placeholder instead. If the socket successfully connects the nil-placeholder is replaced with the valid socket and removes the nil-placeholder if the socket fails to connect.

This is a follow on to #40 and fixes the race identified here.

Previously, this code used a non-nil socket as a placeholder in
server.AcquireSocket.  This could lead to a race if server.Close is
called while a socket is connecting.  This commit changes the code to
use a nil socket placeholder instead.  If the socket successfully
connects the nil-placeholder is replaced with the valid socket and
removes the nil-placeholder if the socket fails to connect.
@charlievieth
Copy link
Author

PTAL @scode

@charlievieth
Copy link
Author

charlievieth commented Jul 24, 2019

🔒 DO NOT MERGE - We cannot rely on the size of the liveSocket slice being static between calls. Fixing now.

Since the size of the liveSockets slice may change while we are
connecting we cannot reliably replace the nil-placeholder we inserted.
Instead we replace/remove the first nil-placeholder we find.
@charlievieth
Copy link
Author

🔓 Fixed, with 0f7e0d4:

server: replace arbitrary nil-placeholders in AcquireSocket

Since the size of the liveSockets slice may change while we are
connecting we cannot reliably replace the nil-placeholder we inserted.
Instead we replace/remove the first nil-placeholder we find.

@scode
Copy link

scode commented Jul 24, 2019

The close case is the primary one I've been looking at; didn't see this being an issue (but only incidentally). The socket close/kill path calls into the server, but those calls have gates checking for the server being closed. Since you determined there's a race though, I want to look again and see if I can convince myself of it a well. Also want to double check all uses of live sockets.

I don't want to rush it and have meetings today; I may not get to it until tomorrow.

@scode scode self-requested a review July 24, 2019 19:10
@scode
Copy link

scode commented Jul 24, 2019

Discussed a bit offline, and wanted to add some more color to the specific issue being fixed here. While #38 does fix a race condition between server close and AcquireSocket, it doesn't so do in all cases.

In general, all the calls into mongoServer methods that happen in the mongoSocket close path are appropriately gated on server status, such that calling mongoSocket.Close() for a socket whose server is already closed should be safe from the perspective of entering a closer server.

However, the remaining issue is that the socket itself is not safe for concurrent use until after it has finished being created. Socket structs are created unlocked, and added to liveSockets immediately. Any concurrent server close that happens will then potentially call Close() on an unlocked socket object.

That's the issue this PR is fixing. We could also potentially fix it by adding locking and checking for the connecting state in Close(). However, I think I prefer this approach since it much more obviously prevents a race by obviously and locally ensuring we never publish an object that hasn't finished initializing.

I've checked each use of liveSockets and I don't see any reason why a nil being present there would cause a problem, aside from the Close() path where this PR is adding a nil check.

// connection is completed, is that an unconnected socket is not safe
// for concurrent use. Concurrent mongoServer.Close() calls may race with us
// and otherwise call Close() on an unlocked not-yet-connected socket.
for i, s := range server.liveSockets {
Copy link

@scode scode Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel slightly uncomfortable with the fact that this is O(n) during connection acquisition (i.e. O(n^2) to open n new connections).

However, the constant cost is very low compared to the actual connection establishment and the connection caps we use are relatively small. I'd rather us pay this small cost than go with a more complicated fix with a higher risk of introducing a bug. Just wanted to call it out as a decision.

Copy link
Author

@charlievieth charlievieth Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but I think we'll have much bigger problems if the connection count becomes so large that this is unreasonably slow. I'm pretty sure it would need to be around two million before this takes anywhere near a millisecond (searching 1mil pointers took ~500μs on my machine, example test code below).

package main

import (
	"math/rand"
	"testing"
	"time"
)

func init() {
	rand.Seed(time.Now().UnixNano())
}

func FindNil(a []*uint64) int {
	for i, p := range a {
		if p == nil {
			return i
		}
	}
	return -1
}

func BenchmarkPointer(b *testing.B) {
	const N = 1000000
	a := make([]*uint64, N)
	for i := range a {
		u := uint64(i)
		a[i] = &u
	}
	n := rand.Intn(1000) // place near end
	a[len(a)-n-1] = nil
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		FindNil(a)
	}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Sorry, I didn't meant to make you do a test :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I was actually curious about it how long it would take.

Copy link

@scode scode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on my account (charlie feel free to review on account of the comments I pushed). Giving other team mates a chance to take a look before shipping.

server.liveSockets = append(server.liveSockets, socket)
// hold a spot in the liveSockets slice to ensure connecting sockets are counted
// against the total connection cap.
server.liveSockets = append(server.liveSockets, nil)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the use of an atomicCounter for the livesocket count for bookkeeping, but that might need more code changes than this isolated one.

Copy link
Author

@charlievieth charlievieth Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but that requires additional coordination. Since the following example is flacky when n < limit - 1 and more than two processes are concurrently checking the limit:

	const limit = 8
	var p int64
	n := atomic.LoadInt64(&p)
	if n < limit && atomic.CompareAndSwapInt64(&p, n, n+1) {
		// do the thing
	} else {
		// if another process incremented p, we would need a 
		// bounded loop to do this properly and even that
		// isn't great since to be truly correct we would
		// also need a sync.Cond.
	}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, if we used atomics we'd end up with a sub-par mutex or a racy solution.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think an atomic counter really helps us.

Normally I'd prefer we do separate explicit book keeping of sockets in connecting states rather than sticking nils into liveSockets, but given that the upstream project is dead and our limited expected usage and lack of good testing, I think the minimal surgical change is preferred.

}
// hold our spot in the liveSockets slice
server.liveSockets = append(server.liveSockets, socket)
// hold a spot in the liveSockets slice to ensure connecting sockets are counted
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain as to why is this important and a close on a socket no longer needed (even if over the pool limit will be detrimental)? This is the thundering herd problem I am assuming (where we run out of connections), so might be a good comment to leave here for future readers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing a comment explaining the pool limit on the mongoCluster socket, and adjusting terminology here to more obviously related it to the poolLimit parameter which can be traced back to mongoCluster.

In case it matters to anyone in the future, here's the relevant change that changed how pool limits work.

@charlievieth
Copy link
Author

+1 on my account (charlie feel free to review on account of the comments I pushed). Giving other team mates a chance to take a look before shipping.

Thanks for adding comments and clearly explaining the issue here (both are super useful). I have nothing more to add so please feel free to merge whenever you feel this has been sufficiently reviewed.

@gitsen
Copy link

gitsen commented Jul 25, 2019

+1

@scode scode merged commit 8de64b2 into lyft-v2-4 Jul 25, 2019
Copy link

@gitsen gitsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants