-
Notifications
You must be signed in to change notification settings - Fork 1
Fix a race that can occur while closing a socket that is connecting #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously, this code used a non-nil socket as a placeholder in server.AcquireSocket. This could lead to a race if server.Close is called while a socket is connecting. This commit changes the code to use a nil socket placeholder instead. If the socket successfully connects the nil-placeholder is replaced with the valid socket and removes the nil-placeholder if the socket fails to connect.
|
PTAL @scode |
|
🔒 DO NOT MERGE - We cannot rely on the size of the |
Since the size of the liveSockets slice may change while we are connecting we cannot reliably replace the nil-placeholder we inserted. Instead we replace/remove the first nil-placeholder we find.
|
🔓 Fixed, with 0f7e0d4: |
|
The close case is the primary one I've been looking at; didn't see this being an issue (but only incidentally). The socket close/kill path calls into the server, but those calls have gates checking for the server being closed. Since you determined there's a race though, I want to look again and see if I can convince myself of it a well. Also want to double check all uses of live sockets. I don't want to rush it and have meetings today; I may not get to it until tomorrow. |
|
Discussed a bit offline, and wanted to add some more color to the specific issue being fixed here. While #38 does fix a race condition between server close and In general, all the calls into However, the remaining issue is that the socket itself is not safe for concurrent use until after it has finished being created. Socket structs are created unlocked, and added to That's the issue this PR is fixing. We could also potentially fix it by adding locking and checking for the connecting state in I've checked each use of |
| // connection is completed, is that an unconnected socket is not safe | ||
| // for concurrent use. Concurrent mongoServer.Close() calls may race with us | ||
| // and otherwise call Close() on an unlocked not-yet-connected socket. | ||
| for i, s := range server.liveSockets { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel slightly uncomfortable with the fact that this is O(n) during connection acquisition (i.e. O(n^2) to open n new connections).
However, the constant cost is very low compared to the actual connection establishment and the connection caps we use are relatively small. I'd rather us pay this small cost than go with a more complicated fix with a higher risk of introducing a bug. Just wanted to call it out as a decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but I think we'll have much bigger problems if the connection count becomes so large that this is unreasonably slow. I'm pretty sure it would need to be around two million before this takes anywhere near a millisecond (searching 1mil pointers took ~500μs on my machine, example test code below).
package main
import (
"math/rand"
"testing"
"time"
)
func init() {
rand.Seed(time.Now().UnixNano())
}
func FindNil(a []*uint64) int {
for i, p := range a {
if p == nil {
return i
}
}
return -1
}
func BenchmarkPointer(b *testing.B) {
const N = 1000000
a := make([]*uint64, N)
for i := range a {
u := uint64(i)
a[i] = &u
}
n := rand.Intn(1000) // place near end
a[len(a)-n-1] = nil
b.ResetTimer()
for i := 0; i < b.N; i++ {
FindNil(a)
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Sorry, I didn't meant to make you do a test :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I was actually curious about it how long it would take.
scode
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on my account (charlie feel free to review on account of the comments I pushed). Giving other team mates a chance to take a look before shipping.
| server.liveSockets = append(server.liveSockets, socket) | ||
| // hold a spot in the liveSockets slice to ensure connecting sockets are counted | ||
| // against the total connection cap. | ||
| server.liveSockets = append(server.liveSockets, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer the use of an atomicCounter for the livesocket count for bookkeeping, but that might need more code changes than this isolated one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but that requires additional coordination. Since the following example is flacky when n < limit - 1 and more than two processes are concurrently checking the limit:
const limit = 8
var p int64
n := atomic.LoadInt64(&p)
if n < limit && atomic.CompareAndSwapInt64(&p, n, n+1) {
// do the thing
} else {
// if another process incremented p, we would need a
// bounded loop to do this properly and even that
// isn't great since to be truly correct we would
// also need a sync.Cond.
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, if we used atomics we'd end up with a sub-par mutex or a racy solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think an atomic counter really helps us.
Normally I'd prefer we do separate explicit book keeping of sockets in connecting states rather than sticking nils into liveSockets, but given that the upstream project is dead and our limited expected usage and lack of good testing, I think the minimal surgical change is preferred.
| } | ||
| // hold our spot in the liveSockets slice | ||
| server.liveSockets = append(server.liveSockets, socket) | ||
| // hold a spot in the liveSockets slice to ensure connecting sockets are counted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain as to why is this important and a close on a socket no longer needed (even if over the pool limit will be detrimental)? This is the thundering herd problem I am assuming (where we run out of connections), so might be a good comment to leave here for future readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing a comment explaining the pool limit on the mongoCluster socket, and adjusting terminology here to more obviously related it to the poolLimit parameter which can be traced back to mongoCluster.
In case it matters to anyone in the future, here's the relevant change that changed how pool limits work.
Thanks for adding comments and clearly explaining the issue here (both are super useful). I have nothing more to add so please feel free to merge whenever you feel this has been sufficiently reviewed. |
|
+1 |
gitsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Previously, this code used a non-nil socket as a placeholder in server.AcquireSocket. This could lead to a race if server.Close is called while a socket is connecting. This commit changes the code to use a nil socket placeholder instead. If the socket successfully connects the nil-placeholder is replaced with the valid socket and removes the nil-placeholder if the socket fails to connect.
This is a follow on to #40 and fixes the race identified here.