[cassandra] Add Configuration.Close() to ensure TLS cert watcher is closed#4515
[cassandra] Add Configuration.Close() to ensure TLS cert watcher is closed#4515yurishkuro merged 2 commits intojaegertracing:mainfrom
Conversation
Head branch was pushed to by a user without write access
|
The first push failed because of the test case in TestProxyBuilder in build_test.go file is calling fswatcher.go Close function twice. Added protection so that if the channel is already closed, it will not try to attempt to close the channel again. |
pkg/fswatcher/fswatcher.go
Outdated
| close(w.stop) | ||
| w.wg.Wait() | ||
| } | ||
| w.isClosed = true |
There was a problem hiding this comment.
unprotected bool can be a source of race condition.
There was a problem hiding this comment.
I see, let me add the mutex lock to prevent the race conditions.
|
please make sure all commits are signed (easiest is to squash into one) |
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## main #4515 +/- ##
==========================================
+ Coverage 97.03% 97.04% +0.01%
==========================================
Files 301 301
Lines 17817 17817
==========================================
+ Hits 17289 17291 +2
+ Misses 424 422 -2
Partials 104 104
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
eae8f07 to
274d6c4
Compare
|
I squashed and signed all my commits. Can you approve it again? |
pkg/fswatcher/fswatcher.go
Outdated
| // Close closes the watcher. | ||
| // Close closes the watcher and stops the background go routine of FSWatcher. | ||
| func (w *FSWatcher) Close() error { | ||
| if w.isClosed.CompareAndSwap(false, true) { |
There was a problem hiding this comment.
I was looking closer at this and I am not convinced this is all needed. When we call w.watcher.Close(), it closes the Events and Errors channels, so our goroutine will exit naturally. Yes, there is a race condition if you're checking for its exit immediately, but that cannot be completely avoided, even with these changes the goroutine may go to sleep after L143 lowers the semaphore.
If you observed that the goroutine is still running for a long time after close, there may be something else going on.
We can still use the waitgroup introduced here, but without the channel/bool:
func (w *FSWatcher) Close() error {
// watcher.Close() takes care of multiple calls by returning "already closed" error
if err := w.watcher.Close(); err != nil {
return err
}
w.wg.Wait()
return nil
}There was a problem hiding this comment.
I see. let me send out a new commit with these changes.
There was a problem hiding this comment.
Can you elaborate how you determined that the GR was still running? Can it be reproduced with a unit test?
There was a problem hiding this comment.
So you are right, the original code should work fine without any issues. I dig deeper into why this issue will happen in the first place. The main reason is that when you create a new Cassandra Cluster on L149, the code calls c.TLS.Config(). The c.TLS.Config call newCertWatcher which then calls watchCertPair(). watchCertPair() call fswatcher.new() which creates a new leaky go routine. After that in Cassandra, there is no method or function to Close the goroutine. In short, the code is working correctly, it is just that when creating the newCluster, the Close function for fswatcher was never called or the method was never implemented to close that go Routine.
https://github.com/jaegertracing/jaeger/blob/main/pkg/cassandra/config/config.go#L149
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/options.go#L99
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/cert_watcher.go#L69
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/cert_watcher.go#L97
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/cert_watcher.go#L106
There was a problem hiding this comment.
I propose that in https://github.com/jaegertracing/jaeger/blob/main/pkg/cassandra/config/config.go we implement a function called Close that will allow us to close that goroutine.
func (c *Configuration) Close() error { return c.TLS.Close() }
eb2efd2 to
e2b9f11
Compare
ea618cb to
b42bea8
Compare
| return cluster, nil | ||
| } | ||
|
|
||
| func (c *Configuration) Close() error { |
There was a problem hiding this comment.
Since this function did not exist before, nothing is calling it, so just adding it here is not going to address the issue in the ticket.
There was a problem hiding this comment.
So I looked at the code base, the leaky go routine only happens when new Cassandra Cluster is created directly with no option to close. So, the leaky-go routine is happening on our side, not on your end. I am thinking about changing a ticket from bug to feature request so that users who created the new Cassandra Cluster will have the option to close the fswatcher leaky go routine.
…losed (jaegertracing#4515) Which problem is this PR solving? Resolves jaegertracing#4514 Short description of the changes Implement a Close Function in Cassandra to allow us to close the leaky-go routine. Signed-off-by: kennyaz <115052215+kennyaz@users.noreply.github.com> Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com> Signed-off-by: KevinSchneider <kevin.schneider@target.com>
Which problem is this PR solving?
Resolves #4514
Short description of the changes
Implement a Close Function in Cassandra to allow us to close the leaky-go routine.