fix: Attempt to repair disconnected/failed master nodes before failing over#1105
Conversation
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1105 +/- ##
==========================================
+ Coverage 35.20% 44.34% +9.14%
==========================================
Files 19 20 +1
Lines 3213 3412 +199
==========================================
+ Hits 1131 1513 +382
+ Misses 2015 1813 -202
- Partials 67 86 +19 ☔ View full report in Codecov by Sentry. |
|
@drivebyer do you mind having a look please? |
Sure, I would add some end-to-end tests to improve this fix. |
0a01dc7 to
89b3b52
Compare
Oh, thanks for adding them! Was just getting around to it :) I've had a hell of a time battling flakes on the e2e tests |
|
Thanks for your help @drivebyer! Is there planned release coming soon so I can pick up this change? Looks like the last release was in July of this year |
I’m not sure about the exact timing of the next release. If you’re in a hurry, you could build your own image using the Dockerfile from this link: https://github.com/OT-CONTAINER-KIT/redis-operator/blob/master/Dockerfile. |
ended up closing #1101 as I was forgetting to sign commits and the git history was getting out of control with having to rebase
Description
Fixes #1100
As stated in the above issue, a cluster that has unhealthy leaders as the result of being scaled to zero nodes can be recovered from without having to issue a failover (which leads to data loss).
The failed/disconnected nodes simply need to have their address updated with the IP of the new leader pods.
CLUSTER MEETis able to map the address specified to the existing host & port, meaning we don't need to wipe the master & start afresh. If this fails, we fall back to the failover.Type of change
This is a best-effort attempt
Checklist
If new strategy fails, failover still works as expected
and new strategy working as expected
corresponding logs
Additional Context
There's a small bit of refactoring too
CLUSTER NODESresponse for a bit more safety on helper functionsnodeFailedOrDisconnectedandnodeIsOfType