fix: Attempt to repair disconnected/failed master nodes before failing over#1101
Closed
nashluffy wants to merge 33 commits intoOT-CONTAINER-KIT:masterfrom
Closed
fix: Attempt to repair disconnected/failed master nodes before failing over#1101nashluffy wants to merge 33 commits intoOT-CONTAINER-KIT:masterfrom
nashluffy wants to merge 33 commits intoOT-CONTAINER-KIT:masterfrom
Conversation
4e2b4d7 to
e633a19
Compare
nashluffy
commented
Oct 13, 2024
| @@ -1,5 +1,5 @@ | |||
| # Build the manager binary | |||
| FROM golang:1.21 as builder | |||
| FROM golang:1.21 AS builder | |||
Contributor
Author
There was a problem hiding this comment.
on make docker-build I saw a warning about inconsistent casing in the dockerfile, just a drive-by
nashluffy
commented
Oct 13, 2024
| reqLogger.Error(err, "failed to repair disconnected masters") | ||
| } | ||
|
|
||
| err := retry.Do(func() error { |
Contributor
Author
There was a problem hiding this comment.
It can take a few seconds after issuing CLUSTER MEET for it to be reflected, so we retry 3 times with a 5 second back-off before proceeding to start failover
Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: drivebyer <wuyangmuc@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
…-KIT#1102) * update controller-gen to fix make manifests, update Makefile Signed-off-by: Nash Luffman <nashluffman@gmail.com> * update envtest version Signed-off-by: Nash Luffman <nashluffman@gmail.com> * fix path Signed-off-by: Nash Luffman <nashluffman@gmail.com> * add kind as dependency Signed-off-by: Nash Luffman <nashluffman@gmail.com> --------- Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
70b2179 to
dc6edc6
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1101 +/- ##
===========================================
+ Coverage 35.20% 45.63% +10.43%
===========================================
Files 19 20 +1
Lines 3213 2763 -450
===========================================
+ Hits 1131 1261 +130
+ Misses 2015 1416 -599
- Partials 67 86 +19 ☔ View full report in Codecov by Sentry. |
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v3...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: drivebyer <wuyangmuc@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
…-KIT#1102) * update controller-gen to fix make manifests, update Makefile Signed-off-by: Nash Luffman <nashluffman@gmail.com> * update envtest version Signed-off-by: Nash Luffman <nashluffman@gmail.com> * fix path Signed-off-by: Nash Luffman <nashluffman@gmail.com> * add kind as dependency Signed-off-by: Nash Luffman <nashluffman@gmail.com> --------- Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
…-KIT#1102) * update controller-gen to fix make manifests, update Makefile Signed-off-by: Nash Luffman <nashluffman@gmail.com> * update envtest version Signed-off-by: Nash Luffman <nashluffman@gmail.com> * fix path Signed-off-by: Nash Luffman <nashluffman@gmail.com> * add kind as dependency Signed-off-by: Nash Luffman <nashluffman@gmail.com> --------- Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Contributor
Author
|
closing in favour of a less messy PR where I don't forget to sign-off the commits :( |
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes #1100
As stated in the above issue, a cluster that has unhealthy leaders as the result of being scaled to zero nodes can be recovered from without having to issue a failover (which leads to data loss).
The failed/disconnected nodes simply need to have their address updated with the IP of the new leader pods.
CLUSTER MEETis able to map the address specified to the existing host & port, meaning we don't need to wipe the master & start afresh. If this fails, we fall back to the failover.Type of change
This is a best-effort attempt
Checklist
If new strategy fails, failover still works as expected
and new strategy working as expected
corresponding logs
Additional Context
There's a small bit of refactoring too
CLUSTER NODESresponse for a bit more safety on helper functionsnodeFailedOrDisconnectedandnodeIsOfType