fix: Attempt to repair disconnected/failed master nodes before failing over by nashluffy · Pull Request #1101 · OT-CONTAINER-KIT/redis-operator

nashluffy · 2024-10-13T20:10:48Z

Description

As stated in the above issue, a cluster that has unhealthy leaders as the result of being scaled to zero nodes can be recovered from without having to issue a failover (which leads to data loss).

The failed/disconnected nodes simply need to have their address updated with the IP of the new leader pods. CLUSTER MEET is able to map the address specified to the existing host & port, meaning we don't need to wipe the master & start afresh. If this fails, we fall back to the failover.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

This is a best-effort attempt

Checklist

Tests have been added/modified and all tests pass.
Functionality/bugs have been confirmed to be unchanged or fixed.
I have performed a self-review of my own code.
Documentation has been updated or added where necessary.

If new strategy fails, failover still works as expected

{"level":"info","ts":"2024-10-13T19:26:29Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:31Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:31Z","logger":"controllers.RedisCluster","msg":"healthy leader count does not match desired; attempting to repair disconnected masters","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:41Z","logger":"controllers.RedisCluster","msg":"unhealthy nodes exist after attempting to repair disconnected masters; starting failover","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:52Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:53Z","logger":"controllers.RedisCluster","msg":"Creating redis cluster by executing cluster creation commands","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:53Z","logger":"controllers.RedisCluster","msg":"Not all leader are part of the cluster...","Request.Namespace":"default","Request.Name":"redis-cluster","Leaders.Count":1,"Instance.Size":3}
{"level":"info","ts":"2024-10-13T19:27:58Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:27:59Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}

and new strategy working as expected

# existing data
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli -c get k1
v1

# scale down statefulset & operator
nash:~/code/redis-operator$ k scale -n redis-operator-system deploy -l control-plane=redis-operator --replicas=0
deployment.apps/redis-operator-redis-operator scaled
nash:~/code/redis-operator$ k scale sts redis-cluster-leader --replicas=0
statefulset.apps/redis-cluster-leader scaled

# scale back up
nash:~/code/redis-operator$ k scale sts redis-cluster-leader --replicas=3
statefulset.apps/redis-cluster-leader scaled
nash:~/code/redis-operator$ k scale -n redis-operator-system deploy -l control-plane=redis-operator --replicas=1
deployment.apps/redis-operator-redis-operator scaled

# observe data persisted
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli -c get k1
v1

corresponding logs

{"level":"info","ts":"2024-10-13T20:04:11Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:14Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:14Z","logger":"controllers.RedisCluster","msg":"healthy leader count does not match desired; attempting to repair disconnected masters","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:19Z","logger":"controllers.RedisCluster","msg":"repairing unhealthy masters successful, no unhealthy masters left","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:49Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:50Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}

Additional Context

There's a small bit of refactoring too

define type for CLUSTER NODES response for a bit more safety on helper functions
extract some common functionality that operates on above type, ie nodeFailedOrDisconnected and nodeIsOfType
renames some functions to better represent what they do

nashluffy · 2024-10-13T20:12:35Z

Dockerfile

@@ -1,5 +1,5 @@
 # Build the manager binary
-FROM golang:1.21 as builder
+FROM golang:1.21 AS builder


on make docker-build I saw a warning about inconsistent casing in the dockerfile, just a drive-by

nashluffy · 2024-10-13T20:13:58Z

pkg/controllers/rediscluster/rediscluster_controller.go

+			reqLogger.Error(err, "failed to repair disconnected masters")
+		}
+
+		err := retry.Do(func() error {


It can take a few seconds after issuing CLUSTER MEET for it to be reflected, so we retry 3 times with a 5 second back-off before proceeding to start failover

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

Signed-off-by: drivebyer <wuyangmuc@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

…-KIT#1102) * update controller-gen to fix make manifests, update Makefile Signed-off-by: Nash Luffman <nashluffman@gmail.com> * update envtest version Signed-off-by: Nash Luffman <nashluffman@gmail.com> * fix path Signed-off-by: Nash Luffman <nashluffman@gmail.com> * add kind as dependency Signed-off-by: Nash Luffman <nashluffman@gmail.com> --------- Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

codecov · 2024-10-14T15:50:17Z

Codecov Report

Attention: Patch coverage is 40.25974% with 46 lines in your changes missing coverage. Please review.

Project coverage is 45.63%. Comparing base (d121d86) to head (aac668b).
Report is 120 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/k8sutils/redis.go	54.38%	22 Missing and 4 partials ⚠️
...ontrollers/rediscluster/rediscluster_controller.go	0.00%	20 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #1101       +/-   ##
===========================================
+ Coverage   35.20%   45.63%   +10.43%     
===========================================
  Files          19       20        +1     
  Lines        3213     2763      -450     
===========================================
+ Hits         1131     1261      +130     
+ Misses       2015     1416      -599     
- Partials       67       86       +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

…rator into repair-nodes

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v3...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

Signed-off-by: drivebyer <wuyangmuc@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

…-KIT#1102) * update controller-gen to fix make manifests, update Makefile Signed-off-by: Nash Luffman <nashluffman@gmail.com> * update envtest version Signed-off-by: Nash Luffman <nashluffman@gmail.com> * fix path Signed-off-by: Nash Luffman <nashluffman@gmail.com> * add kind as dependency Signed-off-by: Nash Luffman <nashluffman@gmail.com> --------- Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

…-KIT#1102) * update controller-gen to fix make manifests, update Makefile Signed-off-by: Nash Luffman <nashluffman@gmail.com> * update envtest version Signed-off-by: Nash Luffman <nashluffman@gmail.com> * fix path Signed-off-by: Nash Luffman <nashluffman@gmail.com> * add kind as dependency Signed-off-by: Nash Luffman <nashluffman@gmail.com> --------- Signed-off-by: Nash Luffman <nashluffman@gmail.com>

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

…rator into repair-nodes

nashluffy · 2024-10-16T07:49:19Z

closing in favour of a less messy PR where I don't forget to sign-off the commits :(

nashluffy requested a review from iamabhishek-dubey as a code owner October 13, 2024 20:10

nashluffy force-pushed the repair-nodes branch from 4e2b4d7 to e633a19 Compare October 13, 2024 20:11

nashluffy commented Oct 13, 2024

View reviewed changes

nashluffy changed the title ~~Attempt to repair disconnected/failed master nodes before failing over~~ fix: Attempt to repair disconnected/failed master nodes before failing over Oct 14, 2024

nashluffy and others added 8 commits October 14, 2024 16:40

add support for repairing disconnected/failed master nodes

8a889fb

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

fix test

2b42703

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

cleanup func names

fe27ea9

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

cleanup hostname, more defensive

f3e37bd

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

handle failover

f2ed592

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

feat: upgrade redis/sentinel image to 7.0.15 (OT-CONTAINER-KIT#1099)

edf6e8b

Signed-off-by: drivebyer <wuyangmuc@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

golangci-lint fixes

dc6edc6

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

nashluffy force-pushed the repair-nodes branch from 70b2179 to dc6edc6 Compare October 14, 2024 15:40

Merge branch 'master' into repair-nodes

e6ab2f2

mluffman and others added 14 commits October 15, 2024 11:12

add tests

c347840

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

Merge branch 'repair-nodes' of https://github.com/Nashluffy/redis-ope…

7b1953a

…rator into repair-nodes

small refactor

ec5917b

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

make fmt

26d7b97

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

fix make dep

85bbefd

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

fix make dep

b6d1329

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

makefile dep

dcbba9b

add support for repairing disconnected/failed master nodes

4265b49

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

fix test

f034530

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

cleanup func names

2afb953

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

cleanup hostname, more defensive

89d4b76

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

handle failover

f0e3678

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

feat: upgrade redis/sentinel image to 7.0.15 (OT-CONTAINER-KIT#1099)

0e875fb

Signed-off-by: drivebyer <wuyangmuc@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

nashluffy and others added 10 commits October 16, 2024 08:29

golangci-lint fixes

27a90c3

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

add tests

a3e2915

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

small refactor

639389b

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

make fmt

4caf4ab

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

fix make dep

f44349c

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

fix make dep

412c411

Signed-off-by: Nash Luffman <nashluffman@gmail.com> Signed-off-by: mluffman <mluffman@thoughtmachine.net>

makefile dep

961c652

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

Merge branch 'repair-nodes' of https://github.com/Nashluffy/redis-ope…

aac668b

…rator into repair-nodes

nashluffy closed this Oct 16, 2024

nashluffy mentioned this pull request Oct 16, 2024

fix: Attempt to repair disconnected/failed master nodes before failing over #1105

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Attempt to repair disconnected/failed master nodes before failing over#1101

fix: Attempt to repair disconnected/failed master nodes before failing over#1101
nashluffy wants to merge 33 commits intoOT-CONTAINER-KIT:masterfrom
nashluffy:repair-nodes

nashluffy commented Oct 13, 2024

Uh oh!

nashluffy Oct 13, 2024

Uh oh!

nashluffy Oct 13, 2024

Uh oh!

codecov bot commented Oct 14, 2024 •

edited

Loading

Uh oh!

nashluffy commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nashluffy commented Oct 13, 2024

Uh oh!

nashluffy Oct 13, 2024

Choose a reason for hiding this comment

Uh oh!

nashluffy Oct 13, 2024

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nashluffy commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Oct 14, 2024 •

edited

Loading