fix: Attempt to repair disconnected/failed master nodes before failing over by nashluffy · Pull Request #1105 · OT-CONTAINER-KIT/redis-operator

nashluffy · 2024-10-16T07:50:11Z

ended up closing #1101 as I was forgetting to sign commits and the git history was getting out of control with having to rebase

Description

Fixes #1100

As stated in the above issue, a cluster that has unhealthy leaders as the result of being scaled to zero nodes can be recovered from without having to issue a failover (which leads to data loss).

The failed/disconnected nodes simply need to have their address updated with the IP of the new leader pods. CLUSTER MEET is able to map the address specified to the existing host & port, meaning we don't need to wipe the master & start afresh. If this fails, we fall back to the failover.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

This is a best-effort attempt

Checklist

Tests have been added/modified and all tests pass.
Functionality/bugs have been confirmed to be unchanged or fixed.
I have performed a self-review of my own code.
Documentation has been updated or added where necessary.

If new strategy fails, failover still works as expected

{"level":"info","ts":"2024-10-13T19:26:29Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:31Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:31Z","logger":"controllers.RedisCluster","msg":"healthy leader count does not match desired; attempting to repair disconnected masters","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:41Z","logger":"controllers.RedisCluster","msg":"unhealthy nodes exist after attempting to repair disconnected masters; starting failover","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:52Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:53Z","logger":"controllers.RedisCluster","msg":"Creating redis cluster by executing cluster creation commands","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:53Z","logger":"controllers.RedisCluster","msg":"Not all leader are part of the cluster...","Request.Namespace":"default","Request.Name":"redis-cluster","Leaders.Count":1,"Instance.Size":3}
{"level":"info","ts":"2024-10-13T19:27:58Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:27:59Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}

and new strategy working as expected

# existing data
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli -c get k1
v1

# scale down statefulset & operator
nash:~/code/redis-operator$ k scale -n redis-operator-system deploy -l control-plane=redis-operator --replicas=0
deployment.apps/redis-operator-redis-operator scaled
nash:~/code/redis-operator$ k scale sts redis-cluster-leader --replicas=0
statefulset.apps/redis-cluster-leader scaled

# scale back up
nash:~/code/redis-operator$ k scale sts redis-cluster-leader --replicas=3
statefulset.apps/redis-cluster-leader scaled
nash:~/code/redis-operator$ k scale -n redis-operator-system deploy -l control-plane=redis-operator --replicas=1
deployment.apps/redis-operator-redis-operator scaled

# observe data persisted
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli -c get k1
v1

corresponding logs

{"level":"info","ts":"2024-10-13T20:04:11Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:14Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:14Z","logger":"controllers.RedisCluster","msg":"healthy leader count does not match desired; attempting to repair disconnected masters","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:19Z","logger":"controllers.RedisCluster","msg":"repairing unhealthy masters successful, no unhealthy masters left","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:49Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:50Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}

Additional Context

There's a small bit of refactoring too

define type for CLUSTER NODES response for a bit more safety on helper functions
extract some common functionality that operates on above type, ie nodeFailedOrDisconnected and nodeIsOfType
renames some functions to better represent what they do

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

codecov · 2024-10-16T08:00:54Z

Codecov Report

Attention: Patch coverage is 34.04255% with 62 lines in your changes missing coverage. Please review.

Project coverage is 44.34%. Comparing base (d121d86) to head (628d1b2).
Report is 121 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/k8sutils/redis.go	47.69%	30 Missing and 4 partials ⚠️
...ontrollers/rediscluster/rediscluster_controller.go	3.44%	28 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1105      +/-   ##
==========================================
+ Coverage   35.20%   44.34%   +9.14%     
==========================================
  Files          19       20       +1     
  Lines        3213     3412     +199     
==========================================
+ Hits         1131     1513     +382     
+ Misses       2015     1813     -202     
- Partials       67       86      +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nashluffy · 2024-10-17T08:34:04Z

@drivebyer do you mind having a look please?

drivebyer · 2024-10-17T09:11:03Z

@drivebyer do you mind having a look please?

Sure, I would add some end-to-end tests to improve this fix.

Signed-off-by: drivebyer <wuyangmuc@gmail.com>

nashluffy · 2024-10-17T13:45:17Z

@drivebyer do you mind having a look please?

Sure, I would add some end-to-end tests to improve this fix.

Oh, thanks for adding them! Was just getting around to it :) I've had a hell of a time battling flakes on the e2e tests

Signed-off-by: drivebyer <wuyangmuc@gmail.com>

nashluffy · 2024-10-21T10:22:38Z

Thanks for your help @drivebyer! Is there planned release coming soon so I can pick up this change? Looks like the last release was in July of this year

drivebyer · 2024-10-22T02:14:39Z

Thanks for your help @drivebyer! Is there planned release coming soon so I can pick up this change?

I’m not sure about the exact timing of the next release. If you’re in a hurry, you could build your own image using the Dockerfile from this link: https://github.com/OT-CONTAINER-KIT/redis-operator/blob/master/Dockerfile.

add support for repairing leaders

1352f28

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

nashluffy requested a review from iamabhishek-dubey as a code owner October 16, 2024 07:50

add make lint directive

a836b62

Signed-off-by: mluffman <mluffman@thoughtmachine.net>

add test

89b3b52

Signed-off-by: drivebyer <wuyangmuc@gmail.com>

drivebyer force-pushed the repair-leaders branch from 0a01dc7 to 89b3b52 Compare October 17, 2024 10:21

drivebyer added 2 commits October 17, 2024 18:24

fix lint

d45e154

Signed-off-by: drivebyer <wuyangmuc@gmail.com>

fix lint

1aa6b66

Signed-off-by: drivebyer <wuyangmuc@gmail.com>

drivebyer added 4 commits October 17, 2024 22:23

chainsaw

aea0f30

Signed-off-by: drivebyer <wuyangmuc@gmail.com>

update

7d34dcf

Signed-off-by: drivebyer <wuyangmuc@gmail.com>

update

7e37593

Signed-off-by: drivebyer <wuyangmuc@gmail.com>

no parallel

628d1b2

Signed-off-by: drivebyer <wuyangmuc@gmail.com>

drivebyer merged commit de6b066 into OT-CONTAINER-KIT:master Oct 18, 2024

dimpavloff mentioned this pull request Jul 1, 2025

feat: RedisCluster controller: attempt to repair disconnected nodes whenever detected #1426

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Attempt to repair disconnected/failed master nodes before failing over#1105

fix: Attempt to repair disconnected/failed master nodes before failing over#1105
drivebyer merged 9 commits intoOT-CONTAINER-KIT:masterfrom
nashluffy:repair-leaders

nashluffy commented Oct 16, 2024 •

edited

Loading

Uh oh!

codecov bot commented Oct 16, 2024 •

edited

Loading

Uh oh!

nashluffy commented Oct 17, 2024

Uh oh!

drivebyer commented Oct 17, 2024

Uh oh!

nashluffy commented Oct 17, 2024 •

edited

Loading

Uh oh!

nashluffy commented Oct 21, 2024

Uh oh!

drivebyer commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nashluffy commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nashluffy commented Oct 17, 2024

Uh oh!

drivebyer commented Oct 17, 2024

Uh oh!

nashluffy commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nashluffy commented Oct 21, 2024

Uh oh!

drivebyer commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nashluffy commented Oct 16, 2024 •

edited

Loading

codecov bot commented Oct 16, 2024 •

edited

Loading

nashluffy commented Oct 17, 2024 •

edited

Loading