Skip to content

fix: Attempt to repair disconnected/failed master nodes before failing over#1105

Merged
drivebyer merged 9 commits intoOT-CONTAINER-KIT:masterfrom
nashluffy:repair-leaders
Oct 18, 2024
Merged

fix: Attempt to repair disconnected/failed master nodes before failing over#1105
drivebyer merged 9 commits intoOT-CONTAINER-KIT:masterfrom
nashluffy:repair-leaders

Conversation

@nashluffy
Copy link
Copy Markdown
Contributor

@nashluffy nashluffy commented Oct 16, 2024

ended up closing #1101 as I was forgetting to sign commits and the git history was getting out of control with having to rebase

Description

Fixes #1100

As stated in the above issue, a cluster that has unhealthy leaders as the result of being scaled to zero nodes can be recovered from without having to issue a failover (which leads to data loss).

The failed/disconnected nodes simply need to have their address updated with the IP of the new leader pods. CLUSTER MEET is able to map the address specified to the existing host & port, meaning we don't need to wipe the master & start afresh. If this fails, we fall back to the failover.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

This is a best-effort attempt

Checklist

  • Tests have been added/modified and all tests pass.
  • Functionality/bugs have been confirmed to be unchanged or fixed.
  • I have performed a self-review of my own code.
  • Documentation has been updated or added where necessary.

If new strategy fails, failover still works as expected

{"level":"info","ts":"2024-10-13T19:26:29Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:31Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:31Z","logger":"controllers.RedisCluster","msg":"healthy leader count does not match desired; attempting to repair disconnected masters","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:41Z","logger":"controllers.RedisCluster","msg":"unhealthy nodes exist after attempting to repair disconnected masters; starting failover","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:52Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:53Z","logger":"controllers.RedisCluster","msg":"Creating redis cluster by executing cluster creation commands","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:53Z","logger":"controllers.RedisCluster","msg":"Not all leader are part of the cluster...","Request.Namespace":"default","Request.Name":"redis-cluster","Leaders.Count":1,"Instance.Size":3}
{"level":"info","ts":"2024-10-13T19:27:58Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:27:59Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}

and new strategy working as expected

# existing data
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli -c get k1
v1

# scale down statefulset & operator
nash:~/code/redis-operator$ k scale -n redis-operator-system deploy -l control-plane=redis-operator --replicas=0
deployment.apps/redis-operator-redis-operator scaled
nash:~/code/redis-operator$ k scale sts redis-cluster-leader --replicas=0
statefulset.apps/redis-cluster-leader scaled

# scale back up
nash:~/code/redis-operator$ k scale sts redis-cluster-leader --replicas=3
statefulset.apps/redis-cluster-leader scaled
nash:~/code/redis-operator$ k scale -n redis-operator-system deploy -l control-plane=redis-operator --replicas=1
deployment.apps/redis-operator-redis-operator scaled

# observe data persisted
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli -c get k1
v1

corresponding logs

{"level":"info","ts":"2024-10-13T20:04:11Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:14Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:14Z","logger":"controllers.RedisCluster","msg":"healthy leader count does not match desired; attempting to repair disconnected masters","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:19Z","logger":"controllers.RedisCluster","msg":"repairing unhealthy masters successful, no unhealthy masters left","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:49Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:50Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}

Additional Context

There's a small bit of refactoring too

  • define type for CLUSTER NODES response for a bit more safety on helper functions
  • extract some common functionality that operates on above type, ie nodeFailedOrDisconnected and nodeIsOfType
  • renames some functions to better represent what they do

Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 16, 2024

Codecov Report

Attention: Patch coverage is 34.04255% with 62 lines in your changes missing coverage. Please review.

Project coverage is 44.34%. Comparing base (d121d86) to head (628d1b2).
Report is 121 commits behind head on master.

Files with missing lines Patch % Lines
pkg/k8sutils/redis.go 47.69% 30 Missing and 4 partials ⚠️
...ontrollers/rediscluster/rediscluster_controller.go 3.44% 28 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1105      +/-   ##
==========================================
+ Coverage   35.20%   44.34%   +9.14%     
==========================================
  Files          19       20       +1     
  Lines        3213     3412     +199     
==========================================
+ Hits         1131     1513     +382     
+ Misses       2015     1813     -202     
- Partials       67       86      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nashluffy
Copy link
Copy Markdown
Contributor Author

@drivebyer do you mind having a look please?

@drivebyer
Copy link
Copy Markdown
Collaborator

@drivebyer do you mind having a look please?

Sure, I would add some end-to-end tests to improve this fix.

Signed-off-by: drivebyer <wuyangmuc@gmail.com>
Signed-off-by: drivebyer <wuyangmuc@gmail.com>
Signed-off-by: drivebyer <wuyangmuc@gmail.com>
@nashluffy
Copy link
Copy Markdown
Contributor Author

nashluffy commented Oct 17, 2024

@drivebyer do you mind having a look please?

Sure, I would add some end-to-end tests to improve this fix.

Oh, thanks for adding them! Was just getting around to it :) I've had a hell of a time battling flakes on the e2e tests

Signed-off-by: drivebyer <wuyangmuc@gmail.com>
Signed-off-by: drivebyer <wuyangmuc@gmail.com>
Signed-off-by: drivebyer <wuyangmuc@gmail.com>
Signed-off-by: drivebyer <wuyangmuc@gmail.com>
@drivebyer drivebyer merged commit de6b066 into OT-CONTAINER-KIT:master Oct 18, 2024
@nashluffy
Copy link
Copy Markdown
Contributor Author

Thanks for your help @drivebyer! Is there planned release coming soon so I can pick up this change? Looks like the last release was in July of this year

@drivebyer
Copy link
Copy Markdown
Collaborator

Thanks for your help @drivebyer! Is there planned release coming soon so I can pick up this change?

I’m not sure about the exact timing of the next release. If you’re in a hurry, you could build your own image using the Dockerfile from this link: https://github.com/OT-CONTAINER-KIT/redis-operator/blob/master/Dockerfile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rediscluster failing to recover when scaling up from zero nodes

2 participants