Skip to content

fix: Attempt to repair disconnected/failed master nodes before failing over#1101

Closed
nashluffy wants to merge 33 commits intoOT-CONTAINER-KIT:masterfrom
nashluffy:repair-nodes
Closed

fix: Attempt to repair disconnected/failed master nodes before failing over#1101
nashluffy wants to merge 33 commits intoOT-CONTAINER-KIT:masterfrom
nashluffy:repair-nodes

Conversation

@nashluffy
Copy link
Copy Markdown
Contributor

Description

Fixes #1100

As stated in the above issue, a cluster that has unhealthy leaders as the result of being scaled to zero nodes can be recovered from without having to issue a failover (which leads to data loss).

The failed/disconnected nodes simply need to have their address updated with the IP of the new leader pods. CLUSTER MEET is able to map the address specified to the existing host & port, meaning we don't need to wipe the master & start afresh. If this fails, we fall back to the failover.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

This is a best-effort attempt

Checklist

  • Tests have been added/modified and all tests pass.
  • Functionality/bugs have been confirmed to be unchanged or fixed.
  • I have performed a self-review of my own code.
  • Documentation has been updated or added where necessary.

If new strategy fails, failover still works as expected

{"level":"info","ts":"2024-10-13T19:26:29Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:31Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:31Z","logger":"controllers.RedisCluster","msg":"healthy leader count does not match desired; attempting to repair disconnected masters","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:41Z","logger":"controllers.RedisCluster","msg":"unhealthy nodes exist after attempting to repair disconnected masters; starting failover","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:52Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:53Z","logger":"controllers.RedisCluster","msg":"Creating redis cluster by executing cluster creation commands","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:53Z","logger":"controllers.RedisCluster","msg":"Not all leader are part of the cluster...","Request.Namespace":"default","Request.Name":"redis-cluster","Leaders.Count":1,"Instance.Size":3}
{"level":"info","ts":"2024-10-13T19:27:58Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:27:59Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}

and new strategy working as expected

# existing data
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli -c get k1
v1

# scale down statefulset & operator
nash:~/code/redis-operator$ k scale -n redis-operator-system deploy -l control-plane=redis-operator --replicas=0
deployment.apps/redis-operator-redis-operator scaled
nash:~/code/redis-operator$ k scale sts redis-cluster-leader --replicas=0
statefulset.apps/redis-cluster-leader scaled

# scale back up
nash:~/code/redis-operator$ k scale sts redis-cluster-leader --replicas=3
statefulset.apps/redis-cluster-leader scaled
nash:~/code/redis-operator$ k scale -n redis-operator-system deploy -l control-plane=redis-operator --replicas=1
deployment.apps/redis-operator-redis-operator scaled

# observe data persisted
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli -c get k1
v1

corresponding logs

{"level":"info","ts":"2024-10-13T20:04:11Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:14Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:14Z","logger":"controllers.RedisCluster","msg":"healthy leader count does not match desired; attempting to repair disconnected masters","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:19Z","logger":"controllers.RedisCluster","msg":"repairing unhealthy masters successful, no unhealthy masters left","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:49Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:50Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}

Additional Context

There's a small bit of refactoring too

  • define type for CLUSTER NODES response for a bit more safety on helper functions
  • extract some common functionality that operates on above type, ie nodeFailedOrDisconnected and nodeIsOfType
  • renames some functions to better represent what they do

@@ -1,5 +1,5 @@
# Build the manager binary
FROM golang:1.21 as builder
FROM golang:1.21 AS builder
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on make docker-build I saw a warning about inconsistent casing in the dockerfile, just a drive-by

reqLogger.Error(err, "failed to repair disconnected masters")
}

err := retry.Do(func() error {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can take a few seconds after issuing CLUSTER MEET for it to be reflected, so we retry 3 times with a 5 second back-off before proceeding to start failover

@nashluffy nashluffy changed the title Attempt to repair disconnected/failed master nodes before failing over fix: Attempt to repair disconnected/failed master nodes before failing over Oct 14, 2024
nashluffy and others added 8 commits October 14, 2024 16:40
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: drivebyer <wuyangmuc@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
…-KIT#1102)

* update controller-gen to fix make manifests, update Makefile

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

* update envtest version

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

* fix path

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

* add kind as dependency

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

---------

Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 14, 2024

Codecov Report

Attention: Patch coverage is 40.25974% with 46 lines in your changes missing coverage. Please review.

Project coverage is 45.63%. Comparing base (d121d86) to head (aac668b).
Report is 120 commits behind head on master.

Files with missing lines Patch % Lines
pkg/k8sutils/redis.go 54.38% 22 Missing and 4 partials ⚠️
...ontrollers/rediscluster/rediscluster_controller.go 0.00% 20 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #1101       +/-   ##
===========================================
+ Coverage   35.20%   45.63%   +10.43%     
===========================================
  Files          19       20        +1     
  Lines        3213     2763      -450     
===========================================
+ Hits         1131     1261      +130     
+ Misses       2015     1416      -599     
- Partials       67       86       +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mluffman and others added 14 commits October 15, 2024 11:12
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v3...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: drivebyer <wuyangmuc@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
nashluffy and others added 10 commits October 16, 2024 08:29
…-KIT#1102)

* update controller-gen to fix make manifests, update Makefile

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

* update envtest version

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

* fix path

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

* add kind as dependency

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

---------

Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
…-KIT#1102)

* update controller-gen to fix make manifests, update Makefile

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

* update envtest version

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

* fix path

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

* add kind as dependency

Signed-off-by: Nash Luffman <nashluffman@gmail.com>

---------

Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: Nash Luffman <nashluffman@gmail.com>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
Signed-off-by: mluffman <mluffman@thoughtmachine.net>
@nashluffy
Copy link
Copy Markdown
Contributor Author

closing in favour of a less messy PR where I don't forget to sign-off the commits :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rediscluster failing to recover when scaling up from zero nodes

2 participants