raft: Use TransferLeadership to make leader demotion safer by aaronlehmann · Pull Request #1939 · moby/swarmkit

aaronlehmann · 2017-02-08T23:32:57Z

When we demote the leader, we currently wait for all queued messages to be sent, as a best-effort approach to making sure the other nodes find out that the node removal has been committed, and stop treating the current leader as a cluster member. This doesn't work perfectly.

To make this more robust, use TransferLeadership when the leader is trying to remove itself. The new leader's reconcilation loop will kick in and remove the old leader.

cc @LK4D4 @cyli

codecov-io · 2017-02-08T23:42:49Z

Codecov Report

Merging #1939 into master will decrease coverage by -0.31%.
The diff coverage is 20.37%.

@@            Coverage Diff             @@
##           master    #1939      +/-   ##
==========================================
- Coverage   54.29%   53.99%   -0.31%     
==========================================
  Files         108      108              
  Lines       18586    18588       +2     
==========================================
- Hits        10092    10036      -56     
- Misses       7257     7324      +67     
+ Partials     1237     1228       -9

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 569defc...5470e07. Read the comment docs.

aaronlehmann · 2017-02-09T01:53:22Z

Note to self: need to confirm this won't cause any problems in mixed version swarms. I don't think it will but it's important to be sure.

LK4D4 · 2017-02-09T16:32:50Z

manager/state/raft/raft.go

+	defer ticker.Stop()
+	for {
+		leader := n.leader()
+		if leader != 0 && leader != n.Config.ID {


replace 0 with raft.None.
Also shouldn't we wait for transferee to became leader? Or it's possible that other node became leader?

I was thinking that if a different node starts an election at the same time, that one might win.

Sounds good.

LK4D4 · 2017-02-09T16:34:49Z

manager/state/raft/raft.go

+		}
+		select {
+		case <-ctx.Done():
+			return errors.Wrap(err, "timed out waiting for leadership change")


it might be not timeout I think, maybe better just errors.Wrap(err, "failed to transfer leadership")

LK4D4 · 2017-02-09T16:39:42Z

manager/state/raft/raft.go

+	start := time.Now()
+	log.G(ctx).Infof("raft: transfer leadership %x -> %x", n.Config.ID, transferee)
+	n.raftNode.TransferLeadership(ctx, n.Config.ID, transferee)
+	ticker := time.NewTicker(100 * time.Millisecond)


I see that etcd uses TickInterval here. Do you think it make sense in this case?

Changed to TickInterval / 10 so we don't have to wait a full tick to recognize the leader has changed.

LK4D4 · 2017-02-09T17:06:37Z

manager/state/raft/raft.go

+}
+
+func (n *Node) removeSelfGracefully(ctx context.Context) error {
+	transferCtx, cancelTransfer := context.WithTimeout(ctx, 10*time.Second)


It looks little too hardcody. Maybe we should inherit it from ElectionTick as etcd does?

etcd seems to use this:

// ReqTimeout returns timeout for request to finish. func (c *ServerConfig) ReqTimeout() time.Duration { // 5s for queue waiting, computation and disk IO delay // + 2 * election timeout for possible leader election return 5*time.Second + 2*time.Duration(c.ElectionTicks)*time.Duration(c.TickMs)*time.Millisecond }

That seems reasonable, but I'm not sure it's that much better than a simple fixed timeout (it hardcodes 5 seconds, for example). I guess I'll switch to it anyway.

cyli · 2017-02-09T18:34:09Z

manager/state/raft/raft.go

+}

-	return ErrCannotRemoveMember
+// transferLeadership attempts to transfer leadership to a different node,


Non-blocking: would it make sense to duplicate the stopMu part of this comment up in removeSelfGracefully as well, since removeSelfGracefully doesn't hold stopMu either?

cyli · 2017-02-09T19:24:14Z

manager/state/raft/raft.go

+		case <-ticker.C:
+		}
+	}
+	log.G(ctx).Infof("raft: transfer leadership %x -> %x finished in %v", n.Config.ID, transferee, time.Since(start))


Should this be leader instead of transferee, in case some other node started leader election and then won?

aaronlehmann · 2017-02-09T19:32:17Z

Addressed comments, PTAL

LK4D4 · 2017-02-09T19:45:45Z

@aaronlehmann I've tried TestDemoteToSingleManager and not a single transfer finished with success:

ERRO[0001] failed to leave raft cluster gracefully       error="failed to transfer leadership: context canceled"

LK4D4 · 2017-02-09T19:46:42Z

manager/state/raft/raft.go

+		return errors.Wrap(err, "failed to get longest-active member")
+	}
+	start := time.Now()
+	log.G(ctx).Infof("raft: transfer leadership %x -> %x", n.Config.ID, transferee)


This message duplicates in etcd/raft, not sure if we care.

cyli · 2017-02-09T22:10:35Z

TestDemoteToSingleManager passed 10 times in a row for me when running that test alone. I am often seeing this though when running the integration test suite with -race, though (although my machine may be haunted again and it may be time to restart again):

--- FAIL: TestDemotePromote (75.30s)
        Error Trace:    integration_test.go:254
			integration_test.go:278
	Error:  	Received unexpected error error stop worker z2dj61jsjefe8dbunf0twdlo1: context deadline exceeded

LK4D4 · 2017-02-09T22:26:03Z

Tests is passing okay, but transfer leadership doesn't work.

cyli · 2017-02-10T00:49:53Z

@LK4D4 Ah sorry, yes, thanks for clarifying. I'm seeing the same.

aaronlehmann · 2017-02-10T02:13:28Z

Thanks for pointing out that TransferLeadership was not actually succeeding in the tests.

The first problem here was that the context would be cancelled when the old leader lost the leadership. This caused the function to return an error.

I fixed this, but afterwards it turned out that the node would get removed twice: once by the call to Leave, and another by the new leader's reconciliation loop kicking in and trying to remove the same node.

So I'm trying a new, simpler, approach. RemoveNode no longer supports removing the local node. If this is attempted, an error gets returned, and the node role reconciliation loop can invoke TransferLeadership. Once there is a new leader, that leader can remove the old leader without incident.

PTAL

aaronlehmann · 2017-02-10T02:27:10Z

...of course the tests consistently pass on my machine :/

LK4D4 · 2017-02-10T17:25:25Z

manager/state/raft/raft.go

+		case <-ticker.C:
+		}
+	}
+	return nil


Transfer finish is not logged by raft library :(

It's not, but raft does log when a new leader is elected, so I thought that would be enough.

LK4D4 · 2017-02-10T17:35:57Z

manager/state/raft/raft.go

 	ErrMemberUnknown = errors.New("raft: member unknown")
+	// ErrCantRemoveSelf is returned if RemoveMember is called with the
+	// local node as the argument.
+	ErrCantRemoveSelf = errors.New("raft: can't remove self")


Error above is using word Cannot, maybe better to keep it consistent.

LK4D4 · 2017-02-10T17:42:20Z

manager/role_manager.go

+				if err == raft.ErrCantRemoveSelf {
+					// Don't use rmCtx, because we expect to lose
+					// leadership, which will cancel this context.
+					log.L.Info("demoted; ceding leadership")


I live in the USA for almost three years, and only vague know the meaning of the word ceding :) Let's use transferring or something more common.

LK4D4 · 2017-02-10T17:43:07Z

have minor comments.
Looks good overall

aaronlehmann · 2017-02-10T18:02:27Z

I've made the cosmetic changes. I'm happy to add back the "transfered leadership" log message if you think it makes sense.

I had some test failures in CI with an earlier version of this, but I can't reproduce them locally. Can you try running the tests a few times to see if they are stable?

LK4D4 · 2017-02-10T18:23:23Z

@aaronlehmann sure will run tests now.
Yeah, I liked transfer finished with times, it's kinda cool to see how fast raft is :)

aaronlehmann · 2017-02-10T18:32:38Z

Added back the transfer timing.

aaronlehmann · 2017-02-10T18:50:17Z

Adding back the n.asyncTasks.Wait() in applyRemoveNode seems to have stablized CI. I'm not sure why, because I expected this would only be necessary when the leader removes itself. Otherwise, this case should only be reached when a node finds out its removal has been committed, and then it shouldn't be necessary to communicate anymore.

LK4D4 · 2017-02-10T19:01:51Z

Tests are quite stable for me, no leaks, no races.

LK4D4

LGTM

aaronlehmann · 2017-02-10T19:03:29Z

I'm rebuilding it a few more times in CI.

aaronlehmann · 2017-02-10T19:50:18Z

I can't get this to fail in CI anymore. Either the initial failures were a fluke, or adding back n.asyncTasks.Wait() made a difference for reasons I don't understand (maybe a timing issue?).

cyli · 2017-02-10T22:09:29Z

manager/role_manager.go

 			}

-			rmCtx, rmCancel := context.WithTimeout(rm.ctx, 5*time.Second)
+			rmCtx, rmCancel := context.WithTimeout(context.Background(), 5*time.Second)


Non-blocking question: Does this need to be context.Background() if rm.raft.TransferLeadership uses a new context?

cyli · 2017-02-10T22:19:56Z

manager/state/raft/raft_test.go

 	assert.Len(t, nodes[4].GetMemberlist(), 4)
 }

-func TestRaftLeaderLeave(t *testing.T) {


Should this test be modified to just fail with the expected error rather than removed entirely?

Actually even dumber question: does anything besides the tests call Leave on a raft member anymore? (I assume even if not, we'd have to leave it in for backwards compatibility, but it might be nice to comment)

Actually even dumber question: does anything besides the tests call Leave on a raft member anymore?

Nope. Maybe we should remove it.

cyli · 2017-02-10T22:39:50Z

Nice, I'm not getting any intermittent context exceeded errors in other integration tests anymore. Just have a few questions above, but other than that LGTM.

When we demote the leader, we currently wait for all queued messages to be sent, as a best-effort approach to making sure the other nodes find out that the node removal has been committed, and stop treating the current leader as a cluster member. This doesn't work perfectly. To make this more robust, use TransferLeadership when the leader is trying to remove itself. The new leader's reconcilation loop will kick in and remove the old leader. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2017-02-15T22:17:15Z

I've updated this to work slightly differently. If the leadership transfer times out, we fall back to self-demoting the old way. I've tested this with a 1.14-dev leader in a cluster with a 1.12.6 follower, and demoting the leader works properly (after the timeout elapses).

PTAL. Hoping to merge this soon if it looks good.

cyli · 2017-02-17T19:13:08Z

LGTM! Thank you for tracking down a different unrelated error!

LK4D4 reviewed Feb 9, 2017

View reviewed changes

cyli reviewed Feb 9, 2017

View reviewed changes

aaronlehmann force-pushed the transferleadership branch from f7b7644 to 4e0b605 Compare February 9, 2017 19:31

LK4D4 reviewed Feb 9, 2017

View reviewed changes

aaronlehmann force-pushed the transferleadership branch from 4e0b605 to 8ba1a53 Compare February 10, 2017 02:10

aaronlehmann force-pushed the transferleadership branch from 8ba1a53 to d667824 Compare February 10, 2017 02:53

LK4D4 reviewed Feb 10, 2017

View reviewed changes

aaronlehmann force-pushed the transferleadership branch from d667824 to 392bdc3 Compare February 10, 2017 17:58

aaronlehmann force-pushed the transferleadership branch from 392bdc3 to e05da5a Compare February 10, 2017 18:32

LK4D4 approved these changes Feb 10, 2017

View reviewed changes

cyli reviewed Feb 10, 2017

View reviewed changes

aaronlehmann mentioned this pull request Feb 11, 2017

taskreaper: Wait for tasks to stop running #1948

Merged

aaronlehmann force-pushed the transferleadership branch from e05da5a to ca85e3a Compare February 11, 2017 01:55

aaronlehmann force-pushed the transferleadership branch from ca85e3a to 5470e07 Compare February 15, 2017 22:01

LK4D4 merged commit a82deb6 into moby:master Feb 17, 2017

Conversation

aaronlehmann commented Feb 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Feb 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aaronlehmann commented Feb 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyli Feb 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Feb 9, 2017

Uh oh!

LK4D4 commented Feb 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyli commented Feb 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LK4D4 commented Feb 9, 2017

Uh oh!

cyli commented Feb 10, 2017

Uh oh!

aaronlehmann commented Feb 10, 2017

Uh oh!

aaronlehmann commented Feb 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LK4D4 commented Feb 10, 2017

Uh oh!

aaronlehmann commented Feb 10, 2017

Uh oh!

LK4D4 commented Feb 10, 2017

Uh oh!

aaronlehmann commented Feb 10, 2017

Uh oh!

aaronlehmann commented Feb 10, 2017

Uh oh!

LK4D4 commented Feb 10, 2017

Uh oh!

LK4D4 left a comment

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Feb 10, 2017

Uh oh!

aaronlehmann commented Feb 10, 2017

Uh oh!

cyli Feb 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Feb 8, 2017 •

edited

Loading

codecov-io commented Feb 8, 2017 •

edited

Loading

cyli Feb 9, 2017 •

edited

Loading

cyli commented Feb 9, 2017 •

edited

Loading

cyli Feb 10, 2017 •

edited

Loading

cyli Feb 10, 2017 •

edited

Loading