Avoid returning early on agent join failures by mrjana · Pull Request #1473 · moby/libnetwork

mrjana · 2016-09-27T07:17:01Z

When a gossip join failure happens do not return early in the call chain because a join failure is most likely transient and the retry logic built in the networkdb is going to retry and succeed. Returning early makes the initialization of ingress network/sandbox to not happen which causes a problem even after the gossip join on retry is successful.

Signed-off-by: Jana Radhakrishnan mrjana@docker.com

mavenugo · 2016-09-27T14:01:18Z

agent.go

 	if remoteAddr != "" {
 		if err := c.agentJoin(remoteAddr); err != nil {
 			logrus.Errorf("Error in agentJoin : %v", err)
-			return nil


This is not returning the error on the first place (returning nil). Am not sure how this will fix the issue mentioned in the description. The only possible effect this has is the fact that agentInitDone is not reinitialized. Is that the issue that you are intending to fix ?

When we return early agentInitDone is not closed. If it is not closed the go routing waiting on it to proceed to plumbing the ingress sandbox will not happen.

Got it. Just noticed that even though the networkDB retry logic takes effect, call to nDB.sendNodeEvent is not involved in that retry. am not sure if that is required. Just giving my observation.

It is not strictly needed as the other node will be able to work with just getting a join notification from memberlist. But I think it is still correct to make that fix. Will update the PR.

sanimej · 2016-09-27T15:27:57Z

In what cases do we see gossip join failure ? Since we already got the keys the gRPC session should be working.

mrjana · 2016-09-27T15:33:00Z

@sanimej When we are restarting all the nodes in the cluster at the same time the daemon which joined the cluster using swarm join might try to join the node which inited the cluster using swarm init before it has reached the part where it has initialized gossip. In that case this can happen.

When a gossip join failure happens do not return early in the call chain because a join failure is most likely transient and the retry logic built in the networkdb is going to retry and succeed. Returning early makes the initialization of ingress network/sandbox to not happen which causes a problem even after the gossip join on retry is successful. Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>

sanimej · 2016-09-27T15:47:58Z

LGTM. As a side note we don't need NodeEventTypeJoin since the memberlist join is sufficient.

mavenugo · 2016-09-27T16:04:05Z

LGTM.

GordonTheTurtle added status/0-triage labels Sep 27, 2016

mavenugo reviewed Sep 27, 2016

View reviewed changes

mrjana changed the title ~~Avoid propogating agent join failures~~ Avoid returning early on agent join failures Sep 27, 2016

mrjana force-pushed the agent branch from 727b917 to 23a782b Compare September 27, 2016 15:37

mavenugo merged commit f1aa086 into moby:master Sep 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid returning early on agent join failures#1473

Avoid returning early on agent join failures#1473
mavenugo merged 1 commit intomoby:masterfrom
mrjana:agent

mrjana commented Sep 27, 2016 •

edited

Loading

Uh oh!

mavenugo Sep 27, 2016

Uh oh!

mrjana Sep 27, 2016

Uh oh!

mavenugo Sep 27, 2016

Uh oh!

mrjana Sep 27, 2016

Uh oh!

sanimej commented Sep 27, 2016

Uh oh!

mrjana commented Sep 27, 2016

Uh oh!

sanimej commented Sep 27, 2016

Uh oh!

mavenugo commented Sep 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mrjana commented Sep 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mavenugo Sep 27, 2016

Choose a reason for hiding this comment

Uh oh!

mrjana Sep 27, 2016

Choose a reason for hiding this comment

Uh oh!

mavenugo Sep 27, 2016

Choose a reason for hiding this comment

Uh oh!

mrjana Sep 27, 2016

Choose a reason for hiding this comment

Uh oh!

sanimej commented Sep 27, 2016

Uh oh!

mrjana commented Sep 27, 2016

Uh oh!

sanimej commented Sep 27, 2016

Uh oh!

mavenugo commented Sep 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mrjana commented Sep 27, 2016 •

edited

Loading