Proposal: Container Rescheduling

## Background

The goal of this proposal is to reschedule containers automatically in case of node failure.

This is currently one of the top requested feature for Swarm.
## Configuration

The behavior should be user controllable and **disabled** by default since rescheduling can have nasty effects on stateful containers.

The user can select the policy at `run` time use the `reschedule` environment variable:

```
docker run -e reschedule:on-node-failure redis
```

Possible values for `reschedule` are:
- **no**: Never reschedule the container (**default**).
- **on-node-failure**: Reschedule the container whenever the node fails.

The reason this is more complicated than `yes`/`no` is in the future we might have more complicated rescheduling policies (for instance, we might want reschedule containers to re-spread or re-pack them). **Open question: Is this really necessary?**

Rescheduling policies will be stored as a container label: `com.docker.swarm.reschedule-policy`
## Persistance

Ideally, Swarm would store all containers (at least those that should be rescheduled) persistently. That way, the manager can figure out which containers are down and take action.

Unfortunately, we currently don't have a shared state and this feature has been postponed because of that for a long time.

Since this is one of the top requested feature, I propose we take a different approach until we have shared state (that feature has been postponed for usability concerns - we don't want to make a kv store a dependency for Swarm).

By storing the _rescheduling policy_ as a container label, we are able to reconstruct the desired state at startup time.

Since we are already storing **constraints**,  **affinities** etc as container labels (exactly for this reason), the manager will have all the information it needs to perform rescheduling.

This means we can restart the manager as much as we want and it will resume rescheduling as expected.

However, the problem arises when a node goes **down** while the manager is not running: in that case, we won't "remember" that container even existed when the manager is started again.

This situation can be counter-balanced by using `replication`. The `rescheduler` would be running on the **primary** manager and, upon failure, the **replica** that gets elected **primary** would be taking over rescheduling responsibilities.

Since every manager is aware of the cluster state (containers & rescheduling policy), it means that as long as at least one manager is still running we won't forget about containers.
## Failure detection

This functionality is already provided by `cluster/engine`.

`Engine` actively heartbeats nodes in the cluster every `X` seconds. After `Y` failures, the node is marked as _unhealthy_.

Rescheduling can rely on the health status already available.
## Resurrection

Eventually, a node may come back to life and re-join the cluster. If the node has containers that were rescheduled, we will end up with duplicates.

Swarm should monitor incoming nodes and, upon detecting a duplicate container, it should destroy the oldest one (keeping the most recently created container alive). This behavior could eventually be configurable by the user  (keep oldest, keep newer, ...), although we may want to avoid providing that option until we see a valid use case.

If duplicate containers were started with a `--restart` option, there is going to be a small window during which both containers are running at the same time. This can be a serious problem if only one instance of that container is supposed to run at one point.

We could force all containers that have rescheduling enabled to never automatically restart. In that case, whenever a node joins, Swarm could decide to either start the containers of destroy them if they are duplicates.

However, there are many drawbacks to this approach:
- Restart policies have to be handled by Swarm. This introduces high complexity since we'd have to re-implement things such as `--restart=on-failure:5` which require maintaining lots of state
- If the manager is down, containers won't start automatically. This is a serious issue since it could lead to outages. Up until now, if Swarm is down the engines continue to operate normally and this change would break that contract.
- Swarm might miss events, leading to containers not getting properly restarted

Furthermore, It doesn't actually entirely solve the issue. If the node didn't actually _die_ (e.g. it just froze for a while, there was a netsplit, networking temporarily dropped, ...) we will end up with duplicate containers running for a while anyway.

Given all the potential issues that might arise by handling the restart policy on the Swarm side and the fact that duplicate containers may end up running at the same time anyway, I suggest we do not interfere with `--restart` and document the fact that rescheduled containers may be running in parallel for a short time window.
## Networking

When rescheduling containers, Swarm must handle multi-host networking properly.

The goal is for the new container to _take over_ the previous one.

In an overlay network setup, this may involve:
- Making sure the _new_ container takes over the IP address of the _old_ container
- Ensuring service discovery works properly
- Cutting the old container off the network **before** starting the new one. Even though we presume the node to be down, it might still be up and running. Disconnecting the old container would alleviate side effects of duplicate containers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Container Rescheduling #1488

Background

Configuration

Persistance

Failure detection

Resurrection

Networking

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Container Rescheduling #1488

Description

Background

Configuration

Persistance

Failure detection

Resurrection

Networking

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions