-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Proposal: Container Rescheduling #1488
Description
Background
The goal of this proposal is to reschedule containers automatically in case of node failure.
This is currently one of the top requested feature for Swarm.
Configuration
The behavior should be user controllable and disabled by default since rescheduling can have nasty effects on stateful containers.
The user can select the policy at run time use the reschedule environment variable:
docker run -e reschedule:on-node-failure redis
Possible values for reschedule are:
- no: Never reschedule the container (default).
- on-node-failure: Reschedule the container whenever the node fails.
The reason this is more complicated than yes/no is in the future we might have more complicated rescheduling policies (for instance, we might want reschedule containers to re-spread or re-pack them). Open question: Is this really necessary?
Rescheduling policies will be stored as a container label: com.docker.swarm.reschedule-policy
Persistance
Ideally, Swarm would store all containers (at least those that should be rescheduled) persistently. That way, the manager can figure out which containers are down and take action.
Unfortunately, we currently don't have a shared state and this feature has been postponed because of that for a long time.
Since this is one of the top requested feature, I propose we take a different approach until we have shared state (that feature has been postponed for usability concerns - we don't want to make a kv store a dependency for Swarm).
By storing the rescheduling policy as a container label, we are able to reconstruct the desired state at startup time.
Since we are already storing constraints, affinities etc as container labels (exactly for this reason), the manager will have all the information it needs to perform rescheduling.
This means we can restart the manager as much as we want and it will resume rescheduling as expected.
However, the problem arises when a node goes down while the manager is not running: in that case, we won't "remember" that container even existed when the manager is started again.
This situation can be counter-balanced by using replication. The rescheduler would be running on the primary manager and, upon failure, the replica that gets elected primary would be taking over rescheduling responsibilities.
Since every manager is aware of the cluster state (containers & rescheduling policy), it means that as long as at least one manager is still running we won't forget about containers.
Failure detection
This functionality is already provided by cluster/engine.
Engine actively heartbeats nodes in the cluster every X seconds. After Y failures, the node is marked as unhealthy.
Rescheduling can rely on the health status already available.
Resurrection
Eventually, a node may come back to life and re-join the cluster. If the node has containers that were rescheduled, we will end up with duplicates.
Swarm should monitor incoming nodes and, upon detecting a duplicate container, it should destroy the oldest one (keeping the most recently created container alive). This behavior could eventually be configurable by the user (keep oldest, keep newer, ...), although we may want to avoid providing that option until we see a valid use case.
If duplicate containers were started with a --restart option, there is going to be a small window during which both containers are running at the same time. This can be a serious problem if only one instance of that container is supposed to run at one point.
We could force all containers that have rescheduling enabled to never automatically restart. In that case, whenever a node joins, Swarm could decide to either start the containers of destroy them if they are duplicates.
However, there are many drawbacks to this approach:
- Restart policies have to be handled by Swarm. This introduces high complexity since we'd have to re-implement things such as
--restart=on-failure:5which require maintaining lots of state - If the manager is down, containers won't start automatically. This is a serious issue since it could lead to outages. Up until now, if Swarm is down the engines continue to operate normally and this change would break that contract.
- Swarm might miss events, leading to containers not getting properly restarted
Furthermore, It doesn't actually entirely solve the issue. If the node didn't actually die (e.g. it just froze for a while, there was a netsplit, networking temporarily dropped, ...) we will end up with duplicate containers running for a while anyway.
Given all the potential issues that might arise by handling the restart policy on the Swarm side and the fact that duplicate containers may end up running at the same time anyway, I suggest we do not interfere with --restart and document the fact that rescheduled containers may be running in parallel for a short time window.
Networking
When rescheduling containers, Swarm must handle multi-host networking properly.
The goal is for the new container to take over the previous one.
In an overlay network setup, this may involve:
- Making sure the new container takes over the IP address of the old container
- Ensuring service discovery works properly
- Cutting the old container off the network before starting the new one. Even though we presume the node to be down, it might still be up and running. Disconnecting the old container would alleviate side effects of duplicate containers.