Skip to content

Don't PublishAllPorts AND specify the same IP bindings, else we get "address already in use" reliably on docker v28 #776

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 8, 2025

Conversation

anoadragon453
Copy link
Member

@anoadragon453 anoadragon453 commented May 7, 2025

Docker v28 refactored the networking code. Complement was driving the networking API incorrectly, by using PublishAllPorts in addition to specifying specific PortBindings. This was because a niche use for Homerunner needed the homeservers to bind to a specific address, see #560 However, Docker was happy with this up until v28 where it would produce "address already in use" errors. This then created unrelated problems because the error occurred during ContainerStart. We never cleaned up the container properly, thus subsequent attempts to create a container with the same name would produce Conflict. The container name "/complement_crypto_dirty_hs1" is already in use by container "ef141595c2188aa4d0ece3aa5f74fdb43a7ba276cf95a00076ad387881d7b24a". You have to remove (or rename) that container to be able to reuse that name


Old theory:

See the explanation in moby/moby#49935.

Also see the Docker v28 release notes:

  • Add a new netlabel com.docker.network.endpoint.ifname to customize the interface name used when connecting a container to a network. It's supported by all built-in network drivers on Linux. moby/moby#49155

    • When a container is created with multiple networks specified, there's no guarantee on the order networks will be connected to the container. So, if a custom interface name uses the same prefix as the auto-generated names, for example eth, the container might fail to start.

    • The recommended practice is to use a different prefix, for example en0, or a numerical suffix high enough to never collide, for example eth100.

    • This label can be specified on docker network connect via the --driver-opt flag, for example:

      docker network connect --driver-opt=com.docker.network.endpoint.ifname=foobar …
    • Or via the long-form --network flag on docker run, for example:

      docker run --network=name=bridge,driver-opt=com.docker.network.endpoint.ifname=foobar …

Closes #775

…ult gateway

These options were introduced in v28. They are ignored by older Docker
client versions.
@anoadragon453 anoadragon453 requested review from kegsay and a team as code owners May 7, 2025 16:42
@anoadragon453
Copy link
Member Author

Next issue to contend with:

CreateDirtyDeployment failed: CreateDirtyDeployment: dirty: failed to list networks. Error response from daemon: client version 1.49 is too new. Maximum supported API version is 1.48

I plan to split the go1.22 -> 1.23 update into a separate PR once CI is passing.

@anoadragon453 anoadragon453 marked this pull request as draft May 7, 2025 17:35
@3nprob 3nprob mentioned this pull request May 7, 2025
1 task
@3nprob
Copy link

3nprob commented May 7, 2025

Next issue to contend with:

CreateDirtyDeployment failed: CreateDirtyDeployment: dirty: failed to list networks. Error response from daemon: client version 1.49 is too new. Maximum supported API version is 1.48

Looks like this is just about matching the minor version:

@kegsay kegsay changed the title Update docker to v28.1.1 and pin network interface name Don't PublishAllPorts AND specify the same IP bindings, else we get "address already in use" reliably on docker v28 May 8, 2025
@kegsay kegsay marked this pull request as ready for review May 8, 2025 08:06
@kegsay kegsay merged commit d2e04c9 into main May 8, 2025
4 checks passed
@kegsay kegsay deleted the anoa/fix_network_interface_flake branch May 8, 2025 08:12
MadLittleMods added a commit that referenced this pull request May 26, 2025
…server `BaseURL`

Regressed in #776
which started using the Docker default (`0.0.0.0`).

This caused downstream issues in some of our out-of-repo Complement tests
which stress some SSO/OIDC login flows and expect to be using the canonical
`public_baseurl` configured in Synapse.
MadLittleMods added a commit that referenced this pull request May 29, 2025
…server `BaseURL` (#781)

Regressed in #776 which started using the Docker default `HostIP` (`0.0.0.0`) because we removed the specific `HSPortBindingIP` port bindings for `8008` and `8448` (see "Root cause" section below).

### Problem

This caused downstream issues in some of our out-of-repo Complement tests (Element internal) which stress some SSO/OIDC login flows with Synapse. 

Before #776, [`SsoRedirectServlet`](https://github.com/element-hq/synapse/blob/33ba8860c43d4770ea119a09a4fcbbf366f3b32e/synapse/rest/client/login.py#L645-L683) received a request like this `http://127.0.0.1:8008/_matrix/client/v3/login/sso/redirect/oidc-test_idp?redirectUrl=http%3A%2F%2Fapp.my-fake-client.io%2Fclient-callback`. But now it's seeing `http://0.0.0.0:8008/_matrix/client/v3/login/sso/redirect/oidc-test_idp?redirectUrl=http%3A%2F%2Fapp.my-fake-client.io%2Fclient-callback` and tries to redirect to the canonical `http://127.0.0.1:8008/` address which would then go to the step we expect in the tests (authorization endpoint).

Basically, the `host` header in the request used to be set to `127.0.0.1:8008` but now it's `0.0.0.0:8008`. Synapse wants to use the canonical URL so the cookies are available so it redirects you to the canonical `public_baseurl` version first. Relevant Synapse code that cares about using the canonical URL to get cookies back -> [`SsoRedirectServlet`](https://github.com/element-hq/synapse/blob/33ba8860c43d4770ea119a09a4fcbbf366f3b32e/synapse/rest/client/login.py#L645-L683)

We could alternatively, update our out-of-repo tests to account for this extra redirect but it seems more appropriate to update Complement to use the the configured `HSPortBindingIP` as expected.


### Root cause

The host name used in the requests during the Complement tests is from the `client.BaseURL` which is sourced from the [Docker container port mapping/bindings](https://github.com/matrix-org/complement/blob/28a09014afd1f9c75a3549b4cd9d8fc480ba1149/internal/docker/deployer.go#L521) in Complement.

In #776, we removed the explicit port bindings for `8008` with `HSPortBindingIP` ([`127.0.0.1`](https://github.com/matrix-org/complement/blob/28a09014afd1f9c75a3549b4cd9d8fc480ba1149/config/config.go#L180)) so now it uses Docker's default `0.0.0.0`.

```patch
diff --git a/internal/docker/deployer.go b/internal/docker/deployer.go
index cdcf51a..a186b30 100644
--- a/internal/docker/deployer.go
+++ b/internal/docker/deployer.go
@@ -382,20 +380,8 @@ func deployImage(
 	}, &container.HostConfig{
 		CapAdd:          []string{"NET_ADMIN"}, // TODO : this should be some sort of option
 		PublishAllPorts: true,
-		PortBindings: nat.PortMap{
-			nat.Port("8008/tcp"): []nat.PortBinding{
-				{
-					HostIP: cfg.HSPortBindingIP,
-				},
-			},
-			nat.Port("8448/tcp"): []nat.PortBinding{
-				{
-					HostIP: cfg.HSPortBindingIP,
-				},
-			},
-		},
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker v28 update causes failed to listen on TCP socket: address already in use
3 participants