Description
Feature request description
cockpit-podman is plagued with lots of race conditions and flaky tests. I have investigated many of them, but the remaining ones are due to a fundamental issue with the monitoring API.
The UI uses the libpod/events API, which notifies about high-level actions such as start
or died
, for example:
{"status":"start","id":"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2","from":"localhost/test-busybox:latest","Type":"container","Action":"start","Actor":{"ID":"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2","Attributes":{"containerExitCode":"0","image":"localhost/test-busybox:latest","name":"swamped-crate","podId":""}},"scope":"local","time":1688533611,"timeNano":1688533611933738591}
However, this does not contain any (or at least most) of the properties that the UI needs to show, so in reaction to these, the UI does a containers/json
query for that container:
{"method":"GET","path":"/v1.12/libpod/containers/json","body":"","params":{"all":true,"filters":"{\"id\":[\"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2\"]}"}}
which then responds with all the info that the UI needs:
[{"AutoRemove":false,"Command":["sh","-c","echo 123; sleep infinity"],"Created":"2023-07-05T05:06:41.355776664Z","CreatedAt":"","Exited":false,"ExitedAt":1688533611,"ExitCode":0,"Id":"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2","Image":"localhost/test-busybox:latest","ImageID":"24ac8b76cfb0440579ade1908a8a765d3c8a62bd366058cf84e1a7d6754ee585","IsInfra":false,"Labels":null,"Mounts":[],"Names":["swamped-crate"],"Namespaces":{},"Networks":[],"Pid":19176,"Pod":"","PodName":"","Ports":null,"Size":null,"StartedAt":1688533611,"State":"running","Status":""}]
The problem is that this is racy: The /containers/json call is necessarily async, and when events come in bursts, they will then overlap. But their replies from podman are not coming in in the same order. This is a log capture from a part of the test where it does a few container operations like stopping and restarting a container. I stripped out all the JSON data for clarity, the important bit is the ordering:
> debug: podman user call 44:
> debug: podman user call 45:
> debug: podman user call 45 result:
> debug: podman user call 46:
> debug: podman user call 44 result:
> debug: podman user call 47:
> debug: podman user call 46 result:
> debug: podman user call 47 result:
> debug: podman user call 48:
> debug: podman user call 43 result:
> debug: podman user call 49:
> debug: podman user call 49 result:
> debug: podman user call 50:
> debug: podman user call 48 result:
> debug: podman user call 51:
> debug: podman user call 50 result
> debug: podman user call 52:
> debug: podman user call 52 result:
> debug: podman user call 51 result:
So if the container moves from "Running" → "Exited" → "Stopped" → "Restarting" → "Running", a jumbled response order can lead to swaps, and the final state reported in the UI is e.g. "Restarting" or "Exited". The latter happened in this run, where the screenshot says "Exited", but podman ps
says "Up" (i.e. "Running"), as can be seen in the "----- user containers -----" dump in the log.
Suggest potential solution
My prefered solution would be to avoid having to call /containers/json
after a "start" or "rename" event in the first place. That only leads to additional API traffic and thus more computational overhead on both the podman and the UI side, and is prone to these kinds of race conditions. D-Bus services like systemd or udisks generally solved this with the PropertiesChanged signal, i.e. there is a notification with the set of changed properties each time when there is a change. These are naturally ordered correctly, and the watcher can tally them up to always have an accurate model of the state without having to do extra "get" calls.
For the podman API, this cannot just be squeezed into the existing start
(or remove
, etc.) events, as the container properties can change more often, and also independently from the coarse-grained lifecycle events.
Perhaps this could introduce a new event type changed
that gets fired whenever any property changes, and deliver the /containers/json info for the container(s) which changed. Both "your" (podman) and "my" (cockpit-podman) sides already have the code to generate/parse this information, it would just mean some minor plumbing changes.
If this is expensive, it may also be adequate to explicitly opt into getting these notifications, although connecting to /events generally already means that the listener wants to know this kind of information.
Have you considered any alternatives?
It may also be possible to change podman to not reply to requests out of order. I don't know how easy/hard that is with Go and coroutines. I know that it is very hard in JavaScript on the client side to reorder the replies.
It might be easier on our side to completely serialize all API calls, but that would make the UI very slow especially if there are many containers. These are independent from each other, so serializing calls is not conceptually necessary.
Additional context
No response