Skip to content

Conversation

@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Apr 16, 2025

  1. We are seeing a ton on flakes on almalinux-8 CI job, all caused by criu inability to freeze a cgroup. This was worked around in criu (Freeze fixes and v1 kludges checkpoint-restore/criu#2545), but obviously we can't rely on a distro vendor to update the package.

Let's use a copr (thanks to @adrianreber!)

Fixes: #4273

  1. ssh-keygen stopped working in AlmaLinux 8, fix this as well (see commit for details).

Fixes: #4731

@kolyshkin kolyshkin force-pushed the ci-criu branch 10 times, most recently from c08c967 to 33f6034 Compare April 16, 2025 21:10
@kolyshkin kolyshkin marked this pull request as ready for review April 16, 2025 21:24
@rata
Copy link
Member

rata commented Apr 17, 2025

This LGTM, but why it started happening now? A similar thing is happening on containerd repo with criu now.

Is there a kernel issue that causes this?

It will be great if criu's CI would catch these things and fix them before we even notice, ideally. I don't know what is causing it, I guess a kernel update, in which case it will be great if something like proposed-updates from debian can be used in criu CI to fix it before it hets everyone else CI.

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rata
Copy link
Member

rata commented Apr 17, 2025

It's failing due to another criu failure, the ppa for ubuntu :(

Do we want to change our CI to test criu in non-blocking jobs only or something like that?

@kolyshkin
Copy link
Contributor Author

This LGTM, but why it started happening now? A similar thing is happening on containerd repo with criu now.

Is there a kernel issue that causes this?

It will be great if criu's CI would catch these things and fix them before we even notice, ideally. I don't know what is causing it, I guess a kernel update, in which case it will be great if something like proposed-updates from debian can be used in criu CI to fix it before it hets everyone else CI.

cgroup v1 freezer was always unreliable. I very much hope we'll drop v1 support in a few years.

@kolyshkin kolyshkin force-pushed the ci-criu branch 7 times, most recently from 8d8682f to f84c164 Compare April 17, 2025 22:56
@kolyshkin
Copy link
Contributor Author

It's failing due to another criu failure, the ppa for ubuntu :(

This is a launchpad failure. Heck, even github itself is not available at times from the GHA.

But it looks like we found another issue with CRIU v4.1: #4729

We are seeing a ton on flakes on almalinux-8 CI job, all caused by criu
inability to freeze a cgroup. This was worked around in criu [1], but
obviously we can't rely on a distro vendor to update the package.

Let's use a copr (thanks to Adrian Reber!)

[1]: checkpoint-restore/criu#2545

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin force-pushed the ci-criu branch 2 times, most recently from 5164096 to 58d5ccb Compare April 17, 2025 23:11
For some reason, ssh-keygen is unable to write to /root even as root on
AlmaLinux 8:

	# id
	uid=0(root) gid=0(root) groups=0(root) context=system_u:system_r:initrc_t:s0
	# id -Z
	ls -ld /root
	# ssh-keygen -t ecdsa -N "" -f /root/rootless.key || cat /var/log/audit/audit.log
	Saving key "/root/rootless.key" failed: Permission denied

The audit.log shows:

> type=AVC msg=audit(1744834995.352:546): avc:  denied  { dac_override } for  pid=13471 comm="ssh-keygen" capability=1  scontext=system_u:system_r:ssh_keygen_t:s0 tcontext=system_u:system_r:ssh_keygen_t:s0 tclass=capability permissive=0
> type=SYSCALL msg=audit(1744834995.352:546): arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=5641c7587520 a2=241 a3=180 items=0 ppid=4978 pid=13471 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ssh-keygen" exe="/usr/bin/ssh-keygen" subj=system_u:system_r:ssh_keygen_t:s0 key=(null)␝ARCH=x86_64 SYSCALL=openat AUID="unset" UID="root" GID="root" EUID="root" SUID="root" FSUID="root" EGID="root" SGID="root" FSGID="root"

A workaround is to use /root/.ssh directory instead of just /root.

While at it, let's unify rootless user and key setup into a single place.

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin changed the title ci: install newer criu for almalinux-8 ci fixes (ssh-keygen and criu version bump for almalinux 8) Apr 17, 2025
@kolyshkin kolyshkin added area/ci backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 backport/1.3-todo A PR in main branch which needs to be backported to release-1.3 labels Apr 18, 2025
@kolyshkin
Copy link
Contributor Author

CI is flaking because of launchpad.net criu repo error.

Hope that's temporary -- if not, we can either switch to https://build.opensuse.org/project/show/devel:tools:criu (which I think we did in the past).

Unfortunately there's no official criu packages for Ubuntu 24.04 😕 (checkpoint-restore/criu#2404, https://bugs.launchpad.net/ubuntu/+source/criu/+bug/2066148).

# shellcheck disable=SC2174 # Silence "-m only applies to the deepest directory".
mkdir -p -m 0700 "$HOME/.ssh"
ssh-keygen -t ecdsa -N "" -f "$HOME/.ssh/rootless.key"
sudo mkdir -p -m 0700 /home/rootless/.ssh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need sudo in here and the next lines? I know there is no hurt to keep it, I just want to know the reason.

Copy link
Contributor Author

@kolyshkin kolyshkin Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because this is not always called from a root user.

There are three users of this script -- two (.cirrus.yml job script/setup_host_fedora which itself is called from GHA under sudo) are running as root, and the third one (add rootless user step in .github/workflows/test.yml) is not running as root.

First I tried calling the script itself via sudo, but the third job above actually allows the default GHA user ("runner") to do ssh root@localhost ssh rootless@localhost, and if we run it as root it won't achieve this result. This is also the reason why I use $HOME here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also the reason why the script itself says

# Allow both the current user and rootless itself to use
# ssh rootless@localhost in tests/rootless.sh.

instead of older (and, for the add rootless user step, incorrect) message saying:

# Allow root and rootless itself to execute `ssh rootless@localhost` in tests/rootless.sh

@kolyshkin kolyshkin requested a review from lifubang April 20, 2025 19:40
@lifubang lifubang merged commit eeae96b into opencontainers:main Apr 21, 2025
34 checks passed
@kolyshkin
Copy link
Contributor Author

1.3 backport: #4737

@kolyshkin kolyshkin added backport/1.3-done A PR in main branch which has been backported to release-1.3 and removed backport/1.3-todo A PR in main branch which needs to be backported to release-1.3 labels Apr 22, 2025
@kolyshkin
Copy link
Contributor Author

1.2 backport: #4742

@kolyshkin kolyshkin added backport/1.2-done A PR in main branch which has been backported to release-1.2 and removed backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 labels Apr 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci backport/1.2-done A PR in main branch which has been backported to release-1.2 backport/1.3-done A PR in main branch which has been backported to release-1.3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] Saving key "/root/rootless.key" failed: Permission denied flaky tests: TestUsernsCheckpoint, TestCheckpoint

3 participants