Skip to content

Conversation

@HirazawaUi
Copy link

Based on the context of the Slack discussion, I’m trying to debug the e2e tests for kind.

ref: https://kubernetes.slack.com/archives/C09QZ4DQB/p1743347042802559

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: HirazawaUi
Once this PR has been reviewed and has the lgtm label, please assign bentheelder for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from aojea and stmcginnis April 2, 2025 06:45
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 2, 2025
@HirazawaUi HirazawaUi force-pushed the test-e2e branch 2 times, most recently from a66d7f3 to 2905fec Compare April 2, 2025 07:46
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 2, 2025
@HirazawaUi HirazawaUi force-pushed the test-e2e branch 2 times, most recently from ef45486 to 7a71a97 Compare April 2, 2025 10:09

docker ps -a

docker exec -it kind-worker journalctl -n 2000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dump this kind of logs to $ARTIFACTS already, click the "artifacts" link at the top of the job results and browse there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@HirazawaUi HirazawaUi Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that only includes the logs before kind delete cluster. I’d like to get the logs after kind delete cluster to figure out what’s blocking it :)

@HirazawaUi HirazawaUi force-pushed the test-e2e branch 2 times, most recently from d7a12f5 to 4ebdb37 Compare April 2, 2025 14:06
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 2, 2025
@HirazawaUi
Copy link
Author

HirazawaUi commented Apr 3, 2025

Apr 02 15:56:17 kind-worker kubelet[284]: I0402 15:56:17.112753     284 pod_container_manager_linux.go:217] "Failed to delete cgroup paths" cgroupName=["kubelet","kubepods","besteffort","pod582b8c89-eb89-4661-9710-24bf6f0434a9"] err="unable to destroy cgroup paths for cgroup [kubelet kubepods besteffort pod582b8c89-eb89-4661-9710-24bf6f0434a9] : Failed to remove paths: map[:/sys/fs/cgroup/unified/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice blkio:/sys/fs/cgroup/blkio/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice cpu:/sys/fs/cgroup/cpu,cpuacct/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice cpuacct:/sys/fs/cgroup/cpu,cpuacct/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice cpuset:/sys/fs/cgroup/cpuset/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice devices:/sys/fs/cgroup/devices/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice freezer:/sys/fs/cgroup/freezer/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice hugetlb:/sys/fs/cgroup/hugetlb/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice memory:/sys/fs/cgroup/memory/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice name=systemd:/sys/fs/cgroup/systemd/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice net_cls:/sys/fs/cgroup/net_cls,net_prio/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice net_prio:/sys/fs/cgroup/net_cls,net_prio/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice perf_event:/sys/fs/cgroup/perf_event/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice pids:/sys/fs/cgroup/pids/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice rdma:/sys/fs/cgroup/rdma/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice]"
    State: degraded
    Units: 163 loaded (incl. loaded aliases)
     Jobs: 0 queued
   Failed: 3 units
    Since: Wed 2025-04-02 15:36:29 UTC; 19min ago
  systemd: 252.33-1~deb12u1
  Tainted: cgroupsv1
   CGroup: /
           ├─326557 systemctl status
           ├─init.scope
           │ └─1 /sbin/init
           ├─kubelet.slice
           │ ├─kubelet-kubepods.slice
           │ │ ├─kubelet-kubepods-besteffort.slice
           │ │ │ ├─kubelet-kubepods-besteffort-pod582b8c89_eb89_4661_9710_24bf6f0434a9.slice
           │ │ │ │ └─cri-containerd-cf59fe4ce27393aa3e006673aed35954e33657b9e496924f69af1e90c9ecd4d6.scope
           │ │ │ │   └─50390 runc init

This seems to indicate that systemd needs to clean up all services and cgroups when it terminates,
which may cause systemd to get stuck in the cleanup phase if the residual processes of the Pod are not able to exit

I will try to compile a containerd without the null pointer bug and attempt it again.

@HirazawaUi HirazawaUi force-pushed the test-e2e branch 10 times, most recently from cce6559 to c14e42e Compare April 5, 2025 14:46
@HirazawaUi
Copy link
Author

After performing some simple fixes, I found that the systemd status of some kind nodes is not "degraded," yet I still cannot clean up the kind nodes. This might indicate that my previous assumptions were not accurate and further verification is needed.

State: running
Units: 155 loaded (incl. loaded aliases)
 Jobs: 0 queued
Failed: 0 units
Since: Sat 2025-04-05 09:50:45 UTC; 21min ago
systemd: 252.33-1~deb12u1
Tainted: cgroupsv1
CGroup: /
       ├─341398 systemctl status
       ├─init.scope
       │ └─1 /sbin/init
       ├─kubelet.slice
       │ ├─kubelet-kubepods.slice
       │ │ ├─kubelet-kubepods-besteffort.slice
       │ │ │ └─kubelet-kubepods-besteffort-pod4f7637eb_7d64_41c4_83a0_2e15e9c7c2f2.slice
       │ │ │   ├─cri-containerd-62369af3ecafae0688bcf2b95e962560240c347c3722ab70d5f49fd6e6607120.scope
       │ │ │   │ └─878 /pause
       │ │ │   └─cri-containerd-a4ecfa4ad2c9663be29e912ba786db47913bed26df005d94a61dc204724e3cab.scope
       │ │ │     └─905 /usr/local/bin/kube-proxy --config=/var/lib/kube-proxy/config.conf --hostname-override=kind-worker --v=4
       │ │ └─kubelet-kubepods-pod0819ea3c_c081_4bd0_8e4f_4406412ac59a.slice
       │ │   ├─cri-containerd-e8f88da663534607adb54ade0585755c7dfbe995e7ed0a7f0825f6f1f2067241.scope
       │ │   │ └─401 /pause
       │ │   └─cri-containerd-eb9177bc2926710e28d31222820b24835f306044b9ecc4eeb58e4bb30a8a4cef.scope
       │ │     └─604 /bin/kindnetd
       │ └─kubelet.service
       │   └─283 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-log-max-files=10 --container-log-max-size=100Mi --container-runtime-endpoint=unix:///run/containerd/containerd.sock --node-ip=172.18.0.4 --node-labels= --pod-infra-container-image=registry.k8s.io/pause:3.10 --provider-id=kind://docker/kind/kind-worker --v=4 --runtime-cgroups=/system.slice/containerd.service
       └─system.slice

@k8s-ci-robot k8s-ci-robot reopened this Apr 21, 2025
@k8s-ci-robot
Copy link
Contributor

@HirazawaUi: Reopened this PR.

Details

In response to this:

/reopen
Will continue debugging this issue when I have time.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@HirazawaUi HirazawaUi force-pushed the test-e2e branch 7 times, most recently from 71fc719 to 052d8f1 Compare April 23, 2025 01:44
@HirazawaUi
Copy link
Author

I resolved the issue by not passing containerd's context to the exec.CommandContext() call that executes runc create. However, I believe this fix of not passing the context would likely be unacceptable to containerd's maintainers (I can’t even fully convince myself). I’ll continue to investigate and try to fix this issue in either runc or containerd.

@HirazawaUi HirazawaUi force-pushed the test-e2e branch 5 times, most recently from 2d57c4a to ce1a71a Compare April 24, 2025 04:31
@BenTheElder
Copy link
Member

I'm working on upgrading containerd again in #3920, though IIRC some more fixes are not coming until 2.1.x

@HirazawaUi
Copy link
Author

I'm working on upgrading containerd again in #3920, though IIRC some more fixes are not coming until 2.1.x

Yes, I have used the latest version of containerd, but I got the same result. I have opened a PR in the containerd repository to fix it :)
ref: containerd/containerd#11755

@HirazawaUi
Copy link
Author

/retest

@HirazawaUi HirazawaUi force-pushed the test-e2e branch 3 times, most recently from 3b84d3e to 2155088 Compare April 26, 2025 06:15
@HirazawaUi
Copy link
Author

/retest

1 similar comment
@HirazawaUi
Copy link
Author

/retest

@k8s-ci-robot
Copy link
Contributor

@HirazawaUi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kind-verify d7bfc65 link true /test pull-kind-verify
pull-kind-e2e-kubernetes d7bfc65 link true /test pull-kind-e2e-kubernetes
pull-kind-conformance-parallel-dual-stack-ipv4-ipv6 d7bfc65 link true /test pull-kind-conformance-parallel-dual-stack-ipv4-ipv6

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@HirazawaUi
Copy link
Author

I've finally fixed this issue in the runc repo. Truly appreciate @BenTheElder's patient assistance - thank you so much!

/close

@k8s-ci-robot
Copy link
Contributor

@HirazawaUi: Closed this PR.

Details

In response to this:

I've finally fixed this issue in the runc repo. Truly appreciate @BenTheElder's patient assistance - thank you so much!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants