Skip to content

Conversation

@fuweid
Copy link
Member

@fuweid fuweid commented Sep 7, 2020

If the procRun state has been synced and the runc-create process has
been killed for some reason, the runc-init[2:stage] process will be
leaky. And the runc command also fails to parse root directory because
the container doesn't have state.json.

In order to make it possible to clean the leaky runc-init[2:stage]
process , we should store the status before sync procRun.

current workflow:

[  child  ] <-> [   parent   ]

procHooks   --> [run hooks]
            <-- procResume

procReady   --> [final setup]
            <-- procRun

                ( killed for some reason)
                ( store state.json )
expected workflow:

[  child  ] <-> [   parent   ]

procHooks   --> [run hooks]
            <-- procResume

procReady   --> [final setup]
                store state.json
            <-- procRun

Signed-off-by: Wei Fu [email protected]

If the procRun state has been synced and the runc-create process has
been killed for some reason, the runc-init[2:stage] process will be
leaky. And the runc command also fails to parse root directory because
the container doesn't have state.json.

In order to make it possible to clean the leaky runc-init[2:stage]
process , we should store the status before sync procRun.

```before
current workflow:

[  child  ] <-> [   parent   ]

procHooks   --> [run hooks]
            <-- procResume

procReady   --> [final setup]
            <-- procRun

                ( killed for some reason)
                ( store state.json )
```

```expected
expected workflow:

[  child  ] <-> [   parent   ]

procHooks   --> [run hooks]
            <-- procResume

procReady   --> [final setup]
                store state.json
            <-- procRun
```

Signed-off-by: Wei Fu <[email protected]>
@fuweid fuweid force-pushed the update-state-store branch from 66b39cf to ba0246d Compare September 7, 2020 15:23
@AkihiroSuda AkihiroSuda added this to the 1.0.0 milestone Sep 8, 2020
@fuweid
Copy link
Member Author

fuweid commented Sep 10, 2020

ping @AkihiroSuda @kolyshkin PTAL, thanks!

@fuweid
Copy link
Member Author

fuweid commented Sep 17, 2020

ping @opencontainers/runc-maintainers PTAL, thanks~

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Wonder what was the reason of runc-create's premature demise that you saw?

@fuweid
Copy link
Member Author

fuweid commented Sep 20, 2020

LGTM. Wonder what was the reason of runc-create's premature demise that you saw?

@kolyshkin I didn't see what happened at the comment. We use kubernetes and containerd. The CRI StartContainer returned time out error and runc-init process was leaky and tried to open exec.fifo.

It is hard to reproduce this issue. I meet few cases per few days in my production. When this case shows up, we see the system reclaims many more memory or OOM for the container during runc-create. But didn't find the root cause....

This patch is used to prevent runc-init leaky. I would like to share the root cause if I can get the clue from log monitor. :)

@fuweid
Copy link
Member Author

fuweid commented Sep 23, 2020

ping @AkihiroSuda ~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants