Skip to content

Conversation

@wbpcode
Copy link
Member

@wbpcode wbpcode commented Oct 6, 2024

Commit Message: wasm: restart wasm vm if it's failed
Additional Description:

A experimental PR that support automatic reloading when the wasm VM is failed (panic(), abort(), etc).

Risk Level: low. The wasm is not production ready anyway.
Testing: unit. waiting.
Docs Changes: n/a.
Release Notes: n/a.
Platform Specific Features: n/a.

@repokitteh-read-only
Copy link

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #36456 was opened by wbpcode.

see: more, trace.

@wbpcode
Copy link
Member Author

wbpcode commented Oct 6, 2024

Hi, @mpwarres @kyessenov , specific to wasm module, I am basically a beginner. Could you take a quick look at this to see if this make sense or in the correct way when you get some free time?

If it's OK, then I can try to complete the PR with some more tests and code optimizations.

@kyessenov
Copy link
Contributor

We need some kind of rate limiting I think, since plugins can fail on configuration and this would cause a crash loop without back off. Also, we need some telemetry exposed so that operators can tell when plugins are crashing and the endpoint can be temporarily removed from rotation till recovery. Basically, what k8s normally does with pods.

@wbpcode
Copy link
Member Author

wbpcode commented Oct 8, 2024

We need some kind of rate limiting I think, since plugins can fail on configuration and this would cause a crash loop without back off.

Almost all onConfigure will result in a configuration rejection. So, it will never reach the reload logic. I think this mainly is used for the occasional runtime crash bug?

Also, we need some telemetry exposed so that operators can tell when plugins are crashing and the endpoint can be temporarily removed from rotation till recovery. Basically, what k8s normally does with pods.

This sounds great I prefer to treat it as independent work and implement it in a separated PR.

UNSPECIFIED = 0;

// All plugins associated with the VM will be reloaded then new requests are processed. This
// make sense when the VM failure is caused by runtime exception, abort(), or panic().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the failure is caused by something else? If we think this behavior makes sense only in these cases, can we trigger the behavior only in these cases, rather than doing it for any failure? Or is there no down-side to doing this on other types of failures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do this for runtime error first. And to add new failure support in the future if necessary.

@wbpcode
Copy link
Member Author

wbpcode commented Oct 10, 2024

Will address all the comments and add tests.

@wbpcode
Copy link
Member Author

wbpcode commented Oct 11, 2024

/wait test

Signed-off-by: wangbaiping <[email protected]>
wangbaiping added 2 commits October 13, 2024 10:32
Signed-off-by: wangbaiping <[email protected]>
Signed-off-by: wangbaiping <[email protected]>
@wbpcode
Copy link
Member Author

wbpcode commented Oct 13, 2024

blocked by #36556

wangbaiping added 2 commits October 14, 2024 10:03
Signed-off-by: wangbaiping <[email protected]>
Signed-off-by: wangbaiping <[email protected]>
@wbpcode
Copy link
Member Author

wbpcode commented Oct 14, 2024

Refactored all code based on the #36556 and force push is used.

Sorry to all reviewers 🙏

Signed-off-by: wangbaiping <[email protected]>
@wbpcode
Copy link
Member Author

wbpcode commented Oct 16, 2024

/retest

wbpcode added a commit that referenced this pull request Oct 17, 2024
Commit Message: wasm: clean up the code
Additional Description:

When I doing the #36456, I found
there are lots of redundant code in the wasm extensions. And the wasm
loading and creations are spread out in multiple different positions.
This redundancy and fragmentation make #36456 become more and more
complex.

Finally, I split the code clean up out as an independent PR. 

This PR doesn't change any logic but only merge duplicated logic.

Risk Level: n/a.
Testing: n/a.
Docs Changes: n/a.
Release Notes: n/a.

---------

Signed-off-by: wangbaiping <[email protected]>
wangbaiping/wbpcode added 2 commits October 17, 2024 15:06
@wbpcode wbpcode requested a review from kyessenov as a code owner October 17, 2024 07:10
@wbpcode
Copy link
Member Author

wbpcode commented Oct 17, 2024

finally, it's done.

@wbpcode
Copy link
Member Author

wbpcode commented Oct 17, 2024

friendly ping @markdroth for another API review.

@wbpcode
Copy link
Member Author

wbpcode commented Oct 17, 2024

And also friendly ping @kyessenov and @mpwarres for a code review.

Because the API won't have big effect to the implementation. So, I think we can do code review and API review at same time. :)

@wbpcode wbpcode changed the title wasm: restart wasm vm if it's failed wasm: restart wasm vm if it's failed because runtime error Oct 17, 2024
@wbpcode
Copy link
Member Author

wbpcode commented Oct 17, 2024

After this PR, at least an occational runtime error of wasm extension won't block the whole worker.

}

message ReloadConfig {
// Backoff strategy for the VM failure reload. If not specified, the default 1s base interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the value of the default base interval?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1s? This is documented at the comment.

  // Backoff strategy for the VM failure reload. If not specified, the default 1s base interval
  // will be applied.

@mpwarres
Copy link
Contributor

Hi, @mpwarres @kyessenov , specific to wasm module, I am basically a beginner. Could you take a quick look at this to see if this make sense or in the correct way when you get some free time?

I think it makes sense FWIW. Looking at the implementation now.

Signed-off-by: wangbaiping/wbpcode <[email protected]>
@wbpcode
Copy link
Member Author

wbpcode commented Oct 22, 2024

/retest

Copy link
Contributor

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm api

Copy link
Contributor

@mpwarres mpwarres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed review. LGTM, all comments are minor / optional. Thanks!

// updated anyway.
handle_wrapper.last_load = now;
PluginHandleSharedPtr new_load = getOrCreateThreadLocalPlugin(base_wasm_, plugin_, dispatcher);
if (new_load != nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the new_load == nullptr case also count as a WasmEvent::VmReloadFailure?

wangbaiping(wbpcode) added 2 commits October 24, 2024 06:50
@wbpcode
Copy link
Member Author

wbpcode commented Oct 24, 2024

/retest

Copy link
Contributor

@kyessenov kyessenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@wbpcode wbpcode merged commit 64b4d2e into envoyproxy:main Oct 24, 2024
21 checks passed
@wbpcode wbpcode deleted the dev-wasm-restart branch October 24, 2024 23:46
@vibneiro
Copy link

Is there a way to see envoy stats via http if wasm crashed? e.g. via curl http://localhost:15000/stats?

@wbpcode
Copy link
Member Author

wbpcode commented Feb 8, 2025

Is there a way to see envoy stats via http if wasm crashed? e.g. via curl http://localhost:15000/stats?

Yeah, but the port depends on your admin port. If istio is used as control plane, it ususally is 15000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants