wasm: restart wasm vm if it's failed because runtime error #36456

wbpcode · 2024-10-06T03:52:07Z

Commit Message: wasm: restart wasm vm if it's failed
Additional Description:

A experimental PR that support automatic reloading when the wasm VM is failed (panic(), abort(), etc).

Risk Level: low. The wasm is not production ready anyway.
Testing: unit. waiting.
Docs Changes: n/a.
Release Notes: n/a.
Platform Specific Features: n/a.

repokitteh-read-only · 2024-10-06T03:52:17Z

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #36456 was opened by wbpcode.

see: more, trace.

wbpcode · 2024-10-06T03:56:53Z

Hi, @mpwarres @kyessenov , specific to wasm module, I am basically a beginner. Could you take a quick look at this to see if this make sense or in the correct way when you get some free time?

If it's OK, then I can try to complete the PR with some more tests and code optimizations.

api/envoy/extensions/wasm/v3/wasm.proto

kyessenov · 2024-10-07T23:53:04Z

We need some kind of rate limiting I think, since plugins can fail on configuration and this would cause a crash loop without back off. Also, we need some telemetry exposed so that operators can tell when plugins are crashing and the endpoint can be temporarily removed from rotation till recovery. Basically, what k8s normally does with pods.

wbpcode · 2024-10-08T01:55:30Z

We need some kind of rate limiting I think, since plugins can fail on configuration and this would cause a crash loop without back off.

Almost all onConfigure will result in a configuration rejection. So, it will never reach the reload logic. I think this mainly is used for the occasional runtime crash bug?

Also, we need some telemetry exposed so that operators can tell when plugins are crashing and the endpoint can be temporarily removed from rotation till recovery. Basically, what k8s normally does with pods.

This sounds great I prefer to treat it as independent work and implement it in a separated PR.

markdroth · 2024-10-08T15:58:18Z

api/envoy/extensions/wasm/v3/wasm.proto

+  UNSPECIFIED = 0;
+
+  // All plugins associated with the VM will be reloaded then new requests are processed. This
+  // make sense when the VM failure is caused by runtime exception, abort(), or panic().


What if the failure is caused by something else? If we think this behavior makes sense only in these cases, can we trigger the behavior only in these cases, rather than doing it for any failure? Or is there no down-side to doing this on other types of failures?

I think we can do this for runtime error first. And to add new failure support in the future if necessary.

api/envoy/extensions/wasm/v3/wasm.proto

wbpcode · 2024-10-10T09:43:23Z

Will address all the comments and add tests.

wbpcode · 2024-10-11T12:18:02Z

/wait test

Signed-off-by: wangbaiping <[email protected]>

wbpcode · 2024-10-13T04:34:25Z

blocked by #36556

Signed-off-by: wangbaiping <[email protected]>

wbpcode · 2024-10-14T11:06:53Z

Refactored all code based on the #36556 and force push is used.

Sorry to all reviewers 🙏

Signed-off-by: wangbaiping <[email protected]>

…asm-restart

Signed-off-by: wangbaiping <[email protected]>

wbpcode · 2024-10-16T16:01:46Z

/retest

Commit Message: wasm: clean up the code Additional Description: When I doing the #36456, I found there are lots of redundant code in the wasm extensions. And the wasm loading and creations are spread out in multiple different positions. This redundancy and fragmentation make #36456 become more and more complex. Finally, I split the code clean up out as an independent PR. This PR doesn't change any logic but only merge duplicated logic. Risk Level: n/a. Testing: n/a. Docs Changes: n/a. Release Notes: n/a. --------- Signed-off-by: wangbaiping <[email protected]>

Signed-off-by: wangbaiping/wbpcode <[email protected]>

…asm-restart

wbpcode · 2024-10-17T07:59:51Z

finally, it's done.

wbpcode · 2024-10-17T08:00:08Z

friendly ping @markdroth for another API review.

wbpcode · 2024-10-17T08:01:52Z

And also friendly ping @kyessenov and @mpwarres for a code review.

Because the API won't have big effect to the implementation. So, I think we can do code review and API review at same time. :)

wbpcode · 2024-10-17T08:07:46Z

After this PR, at least an occational runtime error of wasm extension won't block the whole worker.

…asm-restart

source/extensions/common/wasm/wasm.cc

kyessenov · 2024-10-17T14:16:37Z

api/envoy/extensions/wasm/v3/wasm.proto

+}
+
+message ReloadConfig {
+  // Backoff strategy for the VM failure reload. If not specified, the default 1s base interval


what's the value of the default base interval?

1s? This is documented at the comment.

// Backoff strategy for the VM failure reload. If not specified, the default 1s base interval // will be applied.

api/envoy/extensions/wasm/v3/wasm.proto

mpwarres · 2024-10-21T19:53:59Z

Hi, @mpwarres @kyessenov , specific to wasm module, I am basically a beginner. Could you take a quick look at this to see if this make sense or in the correct way when you get some free time?

I think it makes sense FWIW. Looking at the implementation now.

Signed-off-by: wangbaiping/wbpcode <[email protected]>

wbpcode · 2024-10-22T03:49:57Z

/retest

markdroth

/lgtm api

api/envoy/extensions/wasm/v3/wasm.proto

mpwarres

Sorry for the delayed review. LGTM, all comments are minor / optional. Thanks!

mpwarres · 2024-10-24T04:41:02Z

source/extensions/common/wasm/wasm.cc

+  // updated anyway.
+  handle_wrapper.last_load = now;
+  PluginHandleSharedPtr new_load = getOrCreateThreadLocalPlugin(base_wasm_, plugin_, dispatcher);
+  if (new_load != nullptr) {


Should the new_load == nullptr case also count as a WasmEvent::VmReloadFailure?

source/extensions/common/wasm/stats_handler.h

source/extensions/common/wasm/wasm.cc

Signed-off-by: wangbaiping(wbpcode) <[email protected]>

…asm-restart

wbpcode · 2024-10-24T07:44:55Z

/retest

kyessenov

Great work!

vibneiro · 2025-01-27T03:31:01Z

Is there a way to see envoy stats via http if wasm crashed? e.g. via curl http://localhost:15000/stats?

wbpcode · 2025-02-08T02:17:10Z

Is there a way to see envoy stats via http if wasm crashed? e.g. via curl http://localhost:15000/stats?

Yeah, but the port depends on your admin port. If istio is used as control plane, it ususally is 15000.

repokitteh-read-only bot added the api label Oct 6, 2024

repokitteh-read-only bot assigned markdroth Oct 6, 2024

wbpcode assigned mpwarres and kyessenov Oct 6, 2024

ramaraochavali reviewed Oct 6, 2024

View reviewed changes

api/envoy/extensions/wasm/v3/wasm.proto Outdated Show resolved Hide resolved

ramaraochavali reviewed Oct 6, 2024

View reviewed changes

api/envoy/extensions/wasm/v3/wasm.proto Outdated Show resolved Hide resolved

markdroth reviewed Oct 8, 2024

View reviewed changes

repokitteh-read-only bot added waiting and removed waiting labels Oct 11, 2024

wasm: clean up the code

e472ab7

Signed-off-by: wangbaiping <[email protected]>

wbpcode mentioned this pull request Oct 12, 2024

wasm: clean up the code #36556

Merged

wangbaiping added 2 commits October 13, 2024 10:32

fix test

6106740

Signed-off-by: wangbaiping <[email protected]>

support singleton

2cc8acb

Signed-off-by: wangbaiping <[email protected]>

wangbaiping added 2 commits October 14, 2024 10:03

improve coverage

3a8fd67

Signed-off-by: wangbaiping <[email protected]>

check check

91f2db4

Signed-off-by: wangbaiping <[email protected]>

wbpcode force-pushed the dev-wasm-restart branch from 2ab0826 to e84510f Compare October 14, 2024 11:04

wangbaiping added 4 commits October 15, 2024 11:03

address some comments

8270067

Signed-off-by: wangbaiping <[email protected]>

merge main and resolve conflicts

664ee1f

complete unit tests

5b9a99f

Signed-off-by: wangbaiping <[email protected]>

Merge branch 'main' of https://github.com/envoyproxy/envoy into dev-w…

151568a

…asm-restart

wbpcode force-pushed the dev-wasm-restart branch from e84510f to 151568a Compare October 15, 2024 11:22

fix building

ac55175

Signed-off-by: wangbaiping <[email protected]>

wangbaiping/wbpcode added 2 commits October 17, 2024 15:06

fix test

5547658

Signed-off-by: wangbaiping/wbpcode <[email protected]>

Merge branch 'main' of https://github.com/envoyproxy/envoy into dev-w…

fb5be22

…asm-restart

wbpcode requested a review from kyessenov as a code owner October 17, 2024 07:10

wbpcode changed the title ~~wasm: restart wasm vm if it's failed~~ wasm: restart wasm vm if it's failed because runtime error Oct 17, 2024

Merge branch 'main' of https://github.com/envoyproxy/envoy into dev-w…

bc79a2d

…asm-restart

kyessenov reviewed Oct 21, 2024

View reviewed changes

address comments

84c606c

Signed-off-by: wangbaiping/wbpcode <[email protected]>

markdroth reviewed Oct 22, 2024

View reviewed changes

api/envoy/extensions/wasm/v3/wasm.proto Show resolved Hide resolved

repokitteh-read-only bot removed the api label Oct 22, 2024

mpwarres approved these changes Oct 24, 2024

View reviewed changes

wangbaiping(wbpcode) added 2 commits October 24, 2024 06:50

address comments

1c8f6a1

Signed-off-by: wangbaiping(wbpcode) <[email protected]>

Merge branch 'main' of https://github.com/envoyproxy/envoy into dev-w…

a809433

…asm-restart

kyessenov approved these changes Oct 24, 2024

View reviewed changes

wbpcode merged commit 64b4d2e into envoyproxy:main Oct 24, 2024
21 checks passed

wbpcode deleted the dev-wasm-restart branch October 24, 2024 23:46

ramaraochavali mentioned this pull request Dec 30, 2024

503 errors after WASM crash istio/istio#54490

Closed

2 tasks

maleck13 mentioned this pull request Mar 20, 2025

Try out the WASM VM restart in Envoy 1.33 Kuadrant/wasm-shim#171

Closed

2 tasks

alexsnaps mentioned this pull request Mar 24, 2025

Document the trust model and panics in the codebase. proxy-wasm/proxy-wasm-rust-sdk#282

Open

PiotrSikora mentioned this pull request Mar 27, 2025

Panics in the SDK proxy-wasm/proxy-wasm-rust-sdk#310

Open

wasm: restart wasm vm if it's failed because runtime error #36456

wasm: restart wasm vm if it's failed because runtime error #36456

Uh oh!

Conversation

wbpcode commented Oct 6, 2024

Uh oh!

repokitteh-read-only bot commented Oct 6, 2024

Uh oh!

wbpcode commented Oct 6, 2024

Uh oh!

Uh oh!

Uh oh!

kyessenov commented Oct 7, 2024

Uh oh!

wbpcode commented Oct 8, 2024

Uh oh!

markdroth Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

wbpcode Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wbpcode commented Oct 10, 2024

Uh oh!

wbpcode commented Oct 11, 2024

Uh oh!

wbpcode commented Oct 13, 2024

Uh oh!

wbpcode commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbpcode commented Oct 16, 2024

Uh oh!

wbpcode commented Oct 17, 2024

Uh oh!

wbpcode commented Oct 17, 2024

Uh oh!

wbpcode commented Oct 17, 2024

Uh oh!

wbpcode commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kyessenov Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

wbpcode Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mpwarres commented Oct 21, 2024

Uh oh!

wbpcode commented Oct 22, 2024

Uh oh!

markdroth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mpwarres left a comment

Choose a reason for hiding this comment

Uh oh!

mpwarres Oct 24, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wbpcode commented Oct 24, 2024

Uh oh!

kyessenov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vibneiro commented Jan 27, 2025

Uh oh!

wbpcode commented Feb 8, 2025

Uh oh!

wbpcode commented Oct 14, 2024 •

edited

Loading

wbpcode commented Oct 17, 2024 •

edited

Loading