-
Notifications
You must be signed in to change notification settings - Fork 5.2k
wasm: restart wasm vm if it's failed because runtime error #36456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC @envoyproxy/runtime-guard-changes: FYI only for changes made to |
|
Hi, @mpwarres @kyessenov , specific to wasm module, I am basically a beginner. Could you take a quick look at this to see if this make sense or in the correct way when you get some free time? If it's OK, then I can try to complete the PR with some more tests and code optimizations. |
|
We need some kind of rate limiting I think, since plugins can fail on configuration and this would cause a crash loop without back off. Also, we need some telemetry exposed so that operators can tell when plugins are crashing and the endpoint can be temporarily removed from rotation till recovery. Basically, what k8s normally does with pods. |
Almost all onConfigure will result in a configuration rejection. So, it will never reach the reload logic. I think this mainly is used for the occasional runtime crash bug?
This sounds great I prefer to treat it as independent work and implement it in a separated PR. |
| UNSPECIFIED = 0; | ||
|
|
||
| // All plugins associated with the VM will be reloaded then new requests are processed. This | ||
| // make sense when the VM failure is caused by runtime exception, abort(), or panic(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the failure is caused by something else? If we think this behavior makes sense only in these cases, can we trigger the behavior only in these cases, rather than doing it for any failure? Or is there no down-side to doing this on other types of failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do this for runtime error first. And to add new failure support in the future if necessary.
|
Will address all the comments and add tests. |
|
/wait test |
Signed-off-by: wangbaiping <[email protected]>
Signed-off-by: wangbaiping <[email protected]>
Signed-off-by: wangbaiping <[email protected]>
|
blocked by #36556 |
Signed-off-by: wangbaiping <[email protected]>
Signed-off-by: wangbaiping <[email protected]>
2ab0826 to
e84510f
Compare
|
Refactored all code based on the #36556 and force push is used. Sorry to all reviewers 🙏 |
Signed-off-by: wangbaiping <[email protected]>
Signed-off-by: wangbaiping <[email protected]>
e84510f to
151568a
Compare
Signed-off-by: wangbaiping <[email protected]>
|
/retest |
Commit Message: wasm: clean up the code Additional Description: When I doing the #36456, I found there are lots of redundant code in the wasm extensions. And the wasm loading and creations are spread out in multiple different positions. This redundancy and fragmentation make #36456 become more and more complex. Finally, I split the code clean up out as an independent PR. This PR doesn't change any logic but only merge duplicated logic. Risk Level: n/a. Testing: n/a. Docs Changes: n/a. Release Notes: n/a. --------- Signed-off-by: wangbaiping <[email protected]>
Signed-off-by: wangbaiping/wbpcode <[email protected]>
|
finally, it's done. |
|
friendly ping @markdroth for another API review. |
|
And also friendly ping @kyessenov and @mpwarres for a code review. Because the API won't have big effect to the implementation. So, I think we can do code review and API review at same time. :) |
|
After this PR, at least an occational runtime error of wasm extension won't block the whole worker. |
| } | ||
|
|
||
| message ReloadConfig { | ||
| // Backoff strategy for the VM failure reload. If not specified, the default 1s base interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the value of the default base interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1s? This is documented at the comment.
// Backoff strategy for the VM failure reload. If not specified, the default 1s base interval
// will be applied.
I think it makes sense FWIW. Looking at the implementation now. |
Signed-off-by: wangbaiping/wbpcode <[email protected]>
|
/retest |
markdroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm api
mpwarres
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed review. LGTM, all comments are minor / optional. Thanks!
| // updated anyway. | ||
| handle_wrapper.last_load = now; | ||
| PluginHandleSharedPtr new_load = getOrCreateThreadLocalPlugin(base_wasm_, plugin_, dispatcher); | ||
| if (new_load != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the new_load == nullptr case also count as a WasmEvent::VmReloadFailure?
Signed-off-by: wangbaiping(wbpcode) <[email protected]>
|
/retest |
kyessenov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
|
Is there a way to see envoy stats via http if wasm crashed? e.g. via curl http://localhost:15000/stats? |
Yeah, but the port depends on your admin port. If istio is used as control plane, it ususally is 15000. |
Commit Message: wasm: restart wasm vm if it's failed
Additional Description:
A experimental PR that support automatic reloading when the wasm VM is failed (panic(), abort(), etc).
Risk Level: low. The wasm is not production ready anyway.
Testing: unit. waiting.
Docs Changes: n/a.
Release Notes: n/a.
Platform Specific Features: n/a.