Skip to content

[v2] Gather crash reports #117

Closed
Closed
@ArtemGr

Description

@ArtemGr
  • Functional test (those things tends to be flaky, we need to automatically test that crashes are caught).
  • should] Targeted at the binary Docker deployments and internal testing there should be an option of letting the OS dump the core instead of manually collecting the stack trace. No need to upload the core anywhere, at least not withing this issue timeframe: the testing developers will get it manually from the image.
  • Handle Windows exceptions (example, _EXCEPTION_POINTERS, ru, StackWalker). This should speed up the development by allowing me to test the crash handling code from my primary development environments. And will work for the Windows deployments of MM. => POC.
  • -> A full Windows build is likely necessary. Otherwise we'll be getting linking errors with things like os_portable::OS_init missing.
  • -> -> Try to link the C MM1 library to the MM2 Rust binary instead of the other way around.
  • Handle Unix signals (having the Linux and macOS deployments in mind).
  • should] Capture not just the C crashes but also the Rust panics. => Global panic handlers aren't stable yet (#[panic_handler]). We could set thread-local hooks, but I'll try a simpler (?) route first of using RUST_BACKTRACE and getting crash reports by capturing the standard output.
  • -> Dump C/Rust backtraces to standard output and save-to-share log.
  • -> -> Rehash invariants regarding the logging of sensitive information.
  • Backtrace without line numbers first, to improve reliability? => Not an option.
  • Scan the logs folder and see if some of the logs might constitute a crash or a failure.
  • won't] Watchdog. Mark the log dirty when we're doing something and clean when we're finished. That way we can know that there was a failure even if we were too dead to capture the backtrace in the log. We'll probably need a helper, a forked process that leaves a trace that the computer and filesystem were online. If the system was online but MM failed to leave the clean mark then something went wrong and the log might help us figure it out. It's important to leave the clean mark whenever MM is killed. Won't do it in this issue, but should probably create a separate one.
  • Make sure it works on macOS, HyperDEX team mostly deploys there.
  • -> Test the macOS build?
  • -> CI macOS build?
  • -> Remove the strip from the builds, turn debug options on.
  • Figure out where to send/store them. => I'd like to use the Google Cloud Storage for that. Should ask Artem for alternatives. => For starters it might be good enough if they're printed to stdout.
  • Consider implementing this for MM1 too. => The plan is to make a new release of MM reusing the same CI build chain.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions