-
Notifications
You must be signed in to change notification settings - Fork 44
FMA for Cannon updates for Go 1.23 and Kona #279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@mbaxter could you take a look and see if you have anything to add or change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the feature toggles section might need to be adjusted a bit to reflect the latest code, but generally looks good to me 👍
- **Description:** A [feature toggle](https://github.com/ethereum-optimism/optimism/pull/15487) was added. The contract could be deployed with the wrong version. | ||
- **Risk Assessment:** low | ||
- **Mitigations:** | ||
1. The version number is checked in the constructor, and currently it's required to be 7 (the latest version) so we shouldn't be able to deploy MIPS64.sol with the wrong version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this toggle. The linked PR shows that it requires version 6, not 7. And if the input value has to be some static value, why is that added as a constructor argument at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That PR is the one that added the feature toggle, but the changes for Go 1.23 updated it to 7. And I'll let @mbaxter answer your question about the design of the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linked PR shows version 6 because that PR was laying the groundwork for Go 1.23 - it just handles all of the plumbing required to be able to inject the stateVersion
. You can find the PR's that build on top of this here.
Currently in develop
, we support 2 versions: 7 (current) and 8 (the next "experimental" version). The U16 release branch supports only version 7. The general idea is that we want to be able to support the governance-approved version while continuing to develop the next version of the VM.
### FM2: Stack depth-related refactoring with new dclo/dclz instructions introduced a bug | ||
|
||
- **Description:** Arguments were consolidated into a struct to avoid "stack too deep" issues. | ||
- **Risk Assessment:** low | ||
- **Mitigations:** | ||
1. We have comprehensive differential testing on all VM instructions, which should catch any potential refactoring-related bugs | ||
2. This is a trivial refactoring | ||
- **Detection:** We rely on our tests. | ||
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding of this failure mode correct?
edit: seems like no, based on the below failure mode saying these are new instructions?
### FM2: Stack depth-related refactoring with new dclo/dclz instructions introduced a bug | |
- **Description:** Arguments were consolidated into a struct to avoid "stack too deep" issues. | |
- **Risk Assessment:** low | |
- **Mitigations:** | |
1. We have comprehensive differential testing on all VM instructions, which should catch any potential refactoring-related bugs | |
2. This is a trivial refactoring | |
- **Detection:** We rely on our tests. | |
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract. | |
### FM2: Stack depth-related refactoring with new dclo/dclz instructions introduced a bug | |
- **Description:** In the solidity code, arguments were consolidated into a struct to avoid "stack too deep" issues, which may have introduced a bug. | |
- **Risk Assessment:** low | |
- **Mitigations:** | |
1. We have comprehensive differential testing on all VM instructions between go and solidity, which should catch any potential refactoring-related bugs. In this case, the solidity code was changed but the go code was unchanged, therefore we have confidence a bug was not introduced from the refactor. | |
2. This is a trivial refactoring | |
- **Detection:** We rely on our tests. | |
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have taken your edits, thanks.
### FM3: New dclo/dclz instructions are incorrectly implemented | ||
|
||
- **Description:** There are two new instructions, there could be a bug in the implementation. They aren't used by op-program, but would be used if we ever deployed Kona on Cannon. | ||
- **Risk Assessment:** low | ||
- **Mitigations:** | ||
1. These instructions aren't emitted by the Go compiler, so behavior should not affect the VM when running op-program | ||
2. If we ever do deploy Kona on Cannon we will do more testing, including running it on mainnet data for weeks in VM Runner. | ||
- **Detection:** The program would crash if it used those instructions and they were incorrectly implemented. | ||
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other onchain changes we'd need to ship kona to prod? i.e why we are shipping these to production if we aren't going to use them in production, is this to help roll out kona incrementally in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these are the only changes needed. They're included here because it gives us the option to use Kona with Cannon if needed.
|
||
### FM4: Incomplete Go 1.23 support (missing syscalls) | ||
|
||
- **Description:** It's possible that the Go 1.23 compiler uses additional syscalls that we haven't noticed and they aren't implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we typically notice the new syscalls that we need to implement? It sounds like this is empirically based on running the challenger, as opposed to e.g. go release notes giving us this info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two ways:
vm-compat
is a tool that runs in CI and detects new syscalls referenced in the op-program binary (I added this to the text).op-challenger-runner
continually runs op-program against mainnet blocks
- **Detection:** We rely on our tests. | ||
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we detect and recover from this in production? Would our challenger predict incorrect game result, and we pause and follow an existing runbook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends what you mean by "in production". We have never actually needed to run op-program (i.e. play a dispute game all the way down to step()
) in production, so this probably means that op-challenger-runner (AKA "VM runner") would fail and alert us, which would mean we're aware of a vulnerability that hasn't yet been exploited, so we can fix it and disclose as per our policies (which at the moment wouldn't require a pause).
- **Mitigations:** | ||
1. We have comprehensive differential testing on all VM instructions, which should catch any potential refactoring-related bugs | ||
2. This is a trivial refactoring | ||
- **Detection:** We rely on our tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to the above about how this gets detected in production
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no generalized automatic way to catch a bug, but this refactoring was pretty trivial and has been audited.
- **Mitigations:** | ||
1. The version number is checked in the constructor, and currently it's required to be 7 (the latest version) so we shouldn't be able to deploy MIPS64.sol with the wrong version. | ||
2. This logic is fairly simple, it's just a check against the version number to enable features, so it's easy to reason about and low risk of being implemented incorrectly. | ||
- **Detection:** We have manually reviewed for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to the above about how this gets detected in production, do have some post-deploy assertions in OPCM or superchain-ops that verify the config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no assertions in OPCM to verify the version.
- **Description:** the eventfd and mprotect syscalls were implemented as a noop, because it was determined that it won't be used by op-program even though there is a reference to it in the binary. | ||
- **Risk Assessment:** medium | ||
- **Mitigations:** | ||
1. We have been running op-challenger-runner on production data for several weeks with the new VM | ||
- **Detection:** We rely on our tests. | ||
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar questions to above—how did we determine this, how do we detect/recover in production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this comment.
Adds an FMA for Cannon updates for Go 1.23 and Kona.
Closes ethereum-optimism/release-management#323