Skip to content

FMA for Cannon updates for Go 1.23 and Kona #279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

pauldowman
Copy link
Contributor

Adds an FMA for Cannon updates for Go 1.23 and Kona.

Closes ethereum-optimism/release-management#323

@pauldowman
Copy link
Contributor Author

@mbaxter could you take a look and see if you have anything to add or change?

Copy link
Contributor

@mbaxter mbaxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the feature toggles section might need to be adjusted a bit to reflect the latest code, but generally looks good to me 👍

@pauldowman pauldowman changed the title Begin FMA for Cannon updates for Go 1.23 and Kona FMA for Cannon updates for Go 1.23 and Kona May 21, 2025
@pauldowman pauldowman marked this pull request as ready for review May 21, 2025 23:24
@pauldowman pauldowman requested a review from mds1 May 21, 2025 23:24
- **Description:** A [feature toggle](https://github.com/ethereum-optimism/optimism/pull/15487) was added. The contract could be deployed with the wrong version.
- **Risk Assessment:** low
- **Mitigations:**
1. The version number is checked in the constructor, and currently it's required to be 7 (the latest version) so we shouldn't be able to deploy MIPS64.sol with the wrong version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this toggle. The linked PR shows that it requires version 6, not 7. And if the input value has to be some static value, why is that added as a constructor argument at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That PR is the one that added the feature toggle, but the changes for Go 1.23 updated it to 7. And I'll let @mbaxter answer your question about the design of the constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked PR shows version 6 because that PR was laying the groundwork for Go 1.23 - it just handles all of the plumbing required to be able to inject the stateVersion. You can find the PR's that build on top of this here.

Currently in develop, we support 2 versions: 7 (current) and 8 (the next "experimental" version). The U16 release branch supports only version 7. The general idea is that we want to be able to support the governance-approved version while continuing to develop the next version of the VM.

Comment on lines 71 to 79
### FM2: Stack depth-related refactoring with new dclo/dclz instructions introduced a bug

- **Description:** Arguments were consolidated into a struct to avoid "stack too deep" issues.
- **Risk Assessment:** low
- **Mitigations:**
1. We have comprehensive differential testing on all VM instructions, which should catch any potential refactoring-related bugs
2. This is a trivial refactoring
- **Detection:** We rely on our tests.
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract.
Copy link
Contributor

@mds1 mds1 May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding of this failure mode correct?

edit: seems like no, based on the below failure mode saying these are new instructions?

Suggested change
### FM2: Stack depth-related refactoring with new dclo/dclz instructions introduced a bug
- **Description:** Arguments were consolidated into a struct to avoid "stack too deep" issues.
- **Risk Assessment:** low
- **Mitigations:**
1. We have comprehensive differential testing on all VM instructions, which should catch any potential refactoring-related bugs
2. This is a trivial refactoring
- **Detection:** We rely on our tests.
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract.
### FM2: Stack depth-related refactoring with new dclo/dclz instructions introduced a bug
- **Description:** In the solidity code, arguments were consolidated into a struct to avoid "stack too deep" issues, which may have introduced a bug.
- **Risk Assessment:** low
- **Mitigations:**
1. We have comprehensive differential testing on all VM instructions between go and solidity, which should catch any potential refactoring-related bugs. In this case, the solidity code was changed but the go code was unchanged, therefore we have confidence a bug was not introduced from the refactor.
2. This is a trivial refactoring
- **Detection:** We rely on our tests.
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken your edits, thanks.

Comment on lines +81 to +89
### FM3: New dclo/dclz instructions are incorrectly implemented

- **Description:** There are two new instructions, there could be a bug in the implementation. They aren't used by op-program, but would be used if we ever deployed Kona on Cannon.
- **Risk Assessment:** low
- **Mitigations:**
1. These instructions aren't emitted by the Go compiler, so behavior should not affect the VM when running op-program
2. If we ever do deploy Kona on Cannon we will do more testing, including running it on mainnet data for weeks in VM Runner.
- **Detection:** The program would crash if it used those instructions and they were incorrectly implemented.
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other onchain changes we'd need to ship kona to prod? i.e why we are shipping these to production if we aren't going to use them in production, is this to help roll out kona incrementally in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these are the only changes needed. They're included here because it gives us the option to use Kona with Cannon if needed.


### FM4: Incomplete Go 1.23 support (missing syscalls)

- **Description:** It's possible that the Go 1.23 compiler uses additional syscalls that we haven't noticed and they aren't implemented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we typically notice the new syscalls that we need to implement? It sounds like this is empirically based on running the challenger, as opposed to e.g. go release notes giving us this info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two ways:

  1. vm-compat is a tool that runs in CI and detects new syscalls referenced in the op-program binary (I added this to the text).
  2. op-challenger-runner continually runs op-program against mainnet blocks

Comment on lines 97 to 98
- **Detection:** We rely on our tests.
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we detect and recover from this in production? Would our challenger predict incorrect game result, and we pause and follow an existing runbook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends what you mean by "in production". We have never actually needed to run op-program (i.e. play a dispute game all the way down to step()) in production, so this probably means that op-challenger-runner (AKA "VM runner") would fail and alert us, which would mean we're aware of a vulnerability that hasn't yet been exploited, so we can fix it and disclose as per our policies (which at the moment wouldn't require a pause).

- **Mitigations:**
1. We have comprehensive differential testing on all VM instructions, which should catch any potential refactoring-related bugs
2. This is a trivial refactoring
- **Detection:** We rely on our tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to the above about how this gets detected in production

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no generalized automatic way to catch a bug, but this refactoring was pretty trivial and has been audited.

- **Mitigations:**
1. The version number is checked in the constructor, and currently it's required to be 7 (the latest version) so we shouldn't be able to deploy MIPS64.sol with the wrong version.
2. This logic is fairly simple, it's just a check against the version number to enable features, so it's easy to reason about and low risk of being implemented incorrectly.
- **Detection:** We have manually reviewed for this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to the above about how this gets detected in production, do have some post-deploy assertions in OPCM or superchain-ops that verify the config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no assertions in OPCM to verify the version.

Comment on lines +102 to +107
- **Description:** the eventfd and mprotect syscalls were implemented as a noop, because it was determined that it won't be used by op-program even though there is a reference to it in the binary.
- **Risk Assessment:** medium
- **Mitigations:**
1. We have been running op-challenger-runner on production data for several weeks with the new VM
- **Detection:** We rely on our tests.
- **Recovery Path(s)**: It would require fixing the bug and upgrading the contract.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar questions to above—how did we determine this, how do we detect/recover in production?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[U16] FMA for Cannon updates to support Go 1.23 and Kona
3 participants