-
Notifications
You must be signed in to change notification settings - Fork 95
Add first version of the spec for the Head-MPT State network #389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,267 @@ | ||
# Execution Head-MPT State Network | ||
|
||
| 🚧 THE SPEC IS IN A STATE OF FLUX AND SHOULD BE CONSIDERED UNSTABLE 🚧 <br> _Clients should implement Account Trie first and reevaluate poposed approach for Storage Tries_ | | ||
|-| | ||
|
||
This document is the specification for the Portal Network that supports on-demand availability of | ||
close to the head of the chain, Merkle Patricia Trie State data from the execution chain. | ||
|
||
While similar to the [Execution State network](state-network.md), it has some unique features: | ||
|
||
- It supports direct lookup of the account state, contract's code and contract's storage | ||
- It only stores state that is present in any of the latest 256 blocks | ||
- Nodes are responsible for storing subtree of the entire state trie | ||
|
||
## Overview | ||
|
||
The Execution Head-MPT State Network is a | ||
[Kademlia](https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf) DHT that uses the | ||
[Portal Wire Protocol](../portal-wire-protocol.md) to establish an overlay network on top of the | ||
[Discovery v5](https://github.com/ethereum/devp2p/blob/master/discv5/discv5-wire.md) protocol. | ||
|
||
Nodes are responsible for storing fixed state subtree, across all 256 recent blocks. | ||
|
||
Nodes are expected to have access to the latest 256 block headers, which they will use to validate | ||
content and handle re-orgs. The [History](../history/history-network.md) and | ||
[Beacon](../beacon-chain/beacon-network.md) networks can be used for this purpose, but | ||
implementations can use other out-of-protocol solutions as well. | ||
|
||
Content is gossiped as block's trie-diff subtries. This provides very efficient way to keep node's | ||
subtrie updated as the chain progresses. | ||
|
||
### Data | ||
|
||
The network stores execution layer state content, which encompases the following data from the | ||
latest 256 blocks: | ||
|
||
- Block trie-diffs | ||
- Account trie | ||
- All contract bytecode (planned but not implemented) | ||
- All contract storage tries (planned but not implemented) | ||
|
||
#### Types | ||
|
||
Available content types are: | ||
|
||
- Block trie-diff, identifiable by block hash and subtrie path | ||
- Account trie nodes, identifiable by block hash and account trie path | ||
- Contract's bytecode, identifiable by block hash and account's address | ||
- Contract's storage trie, identifiable by block hash, contract's address, and storate trie path | ||
|
||
#### Retrieval | ||
|
||
Every content type is retrievable, using its identifiers. This means that account state (balance, | ||
nonce, bytecode, state root) and contract's storage is retrievable using single content lookup. | ||
|
||
## Specification | ||
|
||
### Distance Function | ||
|
||
The network uses standard XOR distance metric, defined in the | ||
[portal wire protocol](../portal-wire-protocol.md#xor-distance-function) specification. | ||
|
||
The only difference is that arguments can have less than 256 bits, but both arguments must have the | ||
same length. | ||
|
||
### Content ID Derivation Function | ||
|
||
> 🚧 **TODO**: Consider changing name to Content Path | ||
|
||
The content id is derived only from content's trie path (and contract's address in the case of the | ||
contracts's trie node). Its primary use case is in determining which nodes on the network should | ||
store the content. | ||
|
||
The content id has following properties are: | ||
|
||
- It has variable length, between 0 and 256 bits (only multiplies of 4), representing trie path | ||
- It's not unique, meaning different content will have the same content id | ||
- The trie path of the contract's storage trie node is modified using contract's address | ||
|
||
The derivation function is slightly different for different types and is defined below. | ||
Comment on lines
+70
to
+80
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not understanding the motivation/need/justification for a variable length content-id. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess that they would have to be 32 bytes. And you would generate them using my approach plus appending zeroes, right? The problem is that you can't distinguish between root node, and first node of the first level (path: 0), and first node on the second level (path: 00), as they would all have content_id: 0x000000..00 You can make it 34 bytes long and include (e.g. at start) the length. But in that case you might just as well make them different length. I could be mistaken and there is some way around it. |
||
|
||
### Wire Protocol | ||
|
||
#### Protocol Identifier | ||
|
||
As specified in the [Protocol identifiers](../portal-wire-protocol.md#protocol-identifiers) section | ||
of the Portal wire protocol, the `protocol` field in the `TALKREQ` message **MUST** contain the | ||
value of `0x5009`. | ||
|
||
#### Supported Message Types | ||
|
||
The network supports the following protocol messages: | ||
|
||
- `Ping` - `Pong` | ||
- `Find Nodes` - `Nodes` | ||
- `Find Content` - `Found Content` | ||
- `Offer` - `Accept` | ||
|
||
#### `Ping.payload` & `Pong.payload` | ||
|
||
The pyload type of the first `Ping` message between nodes MUST be | ||
[Type 0: Client Info, Radius, and Capabilities Payload](../ping-extensions/extensions/type-0.md). | ||
Nodes then upgrade to the latest payload type supported by both of the clients. | ||
|
||
List of currently supported payloads, by latest to oldest. | ||
- [Type 1 Basic Radius Payload](../ping-extensions/extensions/type-1.md) | ||
|
||
### Routing Table | ||
|
||
The network uses the standard routing table structure from the Portal Wire Protocol. | ||
|
||
### Node State | ||
|
||
#### Data Radius | ||
|
||
The network includes one additional piece of node state that should be tracked: `data_radius`. This | ||
value is a 256 bit integer and represents the data that a node is "interested" in. The value may | ||
fluctuate as the contents of local key-value store changes. | ||
|
||
> 🚧 **TODO**: consider making radius power of two | ||
|
||
A node should track their own radius value and provide this value in all `Ping` or `Pong` | ||
messages it sends to other nodes. A node is expected to maintain `data_radius` information for each | ||
node in its local routing table. | ||
|
||
We define the following function to determine whether node in the network should be interested in a | ||
piece of content. | ||
|
||
```py | ||
def interested(node, content): | ||
bits = content.id.length | ||
return distance(node.id[:bits], content.id) <= node.radius[:bits] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the trie diff data type which has a path of len = 8, if a node has a radius where I guess a similar problem can happen for the other data types that have a longer path but with a lower probability of occurrence. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assumed that type of distance and radius is U256 (unsigned 256 bit integer). When represented as bytes, I assumed big-endian ordering, so if Is that what you understood as well? And is logic still faulty? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually I miss understood the python slice syntax here, I read it as the last 8 bits but it is actually the first 8 bits of the radius. There is still a similar problem as described above but the reverse scenario. If nodes that have a radius where the first 8 bits are 0 (very common) then they will store almost no content (only content with a distance of zero) Perhaps we should change the distance function to something like this:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's why I use I think that logdistance would work as well (at least in the case of the gossip), as they should be pretty much the same. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I see. Well given that scenario, if a node sets it's storage capacity to zero and therefore sets it's radius to zero, then it would still store one of the subtrie-diffs is that right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Huh, that's interesting. We have the current expectation of being able to set the storage to 0 and accept nothing in history. It would be surprising to me to set storage to 0 and accept/store content anyway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Well, yes and no. According to spec (history, state), we define following:
So even if With that being said, I'm also fine having some rules that lossen the requirements for storing, for example one of these would work:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that nodes that have a zero radius should be allowed to store nothing even with the existing distance function but why not change the distance function to use For example: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Than the node that has the maximum radius (0xFFFF..FF) shouldn't store content that is farthest away (e.g. Also, comparing trie path of the subtrie (or any inner node for that matter) works much better for |
||
``` | ||
|
||
> 🚧 **TODO**: revisit if this is correct for non-leaf nodes | ||
|
||
> 🚧 **TODO**: maybe adjust for contract's storage trie | ||
|
||
### Data Types | ||
|
||
#### Helper Data Types | ||
|
||
The helper types `Nibbles`, `AddressHash`, `TrieNode`, and `TrieProof` are defined the same way as | ||
in [Execution State Network](state-network.md#helper-data-types). | ||
|
||
##### TrieDiff | ||
|
||
The Trie-Diff represents the minimal structure that represents how MPT changed from one block to | ||
another. Observe that this is not enough in order to execute block (as data that is only read is not | ||
present). | ||
|
||
In order for node to verify that provided trie-diff is complete and minimal, we need both previous | ||
and new value of every changed trie node. | ||
|
||
```py | ||
TrieNodeList = List[TrieNode, limit=65536] | ||
TrieDiff = Container(before: TrieNodeList, after: TrieNodeList) | ||
``` | ||
|
||
One should be able to construct two partial views of the Merkle Patricia Trie before and after | ||
associated block. The present part of both partial views should match (and the same for missing | ||
part). The trie nodes are ordered as if they are visited using | ||
[pre-order Depth First Search](https://en.wikipedia.org/wiki/Tree_traversal#Pre-order,_NLR) | ||
traversal algorithm. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @morph-dev Is there a specific reason why pre-order Depth First Search was selected as the ordering of the trie nodes in the trie diff trie node lists? I'm wondering if using Breadth First Search ordering of the trie nodes would be better because it might lead to simpler algorithms that can be defined recursively and are easier to reason about. For example when validating the trie nodes in the trie diff in place (without using additional memory/storage) we need to walk down each of the paths in the trie and check that the hashes of the child nodes match the hashes in the parent nodes. With a breadth first ordering we can define a recursive algorithm that takes a parent node, looks at how many children it should have and then check the hashes of the next n child nodes in the trie node list. I'm sure it can still be done using pre-order Depth First Search but I'm thinking that Breadth First Search ordering might be easier to understand because the trie nodes are organised in layers based on trie depth. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no strong reason for picking DFS over BFS, but I think that I disagree that BFS would be easier to reason about (especially if you plan to use recursive algorithm), because:
All in all, I don't have strong preferences, but my intuition is that DFS is simpler. But if most people prefer one, I don't mind. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a strong opinion either way so I don't mind sticking with DFS if that is preferred. Yes true, not all trie nodes will be present because it is just a diff not the full state subtree. This makes me wonder, for each parent node in a trie node list how can we know which child nodes are encoded in the list? It would be useful to know this. I guess we could use a trial and error method where we hash the child nodes and see which of them match the hashes in the parent nodes (not sure how well this would work). But perhaps it would be better to encode in the data structure this information somehow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The way I envisioned is that:
There is probably some tricky logic with dealing with extension nodes and branches, and/or when some subtrie is not missing, I think this approach can be adjusted for that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I only skimmed this conversation, but I believe Depth-First is the correct ordering here. My previous experience working with MPT proofs suggests that efficient algorithms can generally be written for either ordering, but that Depth-First is the more natural way to do this. |
||
|
||
The actual usage on this type is slightly different. We use subtrie (at depth 2) of the whole | ||
Trie-Diff. To acocomodate this, we don't have to change the type. Instead, it's enough to allow | ||
first two layers to omit parts that are different (except the subtrie that is specified by content | ||
key). | ||
|
||
#### Account Trie-Diff | ||
|
||
This data type represent a subtrie of block's Trie-Diff. The entire Trie-Diff is split into | ||
subtries at depth 2 (2 nibbles or 8 bits). This was chosen arbitrary | ||
as a good estimate for not making subtrie-diffs too small or too big. | ||
|
||
```py | ||
selector = 0x30 | ||
account_trie_diff = Container(path: u8, block_hash: Bytes32) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does path make more sense as a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You mean as I don't have strong preference. We already bundle 2 nibbles into one byte as part of the Nibbles type, so I though I would just assume that two nibbles are bundled into one byte. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant |
||
|
||
content_key = selector + SSZ.serialize(account_trie_diff) | ||
content_value = Container(subtrie_diff: TrieDiff) | ||
|
||
def content_id(account_trie_diff): | ||
return account_trie_diff.path | ||
``` | ||
|
||
The `subtrie_diff` field of the content value includes first 2 layers of the trie as well (as first | ||
two elements). | ||
|
||
#### Account Trie Node | ||
|
||
This data type represent a node from the account trie. | ||
|
||
```py | ||
selector = 0x31 | ||
account_trie_node = Container(path: Nibbles, block_hash: Bytes32) | ||
|
||
content_key = selector + SSZ.serialize(account_trie_node) | ||
content_value = Container(proof: TrieProof) | ||
|
||
def content_id(account_trie_node): | ||
return account_trie_node.path | ||
``` | ||
|
||
The last trie node in the `proof` MUST correspond to the trie path from the content key. | ||
|
||
#### Contract Trie-Diff | ||
|
||
This data type represent a subtrie of contract's Trie-Diff at the specific block. The entire | ||
Trie-Diff is split into subtries at depth 2. This was chosen arbitrary as a good estimate for not | ||
making subtrie-diffs too small or too big. | ||
|
||
```py | ||
selector = 0x32 | ||
contract_trie_diff = Container(path: u8, address_hash: AddressHash, block_hash: Bytes32) | ||
|
||
content_key = selector + SSZ.serialize(contract_trie_diff) | ||
content_value = Container(subtrie_diff: TrieDiff, account_proof: TrieProof) | ||
|
||
def content_id(contract_trie_diff): | ||
return contract_trie_node.path XOR contract_trie_diff.address_hash[:2] | ||
``` | ||
|
||
#### Contract Trie Node | ||
|
||
This data type represent a node from the contracts's storage trie. | ||
|
||
```py | ||
selector = 0x33 | ||
contract_trie_node = Container(path: Nibbles, address_hash: AddressHash, block_hash: Bytes32) | ||
|
||
content_key = selector + SSZ.serialize(contract_trie_node) | ||
content_value = Container(storage_proof: TrieProof, account_proof: TrieProof) | ||
|
||
def content_id(contract_trie_node): | ||
bits = contract_trie_node.path.length | ||
return contract_trie_node.path XOR contract_trie_node.address_hash[:bits] | ||
``` | ||
|
||
The last trie node in the `storage_proof` MUST correspond to the trie path from the content key, | ||
inside contract's storage trie. | ||
|
||
#### Contract Bytecode | ||
|
||
> 🚧 **TODO**: Evaluate if needed | ||
|
||
> 🚧 **TODO**: Write spec (should be similar to [Execution State Network](state-network.md)). | ||
|
||
### Algorithms | ||
|
||
#### Gossip | ||
|
||
Only the Trie-Diffs will be gossiped between nodes. This is done as efficient mechanism for nodes | ||
to keep up-to-date with the chain, as they can easily update the subtrie they are responsible for. | ||
|
||
#### Storage layout | ||
|
||
Clients MUST store entire subtree that is "close" to their node.id. They MUST also keep track of | ||
all trie nodes from the root of the trie to the root of their respective subtree. | ||
|
||
One way of storing data is to combine trie nodes that correspond to the same trie path, and keep | ||
track of the latest version and series of reversed diffs (block number at which trie node changed, | ||
and its previous value) | ||
|
||
Trie Diff content type can be used for determining which trie nodes contain reverse diff that go | ||
out of "most recent 256" window and can be purged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to get validation of this concept done since it's a core building block of the network designs. We need test data for: