Description
Lossless syntax trees (rowan), by definition, need to represent whitespace and comments (trivai).
There are two approaches how to represent them in the syntax tree (1 + 1
in the example):
Attach to nodes:
[email protected]
[email protected]
[email protected] "1"
[email protected] " "
[email protected] "+"
[email protected] " "
[email protected]
[email protected] "1"
Attach to tokens:
[email protected]
[email protected]
[email protected] "1"
[email protected] "+", leading trivia: " "
[email protected]
[email protected] "1", leading trivia: " "
The first approach is what's used by rust-analyzer today by IntelliJ. Here, we attach trivia which sits between two nodes to the parent node.
The second approach is what's used by Roslyn & Swift' libsyntax. It's a bit more hacky -- a trivia is attached to a following non-trivia token. That is, each token conceptually stores a leading_trivia: Vec<Trivia>
(but of course the encoding is optimized for common cases like single whitespace between tokens). See this doc for a more thorough description.
Why go with this strange hack with "fat" tokens? There are several benefits to it:
-
fixed structure of the syntax tree. With floating tokens, any node can have /any/ /number/ /of/ /children/. If we attached trivial to tokens, we can classify nodes into two buckets:
- those that have a fixed number of (potentially missing) children (like struct decl)
- those that have any number of children of the same type (like struct's list of fields)
This in turn gives us O(1) access to a specific child
-
better programming model. I hypothesize that having trivia attached to nodes makes certain refactors to "just work". Specifically, type-safe modifications are naturally trivia-preserving. For example, if you have two blocks, and you want to append the content of the first block to the second one, you can do roughly:
for stmt in b1.stmts() { b2 = b2.append_stmt(stmt); }
this works with attached trivia, but, with floating trivia, you'd need to transfer trivia nodes explicitly, or resort to an untyped API. Note that this is a hypothesis: I haven't worked with Roslyn-style API closely, so I don't know how important is it in practice
-
better performance. This also is hypothetical, but, with token interning, storing trivia inside tokens probably won't increase the overall storage for tokens that much. However, we'd spend 2x less memory on storing pointers to tokens, because roughly half of the tokens are trivia.
I think I lean towards trying the Roslyn trivia model -- it seems like it can be better long term. I wish we can experiment with this in a simple way though :-(