RFC: transition to Roslyn's model for trivia

Lossless syntax trees (rowan), by definition, need to represent whitespace and comments (trivai).  

There are two approaches how to represent them in the syntax tree (`1 + 1` in the example):

**Attach to nodes:**

```
BIN_EXPR@8200..8205
  LITERAL@8200..8201
    INT_NUMBER@8200..8201 "1"
  WHITESPACE@8201..8202 " "
  PLUS@8202..8203 "+"
  WHITESPACE@8203..8204 " "
  LITERAL@8204..8205
    INT_NUMBER@8204..8205 "1"
```

**Attach to tokens:**

```
BIN_EXPR@8200..8205
  LITERAL@8200..8201
    INT_NUMBER@8200..8201 "1"
  PLUS@8202..8203 "+", leading trivia: " "
  LITERAL@8204..8205
    INT_NUMBER@8204..8205 "1", leading trivia: " "

```

The first approach is what's used by rust-analyzer today by IntelliJ. Here, we attach trivia which sits between two nodes to the parent node.

The second approach is what's used by Roslyn & Swift' libsyntax. It's a bit more hacky -- a trivia is attached to a following non-trivia token. That is, each token conceptually stores a `leading_trivia: Vec<Trivia>` (but of course the encoding is optimized for common cases like single whitespace between tokens). See [this doc](https://github.com/apple/swift/tree/a4378151d80b7bac8fd827b03fc8db5c7906bfc8/lib/Syntax#trivia) for a more thorough description. 

Why go with this strange hack with "fat" tokens? There are several benefits to it:

* fixed structure of the syntax tree. With floating tokens, any node can have /*any*/ /*number*/ /*of*/ /*children*/. If we attached trivial to tokens, we can classify nodes into two buckets:
  * those that have a fixed number of (potentially missing) children (like struct decl)
  * those that have any number of children of the same type (like struct's list of fields)
  This in turn gives us O(1) access to a specific child
* better programming model. I hypothesize that having trivia attached to nodes makes certain refactors to "just work". Specifically, type-safe modifications are naturally trivia-preserving. For example, if you have two blocks, and you want to append the content of the first block to the second one, you can do roughly:

  ```rust
  for stmt in b1.stmts() { b2 = b2.append_stmt(stmt); }
  ```
  this works with attached trivia, but, with floating trivia, you'd need to transfer trivia nodes explicitly, or resort to an untyped API. Note that this is a hypothesis: I haven't worked with Roslyn-style API closely, so I don't know how important is it in practice
* better performance. This also is hypothetical, but, with token interning, storing trivia inside tokens probably won't increase the overall storage for tokens that much. However, we'd spend 2x less memory on storing pointers to tokens, because roughly half of the tokens are trivia.

I think I lean towards trying the Roslyn trivia model -- it seems like it can be better long term. I wish we can experiment with this in a simple way though :-( 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: transition to Roslyn's model for trivia #6584

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: transition to Roslyn's model for trivia #6584

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions