Description
There seems to have been a misunderstanding in the past around proto2 vs proto3. My attempt here is to clear up the confusion, recommend proto3 in general, and explain why proto3 should be preferred.
Our main confusion is about field presence. That is, if a field is omitted from the serialized wire format does the user of the decoded message know the difference between if the field was unset or set as the default value. This document has a lot of good information and is worth the read: https://github.com/protocolbuffers/protobuf/blob/main/docs/field_presence.md
Origins of the confusion
Proto2 would always serialize an explicitly set field, even if it was set to the default. This meant that you could know on the decoding side whether the field was set or not. This is called Explicit Presence. For example, in the Rust protobuf compiler, it would wrap these in Option<T>
: https://github.com/tokio-rs/prost#field-modifiers.
The confusing thing is that the language guide for proto2 states:
A well-formed message may or may not contain an optional element. When a message is parsed, if it does not contain an optional element, accessing the corresponding field in the parsed object returns the default value for that field.
The subtlety here is that this doesn't say anything about "hasField" accessors. Which may be provided by the implementation to check if the field was set or not. This is essentially with prost is doing with Option<T>
types.
Another confusing thing is that this language guide doesn't mention "presence" a single time. Which is what we're talking about here.
In proto3, if a field was set to its default value it would not be serialized. This meant that the decoding sided wouldn't know if the field was omitted because it was unset or because it was the default value. This is called No Presence.
Field Presence Proto2 vs Proto3
To clarify field presence in proto2 vs proto3:
Proto2
Field type | Explicit Presence |
---|---|
Singular numeric (integer or floating point) | ✔️ |
Singular enum | ✔️ |
Singular string or bytes | ✔️ |
Singular message | ✔️ |
Repeated | |
Oneofs | ✔️ |
Maps |
Proto3
Field type | optional |
Explicit Presence |
---|---|---|
Singular numeric (integer or floating point) | No | |
Singular enum | No | |
Singular string or bytes | No | |
Singular numeric (integer or floating point) | Yes | ✔️ |
Singular enum | Yes | ✔️ |
Singular string or bytes | Yes | ✔️ |
Singular message | Yes | ✔️ |
Singular message | No | ✔️ |
Repeated | N/A | |
Oneofs | N/A | ✔️ |
Maps | N/A |
Advantages in Proto3 compared to Proto2
- No
required
modifier- This is generally considered an anti-pattern since all future versions of this message will need to contain this field. Generally users should prefer custom validation.
- Opt-in explicit presence
- It's good to be able to get the space advantages of no-presence while still being able to opt-in to explicit presence. If we pass in an empty byte array most of the time this is semantically the same as passing no byte array, so it's nice to avoid paying the byte-cost for this.
- But if we do want explicit presence we can opt in to it. This is useful in case we do semantically care about knowing if there was nothing set.
- Simple feature set: "The reason for removing these features is to make API designs simpler, more stable, and more performant. " from https://cloud.google.com/apis/design/proto3
- Better ecosystem support. As libraries develop, it's likely they will support the latest protobuf spec rather than continue supporting proto2. This is already the case with protons, the compiler that JS-IP uses (see this bug).
- User-defined default value for fields is no longer available. This was somewhat tricky to get right.
Next steps
- Come to mutual understanding around proto2 vs proto3.
- Come to consensus around recommending proto3.
- Make the change to
README.md