-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Support for Self-Contained Protobuf Descriptors with Google Well-Known Types
Problem Description
The SDK currently only accepts DescriptorProto in TableProperties.descriptor_proto, but when encoding it to bytes for the gRPC CreateIngestStreamRequest, it only includes the single descriptor. This causes "Proto not self-contained" validation errors when the descriptor references Google well-known types (e.g., google.protobuf.StringValue, google.protobuf.Int64Value, google.protobuf.DoubleValue) or custom nested types from separate files.
The server validation checks that all referenced types are present in the FileDescriptorProto.message_type field. Since the SDK only encodes a single DescriptorProto, well-known types are not included, causing validation to fail.
Error Message
Proto validation error:
Proto not self-contained for field "id" with type name ".google.protobuf.StringValue". Error Code: 4033, Error State: 4.
Proto not self-contained for field "time" with type name ".google.protobuf.Int64Value". Error Code: 4033, Error State: 4.
Proto not self-contained for field "total" with type name ".google.protobuf.DoubleValue". Error Code: 4033, Error State: 4.
Root Cause
In sdk/src/lib.rs at lines 793-803, the SDK encodes only the DescriptorProto:
let descriptor_proto = if record_type == RecordType::Proto {
Some(
table_properties
.descriptor_proto
.as_ref()
.unwrap()
.encode_to_vec(),
)
} else {
None
};The gRPC CreateIngestStreamRequest.descriptor_proto field accepts Option<Vec<u8>> (bytes), which can be a FileDescriptorSet encoded as bytes. However, the SDK only encodes a single DescriptorProto, not a complete FileDescriptorProto or FileDescriptorSet that includes well-known types.
Proposed Solution
-
Option 1 (Recommended): Add support for
FileDescriptorProtoorFileDescriptorSetinTableProperties:- Add a new field
file_descriptor_proto: Option<FileDescriptorProto>orfile_descriptor_set: Option<FileDescriptorSet> - When encoding, prefer
FileDescriptorSetif provided, otherwise fall back to encodingDescriptorProtoas before - Encode the
FileDescriptorSetto bytes usingFileDescriptorSet::encode_to_vec()
- Add a new field
-
Option 2: Modify the encoding logic to automatically create a
FileDescriptorProtofrom theDescriptorProtoand include well-known types:- When encoding, wrap the
DescriptorProtoin aFileDescriptorProto - Automatically detect referenced well-known types and include them in
message_type - Encode as a
FileDescriptorSetcontaining theFileDescriptorProto
- When encoding, wrap the
Example Use Case
When using Protobuf messages that reference Google well-known types for nullable primitives:
syntax = "proto2";
package example;
import "google/protobuf/wrappers.proto";
message Bet {
optional google.protobuf.StringValue id = 1;
optional google.protobuf.Int64Value time = 2;
optional google.protobuf.DoubleValue total = 3;
}Note: The SDK uses Proto 2 syntax, so fields should be marked as optional when using well-known types.
The descriptor needs to include definitions for StringValue, Int64Value, and DoubleValue in the FileDescriptorProto.message_type field for server validation to pass.
Minimal Example
use databricks_zerobus_ingest_sdk::*;
use prost_types::{FileDescriptorProto, FileDescriptorSet};
// Current approach (fails with "Proto not self-contained"):
let table_props = TableProperties {
table_name: "catalog.schema.table".to_string(),
descriptor_proto: Some(main_descriptor), // Only main descriptor
};
// Desired approach (with fix):
let file_desc = FileDescriptorProto {
name: Some("example.proto".to_string()),
package: Some("example".to_string()),
dependency: vec!["google/protobuf/wrappers.proto".to_string()],
message_type: vec![
main_descriptor, // Main message
string_value_descriptor, // Well-known type
int64_value_descriptor, // Well-known type
double_value_descriptor, // Well-known type
],
..Default::default()
};
let file_descriptor_set = FileDescriptorSet {
file: vec![file_desc],
};
let table_props = TableProperties {
table_name: "catalog.schema.table".to_string(),
descriptor_proto: Some(main_descriptor), // For backward compatibility
file_descriptor_set: Some(file_descriptor_set), // New field
};Impact
- Breaking Change: None (backward compatible if Option 1 is implemented with optional field)
- Performance: Minimal (encoding
FileDescriptorSetis similar to encodingDescriptorProto) - Compatibility: Works with existing code (fallback to
DescriptorProtoifFileDescriptorSetnot provided)
Additional Context
The gRPC proto definition shows that descriptor_proto accepts bytes:
message CreateIngestStreamRequest {
optional string table_name = 1;
optional bytes descriptor_proto = 3; // Can be FileDescriptorSet encoded as bytes
optional RecordType record_type = 4;
}The server expects a FileDescriptorSet encoded as bytes, which can contain multiple FileDescriptorProto messages, each with multiple message types. This allows including well-known types and custom nested types.