Skip to content

[ISSUE] #24

@pixie79

Description

@pixie79

Support for Self-Contained Protobuf Descriptors with Google Well-Known Types

Problem Description

The SDK currently only accepts DescriptorProto in TableProperties.descriptor_proto, but when encoding it to bytes for the gRPC CreateIngestStreamRequest, it only includes the single descriptor. This causes "Proto not self-contained" validation errors when the descriptor references Google well-known types (e.g., google.protobuf.StringValue, google.protobuf.Int64Value, google.protobuf.DoubleValue) or custom nested types from separate files.

The server validation checks that all referenced types are present in the FileDescriptorProto.message_type field. Since the SDK only encodes a single DescriptorProto, well-known types are not included, causing validation to fail.

Error Message

Proto validation error:
  Proto not self-contained for field "id" with type name ".google.protobuf.StringValue". Error Code: 4033, Error State: 4.
  Proto not self-contained for field "time" with type name ".google.protobuf.Int64Value". Error Code: 4033, Error State: 4.
  Proto not self-contained for field "total" with type name ".google.protobuf.DoubleValue". Error Code: 4033, Error State: 4.

Root Cause

In sdk/src/lib.rs at lines 793-803, the SDK encodes only the DescriptorProto:

let descriptor_proto = if record_type == RecordType::Proto {
    Some(
        table_properties
            .descriptor_proto
            .as_ref()
            .unwrap()
            .encode_to_vec(),
    )
} else {
    None
};

The gRPC CreateIngestStreamRequest.descriptor_proto field accepts Option<Vec<u8>> (bytes), which can be a FileDescriptorSet encoded as bytes. However, the SDK only encodes a single DescriptorProto, not a complete FileDescriptorProto or FileDescriptorSet that includes well-known types.

Proposed Solution

  1. Option 1 (Recommended): Add support for FileDescriptorProto or FileDescriptorSet in TableProperties:

    • Add a new field file_descriptor_proto: Option<FileDescriptorProto> or file_descriptor_set: Option<FileDescriptorSet>
    • When encoding, prefer FileDescriptorSet if provided, otherwise fall back to encoding DescriptorProto as before
    • Encode the FileDescriptorSet to bytes using FileDescriptorSet::encode_to_vec()
  2. Option 2: Modify the encoding logic to automatically create a FileDescriptorProto from the DescriptorProto and include well-known types:

    • When encoding, wrap the DescriptorProto in a FileDescriptorProto
    • Automatically detect referenced well-known types and include them in message_type
    • Encode as a FileDescriptorSet containing the FileDescriptorProto

Example Use Case

When using Protobuf messages that reference Google well-known types for nullable primitives:

syntax = "proto2";
package example;

import "google/protobuf/wrappers.proto";

message Bet {
  optional google.protobuf.StringValue id = 1;
  optional google.protobuf.Int64Value time = 2;
  optional google.protobuf.DoubleValue total = 3;
}

Note: The SDK uses Proto 2 syntax, so fields should be marked as optional when using well-known types.

The descriptor needs to include definitions for StringValue, Int64Value, and DoubleValue in the FileDescriptorProto.message_type field for server validation to pass.

Minimal Example

use databricks_zerobus_ingest_sdk::*;
use prost_types::{FileDescriptorProto, FileDescriptorSet};

// Current approach (fails with "Proto not self-contained"):
let table_props = TableProperties {
    table_name: "catalog.schema.table".to_string(),
    descriptor_proto: Some(main_descriptor), // Only main descriptor
};

// Desired approach (with fix):
let file_desc = FileDescriptorProto {
    name: Some("example.proto".to_string()),
    package: Some("example".to_string()),
    dependency: vec!["google/protobuf/wrappers.proto".to_string()],
    message_type: vec![
        main_descriptor,           // Main message
        string_value_descriptor,    // Well-known type
        int64_value_descriptor,     // Well-known type
        double_value_descriptor,    // Well-known type
    ],
    ..Default::default()
};

let file_descriptor_set = FileDescriptorSet {
    file: vec![file_desc],
};

let table_props = TableProperties {
    table_name: "catalog.schema.table".to_string(),
    descriptor_proto: Some(main_descriptor), // For backward compatibility
    file_descriptor_set: Some(file_descriptor_set), // New field
};

Impact

  • Breaking Change: None (backward compatible if Option 1 is implemented with optional field)
  • Performance: Minimal (encoding FileDescriptorSet is similar to encoding DescriptorProto)
  • Compatibility: Works with existing code (fallback to DescriptorProto if FileDescriptorSet not provided)

Additional Context

The gRPC proto definition shows that descriptor_proto accepts bytes:

message CreateIngestStreamRequest {
  optional string table_name = 1;
  optional bytes descriptor_proto = 3;  // Can be FileDescriptorSet encoded as bytes
  optional RecordType record_type = 4;
}

The server expects a FileDescriptorSet encoded as bytes, which can contain multiple FileDescriptorProto messages, each with multiple message types. This allows including well-known types and custom nested types.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions