Skip to content

feat: [velox+prestissimo][iceberg] Iceberg V3 full C++ support: deletion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (#27462)#27462

Open
apurva-meta wants to merge 1 commit intoprestodb:masterfrom
apurva-meta:export-D98704718

Conversation

@apurva-meta
Copy link
Copy Markdown
Contributor

@apurva-meta apurva-meta commented Mar 30, 2026

Summary:
X-link: facebookincubator/velox#16959

Combined velox/prestissimo diffs for Iceberg V3 C++ support:

  • Improve IcebergSplitReader error handling and fix test file handle leaks
  • Add Iceberg V3 deletion vector support (DeletionVectorReader)
  • Add Iceberg equality delete file reader (EqualityDeleteFileReader)
  • Add sequence number conflict resolution for equality deletes
  • Add sequence number conflict resolution for positional deletes and deletion vectors
  • Add Iceberg V3 deletion vector writer (DeletionVectorWriter)
  • Add DWRF file format support for Iceberg data sink
  • Add Manifold filesystem support with CAT token authentication
  • Reformat FileContent enum to multi-line for extensibility
  • Wire PUFFIN file format through C++ protocol and connector layer

Thrift ODR Violation Blocking Native Parquet Writes in Velox
Problem
Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation.

Root Cause
Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts:

OSS Apache Thrift (Parquet) FBThrift (Prestissimo RPC)
Namespace apache::thrift apache::thrift
TTransport size ~40 bytes (has TConfiguration shared_ptr, message size fields) ~8 bytes (vtable pointer only)
fd_ offset in TFDTransport ~40+ ~8
When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV.

Crash Signature
Signal 11 (SIGSEGV) (0x0)
std::_Sp_counted_base<>::_M_release_slow_last_use() ← null shared_ptr control block
apache::thrift::protocol::TProtocol::TProtocol() ← wrong TTransport layout
ThriftSerializer::ThriftSerializer() ← Parquet page header serialization
SerializedPageWriter::SerializedPageWriter()
Writer::close() → flush() → writeTable()
IcebergDataSink::closeInternal() ← triggered by any native Parquet write

Impact
All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift
DWRF/ORC writes are unaffected (they don't use thrift serialization)
Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR)
Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant
Prior Art
SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2)
Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023)
T262970501 — tracking task for addressing the Parquet thrift dependency
GitHub issue #13175 — upstream tracking

Solution:
X-link: facebookincubator/velox#16019 this will help fix it.

Differential Revision: D98704718

@apurva-meta apurva-meta requested review from a team as code owners March 30, 2026 06:53
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 30, 2026

Reviewer's Guide

Adds Iceberg V3 protocol support needed by Prestissimo/Velox for deletion vectors and PUFFIN-format files, including new protocol enum values, PUFFIN-to-DWRF mapping, and split reclassification logic so deletion vector files are correctly routed to the DeletionVectorReader instead of positional delete handling.

Sequence diagram for Iceberg V3 deletion vector routing

sequenceDiagram
  actor Coordinator
  participant PrestoWorker
  participant IcebergPrestoToVeloxConnector
  participant HiveIcebergSplit
  participant IcebergSplitReader
  participant DeletionVectorReader
  participant PositionalDeleteFileReader

  Coordinator->>PrestoWorker: Send IcebergSplit with deletes
  PrestoWorker->>IcebergPrestoToVeloxConnector: toVeloxSplit(catalogId, connectorSplit, splitContext)
  IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: dynamic_cast to IcebergSplit*

  loop For each deleteFile in icebergSplit.deletes
    IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: toVeloxFileContent(deleteFile.content)
    IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: toVeloxFileFormat(deleteFile.format)

    alt PUFFIN positional deletes (deletion vector)
      IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: veloxContent = kDeletionVector
    else Other delete file
      IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: veloxContent unchanged
    end

    IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: Construct IcebergDeleteFile with veloxContent
  end

  IcebergPrestoToVeloxConnector->>HiveIcebergSplit: Create HiveIcebergSplit with deletes
  IcebergPrestoToVeloxConnector-->>PrestoWorker: Return HiveIcebergSplit

  PrestoWorker->>IcebergSplitReader: Read HiveIcebergSplit

  alt Delete file content is kDeletionVector
    IcebergSplitReader->>DeletionVectorReader: Read PUFFIN deletion vector
  else Delete file content is kPositionalDeletes
    IcebergSplitReader->>PositionalDeleteFileReader: Read positional delete file
  end
Loading

Class diagram for updated Iceberg file format and content enums

classDiagram
  class FileContent {
    <<enum>>
    DATA
    POSITION_DELETES
    EQUALITY_DELETES
  }

  class FileFormat {
    <<enum>>
    ORC
    PARQUET
    AVRO
    METADATA
    PUFFIN
  }

  class IcebergDeleteFile {
    +FileContent content
    +string path
    +FileFormat format
    +int64_t recordCount
    +unordered_map~int32_t, string~ lowerBounds
    +unordered_map~int32_t, string~ upperBounds
  }

  class VeloxFileFormat {
    <<enum>>
    ORC
    PARQUET
    DWRF
  }

  class VeloxFileContent {
    <<enum>>
    kData
    kPositionalDeletes
    kEqualityDeletes
    kDeletionVector
  }

  class IcebergPrestoToVeloxConnector {
    +toVeloxFileFormat(FileFormat format) VeloxFileFormat
    +toVeloxSplit(ConnectorId catalogId, ConnectorSplit connectorSplit, SplitContext splitContext) HiveIcebergSplitPtr
  }

  FileContent "1" --> "*" IcebergDeleteFile : content
  FileFormat "1" --> "*" IcebergDeleteFile : format
  VeloxFileFormat <-- IcebergPrestoToVeloxConnector : maps from FileFormat
  VeloxFileContent <-- IcebergPrestoToVeloxConnector : maps from FileContent
Loading

File-Level Changes

Change Details Files
Extend Iceberg protocol enums to recognize PUFFIN file format and prepare FileContent for future extensions.
  • Refactor FileContent enum definition into a multi-line form without changing existing values.
  • Add PUFFIN as a new value to the FileFormat enum used in the Presto Iceberg protocol.
  • Update the FileFormat JSON serialization table to serialize/deserialize the PUFFIN value.
presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.h
presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp
Adjust Iceberg Presto-to-Velox connector to handle PUFFIN deletion vector files and route them to the correct Velox reader.
  • Extend toVeloxFileFormat mapping to handle PUFFIN by mapping it to DWRF as a placeholder since the DeletionVectorReader consumes raw binary and ignores the format.
  • Add logic in toVeloxSplit to reclassify Iceberg V3 deletion vectors, originally tagged as positional deletes with PUFFIN format, into kDeletionVector so IcebergSplitReader uses DeletionVectorReader instead of positional delete handling.
  • Tighten types and const-qualification in split conversion (const pointer for IcebergSplit, const maps and delete file object) and make a minor style change to the infoColumns initialization.
presto-native-execution/presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp

Possibly linked issues

  • #native(Iceberg): PR implements Iceberg V3 deletion vectors and equality deletes handling requested for Presto/Prestissimo Iceberg connectors.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The toVeloxFileFormat mapping of PUFFIN to DWRF relies on comments to guarantee it's only used for deletion vectors; consider adding a runtime check (e.g., asserting the associated FileContent is positional delete/DV) at the call site to fail fast if PUFFIN is ever used for other content types.
  • Given that Iceberg V3 deletion vectors are encoded as POSITION_DELETES + PUFFIN at the protocol layer and reclassified later, it may be worth adding a brief comment next to the FileContent enum definition clarifying this convention to avoid future misclassification when new content types are added.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `toVeloxFileFormat` mapping of `PUFFIN` to `DWRF` relies on comments to guarantee it's only used for deletion vectors; consider adding a runtime check (e.g., asserting the associated `FileContent` is positional delete/DV) at the call site to fail fast if PUFFIN is ever used for other content types.
- Given that Iceberg V3 deletion vectors are encoded as `POSITION_DELETES` + `PUFFIN` at the protocol layer and reclassified later, it may be worth adding a brief comment next to the `FileContent` enum definition clarifying this convention to avoid future misclassification when new content types are added.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@apurva-meta apurva-meta mentioned this pull request Mar 30, 2026
18 tasks
@steveburnett
Copy link
Copy Markdown
Contributor

  • Please add a release note - or NO RELEASE NOTE - following the Release Notes Guidelines to pass the failing but not required CI check.

  • Please edit the PR title to follow semantic commit style to pass the failing and required CI check. See the failure in the test for advice. If you can't edit the PR title, let us know and we can help.

@meta-codesync meta-codesync bot changed the title feat: [velox+prestissimo][iceberg] Iceberg V3 full C++ support: deletion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol [velox+prestissimo][iceberg] Iceberg V3 full C++ support: deletion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol Apr 3, 2026
@meta-codesync meta-codesync bot changed the title [velox+prestissimo][iceberg] Iceberg V3 full C++ support: deletion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol feat: [velox+prestissimo][iceberg] Iceberg V3 full C++ support: deletion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (#27462) Apr 4, 2026
apurva-meta added a commit to apurva-meta/presto that referenced this pull request Apr 4, 2026
…ion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (prestodb#27462)

Summary:
X-link: facebookincubator/velox#16959


Combined velox/prestissimo diffs for Iceberg V3 C++ support:
- Improve IcebergSplitReader error handling and fix test file handle leaks
- Add Iceberg V3 deletion vector support (DeletionVectorReader)
- Add Iceberg equality delete file reader (EqualityDeleteFileReader)
- Add sequence number conflict resolution for equality deletes
- Add sequence number conflict resolution for positional deletes and deletion vectors
- Add Iceberg V3 deletion vector writer (DeletionVectorWriter)
- Add DWRF file format support for Iceberg data sink
- Add Manifold filesystem support with CAT token authentication
- Reformat FileContent enum to multi-line for extensibility
- Wire PUFFIN file format through C++ protocol and connector layer


Thrift ODR Violation Blocking Native Parquet Writes in Velox
Problem
Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation.

Root Cause
Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts:

OSS Apache Thrift (Parquet)	FBThrift (Prestissimo RPC)
Namespace	apache::thrift	apache::thrift
TTransport size	~40 bytes (has TConfiguration shared_ptr, message size fields)	~8 bytes (vtable pointer only)
fd_ offset in TFDTransport	~40+	~8
When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV.

Crash Signature
Signal 11 (SIGSEGV) (0x0)
  std::_Sp_counted_base<>::_M_release_slow_last_use()    ← null shared_ptr control block
  apache::thrift::protocol::TProtocol::TProtocol()       ← wrong TTransport layout
  ThriftSerializer::ThriftSerializer()                    ← Parquet page header serialization
  SerializedPageWriter::SerializedPageWriter()
  Writer::close() → flush() → writeTable()
  IcebergDataSink::closeInternal()                        ← triggered by any native Parquet write

Impact
All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift
DWRF/ORC writes are unaffected (they don't use thrift serialization)
Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR)
Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant
Prior Art
SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2)
Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023)
T262970501 — tracking task for addressing the Parquet thrift dependency
GitHub issue prestodb#13175 — upstream tracking

Solution:
X-link: facebookincubator/velox#16019 this will help fix it.

Differential Revision: D98704718
apurva-meta added a commit to apurva-meta/velox that referenced this pull request Apr 4, 2026
…ion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (facebookincubator#16959)

Summary:

X-link: prestodb/presto#27462

Combined velox/prestissimo diffs for Iceberg V3 C++ support:
- Improve IcebergSplitReader error handling and fix test file handle leaks
- Add Iceberg V3 deletion vector support (DeletionVectorReader)
- Add Iceberg equality delete file reader (EqualityDeleteFileReader)
- Add sequence number conflict resolution for equality deletes
- Add sequence number conflict resolution for positional deletes and deletion vectors
- Add Iceberg V3 deletion vector writer (DeletionVectorWriter)
- Add DWRF file format support for Iceberg data sink
- Add Manifold filesystem support with CAT token authentication
- Reformat FileContent enum to multi-line for extensibility
- Wire PUFFIN file format through C++ protocol and connector layer


Thrift ODR Violation Blocking Native Parquet Writes in Velox
Problem
Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation.

Root Cause
Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts:

OSS Apache Thrift (Parquet)	FBThrift (Prestissimo RPC)
Namespace	apache::thrift	apache::thrift
TTransport size	~40 bytes (has TConfiguration shared_ptr, message size fields)	~8 bytes (vtable pointer only)
fd_ offset in TFDTransport	~40+	~8
When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV.

Crash Signature
Signal 11 (SIGSEGV) (0x0)
  std::_Sp_counted_base<>::_M_release_slow_last_use()    ← null shared_ptr control block
  apache::thrift::protocol::TProtocol::TProtocol()       ← wrong TTransport layout
  ThriftSerializer::ThriftSerializer()                    ← Parquet page header serialization
  SerializedPageWriter::SerializedPageWriter()
  Writer::close() → flush() → writeTable()
  IcebergDataSink::closeInternal()                        ← triggered by any native Parquet write

Impact
All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift
DWRF/ORC writes are unaffected (they don't use thrift serialization)
Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR)
Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant
Prior Art
SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2)
Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023)
T262970501 — tracking task for addressing the Parquet thrift dependency
GitHub issue facebookincubator#13175 — upstream tracking

Solution:

Differential Revision: D98704718
apurva-meta added a commit to apurva-meta/presto that referenced this pull request Apr 4, 2026
…ion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (prestodb#27462)

Summary:
X-link: facebookincubator/velox#16959


Combined velox/prestissimo diffs for Iceberg V3 C++ support:
- Improve IcebergSplitReader error handling and fix test file handle leaks
- Add Iceberg V3 deletion vector support (DeletionVectorReader)
- Add Iceberg equality delete file reader (EqualityDeleteFileReader)
- Add sequence number conflict resolution for equality deletes
- Add sequence number conflict resolution for positional deletes and deletion vectors
- Add Iceberg V3 deletion vector writer (DeletionVectorWriter)
- Add DWRF file format support for Iceberg data sink
- Add Manifold filesystem support with CAT token authentication
- Reformat FileContent enum to multi-line for extensibility
- Wire PUFFIN file format through C++ protocol and connector layer


Thrift ODR Violation Blocking Native Parquet Writes in Velox
Problem
Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation.

Root Cause
Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts:

OSS Apache Thrift (Parquet)	FBThrift (Prestissimo RPC)
Namespace	apache::thrift	apache::thrift
TTransport size	~40 bytes (has TConfiguration shared_ptr, message size fields)	~8 bytes (vtable pointer only)
fd_ offset in TFDTransport	~40+	~8
When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV.

Crash Signature
Signal 11 (SIGSEGV) (0x0)
  std::_Sp_counted_base<>::_M_release_slow_last_use()    ← null shared_ptr control block
  apache::thrift::protocol::TProtocol::TProtocol()       ← wrong TTransport layout
  ThriftSerializer::ThriftSerializer()                    ← Parquet page header serialization
  SerializedPageWriter::SerializedPageWriter()
  Writer::close() → flush() → writeTable()
  IcebergDataSink::closeInternal()                        ← triggered by any native Parquet write

Impact
All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift
DWRF/ORC writes are unaffected (they don't use thrift serialization)
Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR)
Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant
Prior Art
SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2)
Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023)
T262970501 — tracking task for addressing the Parquet thrift dependency
GitHub issue prestodb#13175 — upstream tracking

Solution:
X-link: facebookincubator/velox#16019 this will help fix it.

Differential Revision: D98704718
apurva-meta added a commit to apurva-meta/velox that referenced this pull request Apr 4, 2026
…ion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (facebookincubator#16959)

Summary:

X-link: prestodb/presto#27462

Combined velox/prestissimo diffs for Iceberg V3 C++ support:
- Improve IcebergSplitReader error handling and fix test file handle leaks
- Add Iceberg V3 deletion vector support (DeletionVectorReader)
- Add Iceberg equality delete file reader (EqualityDeleteFileReader)
- Add sequence number conflict resolution for equality deletes
- Add sequence number conflict resolution for positional deletes and deletion vectors
- Add Iceberg V3 deletion vector writer (DeletionVectorWriter)
- Add DWRF file format support for Iceberg data sink
- Add Manifold filesystem support with CAT token authentication
- Reformat FileContent enum to multi-line for extensibility
- Wire PUFFIN file format through C++ protocol and connector layer


Thrift ODR Violation Blocking Native Parquet Writes in Velox
Problem
Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation.

Root Cause
Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts:

OSS Apache Thrift (Parquet)	FBThrift (Prestissimo RPC)
Namespace	apache::thrift	apache::thrift
TTransport size	~40 bytes (has TConfiguration shared_ptr, message size fields)	~8 bytes (vtable pointer only)
fd_ offset in TFDTransport	~40+	~8
When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV.

Crash Signature
Signal 11 (SIGSEGV) (0x0)
  std::_Sp_counted_base<>::_M_release_slow_last_use()    ← null shared_ptr control block
  apache::thrift::protocol::TProtocol::TProtocol()       ← wrong TTransport layout
  ThriftSerializer::ThriftSerializer()                    ← Parquet page header serialization
  SerializedPageWriter::SerializedPageWriter()
  Writer::close() → flush() → writeTable()
  IcebergDataSink::closeInternal()                        ← triggered by any native Parquet write

Impact
All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift
DWRF/ORC writes are unaffected (they don't use thrift serialization)
Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR)
Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant
Prior Art
SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2)
Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023)
T262970501 — tracking task for addressing the Parquet thrift dependency
GitHub issue facebookincubator#13175 — upstream tracking

Solution:

Differential Revision: D98704718
apurva-meta added a commit to apurva-meta/presto that referenced this pull request Apr 4, 2026
…ion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (prestodb#27462)

Summary:
X-link: facebookincubator/velox#16959

Pull Request resolved: prestodb#27462

Combined velox/prestissimo diffs for Iceberg V3 C++ support:
- Improve IcebergSplitReader error handling and fix test file handle leaks
- Add Iceberg V3 deletion vector support (DeletionVectorReader)
- Add Iceberg equality delete file reader (EqualityDeleteFileReader)
- Add sequence number conflict resolution for equality deletes
- Add sequence number conflict resolution for positional deletes and deletion vectors
- Add Iceberg V3 deletion vector writer (DeletionVectorWriter)
- Add DWRF file format support for Iceberg data sink
- Add Manifold filesystem support with CAT token authentication
- Reformat FileContent enum to multi-line for extensibility
- Wire PUFFIN file format through C++ protocol and connector layer

Thrift ODR Violation Blocking Native Parquet Writes in Velox
Problem
Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation.

Root Cause
Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts:

OSS Apache Thrift (Parquet)	FBThrift (Prestissimo RPC)
Namespace	apache::thrift	apache::thrift
TTransport size	~40 bytes (has TConfiguration shared_ptr, message size fields)	~8 bytes (vtable pointer only)
fd_ offset in TFDTransport	~40+	~8
When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV.

Crash Signature
Signal 11 (SIGSEGV) (0x0)
  std::_Sp_counted_base<>::_M_release_slow_last_use()    ← null shared_ptr control block
  apache::thrift::protocol::TProtocol::TProtocol()       ← wrong TTransport layout
  ThriftSerializer::ThriftSerializer()                    ← Parquet page header serialization
  SerializedPageWriter::SerializedPageWriter()
  Writer::close() → flush() → writeTable()
  IcebergDataSink::closeInternal()                        ← triggered by any native Parquet write

Impact
All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift
DWRF/ORC writes are unaffected (they don't use thrift serialization)
Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR)
Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant
Prior Art
SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2)
Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023)
T262970501 — tracking task for addressing the Parquet thrift dependency
GitHub issue prestodb#13175 — upstream tracking

Solution:
X-link: facebookincubator/velox#16019 this will help fix it.

Differential Revision: D98704718
apurva-meta added a commit to apurva-meta/velox that referenced this pull request Apr 4, 2026
…ion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (facebookincubator#16959)

Summary:
Pull Request resolved: facebookincubator#16959

X-link: prestodb/presto#27462

Combined velox/prestissimo diffs for Iceberg V3 C++ support:
- Improve IcebergSplitReader error handling and fix test file handle leaks
- Add Iceberg V3 deletion vector support (DeletionVectorReader)
- Add Iceberg equality delete file reader (EqualityDeleteFileReader)
- Add sequence number conflict resolution for equality deletes
- Add sequence number conflict resolution for positional deletes and deletion vectors
- Add Iceberg V3 deletion vector writer (DeletionVectorWriter)
- Add DWRF file format support for Iceberg data sink
- Add Manifold filesystem support with CAT token authentication
- Reformat FileContent enum to multi-line for extensibility
- Wire PUFFIN file format through C++ protocol and connector layer

Thrift ODR Violation Blocking Native Parquet Writes in Velox
Problem
Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation.

Root Cause
Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts:

OSS Apache Thrift (Parquet)	FBThrift (Prestissimo RPC)
Namespace	apache::thrift	apache::thrift
TTransport size	~40 bytes (has TConfiguration shared_ptr, message size fields)	~8 bytes (vtable pointer only)
fd_ offset in TFDTransport	~40+	~8
When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV.

Crash Signature
Signal 11 (SIGSEGV) (0x0)
  std::_Sp_counted_base<>::_M_release_slow_last_use()    ← null shared_ptr control block
  apache::thrift::protocol::TProtocol::TProtocol()       ← wrong TTransport layout
  ThriftSerializer::ThriftSerializer()                    ← Parquet page header serialization
  SerializedPageWriter::SerializedPageWriter()
  Writer::close() → flush() → writeTable()
  IcebergDataSink::closeInternal()                        ← triggered by any native Parquet write

Impact
All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift
DWRF/ORC writes are unaffected (they don't use thrift serialization)
Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR)
Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant
Prior Art
SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2)
Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023)
T262970501 — tracking task for addressing the Parquet thrift dependency
GitHub issue facebookincubator#13175 — upstream tracking

Solution:
Pull Request resolved: facebookincubator#16019 this will help fix it.

Differential Revision: D98704718
apurva-meta added a commit to apurva-meta/presto that referenced this pull request Apr 4, 2026
…ion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (prestodb#27462)

Summary:
X-link: facebookincubator/velox#16959


Combined velox/prestissimo diffs for Iceberg V3 C++ support:
- Improve IcebergSplitReader error handling and fix test file handle leaks
- Add Iceberg V3 deletion vector support (DeletionVectorReader)
- Add Iceberg equality delete file reader (EqualityDeleteFileReader)
- Add sequence number conflict resolution for equality deletes
- Add sequence number conflict resolution for positional deletes and deletion vectors
- Add Iceberg V3 deletion vector writer (DeletionVectorWriter)
- Add DWRF file format support for Iceberg data sink
- Add Manifold filesystem support with CAT token authentication
- Reformat FileContent enum to multi-line for extensibility
- Wire PUFFIN file format through C++ protocol and connector layer


Thrift ODR Violation Blocking Native Parquet Writes in Velox
Problem
Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation.

Root Cause
Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts:

OSS Apache Thrift (Parquet)	FBThrift (Prestissimo RPC)
Namespace	apache::thrift	apache::thrift
TTransport size	~40 bytes (has TConfiguration shared_ptr, message size fields)	~8 bytes (vtable pointer only)
fd_ offset in TFDTransport	~40+	~8
When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV.

Crash Signature
Signal 11 (SIGSEGV) (0x0)
  std::_Sp_counted_base<>::_M_release_slow_last_use()    ← null shared_ptr control block
  apache::thrift::protocol::TProtocol::TProtocol()       ← wrong TTransport layout
  ThriftSerializer::ThriftSerializer()                    ← Parquet page header serialization
  SerializedPageWriter::SerializedPageWriter()
  Writer::close() → flush() → writeTable()
  IcebergDataSink::closeInternal()                        ← triggered by any native Parquet write

Impact
All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift
DWRF/ORC writes are unaffected (they don't use thrift serialization)
Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR)
Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant
Prior Art
SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2)
Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023)
T262970501 — tracking task for addressing the Parquet thrift dependency
GitHub issue prestodb#13175 — upstream tracking

Solution:
X-link: facebookincubator/velox#16019 this will help fix it.

Differential Revision: D98704718
apurva-meta added a commit to apurva-meta/velox that referenced this pull request Apr 4, 2026
…ion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (facebookincubator#16959)

Summary:

X-link: prestodb/presto#27462

Combined velox/prestissimo diffs for Iceberg V3 C++ support:
- Improve IcebergSplitReader error handling and fix test file handle leaks
- Add Iceberg V3 deletion vector support (DeletionVectorReader)
- Add Iceberg equality delete file reader (EqualityDeleteFileReader)
- Add sequence number conflict resolution for equality deletes
- Add sequence number conflict resolution for positional deletes and deletion vectors
- Add Iceberg V3 deletion vector writer (DeletionVectorWriter)
- Add DWRF file format support for Iceberg data sink
- Add Manifold filesystem support with CAT token authentication
- Reformat FileContent enum to multi-line for extensibility
- Wire PUFFIN file format through C++ protocol and connector layer


Thrift ODR Violation Blocking Native Parquet Writes in Velox
Problem
Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation.

Root Cause
Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts:

OSS Apache Thrift (Parquet)	FBThrift (Prestissimo RPC)
Namespace	apache::thrift	apache::thrift
TTransport size	~40 bytes (has TConfiguration shared_ptr, message size fields)	~8 bytes (vtable pointer only)
fd_ offset in TFDTransport	~40+	~8
When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV.

Crash Signature
Signal 11 (SIGSEGV) (0x0)
  std::_Sp_counted_base<>::_M_release_slow_last_use()    ← null shared_ptr control block
  apache::thrift::protocol::TProtocol::TProtocol()       ← wrong TTransport layout
  ThriftSerializer::ThriftSerializer()                    ← Parquet page header serialization
  SerializedPageWriter::SerializedPageWriter()
  Writer::close() → flush() → writeTable()
  IcebergDataSink::closeInternal()                        ← triggered by any native Parquet write

Impact
All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift
DWRF/ORC writes are unaffected (they don't use thrift serialization)
Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR)
Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant
Prior Art
SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2)
Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023)
T262970501 — tracking task for addressing the Parquet thrift dependency
GitHub issue facebookincubator#13175 — upstream tracking

Solution:

Differential Revision: D98704718
…ion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (prestodb#27462)

Summary:
X-link: facebookincubator/velox#16959

Pull Request resolved: prestodb#27462

Combined velox/prestissimo diffs for Iceberg V3 C++ support:
- Improve IcebergSplitReader error handling and fix test file handle leaks
- Add Iceberg V3 deletion vector support (DeletionVectorReader)
- Add Iceberg equality delete file reader (EqualityDeleteFileReader)
- Add sequence number conflict resolution for equality deletes
- Add sequence number conflict resolution for positional deletes and deletion vectors
- Add Iceberg V3 deletion vector writer (DeletionVectorWriter)
- Add DWRF file format support for Iceberg data sink
- Add Manifold filesystem support with CAT token authentication
- Reformat FileContent enum to multi-line for extensibility
- Wire PUFFIN file format through C++ protocol and connector layer

Thrift ODR Violation Blocking Native Parquet Writes in Velox
Problem
Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation.

Root Cause
Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts:

OSS Apache Thrift (Parquet)	FBThrift (Prestissimo RPC)
Namespace	apache::thrift	apache::thrift
TTransport size	~40 bytes (has TConfiguration shared_ptr, message size fields)	~8 bytes (vtable pointer only)
fd_ offset in TFDTransport	~40+	~8
When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV.

Crash Signature
Signal 11 (SIGSEGV) (0x0)
  std::_Sp_counted_base<>::_M_release_slow_last_use()    ← null shared_ptr control block
  apache::thrift::protocol::TProtocol::TProtocol()       ← wrong TTransport layout
  ThriftSerializer::ThriftSerializer()                    ← Parquet page header serialization
  SerializedPageWriter::SerializedPageWriter()
  Writer::close() → flush() → writeTable()
  IcebergDataSink::closeInternal()                        ← triggered by any native Parquet write

Impact
All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift
DWRF/ORC writes are unaffected (they don't use thrift serialization)
Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR)
Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant
Prior Art
SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2)
Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023)
T262970501 — tracking task for addressing the Parquet thrift dependency
GitHub issue prestodb#13175 — upstream tracking

Solution:
X-link: facebookincubator/velox#16019 this will help fix it.

Differential Revision: D98704718
apurva-meta added a commit to apurva-meta/velox that referenced this pull request Apr 4, 2026
…ion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (facebookincubator#16959)

Summary:
Pull Request resolved: facebookincubator#16959

X-link: prestodb/presto#27462

Combined velox/prestissimo diffs for Iceberg V3 C++ support:
- Improve IcebergSplitReader error handling and fix test file handle leaks
- Add Iceberg V3 deletion vector support (DeletionVectorReader)
- Add Iceberg equality delete file reader (EqualityDeleteFileReader)
- Add sequence number conflict resolution for equality deletes
- Add sequence number conflict resolution for positional deletes and deletion vectors
- Add Iceberg V3 deletion vector writer (DeletionVectorWriter)
- Add DWRF file format support for Iceberg data sink
- Add Manifold filesystem support with CAT token authentication
- Reformat FileContent enum to multi-line for extensibility
- Wire PUFFIN file format through C++ protocol and connector layer

Thrift ODR Violation Blocking Native Parquet Writes in Velox
Problem
Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation.

Root Cause
Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts:

OSS Apache Thrift (Parquet)	FBThrift (Prestissimo RPC)
Namespace	apache::thrift	apache::thrift
TTransport size	~40 bytes (has TConfiguration shared_ptr, message size fields)	~8 bytes (vtable pointer only)
fd_ offset in TFDTransport	~40+	~8
When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV.

Crash Signature
Signal 11 (SIGSEGV) (0x0)
  std::_Sp_counted_base<>::_M_release_slow_last_use()    ← null shared_ptr control block
  apache::thrift::protocol::TProtocol::TProtocol()       ← wrong TTransport layout
  ThriftSerializer::ThriftSerializer()                    ← Parquet page header serialization
  SerializedPageWriter::SerializedPageWriter()
  Writer::close() → flush() → writeTable()
  IcebergDataSink::closeInternal()                        ← triggered by any native Parquet write

Impact
All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift
DWRF/ORC writes are unaffected (they don't use thrift serialization)
Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR)
Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant
Prior Art
SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2)
Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023)
T262970501 — tracking task for addressing the Parquet thrift dependency
GitHub issue facebookincubator#13175 — upstream tracking

Solution:
Pull Request resolved: facebookincubator#16019 this will help fix it.

Differential Revision: D98704718
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants