Skip to content

Commit 4ccb647

Browse files
apurva-metafacebook-github-bot
authored andcommitted
feat: [velox+prestissimo][iceberg] Iceberg V3 full C++ support: deletion vectors, equality deletes, sequence number conflict resolution, DV writer, DWRF data sink, Manifold filesystem, PUFFIN protocol (facebookincubator#16959)
Summary: X-link: prestodb/presto#27462 Combined velox/prestissimo diffs for Iceberg V3 C++ support: - Improve IcebergSplitReader error handling and fix test file handle leaks - Add Iceberg V3 deletion vector support (DeletionVectorReader) - Add Iceberg equality delete file reader (EqualityDeleteFileReader) - Add sequence number conflict resolution for equality deletes - Add sequence number conflict resolution for positional deletes and deletion vectors - Add Iceberg V3 deletion vector writer (DeletionVectorWriter) - Add DWRF file format support for Iceberg data sink - Add Manifold filesystem support with CAT token authentication - Reformat FileContent enum to multi-line for extensibility - Wire PUFFIN file format through C++ protocol and connector layer Thrift ODR Violation Blocking Native Parquet Writes in Velox Problem Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation. Root Cause Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts: OSS Apache Thrift (Parquet) FBThrift (Prestissimo RPC) Namespace apache::thrift apache::thrift TTransport size ~40 bytes (has TConfiguration shared_ptr, message size fields) ~8 bytes (vtable pointer only) fd_ offset in TFDTransport ~40+ ~8 When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV. Crash Signature Signal 11 (SIGSEGV) (0x0) std::_Sp_counted_base<>::_M_release_slow_last_use() ← null shared_ptr control block apache::thrift::protocol::TProtocol::TProtocol() ← wrong TTransport layout ThriftSerializer::ThriftSerializer() ← Parquet page header serialization SerializedPageWriter::SerializedPageWriter() Writer::close() → flush() → writeTable() IcebergDataSink::closeInternal() ← triggered by any native Parquet write Impact All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift DWRF/ORC writes are unaffected (they don't use thrift serialization) Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR) Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant Prior Art SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2) Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023) T262970501 — tracking task for addressing the Parquet thrift dependency GitHub issue facebookincubator#13175 — upstream tracking Solution: Differential Revision: D98704718
1 parent 7ea5609 commit 4ccb647

20 files changed

+3734
-57
lines changed

velox/connectors/hive/iceberg/CMakeLists.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,16 @@
1414

1515
set(
1616
ICEBERG_SOURCES
17+
DeletionVectorReader.cpp
18+
DeletionVectorWriter.cpp
1719
IcebergConfig.cpp
1820
IcebergColumnHandle.cpp
1921
IcebergConnector.cpp
2022
IcebergDataFileStatistics.cpp
2123
IcebergDataSink.cpp
2224
IcebergDataSource.cpp
2325
IcebergPartitionName.cpp
26+
EqualityDeleteFileReader.cpp
2427
IcebergSplit.cpp
2528
IcebergSplitReader.cpp
2629
PartitionSpec.cpp

0 commit comments

Comments
 (0)