Skip to content

Commit a2ff0c3

Browse files
apurva-metafacebook-github-bot
authored andcommitted
feat: [velox+prestissimo][iceberg] Apache Iceberg V3 support (facebookincubator#16959)
Summary: X-link: prestodb/presto#27462 Combined velox/prestissimo diffs for Iceberg V3 C++ support: - Improve IcebergSplitReader error handling and fix test file handle leaks - Add Iceberg V3 deletion vector support (DeletionVectorReader) - Add Iceberg equality delete file reader (EqualityDeleteFileReader) - Add sequence number conflict resolution for equality deletes - Add sequence number conflict resolution for positional deletes and deletion vectors - Add Iceberg V3 deletion vector writer (DeletionVectorWriter) - Add DWRF file format support for Iceberg data sink - Reformat FileContent enum to multi-line for extensibility - Wire PUFFIN file format through C++ protocol and connector layer Thrift ODR Violation Blocking Native Parquet Writes in Velox Problem Velox's Parquet writer crashes with SIGSEGV when linked into any binary that also uses FBThrift (e.g., Prestissimo presto_server). The crash is a C++ One Definition Rule (ODR) violation. Root Cause Velox's Parquet writer depends on OSS Apache Thrift (third-party2/apache-thrift/) for serializing Parquet page headers and file metadata. FBThrift (fbcode/thrift/) is Meta's fork used by RPC services. Both libraries declare classes in the same namespace (apache::thrift::protocol::TProtocol, apache::thrift::transport::TTransport, etc.) but with incompatible class layouts: OSS Apache Thrift (Parquet) FBThrift (Prestissimo RPC) Namespace apache::thrift apache::thrift TTransport size ~40 bytes (has TConfiguration shared_ptr, message size fields) ~8 bytes (vtable pointer only) fd_ offset in TFDTransport ~40+ ~8 When both are linked into one binary, the linker picks one definition. Code compiled against the other layout reads wrong memory offsets → SIGSEGV. Crash Signature Signal 11 (SIGSEGV) (0x0) std::_Sp_counted_base<>::_M_release_slow_last_use() ← null shared_ptr control block apache::thrift::protocol::TProtocol::TProtocol() ← wrong TTransport layout ThriftSerializer::ThriftSerializer() ← Parquet page header serialization SerializedPageWriter::SerializedPageWriter() Writer::close() → flush() → writeTable() IcebergDataSink::closeInternal() ← triggered by any native Parquet write Impact All native Parquet writes crash (INSERT, CTAS) in any Velox binary that links FBThrift DWRF/ORC writes are unaffected (they don't use thrift serialization) Parquet reads are unaffected (reads use a different thrift code path that happens to not trigger the ODR) Affects Prestissimo, and potentially any Velox embedder (Gluten/Spark, etc.) that links both Parquet and another thrift variant Prior Art SEV 635079 — same ODR caused SIGSEGV crashes in Spark F3 pipelines (March 2026, SEV-2) Apache Arrow already solved this by vendoring OSS thrift in private_parquet::apache::thrift namespace (D47918122, 2023) T262970501 — tracking task for addressing the Parquet thrift dependency GitHub issue facebookincubator#13175 — upstream tracking Solution: Differential Revision: D98704718
1 parent d789143 commit a2ff0c3

20 files changed

+3838
-37
lines changed

velox/connectors/hive/iceberg/CMakeLists.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,15 @@
1414

1515
set(
1616
ICEBERG_SOURCES
17+
DeletionVectorReader.cpp
18+
DeletionVectorWriter.cpp
1719
IcebergConfig.cpp
1820
IcebergColumnHandle.cpp
1921
IcebergConnector.cpp
2022
IcebergDataSink.cpp
2123
IcebergDataSource.cpp
2224
IcebergPartitionName.cpp
25+
EqualityDeleteFileReader.cpp
2326
IcebergSplit.cpp
2427
IcebergSplitReader.cpp
2528
PartitionSpec.cpp

0 commit comments

Comments
 (0)