Skip to content

Add support for FBThrift in Parquet and remove thrift dependency #13175

@majetideepak

Description

@majetideepak

Discussed in #6163

Originally posted by pedroerp August 18, 2023
We currently have a dependency on thrift from our native parquet reader. As we add support for remote function execution, the client <-> server communication will be done using thrift RPC as well.

FBThrift is a newer and supposedly more efficient version than thrift, and among other things provides an async transport called "cpp2" (not available in thrift), which will be used for remote function execution.

Linking thrift and fbthrift to the same binary causes symbols to conflict and end up causing crashes as the wrong version of symbols eventually get picked. There was a long discussion on Slack (check #builds-and-ci channel) on whether this could be solved using dynamic linking, but so solution was found.

With this context, the proposal is to remove the thrift dependency from Velox (created due to parquet), and consolidate all of that on FBThrift. Despite their divergence, both projects seem to support TCompactProtocol (the ones used by Parquet), so, in theory (need some testing to validate), they should be compatible. FBThrift is also claimed to be more efficient, so there might be small performance gains in doing so.

The parquet generated cpp files are also today pre-compiled and checked in as part of the codebase; as part of this change we would also move them to be compiled at build time, like how it is done for remote functions. I've tested and all of them seems to be doable.

There is also a thrift dependency that comes with Arrow (we use arrow's parquet writer), but this is nicely wrapped within arrow and doesn't leak its internal symbols.

What do you all think?

UPDATE: just doubled checked with someone from the fbthrift team and the TCompactProtocol serialization needed for Parquet should be compatible.

Cc: @assignUser @majetideepak @aditi-pandit @yingsu00 @kgpai @mbasmanova @xiaoxmeng @oerling

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions