Open
Description
Describe the bug
What happened:
There are two ways of querying a Delta Table using DataFusion.
deltalake==1.0.2
datafusion==47.0.0
import time
from datafusion import SessionContext
from deltalake import DeltaTable, QueryBuilder
dt = DeltaTable("./delta_traces_3/otel_traces")
sql = """
SELECT
*
FROM tbl
WHERE
("MlRepoId" = 1089) AND ("TracingProjectId" = '222fde49-1f7a-4752-8ec1-06bcdbf570c5') AND ("TraceId" = '8728990bd3d11fa91a688e9d9964bca1') AND ("SpanId" = '82c0a65e80000450')
"""
qb = QueryBuilder().register("tbl", dt)
start = time.monotonic()
table = qb.execute(sql).read_all()
print("Delta QueryBuilder: ", time.monotonic() - start)
ctx = SessionContext()
ctx.register_table_provider("tbl", dt)
start = time.monotonic()
arrow_list = ctx.sql(sql).collect()
print("DataFusion: ", time.monotonic() - start)
python perf_diff.py
Delta QueryBuilder: 1.2023070430004736
DataFusion: 136.57222191900019
As we can see in the above result, I am noticing a massive difference in the query execution time.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I was expecting a near-identical execution time.
Additional context
delta-io/delta-rs#3517 (comment)