Skip to content

Query execution time difference between deltatable QueryBuilder and using DataFusion directly. #1140

Open
@debajyoti-truefoundry

Description

@debajyoti-truefoundry

Describe the bug
What happened:
There are two ways of querying a Delta Table using DataFusion.

  1. Using DataFusion directly.
  2. Using the Query Builder from Delta.
deltalake==1.0.2
datafusion==47.0.0
import time
from datafusion import SessionContext
from deltalake import DeltaTable, QueryBuilder

dt = DeltaTable("./delta_traces_3/otel_traces")
sql = """
SELECT
  *
FROM tbl
WHERE
("MlRepoId" = 1089) AND ("TracingProjectId" = '222fde49-1f7a-4752-8ec1-06bcdbf570c5') AND ("TraceId" = '8728990bd3d11fa91a688e9d9964bca1') AND ("SpanId" = '82c0a65e80000450')
"""

qb = QueryBuilder().register("tbl", dt)
start = time.monotonic()
table = qb.execute(sql).read_all()
print("Delta QueryBuilder: ", time.monotonic() - start)

ctx = SessionContext()
ctx.register_table_provider("tbl", dt)
start = time.monotonic()
arrow_list = ctx.sql(sql).collect()
print("DataFusion: ", time.monotonic() - start)
python perf_diff.py 
Delta QueryBuilder:  1.2023070430004736
DataFusion:  136.57222191900019

As we can see in the above result, I am noticing a massive difference in the query execution time.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
I was expecting a near-identical execution time.

Additional context
delta-io/delta-rs#3517 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions