Open
Description
Describe the bug
When creating a Python/Pyarrow UDF, extension types and arrays aren't propagated from the output of one to the input of another.
To Reproduce
from uuid import UUID
import datafusion
import pyarrow as pa
@datafusion.udf([pa.string()], pa.uuid(), "stable")
def uuid_from_string(uuid_string):
return pa.array((UUID(s).bytes for s in uuid_string.to_pylist()), pa.uuid())
@datafusion.udf([pa.uuid()], pa.int64(), "stable")
def uuid_version(uuid):
return pa.array(s.version for s in uuid.to_pylist())
def main():
ctx = datafusion.SessionContext()
batch = pa.record_batch({"idx": pa.array(range(100))})
tab = (
ctx.create_dataframe([[batch]])
.with_column("uuid_string", datafusion.functions.uuid())
.with_column("uuid", uuid_from_string(datafusion.col("uuid_string")))
.with_column("uuid_version", uuid_version(datafusion.col("uuid")))
)
#> AttributeError("'bytes' object has no attribute 'version'"), since metadata doesn't make it through
print(tab)
if __name__ == "__main__":
main()
Expected behavior
The pyarrow.Array
that is returned from uuid_from_string()
is a UuidArray
:
pa.array([uuid4().bytes], pa.uuid())
#> <pyarrow.lib.UuidArray object at 0x120292350>
However, the pyarrow.Array
that is passed to uuid_version()
is a FixedSizeBinary
array. I would have expected the array passed here to have the pa.uuid()
type.
Additional context
It seems like create_udf()
is the mechanism being used to create the UDF; however, this doesn't propagate field information I believe since everything goes through the DataType
:
Lines 91 to 105 in 9545634