Skip to content

ScalarUDFs created using datafusion.udf() do not propagate extension type metadata #1172

Open
@paleolimbot

Description

@paleolimbot

Describe the bug

When creating a Python/Pyarrow UDF, extension types and arrays aren't propagated from the output of one to the input of another.

To Reproduce

from uuid import UUID

import datafusion
import pyarrow as pa


@datafusion.udf([pa.string()], pa.uuid(), "stable")
def uuid_from_string(uuid_string):
    return pa.array((UUID(s).bytes for s in uuid_string.to_pylist()), pa.uuid())

@datafusion.udf([pa.uuid()], pa.int64(), "stable")
def uuid_version(uuid):
    return pa.array(s.version for s in uuid.to_pylist())


def main():
    ctx = datafusion.SessionContext()

    batch = pa.record_batch({"idx": pa.array(range(100))})
    tab = (
        ctx.create_dataframe([[batch]])
        .with_column("uuid_string", datafusion.functions.uuid())
        .with_column("uuid", uuid_from_string(datafusion.col("uuid_string")))
        .with_column("uuid_version", uuid_version(datafusion.col("uuid")))
    )
    #> AttributeError("'bytes' object has no attribute 'version'"), since metadata doesn't make it through

    print(tab)


if __name__ == "__main__":
    main()

Expected behavior

The pyarrow.Array that is returned from uuid_from_string() is a UuidArray:

pa.array([uuid4().bytes], pa.uuid())
#> <pyarrow.lib.UuidArray object at 0x120292350>

However, the pyarrow.Array that is passed to uuid_version() is a FixedSizeBinary array. I would have expected the array passed here to have the pa.uuid() type.

Additional context

It seems like create_udf() is the mechanism being used to create the UDF; however, this doesn't propagate field information I believe since everything goes through the DataType:

fn new(
name: &str,
func: PyObject,
input_types: PyArrowType<Vec<DataType>>,
return_type: PyArrowType<DataType>,
volatility: &str,
) -> PyResult<Self> {
let function = create_udf(
name,
input_types.0,
return_type.0,
parse_volatility(volatility)?,
to_scalar_function_impl(func),
);
Ok(Self { function })

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions