Skip to content

ENH: ListDtype / ListArray #35176

@TomAugspurger

Description

@TomAugspurger
Contributor

This issue is for adding a ListDtype. This might be useful on it's own, and will be useful for #35169 when we have string operations that return a List of values per scalar element.

I think the primary points to discuss are around

  1. How the value_type of the List, the T in List[T], should be specified by the user
  2. How, if at all, to switch between the list_ and large_list types.

xref rapidsai/cudf#5610, where cudf is implementing a ListDtype. Let's chime in over there if we have any thoughts.

Activity

added
ExtensionArrayExtending pandas with custom dtypes or arrays.
Nested DataData where the values are collections (lists, sets, dicts, objects, etc.).
on Jul 8, 2020
jbrockmendel

jbrockmendel commented on Jul 8, 2020

@jbrockmendel
Member

IIRC numba has a TypedList, could do something like that for the str.split-like ops

xhochy

xhochy commented on Jul 10, 2020

@xhochy
Contributor

One of the biggest challenges with a ListType is probably how to differentiate between scalars and list-likes in the pandas API. There are quite some places where you can pass in both and the API behaviour will be slightly different. In the case of a ListDtype, the scalar is also a list-like. This makes it harder to decide which code path should be taken. We probably can use the Type-Information to decide whether we actually have a scalar or not. At the moment this information is sadly not yet present in the dispatching interfaces. For fletcher, I have yet been a bit lazy with this and xfail a lot of these cases https://github.com/xhochy/fletcher/blob/18ac1a348fdd6ccfb096ec5e27c9dedc1e7fc837/tests/test_pandas_extension.py#L74-L86

jorisvandenbossche

jorisvandenbossche commented on Jul 10, 2020

@jorisvandenbossche
Member

Yes, this is indeed a general problem we need to solve in pandas. We also have been running into this with GeoPandas (eg #26333) and you also already run into corner cases when using iterable elements in object dtype. Other related issues: #27911, #35131

We will probably need some mechanism to let the dtype decide if some value can be a scalar or not.

jorisvandenbossche

jorisvandenbossche commented on Jul 10, 2020

@jorisvandenbossche
Member

For storing list-like data, I think that will be relatively straightfoward (either just with pyarrow, or even the raw memory layout of Arrow are "just" two arrays with values and offsets).

But right now there are not yet many operations or kernels included in Arrow to work on nested data, I think. In the meantime, awkward-array might be an interesting option to explore to perform more operations on such data (https://github.com/scikit-hep/awkward-array/)

ananis25

ananis25 commented on Dec 7, 2020

@ananis25

Could I request to also consider a pandas Extension type for n-dim numpy arrays? Though it probably strays off from the pandas semantics of considering a series as an array like of scalars.

For a lot of data analysis work, the features are generally aligned along an axis like time and thus are suited to pandas. However, with >1D features, pandas coerces them to a numpy array of subarray objects, which causes memory usage to explode. A native type for numpy arrays of arbitrary dimensions would be very helpful (and easily compatible with arrow), even if aggregation ops, etc. are not allowed.

There is a ragtag implementation here, mostly copied from other available examples of extension arrays. The failing extension array tests generally have to do with:

  1. Failed calls to is_scalar routine in pandas internals, which seems to support only numpy/pandas scalars.
  2. Construction of empty series with the extension dtype. I can't quite pin what would be a good NA value.
JulianWgs

JulianWgs commented on Oct 24, 2021

@JulianWgs

For reference: cuDF (a GPU implementation of Pandas) has now support for ListDtype (Link).

10 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Closing CandidateMay be closeable, needs more eyeballsEnhancementExtensionArrayExtending pandas with custom dtypes or arrays.Nested DataData where the values are collections (lists, sets, dicts, objects, etc.).

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @xhochy@jreback@jorisvandenbossche@TomAugspurger@gwerbin

        Issue actions

          ENH: ListDtype / ListArray · Issue #35176 · pandas-dev/pandas