-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
Closing CandidateMay be closeable, needs more eyeballsMay be closeable, needs more eyeballsEnhancementExtensionArrayExtending pandas with custom dtypes or arrays.Extending pandas with custom dtypes or arrays.Nested DataData where the values are collections (lists, sets, dicts, objects, etc.).Data where the values are collections (lists, sets, dicts, objects, etc.).
Description
This issue is for adding a ListDtype. This might be useful on it's own, and will be useful for #35169 when we have string operations that return a List of values per scalar element.
I think the primary points to discuss are around
- How the
value_type
of the List, theT
inList[T]
, should be specified by the user - How, if at all, to switch between the
list_
andlarge_list
types.
xref rapidsai/cudf#5610, where cudf is implementing a ListDtype. Let's chime in over there if we have any thoughts.
JulianWgs
Metadata
Metadata
Assignees
Labels
Closing CandidateMay be closeable, needs more eyeballsMay be closeable, needs more eyeballsEnhancementExtensionArrayExtending pandas with custom dtypes or arrays.Extending pandas with custom dtypes or arrays.Nested DataData where the values are collections (lists, sets, dicts, objects, etc.).Data where the values are collections (lists, sets, dicts, objects, etc.).
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
jbrockmendel commentedon Jul 8, 2020
IIRC numba has a TypedList, could do something like that for the str.split-like ops
xhochy commentedon Jul 10, 2020
One of the biggest challenges with a ListType is probably how to differentiate between scalars and list-likes in the pandas API. There are quite some places where you can pass in both and the API behaviour will be slightly different. In the case of a ListDtype, the scalar is also a list-like. This makes it harder to decide which code path should be taken. We probably can use the Type-Information to decide whether we actually have a scalar or not. At the moment this information is sadly not yet present in the dispatching interfaces. For
fletcher
, I have yet been a bit lazy with this andxfail
a lot of these cases https://github.com/xhochy/fletcher/blob/18ac1a348fdd6ccfb096ec5e27c9dedc1e7fc837/tests/test_pandas_extension.py#L74-L86jorisvandenbossche commentedon Jul 10, 2020
Yes, this is indeed a general problem we need to solve in pandas. We also have been running into this with GeoPandas (eg #26333) and you also already run into corner cases when using iterable elements in object dtype. Other related issues: #27911, #35131
We will probably need some mechanism to let the dtype decide if some value can be a scalar or not.
jorisvandenbossche commentedon Jul 10, 2020
For storing list-like data, I think that will be relatively straightfoward (either just with pyarrow, or even the raw memory layout of Arrow are "just" two arrays with values and offsets).
But right now there are not yet many operations or kernels included in Arrow to work on nested data, I think. In the meantime,
awkward-array
might be an interesting option to explore to perform more operations on such data (https://github.com/scikit-hep/awkward-array/)ananis25 commentedon Dec 7, 2020
Could I request to also consider a pandas Extension type for n-dim numpy arrays? Though it probably strays off from the pandas semantics of considering a series as an array like of scalars.
For a lot of data analysis work, the features are generally aligned along an axis like time and thus are suited to pandas. However, with >1D features, pandas coerces them to a numpy array of subarray objects, which causes memory usage to explode. A native type for numpy arrays of arbitrary dimensions would be very helpful (and easily compatible with arrow), even if aggregation ops, etc. are not allowed.
There is a ragtag implementation here, mostly copied from other available examples of extension arrays. The failing extension array tests generally have to do with:
is_scalar
routine in pandas internals, which seems to support only numpy/pandas scalars.JulianWgs commentedon Oct 24, 2021
For reference: cuDF (a GPU implementation of Pandas) has now support for ListDtype (Link).
10 remaining items