Skip to content

API: ExtensionArrays and conversion to "native" types (eg in tolist, to_dict, iteration, ..) #29738

@jorisvandenbossche

Description

@jorisvandenbossche
Member

We try to consistently return python objects (instead of numpy scalars) in certain functions like tolist, to_dict, itertuples/items, .. (we have had quite some issues fixing this in several cases).

However, currently we don't do that for extension dtypes (and don't have any mechanism to ask for this):

In [33]: type(pd.Series([1, 2], dtype='int64').tolist()[0]) 
Out[33]: int

In [34]: type(pd.Series([1, 2], dtype='Int64').tolist()[0])  
Out[34]: numpy.int64

In [36]: type(pd.Series([1, 2], dtype='int64').to_dict()[0]) 
Out[36]: int

In [37]: type(pd.Series([1, 2], dtype='Int64').to_dict()[0])
Out[37]: numpy.int64
In [45]: s = pd.Series([1, 2], dtype='int64') 

In [46]: type(list(s.iteritems())[0][1])  
Out[46]: int

In [47]: s = pd.Series([1, 2], dtype='Int64')      

In [48]: type(list(s.iteritems())[0][1])  
Out[48]: numpy.int64

Should we add some API to ExtensionArray to provide this? Eg a method to iterate through the elements that returns "native" objects?

Activity

jorisvandenbossche

jorisvandenbossche commented on Nov 20, 2019

@jorisvandenbossche
MemberAuthor

Actually, for Series, the __iter__ also returns native types, so maybe fixing that is enough: ensure that __iter__ on IntegerArray etc do return the python objects (so by not plainly using __getitem__, which correctly returns numpy scalars):

In [57]: s = pd.Series([1, 2], dtype='int64')  

In [58]: type(s[0])  
Out[58]: numpy.int64

In [59]: type(list(iter(s))[0])  
Out[59]: int

In [60]: s = pd.Series([1, 2], dtype='Int64')         

In [61]: type(s[0])    
Out[61]: numpy.int64

In [62]: type(list(iter(s))[0])  
Out[62]: numpy.int64   # <-- fixing this might fix the other cases?
added this to the Contributions Welcome milestone on Nov 20, 2019
jorisvandenbossche

jorisvandenbossche commented on Nov 20, 2019

@jorisvandenbossche
MemberAuthor

Let's consider this a bug in __iter__ then, because I think all mentioned cases can be solved with (for IntegerArray):

--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -456,7 +456,7 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):
             if self._mask[i]:
                 yield self.dtype.na_value
             else:
-                yield self._data[i]
+                yield self._data[i].item()
marco-neumann-by

marco-neumann-by commented on Nov 20, 2019

@marco-neumann-by

So ExtensionArray.__iter__ should return native types then? Is this also true for ExtensionArray.__getitem__ with some integer? I think it should at least be documented then so that other (external) implementations can get this right.

jorisvandenbossche

jorisvandenbossche commented on Nov 20, 2019

@jorisvandenbossche
MemberAuthor

Is this also true for ExtensionArray.getitem with some integer?

If we mimic what Series with plain numpy dtype does, then getitem should keep returning the numpy scalar.

victorbr92

victorbr92 commented on Nov 23, 2019

@victorbr92

Hi, @marco-neumann-jdas and @jorisvandenbossche, are you guys working on this issue ? Could I help somehow and make a PR for it (it was marked as a good first issue).
I tested the changes proposed by @jorisvandenbossche locally and it fixes the reported issue indeed.

added a commit that references this issue on Dec 10, 2019
marco-neumann-by

marco-neumann-by commented on Jan 21, 2020

@marco-neumann-by

I am not working on it.
Keep in mind that this issue is not only about fixing the behavior of IntegerArray but also to adjust the docs of ExtensionArray.__iter__ to state that EVERY EA should return Python-native types.

7 remaining items

mroeschke

mroeschke commented on Jul 19, 2022

@mroeschke
Member

Similarly, should pd.NA (ExtensionDtype.na_value) be converted to None as the native equivalent?

In [1]: pd.Series([1, 2, None], dtype="Int64").tolist()
Out[1]: [1, 2, <NA>] # should <NA> be None?
removed this from the Contributions Welcome milestone on Oct 13, 2022
Peque

Peque commented on Dec 11, 2022

@Peque
Contributor

Stumbled upon this issue when trying to serialize a DataFrame resulted from .isocalendar():

import json

from pandas import date_range

df = date_range("2021-01-01", freq="D", periods=7).isocalendar()
json.dumps(df.to_dict(orient="list"))
# TypeError: Object of type uint32 is not JSON serializable

Of course, this can be reduced to an example like those presented above by @jorisvandenbossche:

type(pd.Series([1, 2], dtype='UInt32').tolist()[0])
# numpy.int32

Just leaving this comment in case someone looks for "isocalendar" or "not JSON serializable".

lukemanley

lukemanley commented on Dec 11, 2022

@lukemanley
Member

@Peque - your example cases have been fixed on the main branch and will be included in 2.0. On main, your first example works without raising and your second example returns int.

lukemanley

lukemanley commented on Dec 11, 2022

@lukemanley
Member

Similarly, should pd.NA (ExtensionDtype.na_value) be converted to None as the native equivalent?

@mroeschke - was there ever an answer to this? pd.NA is still being returned

Peque

Peque commented on Dec 11, 2022

@Peque
Contributor

@lukemanley Great to know and thanks for sharing! 😊

mroeschke

mroeschke commented on Dec 12, 2022

@mroeschke
Member

@mroeschke - was there ever an answer to this? pd.NA is still being returned

Personally I think it makes sense to convert pd.NA to None as the Python native type. @jorisvandenbossche might have thoughts on this as well

lukemanley

lukemanley commented on Jan 28, 2023

@lukemanley
Member

#50796 changed Series.to_dict to return None in place of pd.NA which seems to be the preferred behavior as it is consistent with returning native types. I took a look at doing the same for tolist and __iter__.

I'll note that changing the value for __iter__ would result in the following list comprehension to start raising:

arr = pd.array([1, 2, pd.NA], dtype="Int64")

[v+1 for v in arr]  # <- TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

I'm curious if that is too big of a change? Since series.tolist() == list(series) is tested behavior I suspect tolist and __iter__ need to remain consistent with each other.

Any suggestions for moving this forward from here? Its probably not ideal that to_dict and tolist are inconsistent with pd.NA at the moment.

cc @phofl @mroeschke

lukemanley

lukemanley commented on Jan 29, 2023

@lukemanley
Member

One more comment here. I think there may be a case for making tolist return None in place of pd.NA but not __iter__. This would make tolist consistent with to_dict. It would also be consistent with how numpy and pyarrow operate. Both numpy and pyarrow have .tolist methods that return python native types and __iter__ methods that return non-native types when iterating through an array.

If we were to take this approach, the currently tested series.tolist() == list(series) behavior would change.

Example with pyarrow:

In [1]: arr = pa.array([1, None])

In [2]: arr.tolist()
Out[2]: [1, None]

In [3]: list(arr)
Out[3]: [<pyarrow.Int64Scalar: 1>, <pyarrow.Int64Scalar: None>]
mroeschke

mroeschke commented on Jan 30, 2023

@mroeschke
Member

My initial feeling is that both __iter__ and list should both return python types with justification that if a user calls list I think they would expect the elements to behave like native Python objects

lukemanley

lukemanley commented on Jan 30, 2023

@lukemanley
Member

My initial feeling is that both iter and list should both return python types with justification that if a user calls list I think they would expect the elements to behave like native Python objects

Sorry for all the questions.

Just want to confirm that you're talking about both Series.__iter__ and EA.__iter__ (e.g. BaseMaskedArray). I ask because with non-EA, there is a native/non-native type difference between Series.__iter__ and Series.values.__iter__:

import pandas as pd

ser  = pd.Series([1])

for v in ser:
    print(type(v))      # -> <class 'int'>

for v in ser.values:
    print(type(v))      # -> <class 'numpy.int64'>
mroeschke

mroeschke commented on Jan 31, 2023

@mroeschke
Member

Series.values is already a numpy array, so I think there's an understanding that np.array.__iter__ is not yield similar Python native types as Series.__iter__, so yes I would expect Series.__iter__ and EA.__iter__ to align.

Note I think Series[datetime64/timedelta64] is the only time where we return pandas objects which make sense due to Timestamp and Timedelta objects holding more resolution that datetime.datetime, `datetime.timedelta.

lukemanley

lukemanley commented on Jan 31, 2023

@lukemanley
Member

Thanks. One concern with replacing pd.NA with None in __iter__ is that code like this will start to break:

import pandas as pd

idx = pd.Index([1, 2, pd.NA], dtype="Int64")
ser = pd.Series(1, index=idx)

for v in ser.index:
    print(ser[v])       # -> KeyError: None

or simply:

import pandas as pd

idx = pd.Index([1, 2, pd.NA], dtype="Int64")

[v in idx for v in idx]    # -> [True, True, False]

e.g. test_array_iterface

import pandas as pd
import numpy as np

arr = pd.array([1, 2, pd.NA], dtype="Int64")

np.array(arr) == np.array(list(arr))    # -> [True, True False]

The test suite alone has hundreds of failures when replacing pd.NA with None in __iter__. If the test suite is indicative of what users may experience, I suspect this would be a big change and maybe not a desirable one given the examples above. Might there be a case to return native types for non-na values but still return pd.NA for missing values? That happens to be the behavior of ArrowExtensionArray at the moment:

import pandas as pd

arr = pd.array([1, pd.NA], dtype='int64[pyarrow]')

for v in arr:
    print(type(v))

# <class 'int'>
# <class 'pandas._libs.missing.NAType'>

If you still think __iter__ should return None I will go through the test suite and see how much actually needs to change to get everything to pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDtype ConversionsUnexpected or buggy dtype conversionsExtensionArrayExtending pandas with custom dtypes or arrays.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @Peque@jreback@jorisvandenbossche@lukemanley@mroeschke

      Issue actions

        API: ExtensionArrays and conversion to "native" types (eg in tolist, to_dict, iteration, ..) · Issue #29738 · pandas-dev/pandas