-
-
Notifications
You must be signed in to change notification settings - Fork 19k
Open
Labels
BugDtype ConversionsUnexpected or buggy dtype conversionsUnexpected or buggy dtype conversionsExtensionArrayExtending pandas with custom dtypes or arrays.Extending pandas with custom dtypes or arrays.
Description
We try to consistently return python objects (instead of numpy scalars) in certain functions like tolist
, to_dict
, itertuples/items
, .. (we have had quite some issues fixing this in several cases).
However, currently we don't do that for extension dtypes (and don't have any mechanism to ask for this):
In [33]: type(pd.Series([1, 2], dtype='int64').tolist()[0])
Out[33]: int
In [34]: type(pd.Series([1, 2], dtype='Int64').tolist()[0])
Out[34]: numpy.int64
In [36]: type(pd.Series([1, 2], dtype='int64').to_dict()[0])
Out[36]: int
In [37]: type(pd.Series([1, 2], dtype='Int64').to_dict()[0])
Out[37]: numpy.int64
In [45]: s = pd.Series([1, 2], dtype='int64')
In [46]: type(list(s.iteritems())[0][1])
Out[46]: int
In [47]: s = pd.Series([1, 2], dtype='Int64')
In [48]: type(list(s.iteritems())[0][1])
Out[48]: numpy.int64
Should we add some API to ExtensionArray to provide this? Eg a method to iterate through the elements that returns "native" objects?
pandamatic and Peque
Metadata
Metadata
Assignees
Labels
BugDtype ConversionsUnexpected or buggy dtype conversionsUnexpected or buggy dtype conversionsExtensionArrayExtending pandas with custom dtypes or arrays.Extending pandas with custom dtypes or arrays.
Activity
jorisvandenbossche commentedon Nov 20, 2019
Actually, for Series, the
__iter__
also returns native types, so maybe fixing that is enough: ensure that__iter__
on IntegerArray etc do return the python objects (so by not plainly using__getitem__
, which correctly returns numpy scalars):jorisvandenbossche commentedon Nov 20, 2019
Let's consider this a bug in
__iter__
then, because I think all mentioned cases can be solved with (for IntegerArray):marco-neumann-by commentedon Nov 20, 2019
So
ExtensionArray.__iter__
should return native types then? Is this also true forExtensionArray.__getitem__
with some integer? I think it should at least be documented then so that other (external) implementations can get this right.jorisvandenbossche commentedon Nov 20, 2019
If we mimic what Series with plain numpy dtype does, then getitem should keep returning the numpy scalar.
victorbr92 commentedon Nov 23, 2019
Hi, @marco-neumann-jdas and @jorisvandenbossche, are you guys working on this issue ? Could I help somehow and make a PR for it (it was marked as a good first issue).
I tested the changes proposed by @jorisvandenbossche locally and it fixes the reported issue indeed.
Resolving issue request pandas-dev#29738. Updated integer.py to analy…
marco-neumann-by commentedon Jan 21, 2020
I am not working on it.
Keep in mind that this issue is not only about fixing the behavior of
IntegerArray
but also to adjust the docs ofExtensionArray.__iter__
to state that EVERY EA should return Python-native types.7 remaining items
mroeschke commentedon Jul 19, 2022
Similarly, should pd.NA (
ExtensionDtype.na_value
) be converted toNone
as the native equivalent?Peque commentedon Dec 11, 2022
Stumbled upon this issue when trying to serialize a DataFrame resulted from
.isocalendar()
:Of course, this can be reduced to an example like those presented above by @jorisvandenbossche:
Just leaving this comment in case someone looks for "isocalendar" or "not JSON serializable".
lukemanley commentedon Dec 11, 2022
@Peque - your example cases have been fixed on the main branch and will be included in 2.0. On main, your first example works without raising and your second example returns
int
.lukemanley commentedon Dec 11, 2022
@mroeschke - was there ever an answer to this?
pd.NA
is still being returnedPeque commentedon Dec 11, 2022
@lukemanley Great to know and thanks for sharing! 😊
mroeschke commentedon Dec 12, 2022
Personally I think it makes sense to convert pd.NA to None as the Python native type. @jorisvandenbossche might have thoughts on this as well
lukemanley commentedon Jan 28, 2023
#50796 changed
Series.to_dict
to returnNone
in place ofpd.NA
which seems to be the preferred behavior as it is consistent with returning native types. I took a look at doing the same fortolist
and__iter__
.I'll note that changing the value for
__iter__
would result in the following list comprehension to start raising:I'm curious if that is too big of a change? Since
series.tolist() == list(series)
is tested behavior I suspecttolist
and__iter__
need to remain consistent with each other.Any suggestions for moving this forward from here? Its probably not ideal that
to_dict
andtolist
are inconsistent withpd.NA
at the moment.cc @phofl @mroeschke
lukemanley commentedon Jan 29, 2023
One more comment here. I think there may be a case for making
tolist
returnNone
in place ofpd.NA
but not__iter__
. This would maketolist
consistent withto_dict
. It would also be consistent with how numpy and pyarrow operate. Both numpy and pyarrow have.tolist
methods that return python native types and__iter__
methods that return non-native types when iterating through an array.If we were to take this approach, the currently tested
series.tolist() == list(series)
behavior would change.Example with pyarrow:
mroeschke commentedon Jan 30, 2023
My initial feeling is that both
__iter__
andlist
should both return python types with justification that if a user callslist
I think they would expect the elements to behave like native Python objectslukemanley commentedon Jan 30, 2023
Sorry for all the questions.
Just want to confirm that you're talking about both
Series.__iter__
andEA.__iter__
(e.g.BaseMaskedArray
). I ask because with non-EA, there is a native/non-native type difference betweenSeries.__iter__
andSeries.values.__iter__
:mroeschke commentedon Jan 31, 2023
Series.values
is already a numpy array, so I think there's an understanding thatnp.array.__iter__
is not yield similar Python native types asSeries.__iter__
, so yes I would expectSeries.__iter__
andEA.__iter__
to align.Note I think
Series[datetime64/timedelta64]
is the only time where we return pandas objects which make sense due toTimestamp
andTimedelta
objects holding more resolution thatdatetime.datetime
, `datetime.timedelta.lukemanley commentedon Jan 31, 2023
Thanks. One concern with replacing
pd.NA
withNone
in__iter__
is that code like this will start to break:or simply:
e.g. test_array_iterface
The test suite alone has hundreds of failures when replacing
pd.NA
withNone
in__iter__
. If the test suite is indicative of what users may experience, I suspect this would be a big change and maybe not a desirable one given the examples above. Might there be a case to return native types for non-na values but still returnpd.NA
for missing values? That happens to be the behavior ofArrowExtensionArray
at the moment:If you still think
__iter__
should returnNone
I will go through the test suite and see how much actually needs to change to get everything to pass.