Skip to content

Series iteration and to_dict methods *sometimes* return underlying storage type vs. Python object #25969

Open
@boydgreenfield

Description

@boydgreenfield

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

s1 = pd.Series({"a": np.int64(64), "b": 10})
for v in s1.to_dict().values():
    print(type(v))  # prints <class 'int'> 2x

s2 = pd.Series({"a": np.int64(64), "b": 10, "c": "ABC"})
for v in s2.to_dict().values():
    print(type(v))  # prints <class 'numpy.int64'> for first variable "a"

for k, v in s1.items():
    print(k, type(v))  # prints <class 'int'> 2x

for k, v in s2.items():
    print(k, type(v))  # prints <class 'numpy.int64'> again for the first variable "a"

Problem description

pd.Series.to_dict can return different types for objects depending on the composition of the series. This also affects iteration, e.g., for k, v in series: .... This is inconsistent and, critically, leads to really weird and hard to debug issues downstream with types, especially around JSON conversion (the built-in json module and many others will blow up when it encounters numpy dtypes).

I cannot find this exact issue open in the issue tracker, though there are a number of related issues including:

Expected Output

Expected output is for type coercion to Python ints to occur regardless of the exact column composition in the Series. #24908 is a related issue for DataFrame coercions with irregular behavior happening as a result.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Activity

mroeschke

mroeschke commented on Apr 4, 2019

@mroeschke
Member

This probably occurs because s2 is object dtype and it's trying to preserve the dtype of each input argument while the arguments in s1 can both be coerced to int64.

Investigation and PR's welcome~

added this to the Contributions Welcome milestone on Apr 4, 2019
drew-heenan

drew-heenan commented on Apr 7, 2019

@drew-heenan
Contributor

I'm having a go at this issue - quick note @boydgreenfield, it looks like iterating over a Series object as in the last two loops in your example results in an iteration only over the int values in the Series. Did you mean to iterate over s1.items() or similar?

boydgreenfield

boydgreenfield commented on Apr 8, 2019

@boydgreenfield
Author

@drew-heenan Yes you're right I meant .items(). Have updated the above code snippet. Thanks for taking a look at the issue!

11 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @boydgreenfield@jreback@drew-heenan@jbrockmendel@mroeschke

      Issue actions

        Series iteration and to_dict methods *sometimes* return underlying storage type vs. Python object · Issue #25969 · pandas-dev/pandas