Skip to content

BUG: groupby first()/last() change datatype of all NaN columns to float; nth() preserves datatype #33591

Closed
@jdmarino

Description

@jdmarino
  • [x ] I have checked that this issue has not already been reported.
  • [x ] I have confirmed this bug exists on the latest version of pandas.
    Version 1.0.3 from the conda distro.

Code Sample, a copy-pastable example

import pandas as pd
print('pandas version is', pd.__version__, '\n')

df = pd.DataFrame({'id':['a','b','b','c'], 'sym':['ibm','msft','msft','goog'], 'idx':range(4), 'sectype':['E','E','E','E']})
df['osi'] = df.sym.where(df.sectype=='O')  # add a column of NaNs that are object/str
print(df)
print(df.info())

# .nth(-1) does the right thing
x = df.groupby('id').nth(-1)
print('\nnth(-1)')
print(x.info())

# .last() converts the osi col to floats
x = df.groupby('id').last()
print('\nlast()')
print(x.info())

# .nth(0) does the right thing
x = df.groupby('id').nth(-1)
print('\nnth(0)')
print(x.info())

# .first() converts the osi col to floats
x = df.groupby('id').first()
print('\nfirst()')
print(x.info())

Problem description

Given a dataframe with an all-NaN column of str/objects, performing a groupby().first() will convert the all-NaN column to float. This is true for .last() as well, but not for .nth(0) and .nth(-1), so these are workarounds.

This is a problem for me as the groupby().first() is in general code that is iteratively called and the results either pd.concat'd (resulting in a column with mixed types) or appended to an hdf file (causing failure on the write).

Expected Output

The resulting dataframe from a groupby().first()/.last() should have the same metadata (column structure and datatypes) as the input dataframe. The result of .first()/.last() should match that of .nth(0)/.nth(-1) .

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3.post20200330
Cython : 0.29.15
pytest : 5.4.1
hypothesis : 5.8.3
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.8
numba : 0.48.0

Activity

added
Needs TriageIssue that has not been reviewed by a pandas team member
on Apr 16, 2020
dsaxton

dsaxton commented on Apr 17, 2020

@dsaxton
Member

I think this is fixed on master:

[ins] In [1]: import pandas as pd
         ...:
         ...: df = pd.DataFrame(
         ...:     {
         ...:         "id": ["a", "b", "b", "c"],
         ...:         "sym": ["ibm", "msft", "msft", "goog"],
         ...:         "sectype": ["E", "E", "E", "E"],
         ...:     }
         ...: )
         ...: df["osi"] = df.sym.where(df.sectype == "O")
         ...: print(df.dtypes)
         ...: print(df.groupby("id")["osi"].first())
         ...: print(df.groupby("id")["osi"].last())
         ...: print(pd.__version__)
         ...:
id         object
sym        object
sectype    object
osi        object
dtype: object
id
a    NaN
b    NaN
c    NaN
Name: osi, dtype: object
id
a    NaN
b    NaN
c    NaN
Name: osi, dtype: object
1.1.0.dev0+1288.g3a5ae505b
added and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Apr 17, 2020
simonjayhawkins

simonjayhawkins commented on Apr 19, 2020

@simonjayhawkins
Member

I think this is fixed on master:

The code sample is different from the OP. The issue in OP does not appear to be fixed.

>>> import numpy as np
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1302.ge878fdc41'
>>>
>>> df = pd.DataFrame(
...     {
...         "id": ["a", "b", "b", "c"],
...         "sym": ["ibm", "msft", "msft", "goog"],
...         "idx": range(4),
...         "sectype": ["E", "E", "E", "E"],
...     }
... )
>>> df["osi"] = df.sym.where(df.sectype == "O")  # add a column of NaNs that are object/str
>>>
>>>
>>> x = df.groupby("id").first()
>>> x.osi.dtype
dtype('float64')
>>>
>>> x = df.groupby("id").osi.first()
>>> x.dtypes
dtype('O')
>>>
dsaxton

dsaxton commented on Apr 19, 2020

@dsaxton
Member

@simonjayhawkins Oops, yeah I think you're right

8 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDuplicate ReportDuplicate issue or pull requestGroupbyMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jdmarino@jreback@TomAugspurger@dsaxton@mroeschke

        Issue actions

          BUG: groupby first()/last() change datatype of all NaN columns to float; nth() preserves datatype · Issue #33591 · pandas-dev/pandas