Skip to content

resample drops dtype('O') columns from DataFrame (but not Series) #14084

Closed
@rcyeh

Description

@rcyeh

Code Sample, a copy-pastable example if possible

import numpy
import pandas as pd

idx = pd.date_range('1/1/2016', periods=100, freq='d')

z = pd.DataFrame({"z": numpy.zeros(len(idx))}, index=idx)
obj = pd.DataFrame({"obj": numpy.zeros(len(idx), dtype=numpy.dtype('O'))}, index=idx)
nan = pd.DataFrame({"nan": numpy.zeros(len(idx)) + float('nan')}, index=idx)
objnan = pd.DataFrame([], columns=['objnan'], index=idx[0:0])  # dtype('O') nan after join

df_list = [z, obj, nan, objnan]

def r1w(x):
    return x.resample('1w').sum()

resample_join_result = r1w(pd.DataFrame().join(df_list, how='outer'))
print(resample_join_result.shape)  # (15, 2) --- I thought this should be (15, 4)

join_resample_result = pd.DataFrame().join([r1w(x) for x in df_list], how='outer')
print(join_resample_result.shape)  # (15, 4)

for columnname in join_resample_result.columns:
    if columnname not in resample_join_result.columns:
        print("DataFrame.resample missing column: " + str(columnname) + " (" + str(join_result[columnname].dtype) + ")")

Expected Output

I would have expected resample_join_result to have all four columns and be the same as join_resample_result, but they are not, because it seems pandas.DataFrame.resample drops dtype('O') (object) columns; while pandas.Series.resample converts those columns into numeric dtypes.

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.22.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: 0.2.0

Activity

jreback

jreback commented on Aug 24, 2016

@jreback
Contributor

xref #12537

for upsampling this would work, but how would you do downsampling? you could maybe do max/min/first/last, but other functions like sum/mean/var won't work at all because its impossible to select a correct row for the non-numerics. So better to simply exclude. groupby does the same (except for sum which is weird because it works on strings).

added this to the No action milestone on Aug 24, 2016
rcyeh

rcyeh commented on Aug 31, 2016

@rcyeh
Author

Wow, thanks for your quick reply. I'm not sure I understand it all, but I think I will after I read it a few more times.

As a newbie, I am very surprised that a DataFrame method could ever return a different result than an index-based DataFrame.join of [df[x].method() for x in df], where the identically-named Series method has been applied to each column of the DataFrame. I haven't used pandas enough to understand why this should be. Based on the cross-referenced issue #12537 and comment about groupby, I am guessing that resample is not the only case that will exhibit this. Is there a way I can guess when that might occur?

I'd rather have an explicit NaN or a ValueError exception due to unsupported data type instead of a silent drop. This may be related to issues of control over the treatment of NaN. The pandas default (exclude/ignore) is usually pleasantly convenient, but there are times when I really do want NaN to propagate (and I am fine with saying to go away and use numpy). I can always explicitly drop non-numeric columns.

Re-reading the documentation, I see that you're careful not to encourage users to think of DataFrame as just an aligned list of Series. I need to think more about that.

jreback

jreback commented on Aug 31, 2016

@jreback
Contributor

http://pandas.pydata.org/pandas-docs/stable/groupby.html#automatic-exclusion-of-nuisance-columns

this is standard practice on groupby as well (resample is just a time based grouper)

your view is incorrect and actually non obvious

join of op on Series only makes sense if he shape of the result is consistent

think about what a resample / groupby is actually doing when reducing and it will become clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @rcyeh@jreback

        Issue actions

          resample drops dtype('O') columns from DataFrame (but not Series) · Issue #14084 · pandas-dev/pandas