Closed
Description
Code Sample, a copy-pastable example if possible
import numpy
import pandas as pd
idx = pd.date_range('1/1/2016', periods=100, freq='d')
z = pd.DataFrame({"z": numpy.zeros(len(idx))}, index=idx)
obj = pd.DataFrame({"obj": numpy.zeros(len(idx), dtype=numpy.dtype('O'))}, index=idx)
nan = pd.DataFrame({"nan": numpy.zeros(len(idx)) + float('nan')}, index=idx)
objnan = pd.DataFrame([], columns=['objnan'], index=idx[0:0]) # dtype('O') nan after join
df_list = [z, obj, nan, objnan]
def r1w(x):
return x.resample('1w').sum()
resample_join_result = r1w(pd.DataFrame().join(df_list, how='outer'))
print(resample_join_result.shape) # (15, 2) --- I thought this should be (15, 4)
join_resample_result = pd.DataFrame().join([r1w(x) for x in df_list], how='outer')
print(join_resample_result.shape) # (15, 4)
for columnname in join_resample_result.columns:
if columnname not in resample_join_result.columns:
print("DataFrame.resample missing column: " + str(columnname) + " (" + str(join_result[columnname].dtype) + ")")
Expected Output
I would have expected resample_join_result
to have all four columns and be the same as join_resample_result
, but they are not, because it seems pandas.DataFrame.resample
drops dtype('O') (object) columns; while pandas.Series.resample
converts those columns into numeric dtypes.
output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.22.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: 0.2.0
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
jreback commentedon Aug 24, 2016
xref #12537
for upsampling this would work, but how would you do downsampling? you could maybe do
max/min/first/last
, but other functions likesum/mean/var
won't work at all because its impossible to select a correct row for the non-numerics. So better to simply exclude. groupby does the same (except forsum
which is weird because it works on strings).rcyeh commentedon Aug 31, 2016
Wow, thanks for your quick reply. I'm not sure I understand it all, but I think I will after I read it a few more times.
As a newbie, I am very surprised that a DataFrame method could ever return a different result than an index-based
DataFrame.join
of[df[x].method() for x in df]
, where the identically-named Series method has been applied to each column of the DataFrame. I haven't used pandas enough to understand why this should be. Based on the cross-referenced issue #12537 and comment aboutgroupby
, I am guessing thatresample
is not the only case that will exhibit this. Is there a way I can guess when that might occur?I'd rather have an explicit NaN or a
ValueError
exception due to unsupported data type instead of a silent drop. This may be related to issues of control over the treatment of NaN. The pandas default (exclude/ignore) is usually pleasantly convenient, but there are times when I really do want NaN to propagate (and I am fine with saying to go away and use numpy). I can always explicitly drop non-numeric columns.Re-reading the documentation, I see that you're careful not to encourage users to think of DataFrame as just an aligned list of Series. I need to think more about that.
jreback commentedon Aug 31, 2016
http://pandas.pydata.org/pandas-docs/stable/groupby.html#automatic-exclusion-of-nuisance-columns
this is standard practice on groupby as well (resample is just a time based grouper)
your view is incorrect and actually non obvious
join of op on Series only makes sense if he shape of the result is consistent
think about what a resample / groupby is actually doing when reducing and it will become clear