Loading of large pickled dataframes fails #2705

Closed

Loading of large pickled dataframes fails#2705

Labels

I tried pickling a very large dataframe (20GB or so) and that succeeded to write to disk, but when I try to read it, it fails with: ValueError: buffer size does not match array size

Now I did a bit of research and found the following:

http://stackoverflow.com/questions/12060932/unable-to-load-a-previously-dumped-pickle-file-of-large-size-in-python

http://bugs.python.org/issue13555

I am thinking this is a numpy/python issue, but it does cause me pretty big pain when I want to back up a dataframe that took a long time to join together, and I want all the dtypes stored (namely what columns are datetimes). Perhaps a solution would be a csv file that keeps the dtypes somewhere (otherwise I'll have to figure out what columns are serialized dates). Any workarounds would be appreciated.

wesm

Member

You should also try using HDFStore (require PyTables)

jreback

Contributor

make sure you are on 0.10.1-dev

store = pd.HDFStore('my_large_frame.h5','w')

this is queryable (and preserves dtypes)

store.append('df',df)

This preserves dtypes, but is not-queryable (but will write much faster)

store.put('df',df)

daggre-gmu

Author

I was trying to use hdfstore and my workhorse machine somehow didn't get it compiled in correctly. I can import pytables, but pandas doesn't know about it... I'll dig in on that as my workaround though.

jostheim

I was posting under the wrong account before (I am the person who started this issue), but I got HDFStore working and immediately ran into:

File "hdf5Extension.pyx", line 884, in tables.hdf5Extension.Array._createArray (tables/hdf5Extension.c:8498)
tables.exceptions.HDF5ExtError: Problems creating the Array.

That was on 0.10.0, I tired compiling 0.10.1.dev, but I couldn't even import pandas:

import pandas
numpy.dtype has the wrong size, try recompiling
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/pandas-0.10.1.dev_6e2b6ea-py2.7-macosx-10.8-intel.egg/pandas/init.py", line 6, in
from . import hashtable, tslib, lib
File "numpy.pxd", line 156, in init pandas.hashtable (pandas/hashtable.c:20380)
ValueError: numpy.dtype has the wrong size, try recompiling

So I am still stuck.

jreback

Contributor

heard building yourself is a little tough on mac
did you try this: http://ericjang.tumblr.com/post/25096909713/annoying-pytables-build-on-mac-osx-10-7
from your error looks like pandas and pytables built against different numpy

jostheim

I was able to save smaller dataframes to a HDF5 store with 0.10.0, so that works, is there any reason to expect 0.10.1.dev will allow me to save larger ones?

That problem installing 0.10.1.dev actually screwed up my entire numpy install, I had to re-install from scratch, so I'd rather not try that again.

I suppose I can just write out a csv file myself with the dtypes in a header row and cast the columns myself.

jreback

Contributor

It depends on what dtypes you are tring to save. Strings are broken in 0.10.0 (they work just very slowly). Also, depending if you need query capability (e.g. you want to save as a table) - which I'll recommend because strings and other data types are more effeciently stored.

can you post a df.get_dtype_counts() on your frame?

jreback

Contributor

are you storing as put or append? are you anticpating reading the entire frame into memory for operations later on?

jostheim

I'll add that call, though it takes hours to get to the dataframe I want to write out (hence me wanting to save it).

I am storing as a put currently:

store['blah'] = df

which the docs say is a put.

I expect to pull the entire dataframe out to operate on it this is truly just a way to store intermediate processing. The dataframe definitely has strings in it.

In all likelhood I'll simply write my own serializer that will tell me which columns are datetimes so I can specify those columns in pd.read_csv (for my data I have to specify the datetime columns explicitly in order to get them parsed). None of this would matter if I could get pandas to set the dtype on a datetime column correctly (it is always object for me, despite the fact I successfully parse the date). Perhaps I'll open another issue for the datetime columns to get some separate advice.

I just want to say since I sound like I am complaining, I love pandas, it has been a huge help in getting to where I got with this data so quickly.

jreback

Contributor

object type is bad! definitily try to convert

just a general recommendation - definitntly try to split up frames and store in separate hdf files
easier to debug and inspect things. esp when I am doing something new, I make lots of intermediate data steps (where I save data); you can always combine later.

try df.convert_objects() to convert your dates
take a look at this question as well: http://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190

jreback

Contributor

fyi...the reason many use hdf5 is just raw speed....e.g. a random 2.5GB file (1.5M rows)..its atcually a panel stored as a table for much slower than a raw df....(50 columns or so), all floats, took about 30s. In addition PyTables supports compression and utilizied multi-core to read....(and I'm just a fan of PyTables!)

jostheim

I tried convert_objects with no success. I don't know if this is the problem, but the columns with datetimes have missing values which I am putting as np.nan (when I cannot parse the date, I've tried None too). This may make it look like a "mixed" type column, which it really isn't. I just couldn't figure out what to fill in for the missing values that would connote missing but allow for proper typing. Same situation I think for the string columns, the missing values are represented as np.nan which is a float, so pandas believes they are mixed types.

Do I have this right? Are there any suggestions for handling this?

Pytables looks amazing, unfortunately right now I am just trying to get something done so I haven't had time to experiment, but I am glad I got it all installed so I can when I have more time.

jreback

Contributor

try this for creation (works in 0.10.0 i think)....0.10.1 has a little better handling

import pandas as pd

# objvious you can also use Timestamp objects as well
s = Series([datetime(2001, 1, 2, 0, 0), pd.tslib.iNaT], dtype='M8[ns]')

np.nan assignment in a datetime series will automatically get you an object type with no hope of getting out
(though in 0.10.1 you have a datetime64[ns] and assign a np.nan it will work)

strings have much better support in the Table objects (e.g. pass table=True to put or use append). as they are represented as individual columns and nan conversion is done ( see docs at http://pandas.pydata.org/pandas-docs/dev/io.html#storing-mixed-types-in-a-table - also has an example of a NaT) - some of this might work in 0.10.0...best to use 0.10.1-dev - several bugs specifically related to nan handling were fixed in 0.10.1-dev

np.nan is correct for strings.....its just that PyTables (in 0.10.0) doesn't deal well with this...

jostheim

And I just found this: #2595

So I see this is an issue... I'll try to install 0.10.1 again sometime today and cross my fingers that numpy remains :)

jreback

Contributor

if u want to hack it out
there are no cython changes in this so u could just update the python code in order to test it (its not that much)

jreback

mentioned this

on Jan 20, 2013

pandas converts int32 to int64 #622

hayd

closed this as completed

on May 29, 2014

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

BugIO Data

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

Loading of large pickled dataframes fails #2705

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Loading of large pickled dataframes fails #2705

Description

Activity

wesm commented on Jan 17, 2013

jreback commented on Jan 17, 2013

daggre-gmu commented on Jan 17, 2013

jostheim commented on Jan 18, 2013

jreback commented on Jan 18, 2013

jostheim commented on Jan 18, 2013

jreback commented on Jan 18, 2013

jreback commented on Jan 18, 2013

jostheim commented on Jan 18, 2013

jreback commented on Jan 18, 2013

jreback commented on Jan 18, 2013

jostheim commented on Jan 18, 2013

jreback commented on Jan 18, 2013

jostheim commented on Jan 18, 2013

jreback commented on Jan 18, 2013

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions