Closed
Description
I tried pickling a very large dataframe (20GB or so) and that succeeded to write to disk, but when I try to read it, it fails with: ValueError: buffer size does not match array size
Now I did a bit of research and found the following:
http://bugs.python.org/issue13555
I am thinking this is a numpy/python issue, but it does cause me pretty big pain when I want to back up a dataframe that took a long time to join together, and I want all the dtypes stored (namely what columns are datetimes). Perhaps a solution would be a csv file that keeps the dtypes somewhere (otherwise I'll have to figure out what columns are serialized dates). Any workarounds would be appreciated.
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
wesm commentedon Jan 17, 2013
You should also try using HDFStore (require PyTables)
jreback commentedon Jan 17, 2013
make sure you are on 0.10.1-dev
this is queryable (and preserves dtypes)
This preserves dtypes, but is not-queryable (but will write much faster)
daggre-gmu commentedon Jan 17, 2013
I was trying to use hdfstore and my workhorse machine somehow didn't get it compiled in correctly. I can import pytables, but pandas doesn't know about it... I'll dig in on that as my workaround though.
jostheim commentedon Jan 18, 2013
I was posting under the wrong account before (I am the person who started this issue), but I got HDFStore working and immediately ran into:
File "hdf5Extension.pyx", line 884, in tables.hdf5Extension.Array._createArray (tables/hdf5Extension.c:8498)
tables.exceptions.HDF5ExtError: Problems creating the Array.
That was on 0.10.0, I tired compiling 0.10.1.dev, but I couldn't even import pandas:
So I am still stuck.
jreback commentedon Jan 18, 2013
heard building yourself is a little tough on mac
did you try this: http://ericjang.tumblr.com/post/25096909713/annoying-pytables-build-on-mac-osx-10-7
from your error looks like pandas and pytables built against different numpy
jostheim commentedon Jan 18, 2013
I was able to save smaller dataframes to a HDF5 store with 0.10.0, so that works, is there any reason to expect 0.10.1.dev will allow me to save larger ones?
That problem installing 0.10.1.dev actually screwed up my entire numpy install, I had to re-install from scratch, so I'd rather not try that again.
I suppose I can just write out a csv file myself with the dtypes in a header row and cast the columns myself.
jreback commentedon Jan 18, 2013
It depends on what dtypes you are tring to save. Strings are broken in 0.10.0 (they work just very slowly). Also, depending if you need query capability (e.g. you want to save as a table) - which I'll recommend because strings and other data types are more effeciently stored.
can you post a df.get_dtype_counts() on your frame?
jreback commentedon Jan 18, 2013
are you storing as
put
orappend
? are you anticpating reading the entire frame into memory for operations later on?jostheim commentedon Jan 18, 2013
I'll add that call, though it takes hours to get to the dataframe I want to write out (hence me wanting to save it).
I am storing as a
put
currently:store['blah'] = df
which the docs say is a
put
.I expect to pull the entire dataframe out to operate on it this is truly just a way to store intermediate processing. The dataframe definitely has strings in it.
In all likelhood I'll simply write my own serializer that will tell me which columns are datetimes so I can specify those columns in pd.read_csv (for my data I have to specify the datetime columns explicitly in order to get them parsed). None of this would matter if I could get pandas to set the dtype on a datetime column correctly (it is always object for me, despite the fact I successfully parse the date). Perhaps I'll open another issue for the datetime columns to get some separate advice.
I just want to say since I sound like I am complaining, I love pandas, it has been a huge help in getting to where I got with this data so quickly.
jreback commentedon Jan 18, 2013
object type is bad! definitily try to convert
just a general recommendation - definitntly try to split up frames and store in separate hdf files
easier to debug and inspect things. esp when I am doing something new, I make lots of intermediate data steps (where I save data); you can always combine later.
try
df.convert_objects()
to convert your datestake a look at this question as well: http://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190
jreback commentedon Jan 18, 2013
fyi...the reason many use hdf5 is just raw speed....e.g. a random 2.5GB file (1.5M rows)..its atcually a panel stored as a table for much slower than a raw df....(50 columns or so), all floats, took about 30s. In addition PyTables supports compression and utilizied multi-core to read....(and I'm just a fan of PyTables!)
jostheim commentedon Jan 18, 2013
I tried convert_objects with no success. I don't know if this is the problem, but the columns with datetimes have missing values which I am putting as np.nan (when I cannot parse the date, I've tried None too). This may make it look like a "mixed" type column, which it really isn't. I just couldn't figure out what to fill in for the missing values that would connote missing but allow for proper typing. Same situation I think for the string columns, the missing values are represented as np.nan which is a float, so pandas believes they are mixed types.
Do I have this right? Are there any suggestions for handling this?
Pytables looks amazing, unfortunately right now I am just trying to get something done so I haven't had time to experiment, but I am glad I got it all installed so I can when I have more time.
jreback commentedon Jan 18, 2013
try this for creation (works in 0.10.0 i think)....0.10.1 has a little better handling
np.nan assignment in a datetime series will automatically get you an object type with no hope of getting out
(though in 0.10.1 you have a datetime64[ns] and assign a np.nan it will work)
strings have much better support in the Table objects (e.g. pass table=True to put or use append). as they are represented as individual columns and nan conversion is done ( see docs at http://pandas.pydata.org/pandas-docs/dev/io.html#storing-mixed-types-in-a-table - also has an example of a NaT) - some of this might work in 0.10.0...best to use 0.10.1-dev - several bugs specifically related to nan handling were fixed in 0.10.1-dev
np.nan is correct for strings.....its just that PyTables (in 0.10.0) doesn't deal well with this...
jostheim commentedon Jan 18, 2013
And I just found this: #2595
So I see this is an issue... I'll try to install 0.10.1 again sometime today and cross my fingers that numpy remains :)
jreback commentedon Jan 18, 2013
if u want to hack it out
there are no cython changes in this so u could just update the python code in order to test it (its not that much)