You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm a physicist having no experience in collaborating on such a big project as pandas. Sure I would like to gain some experience by improving the docs, but I need to learn how.
Also, the problem I have goes a tiny step further than the SO question. I am parsing log-files (1.5k files with 100M lines in total; 12GB is the total size of all files) into a dataframe, so I can get some insight into our experiment. I am parsing the log files one by one, and would like to append them to a table in a HDF5 file. A part of each log message, is the name of the process, which created the message. And I know there are a lot less names of processes than lines. So I thought using Categoricals is feasible here. (It might just be another form of efficient string storage .. I don't know... )
I have no way of knowing the complete set of categories in advance. From your SO answer, I learned how to create the following Categoricals using an explicit set of categories. But I have not yet understood how/if I can append following Categoricals to a Table in an HDF5 file
It seems like the overall idea @dneise wants to accomplish is to dynamically create and update a set of categories based on the data from multiple dataframes.
We haven't fully explored the codebase yet, but from a cursory exploration, it seems that there are two ways to potentially accomplish this:
Extend Categorical with a new method that takes in the same inputs as the constructor? Kind of like pandas.Categorical.from_codes but with a potentially incomplete set of categories, which can be added to later.
Revise the spec of pandas.factorize, so that it takes in an additional optional parameter -- we want to pass in a set of mappings that can be added to if we encounter new data values in the new dataframe.
Activity
jreback commentedon Mar 2, 2016
this was discussed in #9927
would be ok with adding this as a sub-section in the Cookbook somewhere.
would you like to do a pull-request? you can point to the SO post and do a short-in-line version.
dneise commentedon Mar 2, 2016
Thanks for the quick reply.
I'm a physicist having no experience in collaborating on such a big project as pandas. Sure I would like to gain some experience by improving the docs, but I need to learn how.
Also, the problem I have goes a tiny step further than the SO question. I am parsing log-files (1.5k files with 100M lines in total; 12GB is the total size of all files) into a dataframe, so I can get some insight into our experiment. I am parsing the log files one by one, and would like to append them to a table in a HDF5 file. A part of each log message, is the name of the process, which created the message. And I know there are a lot less names of processes than lines. So I thought using Categoricals is feasible here. (It might just be another form of efficient string storage .. I don't know... )
I have no way of knowing the complete set of categories in advance. From your SO answer, I learned how to create the following Categoricals using an explicit set of categories. But I have not yet understood how/if I can append following Categoricals to a Table in an HDF5 file
(I should add an example here)
jreback commentedon Mar 3, 2016
docs for contributing are here
Doc: Adds example of categorical data for efficient storage and
Doc: Different example using categorical data type for efficient storage
Doc: Adds example of categorical data for efficient storage and
Doc: Different example using categorical data type for efficient storage
Doc: Updated example of using categorical data type to save on storag…
ngirase10 commentedon Nov 28, 2023
Is this still open? Interested in working on this — thanks!
jonathanho168 commentedon Dec 1, 2023
It seems like the overall idea @dneise wants to accomplish is to dynamically create and update a set of categories based on the data from multiple dataframes.
We haven't fully explored the codebase yet, but from a cursory exploration, it seems that there are two ways to potentially accomplish this: