Skip to content

append a categorical with different categories to the existing #12509

@dneise

Description

@dneise

I just ran into the same problem as the person asking this question on SO

http://stackoverflow.com/questions/29709918/pandas-and-category-replacement

Jeff gave an excellent answer as usual, I believe he is a pandas developer as well?

So I was wondering whether something like hist answer might be planned to become the default behaviour, when appending Categoricals.

Activity

added this to the 0.18.1 milestone on Mar 2, 2016
jreback

jreback commented on Mar 2, 2016

@jreback
Contributor

this was discussed in #9927

would be ok with adding this as a sub-section in the Cookbook somewhere.

would you like to do a pull-request? you can point to the SO post and do a short-in-line version.

dneise

dneise commented on Mar 2, 2016

@dneise
Author

Thanks for the quick reply.

I'm a physicist having no experience in collaborating on such a big project as pandas. Sure I would like to gain some experience by improving the docs, but I need to learn how.

Also, the problem I have goes a tiny step further than the SO question. I am parsing log-files (1.5k files with 100M lines in total; 12GB is the total size of all files) into a dataframe, so I can get some insight into our experiment. I am parsing the log files one by one, and would like to append them to a table in a HDF5 file. A part of each log message, is the name of the process, which created the message. And I know there are a lot less names of processes than lines. So I thought using Categoricals is feasible here. (It might just be another form of efficient string storage .. I don't know... )

I have no way of knowing the complete set of categories in advance. From your SO answer, I learned how to create the following Categoricals using an explicit set of categories. But I have not yet understood how/if I can append following Categoricals to a Table in an HDF5 file

(I should add an example here)

jreback

jreback commented on Mar 3, 2016

@jreback
Contributor

docs for contributing are here

modified the milestones: 0.18.1, 0.18.2 on Apr 25, 2016
modified the milestones: 0.19.0, Next Major Release on Sep 28, 2016
added a commit that references this issue on Jan 15, 2018
5c05768
added a commit that references this issue on Feb 3, 2018
428f9af
added 3 commits that reference this issue on Feb 18, 2018
360e8a1
fdc51c2
5c2b355
removed this from the Contributions Welcome milestone on Oct 13, 2022
ngirase10

ngirase10 commented on Nov 28, 2023

@ngirase10

Is this still open? Interested in working on this — thanks!

jonathanho168

jonathanho168 commented on Dec 1, 2023

@jonathanho168

It seems like the overall idea @dneise wants to accomplish is to dynamically create and update a set of categories based on the data from multiple dataframes.

We haven't fully explored the codebase yet, but from a cursory exploration, it seems that there are two ways to potentially accomplish this:

  1. Extend Categorical with a new method that takes in the same inputs as the constructor? Kind of like pandas.Categorical.from_codes but with a potentially incomplete set of categories, which can be added to later.
  2. Revise the spec of pandas.factorize, so that it takes in an additional optional parameter -- we want to pass in a set of mappings that can be added to if we encounter new data values in the new dataframe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @jreback@TomAugspurger@jbrockmendel@dneise@mroeschke

      Issue actions

        append a categorical with different categories to the existing · Issue #12509 · pandas-dev/pandas