Skip to content

PERF: groupby with many empty groups memory blowup #30552

@jbrockmendel

Description

@jbrockmendel
Member

Suppose we have a Categorical with many unused categories:

cat = pd.Categorical(range(24), categories=range(10**5))

df = pd.DataFrame({"A": cat, "B": range(24), "C": range(24), "D": 1})

gb = df.groupby(["A", "B", "C"])

>>> gb.size()  # memory balloons to 9+ GB before i kill it

There are only 24 rows in this DataFrame, so we shouldn't be creating millions of groups.

Without the Categorical, but just a large cross-product that implies many empty groups, this works fine:

df = pd.DataFrame({n: range(12) for n in range(8)})

gb = df.groupby(list(range(7)))
gb.size() # <-- works fine

Activity

jreback

jreback commented on Dec 29, 2019

@jreback
Contributor

this is the point of the observed keyword

WillAyd

WillAyd commented on Jan 2, 2020

@WillAyd
Member

So yea to second Jeff’s comment above is this a problem with observed=True?

jbrockmendel

jbrockmendel commented on Jan 2, 2020

@jbrockmendel
MemberAuthor

passing observed=True solves the problem. id like to add a warning or something for users who find themselves about to hit a MemoryError

WillAyd

WillAyd commented on Jan 2, 2020

@WillAyd
Member
TomAugspurger

TomAugspurger commented on Jan 2, 2020

@TomAugspurger
Contributor

I don't recall a discussion about that.

I vaguely recall that this will be somewhat solved by having a DictEncodedArray that has a similar data model to Categorical, without the unobserved / fixed categories semantics.

WillAyd

WillAyd commented on Jan 2, 2020

@WillAyd
Member

Hmm I think that is orthogonal. IIUC the memory blowup is because by default we are generating Cartesian products. Would rather just deprecate that and switch the default value for observed

TomAugspurger

TomAugspurger commented on Jan 2, 2020

@TomAugspurger
Contributor

I think it’s a good default for the original design semantics of Categorical. It’s a bad default for the memory-saving aspect of categorical.

jbrockmendel

jbrockmendel commented on Jan 3, 2020

@jbrockmendel
MemberAuthor
jreback

jreback commented on Jan 3, 2020

@jreback
Contributor

i suppose we would show a PerformanceWarning if we detect this would happen which is pretty cheap to do

might be worthwhile

paulrougieux

paulrougieux commented on Jan 14, 2020

@paulrougieux

I cannot reproduce the issue above with cat = pd.Categorical(range(24), categories=range(10**5)) on pandas 0.25.3.

When upgrading from pandas 0.24.2 to 0.25.3, I had a memory issue with a groupby().agg() on a data frame with 100 000 rows and 11 columns. I used 8 grouping variables with a mix of categorical and character variables and the grouping operation was using over 8Gb of memory.

Setting the argument observed=True:

df.groupby(index, observed=True)

fixed the memory issue.

Related Stack Overflow question: Pandas v 0.25 groupby with many columns gives memory error.

Maybe observed=True should be the default? At least beyond a certain ratio of observed / all possible combinations. When the observed combinations of categorical values is way lower than all possible combination of these categorical values, it is clear that it doesn't make sense to use observed=False. Is there a discussion on why the default was set to observed=False?

24 remaining items

corriebar

corriebar commented on Feb 18, 2022

@corriebar
Contributor

In the last months, we had multiple issues due to this. We changed one column to categorical somewhere in our pipeline and got non-obvious errors in very different parts of our pipeline. From my side, also strong ➕ to change the defaults toobserved=True.

As mentioned before and elsewhere in related issues, the behaviour imho is not very intuitive: SQL does not behave like this nor does the R tidyverse with factors.
I was especially confused that having a single categorical grouper in a list of multiple groupers turns all groupers (behaviourwise) categorical, i.e. it outputs the Cartesian product.
I think the only example I've seen mentioned so far as where this might be useful would be survey data with e.g. likert scales. I work with survey data but we have different answer options for different questions, some are likert scale, some are yes/no etc. When using groupby to get counts per question per answer option, the Cartesian product is not really what you want.
It is certainly not clear how to reasonably show empty groups when using multiple groupers and only one is categorical but to me using observed=True as the default seems like the better option than making non-intuitive assumptions.

I also noticed that when grouping by a categorical index column, observed=True does not drop empty groups.

rhshadrach

rhshadrach commented on Oct 4, 2022

@rhshadrach
Member

I think it’s a good default for the original design semantics of Categorical. It’s a bad default for the memory-saving aspect of categorical.

I definitely think this has some truth to it, but only partially. One reason to use categorical within groupby is to take advantage of observed=False. However another reason to use categorical generally is to save memory with e.g. string columns. I would hazard a guess that the latter is much more prevalent than the former, though I have nothing to back this up with.

I do find it surprising that changing from e.g. int to categorical alters the default groupby result, and since observed=False can result in memory issues, I think there is a good case to make observed=True the default. I'm +1 here.

cottrell

cottrell commented on Oct 5, 2022

@cottrell
Contributor

I think it’s a good default for the original design semantics of Categorical. It’s a bad default for the memory-saving aspect of categorical.

I definitely think this has some truth to it, but only partially. One reason to use categorical within groupby is to take advantage of observed=False. However another reason to use categorical generally is to save memory with e.g. string columns. I would hazard a guess that the latter is much more prevalent than the former, though I have nothing to back this up with.

I do find it surprising that changing from e.g. int to categorical alters the default groupby result, and since observed=False can result in memory issues, I think there is a good case to make observed=True the default. I'm +1 here.

Replace "save memory" with "bound memory by data size and instead of merely by product of arities of dimensions of categoricals". I don't think people are understanding the cursive of dimensionality issue here. This is like taking a sparse tensor of N dimensions and going .todense() by default and calling the previous step merely "memory saving".

paulrougieux

paulrougieux commented on Oct 5, 2022

@paulrougieux
  • Many users in this thread are strong proponent of having observed=True as the default. Alternatively, they could also not use categorical variables at all when groupby() operations are needed.
  • Sample code and data from actual users of observed=False could help understand how their work would be damaged by switching to a default of observed=True.
    • Maybe the authors of Quantipy or PandaSurvey can provide sample code and data? (these are returned for a google search of "pandas survey data package", but have not been in active development since 4 and 8 years respectively according to the latest commits)

Other points to clarify different use cases of categorical variables:

  1. A difference has to be made between
    • the case where there are only one or two grouping variable and where there is no risk of combinatorial explosion
    • and the case with many grouping variable and there is a risk of combinatorial explosion.
  2. As already mentioned by @corriebar categorical variables do not behave like this in R.
  3. The different uses for categorical variables should also be clarified: to keep survey options even if there was no reply, to keep and reorder named variables in a graph (such as ordering country names by population size instead of alphabetical order for example), as the result of using a partition column in a parquet file df.to_parquet(partition_cols="a"), many other uses come here ....
  4. The observed argument was switched from True to False in pandas version 0.25. Maybe this was a buggy behaviour as @https://github.com/TomAugspurger wrote at the beginning of this issue. Instead of specifying it at each function call, switching observed to True or False could be dealt with globally. I don't know if it makes sense, using for example an environment variable? [UPDATE] Such a switch would probably be a bad idea.
  5. There has been recent activity related to the observed=False argument in the changelog of version 1.5.1 in interaction with a bug when using dropna=False.

[UPDATE]
@jankatins wrote in pull request 35967 that this reminds him of the stringsAsFactors argument in R. A detailed story stringsAsFactors an unauthorized biography:

"The argument ‘stringsAsFactors’ is an argument to the ‘data.frame()’ function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. [...] By default, ‘stringsAsFactors’ is set to TRUE."
"Most people I talk to today who use R are completely befuddled by the fact that ‘stringsAsFactors’ is set to TRUE by default. [...]"
"In the old days, when R was primarily being used by statisticians and statistical types, this setting strings to be factors made total sense. In most tabular data, if there were a column of the table that was non-numeric, it almost certainly encoded a categorical variable. Think sex (male/female), country (U.S./other), region (east/west), etc. In R, categorical variables are represented by ‘factor’ vectors and so character columns got converted factor. [...]
Why do we need factor variables to begin with? Because of modeling functions like ‘lm()’ and ‘glm()’. Modeling functions need to treat expand categorical variables into individual dummy variables, so that a categorical variable with 5 levels will be expanded into 4 different columns in your modeling matrix." [...]
"Around June of 2007, R introduced hashing of CHARSXP elements in the underlying C code thanks to Seth Falcon. What this meant was that effectively, character strings were hashed to an integer representation and stored in a global table in R. Anytime a given string was needed in R, it could be referenced by its underlying integer. This effectively put in place, globally, the factor encoding behavior of strings from before. Once this was implemented, there was little to be gained from an efficiency standpoint by encoding character variables as factor. Of course, you still needed to use ‘factors’ for the modeling functions."

rhshadrach

rhshadrach commented on Mar 7, 2023

@rhshadrach
Member

@jbrockmendel

passing observed=True solves the problem. id like to add a warning or something for users who find themselves about to hit a MemoryError

If observed=True was the default, would you still find it beneficial to have a warning here?

jbrockmendel

jbrockmendel commented on Mar 7, 2023

@jbrockmendel
MemberAuthor

i think changing the default would close this

jseabold

jseabold commented on Mar 7, 2023

@jseabold
Contributor

i think changing the default would close this

Indeed. I changed and deprecated the default in #35967 but you also have to merge the changes :)

Alexia-I

Alexia-I commented on Jan 15, 2024

@Alexia-I

Seems that this performance improvement could be backported to previous versions due to its severity? Also, the code change seems not too complex to backport.

rhshadrach

rhshadrach commented on Jan 15, 2024

@rhshadrach
Member

The improvement here was changing the default value to observed from False to True. We should not be backporting API changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeGroupbyPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @cottrell@jseabold@WillAyd@jreback@jorisvandenbossche

      Issue actions

        PERF: groupby with many empty groups memory blowup · Issue #30552 · pandas-dev/pandas