-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
CategoricalCategorical Data TypeCategorical Data TypeGroupbyPerformanceMemory or execution speed performanceMemory or execution speed performance
Description
Suppose we have a Categorical with many unused categories:
cat = pd.Categorical(range(24), categories=range(10**5))
df = pd.DataFrame({"A": cat, "B": range(24), "C": range(24), "D": 1})
gb = df.groupby(["A", "B", "C"])
>>> gb.size() # memory balloons to 9+ GB before i kill it
There are only 24 rows in this DataFrame, so we shouldn't be creating millions of groups.
Without the Categorical, but just a large cross-product that implies many empty groups, this works fine:
df = pd.DataFrame({n: range(12) for n in range(8)})
gb = df.groupby(list(range(7)))
gb.size() # <-- works fine
Metadata
Metadata
Assignees
Labels
CategoricalCategorical Data TypeCategorical Data TypeGroupbyPerformanceMemory or execution speed performanceMemory or execution speed performance
Activity
jreback commentedon Dec 29, 2019
this is the point of the observed keyword
WillAyd commentedon Jan 2, 2020
So yea to second Jeff’s comment above is this a problem with observed=True?
jbrockmendel commentedon Jan 2, 2020
passing observed=True solves the problem. id like to add a warning or something for users who find themselves about to hit a MemoryError
WillAyd commentedon Jan 2, 2020
TomAugspurger commentedon Jan 2, 2020
I don't recall a discussion about that.
I vaguely recall that this will be somewhat solved by having a DictEncodedArray that has a similar data model to Categorical, without the unobserved / fixed categories semantics.
WillAyd commentedon Jan 2, 2020
Hmm I think that is orthogonal. IIUC the memory blowup is because by default we are generating Cartesian products. Would rather just deprecate that and switch the default value for observed
TomAugspurger commentedon Jan 2, 2020
I think it’s a good default for the original design semantics of Categorical. It’s a bad default for the memory-saving aspect of categorical.
jbrockmendel commentedon Jan 3, 2020
jreback commentedon Jan 3, 2020
i suppose we would show a PerformanceWarning if we detect this would happen which is pretty cheap to do
might be worthwhile
paulrougieux commentedon Jan 14, 2020
I cannot reproduce the issue above with
cat = pd.Categorical(range(24), categories=range(10**5))
on pandas 0.25.3.When upgrading from pandas 0.24.2 to 0.25.3, I had a memory issue with a groupby().agg() on a data frame with 100 000 rows and 11 columns. I used 8 grouping variables with a mix of categorical and character variables and the grouping operation was using over 8Gb of memory.
Setting the argument
observed=True
:fixed the memory issue.
Related Stack Overflow question: Pandas v 0.25 groupby with many columns gives memory error.
Maybe
observed=True
should be the default? At least beyond a certain ratio of observed / all possible combinations. When the observed combinations of categorical values is way lower than all possible combination of these categorical values, it is clear that it doesn't make sense to useobserved=False
. Is there a discussion on why the default was set toobserved=False
?Observed = True as suggested here pandas-dev/pandas#30552
24 remaining items
corriebar commentedon Feb 18, 2022
In the last months, we had multiple issues due to this. We changed one column to categorical somewhere in our pipeline and got non-obvious errors in very different parts of our pipeline. From my side, also strong ➕ to change the defaults to
observed=True
.As mentioned before and elsewhere in related issues, the behaviour imho is not very intuitive: SQL does not behave like this nor does the R tidyverse with factors.
I was especially confused that having a single categorical grouper in a list of multiple groupers turns all groupers (behaviourwise) categorical, i.e. it outputs the Cartesian product.
I think the only example I've seen mentioned so far as where this might be useful would be survey data with e.g. likert scales. I work with survey data but we have different answer options for different questions, some are likert scale, some are yes/no etc. When using groupby to get counts per question per answer option, the Cartesian product is not really what you want.
It is certainly not clear how to reasonably show empty groups when using multiple groupers and only one is categorical but to me using
observed=True
as the default seems like the better option than making non-intuitive assumptions.I also noticed that when grouping by a categorical index column,
observed=True
does not drop empty groups.rhshadrach commentedon Oct 4, 2022
I definitely think this has some truth to it, but only partially. One reason to use categorical within groupby is to take advantage of
observed=False
. However another reason to use categorical generally is to save memory with e.g. string columns. I would hazard a guess that the latter is much more prevalent than the former, though I have nothing to back this up with.I do find it surprising that changing from e.g. int to categorical alters the default groupby result, and since
observed=False
can result in memory issues, I think there is a good case to makeobserved=True
the default. I'm +1 here.cottrell commentedon Oct 5, 2022
Replace "save memory" with "bound memory by data size and instead of merely by product of arities of dimensions of categoricals". I don't think people are understanding the cursive of dimensionality issue here. This is like taking a sparse tensor of N dimensions and going .todense() by default and calling the previous step merely "memory saving".
paulrougieux commentedon Oct 5, 2022
observed=True
as the default. Alternatively, they could also not use categorical variables at all when groupby() operations are needed.observed=False
could help understand how their work would be damaged by switching to a default ofobserved=True
.Other points to clarify different use cases of categorical variables:
df.to_parquet(partition_cols="a")
, many other uses come here ....observed
argument was switched fromTrue
toFalse
in pandas version 0.25. Maybe this was a buggy behaviour as @https://github.com/TomAugspurger wrote at the beginning of this issue. Instead of specifying it at each function call, switchingobserved
toTrue
orFalse
could be dealt with globally. I don't know if it makes sense, using for example an environment variable? [UPDATE] Such a switch would probably be a bad idea.observed=False
argument in the changelog of version 1.5.1 in interaction with a bug when usingdropna=False
.[UPDATE]
@jankatins wrote in pull request 35967 that this reminds him of the
stringsAsFactors
argument in R. A detailed story stringsAsFactors an unauthorized biography:observed=True
inDataFrame.groupby
#43999rhshadrach commentedon Mar 7, 2023
@jbrockmendel
If observed=True was the default, would you still find it beneficial to have a warning here?
jbrockmendel commentedon Mar 7, 2023
i think changing the default would close this
jseabold commentedon Mar 7, 2023
Indeed. I changed and deprecated the default in #35967 but you also have to merge the changes :)
Alexia-I commentedon Jan 15, 2024
Seems that this performance improvement could be backported to previous versions due to its severity? Also, the code change seems not too complex to backport.
rhshadrach commentedon Jan 15, 2024
The improvement here was changing the default value to
observed
fromFalse
toTrue
. We should not be backporting API changes.