Skip to content

ENH: Add Support for GroupBy Numeric Operations #20060

Open
@WillAyd

Description

@WillAyd

xref some of the conversation in #20024 right now the following is possible

In [16]: df = pd.DataFrame([(0, 1), (0, 2), (1, 3), (1, 4)], columns=['key', 'val'])
In [18]: df - df.mean()
Out[18]: 
   key  val
0 -0.5 -1.5
1 -0.5 -0.5
2  0.5  0.5
3  0.5  1.5


In [19]: df['val'] - df['val'].mean()
Out[19]: 
0   -1.5
1   -0.5
2    0.5
3    1.5
Name: val, dtype: float64

But trying to do something similar with grouped data does not work:

In [20]: df.groupby('key') - df.groupby('key').mean()
        ...
ValueError: Unable to coerce to Series, length must be 1: given 2

I am proposing that we update the GroupBy class to allow numerical operations with the result of aggregations or transformations against that object. Note that this is possible today through a much more verbose and hackish:

In [23]: df.groupby('key').shift(0) - df.groupby('key').transform('mean')
Out[23]: 
   val
0 -0.5
1  0.5
2 -0.5
3  0.5

The Series / DataFrame operations are all added via add_special_arithmetic_methods with their implementations being defined in ops.py. We could leverage a similar mechanism for GroupBy

Why is this worth doing?

  1. Consistent arithmetic ops for Series, DataFrame and GroupBy objects
  2. May enable deprecation of methods like mad (see Cythonized GroupBy mad #20024)
  3. Provides easier "demeaning" and "normalization" for grouped data
  4. Mirrors xarray implementation which appears well received by user base

Why may it not be worth doing?

  1. Will add more complexity to a GroupBy class that is already in need of refactor
  2. TBD

Consideration Points
With this proposal, the left operand would always be a GroupBy object and the right operand would always be a the result of a function application against that same GroupBy. The result of the operation should be a Series or DataFrame like-indexed to the original object.

That said, the following operations would in theory be identical:

df.groupby('key') - df.groupby('key').mean()
# OR
df.groupby('key') - df.groupby('key').transform('mean')

I'm not sure if we care to differentiate between these and force users into choosing one or the other.

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions