Skip to content

API: DataFrameGroupBy column subset selection with single list? #23566

@jorisvandenbossche

Description

@jorisvandenbossche
Member

I wouldn't be surprised if there is already an issue about this, but couldn't directly find one.

When doing a subselection of columns on a DataFrameGroupBy object, both a plain list (so a tuple within the __getitem__ [] brackets) as the double square brackets (a list inside the __getitem__ [] brackets) seems to work:

In [6]: df = pd.DataFrame(np.random.randint(10, size=(10, 4)), columns=['a', 'b', 'c', 'd'])

In [8]: df.groupby('a').sum()
Out[8]: 
    b   c   d
a            
0   0   5   7
3  18   6  12
4  16   6   9
6  10  11  11
9   3   3   0

In [9]: df.groupby('a')['b', 'c'].sum()
Out[9]: 
    b   c
a        
0   0   5
3  18   6
4  16   6
6  10  11
9   3   3

In [10]: df.groupby('a')[['b', 'c']].sum()
Out[10]: 
    b   c
a        
0   0   5
3  18   6
4  16   6
6  10  11
9   3   3

Personally I find this df.groupby('a')['b', 'c'].sum() a bit strange, and inconsistent with how DataFrame indexing works.

Of course, on a DataFrameGroupBy you don't have the possible confusion with indexing multiple dimensions (rows, columns), but still.

cc @jreback @WillAyd

Activity

TomAugspurger

TomAugspurger commented on Nov 8, 2018

@TomAugspurger
Contributor

You do have ambiguity with tuples though (not that anyone should do that)

In [14]: df = pd.DataFrame(np.random.randint(10, size=(10, 4)), columns=['a', 'b', 'c', ('a', 'b')])

In [15]: df.groupby('c')['a', 'b'].sum()
Out[15]:
    a   b
c
0   1   7
1   6   9
2   7   7
5   9  11
6   8   6
7  10   6
8  11   8

In [16]: df.groupby('c')[('a', 'b')].sum()
Out[16]:
    a   b
c
0   1   7
1   6   9
2   7   7
5   9  11
6   8   6
7  10   6
8  11   8

I think both of those are incorrect. It should rather be

In [19]: df.groupby('c').sum()[('a', 'b')]
Out[19]:
c
0     7
1     3
2     5
5     7
6     8
7    16
8    11
Name: (a, b), dtype: int64
WillAyd

WillAyd commented on Nov 13, 2018

@WillAyd
Member

I don't disagree here. There is a difference when selecting only one column (specifically returning a Series vs a DataFrame) but when selecting multiple columns it would be more consistent if we ALWAYS required double brackets brackets. I assume this would also yield a simpler implementation.

Maybe a conversation piece for 1.0? Would be a breaking change for sure so probably best served in a major release like that

TomAugspurger

TomAugspurger commented on Nov 13, 2018

@TomAugspurger
Contributor

I think the hope is for 1.0 to be backwards compatible with 0.25.x.

Do we have a chance to detect this case and throw a FutureWarning (assuming we want to change)?

jorisvandenbossche

jorisvandenbossche commented on Nov 13, 2018

@jorisvandenbossche
MemberAuthor

Yeah, if we want, I would think it should be possible with a deprecation cycle.

jreback

jreback commented on Nov 14, 2018

@jreback
Contributor

this i suspect is actually very common in the wild (not using the double brackets)

but i agree we should deprecate as it is inconsistent

yehoshuadimarsky

yehoshuadimarsky commented on Dec 25, 2019

@yehoshuadimarsky
Contributor

Can I take a crack at this or has it already been fixed?

Also, will this be a part of the 1.0 or other milestones?

WillAyd

WillAyd commented on Dec 25, 2019

@WillAyd
Member
yehoshuadimarsky

yehoshuadimarsky commented on Dec 25, 2019

@yehoshuadimarsky
Contributor

Thanks will do.

yehoshuadimarsky

yehoshuadimarsky commented on Dec 25, 2019

@yehoshuadimarsky
Contributor

take

yehoshuadimarsky

yehoshuadimarsky commented on Dec 25, 2019

@yehoshuadimarsky
Contributor

So this is my first time working on pandas code, and I'm a little confused here, so please bear with me. I'm also new to linking to code on GitHub.

As I understand, when an object calls __getitem__ by using brackets, if you pass in several keys, they are implicitly converted to a tuple of one key. So df['a','b'] is really df[('a','b')] under the hood.

I'm having trouble in tracing the code path to figure out where exactly the __getitem__ on the GroupBy is actually implemented here:

  1. DataFrame.groupby is called on the superclass NDFrame here
  2. This eventually creates the specific DataFrameGroupBy object here
  3. Which is a subclass of GroupBy
  4. Which is a subclass of _GroupBy
  5. Which has the mixin named SelectionMixin, defined here
  6. Which implements __getitem__ here
  7. Which, if the key is a list or tuple, returns self._gotitem(list(key), ndim=2)
  8. self._gotitem needs to be implemented by the respective subclasses, which in this case is the DataFrameGroupBy object, and is implemented here
  9. But all this does is simply create an instance of itself (DataFrameGroupBy) with the key (a list/tuple) passed as a slice to the selection parameter
  10. The selection parameter is implemented in the parent _GroupBy object, which sets the internal self._selection attribute to the key here
  11. This is where I'm lost. How does this actually slice the object and only return a subset of it?

Any help here would be greatly appreciated. Thanks.

yehoshuadimarsky

yehoshuadimarsky commented on Dec 29, 2019

@yehoshuadimarsky
Contributor

@WillAyd @jorisvandenbossche are you able to help point me in the right direction? ☝️

7 remaining items

kusaasira

kusaasira commented on Jun 24, 2020

@kusaasira

@yehoshuadimarsky , this was closed, right?

yehoshuadimarsky

yehoshuadimarsky commented on Jun 25, 2020

@yehoshuadimarsky
Contributor

yes

Thuoq

Thuoq commented on May 8, 2021

@Thuoq

Thanks for , I at version Pandas I using group[[colone_name,]] so it is useful and clear code better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

    Participants

    @WillAyd@jreback@jorisvandenbossche@TomAugspurger@kusaasira

    Issue actions

      API: DataFrameGroupBy column subset selection with single list? · Issue #23566 · pandas-dev/pandas