Skip to content

Bug when combining .groupby() apply with .expanding() apply #12829

Closed
@lminer

Description

@lminer

In this example, the aim is to use an expanding window to create an expanding count, by group, of the occurrence of a predetermined set of strings. Seemed like there might be some sort of bug in the performance of expanding when combined with groupby and apply.

In this case the strings are ['tito', 'bar', 'feep']

              category      group
2000-01-01  'foo'            a
2000-01-02  'tito'           a
2000-01-03  'bar'            a
2000-01-04  'zip'            b
2000-01-05  'zorp'           b
2000-01-03  'feep'           c

So this would become:

              category      group    count
2000-01-01  'foo'            a            0
2000-01-02  'tito'           a            1
2000-01-03  'bar'            a            2
2000-01-04  'zip'            b            0
2000-01-05  'zorp'           b            0
2000-01-03  'feep'           c            1

However, when I run the following code, it's just the category column that gets returned as count. The same thing happens when I use window in the place of expanding.

from operator import or_

df = pd.DataFrame({'category':['foo', 'tito', 'bar', 'zip', 'zorp', 'feep'],                   
                                    'group': ['a', 'a', 'a', 'b', 'b', 'c']},
                                    index=pd.to_datetime(['2000-01-01', '2000-01-02', '2000-01-03',
                                                                          '2000-01-04', '2000-01-05', '2000-01-03']))

def count_categories(ser):

    categories_to_count = ['tito',
                           'bar',
                           'feep']

    conditions = [ser == val for val in categories_to_count]
    mask = reduce(or_, conditions)
    return mask.sum()


def expanding_count_categories(s):
    return s.expanding().apply(count_categories)

df.groupby('group')['category'].apply(expanding_count_categories)

>> '2000-01-01'     foo
>> '2000-01-02'    tito
>> '2000-01-03'     bar
>> '2000-01-04'     zip
>> '2000-01-05'    zorp
>> '2000-01-03'    feep
>> dtype: object

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-76-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.1
pip: 8.0.2
setuptools: 19.1.1
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.0.2
sphinx: 1.2.2
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None

Activity

jreback

jreback commented on Apr 8, 2016

@jreback
Contributor

this is an extremely weird thing to do (and completely non-performant), keeping tuples in columns. do something like this:

In [37]: cats = df.category.apply(Series)

In [38]: cats.columns = ['one','two']

In [39]: df2 = pd.concat([df[['group']],cats],axis=1)

In [40]: df2
Out[40]: 
           group   one     two
2000-01-01     a   foo     NaN
2000-01-02     a  tito  puente
2000-01-03     a   bar     NaN
2000-01-04     b   zip     NaN
2000-01-05     b  zorp     NaN
2000-01-03     c  feep     NaN

In [41]: df2.set_index('group').unstack().dropna()
Out[41]: 
     group
one  a           foo
     a          tito
     a           bar
     b           zip
     b          zorp
     c          feep
two  a        puente
dtype: object
lminer

lminer commented on Apr 8, 2016

@lminer
Author

Sorry there shouldn't have been any tuples in the columns. I've changed it all to strings. The problem is with the attempt to use a window method to count the occurrences of these strings. The code snipped I posted should be returning counts, not strings.

jreback

jreback commented on Apr 8, 2016

@jreback
Contributor
In [14]: df['category2'] = df['category'].astype('category').cat.codes

In [15]: df.groupby('group').category2.apply(lambda x: x.expanding().count())
Out[15]: 
2000-01-01    1.0
2000-01-02    2.0
2000-01-03    3.0
2000-01-04    1.0
2000-01-05    2.0
2000-01-03    1.0
Name: category2, dtype: float64

# this will work in 0.18.1
In [16]: df.groupby('group').category2.expanding().count().astype(int)
Out[16]: 
group            
a      2000-01-01    1
       2000-01-02    2
       2000-01-03    3
b      2000-01-04    1
       2000-01-05    2
c      2000-01-03    1
Name: category2, dtype: int64
lminer

lminer commented on Apr 8, 2016

@lminer
Author

Thanks, but I'm trying to only count rows including a string in the list ['tito', 'bar', 'feep']. It does seem like unexpected behavior that the approach I'm using isn't even returning numbers.

jreback

jreback commented on Apr 8, 2016

@jreback
Contributor

then just pre-filter first.

lminer

lminer commented on Apr 8, 2016

@lminer
Author

It seems so easy once you say it.

jreback

jreback commented on Apr 8, 2016

@jreback
Contributor

.expanding does not handle non-numerics ATM #12541

jreback

jreback commented on Apr 8, 2016

@jreback
Contributor

just df[df.category.isin([.....])]

lminer

lminer commented on Apr 8, 2016

@lminer
Author

Ah! All is clear. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jreback@lminer

        Issue actions

          Bug when combining .groupby() apply with .expanding() apply · Issue #12829 · pandas-dev/pandas