Description
In this example, the aim is to use an expanding window to create an expanding count, by group, of the occurrence of a predetermined set of strings. Seemed like there might be some sort of bug in the performance of expanding
when combined with groupby
and apply
.
In this case the strings are ['tito', 'bar', 'feep']
category group
2000-01-01 'foo' a
2000-01-02 'tito' a
2000-01-03 'bar' a
2000-01-04 'zip' b
2000-01-05 'zorp' b
2000-01-03 'feep' c
So this would become:
category group count
2000-01-01 'foo' a 0
2000-01-02 'tito' a 1
2000-01-03 'bar' a 2
2000-01-04 'zip' b 0
2000-01-05 'zorp' b 0
2000-01-03 'feep' c 1
However, when I run the following code, it's just the category
column that gets returned as count
. The same thing happens when I use window
in the place of expanding
.
from operator import or_
df = pd.DataFrame({'category':['foo', 'tito', 'bar', 'zip', 'zorp', 'feep'],
'group': ['a', 'a', 'a', 'b', 'b', 'c']},
index=pd.to_datetime(['2000-01-01', '2000-01-02', '2000-01-03',
'2000-01-04', '2000-01-05', '2000-01-03']))
def count_categories(ser):
categories_to_count = ['tito',
'bar',
'feep']
conditions = [ser == val for val in categories_to_count]
mask = reduce(or_, conditions)
return mask.sum()
def expanding_count_categories(s):
return s.expanding().apply(count_categories)
df.groupby('group')['category'].apply(expanding_count_categories)
>> '2000-01-01' foo
>> '2000-01-02' tito
>> '2000-01-03' bar
>> '2000-01-04' zip
>> '2000-01-05' zorp
>> '2000-01-03' feep
>> dtype: object
INSTALLED VERSIONS
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-76-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.0
nose: 1.3.1
pip: 8.0.2
setuptools: 19.1.1
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.0.2
sphinx: 1.2.2
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
Activity
jreback commentedon Apr 8, 2016
this is an extremely weird thing to do (and completely non-performant), keeping tuples in columns. do something like this:
lminer commentedon Apr 8, 2016
Sorry there shouldn't have been any tuples in the columns. I've changed it all to strings. The problem is with the attempt to use a window method to count the occurrences of these strings. The code snipped I posted should be returning counts, not strings.
jreback commentedon Apr 8, 2016
lminer commentedon Apr 8, 2016
Thanks, but I'm trying to only count rows including a string in the list
['tito', 'bar', 'feep']
. It does seem like unexpected behavior that the approach I'm using isn't even returning numbers.jreback commentedon Apr 8, 2016
then just pre-filter first.
lminer commentedon Apr 8, 2016
It seems so easy once you say it.
jreback commentedon Apr 8, 2016
.expanding
does not handle non-numerics ATM #12541jreback commentedon Apr 8, 2016
just
df[df.category.isin([.....])]
lminer commentedon Apr 8, 2016
Ah! All is clear. thanks!