Skip to content

Bug: groupby with sort=False creates buggy MultiIndex #32259

@xin-jin

Description

@xin-jin
d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-13'])
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()

The index of t is certainly not sorted, but t.index.is_lexsorted() returns True.

Another more subtle example is

d = [3,4,10,0,1,2,5,3]
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()

This time the lexsort flag is correct. However, calling sortlevel will not sort the new MultiIndex correctly, that is, t.index.sortlevel(['d', 'a'])[0] returns

MultiIndex([( 3, 0),
            ( 3, 7),
            ( 4, 1),
            (10, 2),
            ( 0, 3),
            ( 1, 4),
            ( 2, 5),
            ( 5, 6)],
           names=['d', 'a'])

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-72-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : 5.3.5
hypothesis : 5.5.4
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: 0.8.1
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.45.1

Activity

MarcoGorelli

MarcoGorelli commented on Feb 26, 2020

@MarcoGorelli
Member

take

removed their assignment
on Feb 26, 2020
MarcoGorelli

MarcoGorelli commented on Feb 26, 2020

@MarcoGorelli
Member

(just some notes, will come back to this)

is_lexsorted assumes the order of .codes reflects the order of the elements in the index. But .codes is actually determined by

                codes, uniques = algorithms.factorize(self.grouper, sort=self.sort)

in pandas/core/groupby/grouper.py, where self.sort may be False

xin-jin

xin-jin commented on Feb 26, 2020

@xin-jin
Author

I found that stack/unstack has a similar bug:

d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-14'])
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b}).set_index(['d', 'a']).take([3,2,4,1,0,5,6,7])

If you do df.unstack('d').stack().index, the resulting index is unsorted on level 'd', but its .is_lexsorted() returns True.

I see that stack also relies on algorithms.factorize, which probably leads to the same issue.

MarcoGorelli

MarcoGorelli commented on Feb 27, 2020

@MarcoGorelli
Member

If you do df.unstack('d').stack().index, the resulting index is unsorted on level 'd', but its .is_lexsorted() returns True.

In this case, isn't the current result correct though?

In [1]: import pandas as pd; import numpy as np                                                                                                                                                                  

In [2]: d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-14']) 
   ...: a = np.arange(len(d)) 
   ...: b = np.random.rand(len(d)) 
   ...: df = pd.DataFrame({'d': d, 'a': a, 'b': b}).set_index(['d', 'a']).take([3,2,4,1,0,5,6,7])                                                                                                                

In [3]: df.unstack('d').stack().index                                                                                                                                                                            
Out[3]: 
MultiIndex([(0, '2020-11-02'),
            (1, '2019-01-02'),
            (2, '2020-01-02'),
            (3, '2020-02-04'),
            (4, '2020-11-03'),
            (5, '2019-11-03'),
            (6, '2019-11-13'),
            (7, '2019-11-14')],
           names=['a', 'd'])

The index is sorted by its first argument, and so it is correct to say that it's lexically sorted

xin-jin

xin-jin commented on Feb 27, 2020

@xin-jin
Author

If you do df.unstack('d').stack().index, the resulting index is unsorted on level 'd', but its .is_lexsorted() returns True.

In this case, isn't the current result correct though?

In [1]: import pandas as pd; import numpy as np                                                                                                                                                                  

In [2]: d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-14']) 
   ...: a = np.arange(len(d)) 
   ...: b = np.random.rand(len(d)) 
   ...: df = pd.DataFrame({'d': d, 'a': a, 'b': b}).set_index(['d', 'a']).take([3,2,4,1,0,5,6,7])                                                                                                                

In [3]: df.unstack('d').stack().index                                                                                                                                                                            
Out[3]: 
MultiIndex([(0, '2020-11-02'),
            (1, '2019-01-02'),
            (2, '2020-01-02'),
            (3, '2020-02-04'),
            (4, '2020-11-03'),
            (5, '2019-11-03'),
            (6, '2019-11-13'),
            (7, '2019-11-14')],
           names=['a', 'd'])

The index is sorted by its first argument, and so it is correct to say that it's lexically sorted

Oh you are right. I was confused a bit. Yes, this is actually correctly sorted.

MarcoGorelli

MarcoGorelli commented on Sep 7, 2020

@MarcoGorelli
Member

Decision has been to deprecate is_lexsorted as a public method - users should use is_monotonic_increasing

jorisvandenbossche

jorisvandenbossche commented on Dec 28, 2020

@jorisvandenbossche
Member

Decision has been to deprecate is_lexsorted as a public method - users should use is_monotonic_increasing

Where has this been discussed?

MarcoGorelli

MarcoGorelli commented on Dec 28, 2020

@MarcoGorelli
Member

Hey @jorisvandenbossche - it was in call when we had the sprint on the 31st of August

added this to the 1.3 milestone on Jan 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Bug: groupby with sort=False creates buggy MultiIndex · Issue #32259 · pandas-dev/pandas