-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-13'])
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()
The index of t
is certainly not sorted, but t.index.is_lexsorted()
returns True.
Another more subtle example is
d = [3,4,10,0,1,2,5,3]
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()
This time the lexsort flag is correct. However, calling sortlevel will not sort the new MultiIndex correctly, that is, t.index.sortlevel(['d', 'a'])[0]
returns
MultiIndex([( 3, 0),
( 3, 7),
( 4, 1),
(10, 2),
( 0, 3),
( 1, 4),
( 2, 5),
( 5, 6)],
names=['d', 'a'])
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-72-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : 5.3.5
hypothesis : 5.5.4
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: 0.8.1
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.45.1
Activity
MarcoGorelli commentedon Feb 26, 2020
take
MarcoGorelli commentedon Feb 26, 2020
(just some notes, will come back to this)
is_lexsorted
assumes the order of.codes
reflects the order of the elements in the index. But.codes
is actually determined byin
pandas/core/groupby/grouper.py
, whereself.sort
may be Falsexin-jin commentedon Feb 26, 2020
I found that stack/unstack has a similar bug:
If you do
df.unstack('d').stack().index
, the resulting index is unsorted on level 'd', but its.is_lexsorted()
returns True.I see that stack also relies on algorithms.factorize, which probably leads to the same issue.
MarcoGorelli commentedon Feb 27, 2020
In this case, isn't the current result correct though?
The index is sorted by its first argument, and so it is correct to say that it's lexically sorted
xin-jin commentedon Feb 27, 2020
Oh you are right. I was confused a bit. Yes, this is actually correctly sorted.
MarcoGorelli commentedon Sep 7, 2020
Decision has been to deprecate
is_lexsorted
as a public method - users should useis_monotonic_increasing
jorisvandenbossche commentedon Dec 28, 2020
Where has this been discussed?
MarcoGorelli commentedon Dec 28, 2020
Hey @jorisvandenbossche - it was in call when we had the sprint on the 31st of August