-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
I do not understand the sort order for Python Pandas DataFrame merge function with how="inner". Example:
import pandas as pd
df2 = pd.DataFrame({'a': (6, 7, 8, 6), 'b': ("w", "x", "y", "z")})
print(df2)
print("left:")
dfMerge2 = pd.merge(df2, df2, on='a', how="left")
print(dfMerge2)
dfMerge = pd.merge(df2, df2, on='a', how="inner")
print("inner:")
print(dfMerge)
Result:
a b
0 6 w
1 7 x
2 8 y
3 6 z
left:
a b_x b_y
0 6 w w
1 6 w z
2 7 x x
3 8 y y
4 6 z w
5 6 z z
inner:
a b_x b_y
0 6 w w
1 6 w z
2 6 z w
3 6 z z
4 7 x x
5 8 y y
I would expect that for how="inner" the order of the resulting rows with
6 z w and
6 z z
would be the same as with how="left", as the documentation https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html says:
- left: use only keys from left frame, similar to a SQL left outer join; preserve key order
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
Output of pd.show_versions()
[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS
commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-103-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.13
pymysql: 0.7.9.None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None
Activity
dhimmel commentedon Jan 8, 2018
I just noticed the same issue with pandas 0.22.0:
The expected behavior would be for rows ordering of
affil_map_dfto be preserved. Instead, it seems that instead the order ofaffiliation_dfor perhaps the sortedaffiliationcolumn was used.This behavior does not match the documentation:
pandas/pandas/core/frame.py
Lines 147 to 148 in a00154d
To me, the documented behavior is intuitive and the actual behavior should be updated?
jschendel commentedon Jun 12, 2018
To expand on this, the issue appears to occur when the merge key is non-unique.
Setup:
Non-unique merge key causes improper ordering:
Restricting to a unique portion seems fine:
Using
how='left'maintains proper order:[-]Python Pandas DataFrame Merge: Strange Sort Order for How = Inner[/-][+]Merge with how='inner' and non-unique join key does not preserve the order of the left keys[/+]TartySG commentedon Jun 12, 2018
@jschendel
It has more to do with the order in which the merge encounters the values rather than "non-uniqueness".
The order is the same as the first time it encounters the join key in the column.
So a unique value will only be encountered once, hence the absence of noticeable change in order.
will produce a merge with all 'A' first.
on the other hand :
will produce a merge with all the 'B' first, regardless of the "Order" of the categorical data (or any ordered type e.g. interger).
jschendel commentedon Jun 12, 2018
Yes, looks like I was a bit premature attributing the issue to non-uniquness.
[-]Merge with how='inner' and non-unique join key does not preserve the order of the left keys[/-][+]Merge with how='inner' does not always preserve the order of the left keys[/+]asishm commentedon Dec 25, 2020
@phofl Can this be fixed with similar changes as PR #37406 ?
phofl commentedon Dec 29, 2020
@asishm No, the Cython function perfoming the actual inner join does not support sort. In #37406 the sort keyword was not passed through so this was an easy fix. Not quitte sure why this was not implemented.
rickbeeloo commentedon Jan 20, 2022
This still appears to be an issue
jreback commentedon Jan 20, 2022
@rickbeeloo hence the open status
pull requests to patch are welcome
[03/25/22-03/29/22] Spring 2022, 2022-04-01T16:04:24+00:00, version D
[03/25/22-03/29/22] Spring 2022, 2022-04-01T16:04:24+00:00, version D
[03/25/22-03/29/22] Spring 2022, 2022-04-01T16:04:24+00:00, version D
[03/25/22-03/29/22] Spring 2022, 2022-04-01T16:04:24+00:00, version D
hcaptchavitamin commentedon Oct 17, 2025