Skip to content

BUG: pd.merge using "inner" with multi-index doesn't preserve left index order #55774

Closed
@f-combes

Description

@f-combes

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

left = pd.DataFrame(
    {
        "digits": [1, 2, 3, 1],
        "lowercase": ["a", "v", "f", "g"],
        "uppercase": ["A", "B", "B", "A"],
        "left": [100_000, 100_000, 100_000, 30_000],
    },
).set_index(["digits", "lowercase", "uppercase"])


right = pd.DataFrame(
    {
        "uppercase": ["A", "B", "B", "D"],
        "digits": [1, 2, 3, 5],
        "right": [20_000, 70_000, 100_000, 500],
    },
).set_index(["uppercase", "digits"]).squeeze()

result = pd.merge(left, right, how="inner", left_index=True, right_index=True)

pd.testing.assert_index_equal(left.index, result.index)

Output:
> AssertionError: MultiIndex level [0] are different

MultiIndex level [0] values are different (75.0 %)
[left]:  Index([1, 2, 3, 1], dtype='int64', name='digits')
[right]: Index([1, 1, 2, 3], dtype='int64', name='digits')

Issue Description

As the documentation states:

inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

Performing an inner merge should preserve the order of the left keys. In the above example, the left multi-index should be preserved - it is instead sorted.

In a similar fashion, even though align states that the left key order is preserved when doing inner join, it is not. However, aligning twice will result in the correct order, e.g.

_, aligned_right_once = left.align(right, join="inner", axis=0)
_, aligned_right_twice = left.align(aligned_right_once, join="inner", axis=0)

pd.testing.assert_index_equal(left.index, aligned_right_twice.index) # passes
pd.testing.assert_index_equal(left.index, aligned_right_once.index) # raises 

which seems fairly inconsistent -

Potentially relatable issues/PR

Expected Behavior

The key order of the result should the the same as the left dataframe in the example.

Installed Versions

INSTALLED VERSIONS

commit : ba1cccd
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.0-35-generic
Version : #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct 6 10:23:26 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0
numpy : 1.23.5
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 67.4.0
pip : 23.3.1
Cython : None
pytest : 7.4.0
hypothesis : 6.81.1
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : None
bs4 : 4.11.1
bottleneck : None
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.9.1
gcsfs : 2023.9.1
matplotlib : 3.6.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugClosing CandidateMay be closeable, needs more eyeballsReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions