Skip to content

BUG: Index.difference and Index.intersection doesn't preserve type of Index for some Index subclasses for corner cases #20040

@Dr-Irv

Description

@Dr-Irv
Contributor

Code Sample, a copy-pastable example if possible

pi1 = pd.PeriodIndex(start='2000', end='2010', freq='A')
print(pi1.difference(pi1), pi1.intersection(pi1.drop(pi1)))

ci = pd.CategoricalIndex(['a','b','c'], categories=['a','b','c'])
print(ci.difference(ci), ci.intersection(ci.drop(ci)))

ri = pd.RangeIndex(start=1, stop=5)
print(ri.difference(ri), ri.intersection(ri.drop(ri)))

Problem description

The result of taking the difference of an Index for various Index subclasses and the Index produces a resulting Index that does not preserve the type of the subclass.

From a set algebra point of view, for a set S, S.difference(S) should equal S.intersection(nullset).

The output from the above is:

Index([], dtype='object') PeriodIndex([], dtype='period[A-DEC]', freq='A-DEC')
Index([], dtype='object') CategoricalIndex([], categories=['a', 'b', 'c'], ordered=False, dtype='category')
Index([], dtype='object') Int64Index([], dtype='int64')

There is some discussion in the pull request #19849, where I discovered this bug, but at request of @jreback, I have split this into a separate issue.

Expected Output

PeriodIndex([], dtype='period[A-DEC]', freq='A-DEC') PeriodIndex([], dtype='period[A-DEC]', freq='A-DEC')
CategoricalIndex([], categories=['a', 'b', 'c'], ordered=False, dtype='category') CategoricalIndex([], categories=['a', 'b', 'c'], ordered=False, dtype='category')
RangeIndex(start=0, stop=0, step=1) RangeIndex(start=0, stop=0, step=1)

Note that for RangeIndex, the result of the intersection operation is also incorrect.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Activity

Dr-Irv

Dr-Irv commented on Mar 7, 2018

@Dr-Irv
ContributorAuthor

I'm willing to work on this, but can we have a discussion on the implementation? The suggested solution in the discussion in #19849 is to use self._shallow_copy([]), but that method doesn't work right for empty indexes, so I think it is easier to just have a method that creates an empty index, but preserves the other properties of the index (e.g., categories for CategoricalIndex, range step for RangeIndex, freq for PeriodIndex, etc.)

Alternatively, I can make self._shallow_copy([]) work for the various Index subclasses with an empty list argument.

added
Dtype ConversionsUnexpected or buggy dtype conversions
AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
on Mar 8, 2018
gfyoung

gfyoung commented on Mar 8, 2018

@gfyoung
Member

@Dr-Irv : That seems like a good first attempt to patch this, though other options are welcome of course.

Dr-Irv

Dr-Irv commented on Mar 8, 2018

@Dr-Irv
ContributorAuthor

@gfyoung By "That seems", do you mean having a method to create an empty index, or fixing _shallow_copy([])

gfyoung

gfyoung commented on Mar 8, 2018

@gfyoung
Member

Oh, sorry! I was referring to fixing _shallow_copy([]).

added this to the 0.23.0 milestone on Mar 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffBugDtype ConversionsUnexpected or buggy dtype conversions

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @jreback@gfyoung@Dr-Irv

      Issue actions

        BUG: Index.difference and Index.intersection doesn't preserve type of Index for some Index subclasses for corner cases · Issue #20040 · pandas-dev/pandas