Skip to content

Confusing (possibly buggy) IntervalIndex behavior #16316

@alexlenail

Description

@alexlenail
Contributor

screen shot 2017-05-09 at 10 03 25 pm

In the above, I have a region that I'm querying for with a partially overlapping interval. The query succeeds when the interval is partially overlapping until it doesn't, throwing the key error:

KeyError                                  Traceback (most recent call last)
/Users/alex/Documents/GarNet/venv/lib/python3.6/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1433                 if not ax.contains(key):
-> 1434                     error()
   1435             except TypeError as e:

/Users/alex/Documents/GarNet/venv/lib/python3.6/site-packages/pandas/core/indexing.py in error()
   1428                 raise KeyError("the label [%s] is not in the [%s]" %
-> 1429                                (key, self.obj._get_axis_name(axis)))
   1430 

KeyError: 'the label [(5409951, 5409965]] is not in the [index]'

I think this is particularly confusing because there doesn't seem to be any prominent difference between the locs that succeed and the loc that fails as far as I can tell. I know we had discussed loc's behavior in this context but I'm not sure we came to a conclusion.

By the way, my larger question is about how to find intersections between two IntervalIndex. It seems like the find_intersections function didn't make it into this release @jreback ? Let me know! =]

Activity

jreback

jreback commented on May 10, 2017

@jreback
Contributor

can't u change the post to show code that constructs a minimal example
pictures are not very useful

alexlenail

alexlenail commented on May 10, 2017

@alexlenail
ContributorAuthor

@jreback

Make your dataframe

x = pd.DataFrame(['a'], columns=['col'], index=pd.IntervalIndex.from_tuples([(1, 5)], closed='both'))

Test all these:

x.loc[pd.Interval(1,5)]
x.loc[pd.Interval(-10,10)]
x.loc[pd.Interval(3,5)]
x.loc[pd.Interval(3,4)]
x.loc[pd.Interval(2,4)]
x.loc[pd.Interval(1,4)]

I think that documentation as to what the behavior should be might be helpful. I can't make heads nor tails of what loc thinks it should do in this case, so I can't even tell if this is a bug.

P.S. Still curious as to how you would address my meta-question about find_intersections...

added
IntervalInterval data type
IndexingRelated to indexing on series/frames, not to indexes themselves
on May 11, 2017
jreback

jreback commented on May 11, 2017

@jreback
Contributor

well docs are here: http://pandas.pydata.org/pandas-docs/stable/advanced.html#intervalindex

Well we treat indexing with an Interval as an exact match, if its there exactly it matches, otherwise you get a KeyError. Its a a point it can be contained in an interval. Some of this is probably buggy because we didn't have any real examples

what would be helpful is using this example:

In [26]: df = pd.DataFrame(['a', 'b'], columns=['col'], index=pd.IntervalIndex.from_tuples([(1, 5), (7, 8)]))

In [27]: df
Out[27]: 
       col
(1, 5]   a
(7, 8]   b

is to enumerate what those cases about should do semantically. Further having selections like df.loc[3]. And further still what
df.loc[pd.Interval(4,7.5)] would do.

would be really helpful.

jreback

jreback commented on May 11, 2017

@jreback
Contributor

further I still don't understand what you need from find_intersection a complete example would be helpful. IOW, you have 2 frames as input (show the code) and what you think the output should be.

alexlenail

alexlenail commented on May 11, 2017

@alexlenail
ContributorAuthor

@jreback

Sorry, let me include the output from those .locs and explain why I think they might be buggy, to help clarify.

x = pd.DataFrame(['a'], columns=['col'], index=pd.IntervalIndex.from_tuples([(1, 5)], closed='both'))

Intuitively, I want to say that 'a' is from 1 to 5, closed interval. All good here.

x.loc[pd.Interval(1,5)]

"Do I have any intervals from 1 to 5?" Yes. No surprise

x.loc[pd.Interval(-10,10)]

Do I have any intervals from -10 to 10? Yes. Oh, okay, so it's not exact matches. Okay.

x.loc[pd.Interval(3,5)]

Do I have any intervals from 3 to 5? Yes. Okay, that makes sense, it's a partial overlap with [1,5].

x.loc[pd.Interval(3,4)]

Do I have any intervals from 3 to 4? KeyError Wait what?

x.loc[pd.Interval(2,4)]   # KeyError
x.loc[pd.Interval(1,4)]   # KeyError

I think this qualifies as "Confusing (possibly buggy) IntervalIndex behavior" but it might not -- I might just be thinking about this incorrectly. I'm happy to supplement the docs to clarify for people like me if that's the case.

Thanks!

jreback

jreback commented on May 12, 2017

@jreback
Contributor

was looking for more discussion of whether an Interval needs to be an exact match, or matches if other intervals are fully contained

what happens if it partially overlaps?

alexlenail

alexlenail commented on May 12, 2017

@alexlenail
ContributorAuthor

@jreback I think I might see the footprint of the bug. New dataframe: (same as before, just from 10 to 15)

x = pd.DataFrame(['a'], columns=['col'], index=pd.IntervalIndex.from_tuples([(10, 15)], closed='both'))

x.loc[pd.Interval(9,15)]  # works. 

x.loc[pd.Interval(10,15)]  # works. 

x.loc[pd.Interval(11,15)]  # works. 

x.loc[pd.Interval(9,16)]  # works. 

x.loc[pd.Interval(10,16)]  # works. 

x.loc[pd.Interval(11,16)]  # works.

x.loc[pd.Interval(9,14)]  # fails. 

x.loc[pd.Interval(10,14)]  # fails. 

x.loc[pd.Interval(11,14)]  # fails.
alexlenail

alexlenail commented on May 12, 2017

@alexlenail
ContributorAuthor

So:

x.loc[pd.Interval(9,14)] # fails. x.loc[pd.Interval(10,14)] # fails. x.loc[pd.Interval(11,14)] # fails.
x.loc[pd.Interval(9,15)] # works. x.loc[pd.Interval(10,15)] # true interval. x.loc[pd.Interval(11,15)] # works.
x.loc[pd.Interval(9,16)] # works. x.loc[pd.Interval(10,16)] # works. x.loc[pd.Interval(11,16)] # works.
jreback

jreback commented on May 12, 2017

@jreback
Contributor

@zfrenchee I have enough examples, what I want to know is why you think this should work at all.

In [27]: df = pd.DataFrame(['a'], columns=['col'], 
          index=pd.IntervalIndex.from_tuples([(10, 15)], closed='both'))

In [28]: df
Out[28]: 
         col
[10, 15]   a

IOW, take cases and comment on if they should work or raise (KeyError or other)

  • df.loc[pd.Interval(10,15, closed='both')]
  • df.loc[pd.Interval(10, 15, closed='right')]
  • df.loc[pd.Interval(11, 14)]
  • df.loc[pd.Interval(11, 16)]
  • df.loc[[pd.Interval(11, 13):pd.Interval(14, 15)]]
  • df.loc[[pd.Interval(11, 13), pd.Interval(14, 15)]]
  • df.loc[12]
  • df.loc[9]
  • df.loc[9, 12]
  • df.loc[11:13]
alexlenail

alexlenail commented on May 12, 2017

@alexlenail
ContributorAuthor

I think there are two behaviors for loc which I think would make sense: exact match, or any overlap.

So reasonable behavior 1 is:

df = pd.DataFrame(['a'], columns=['col'], 
          index=pd.IntervalIndex.from_tuples([(10, 15)], closed='both'))

df.loc[pd.Interval(10,15, closed='both')]      # the only one that succeeds
df.loc[pd.Interval(10, 15, closed='right')]     # KeyError
df.loc[pd.Interval(11, 14)]     # KeyError
df.loc[pd.Interval(11, 16)]     # KeyError
df.loc[[pd.Interval(11, 13):pd.Interval(14, 15)]]     # KeyError
df.loc[[pd.Interval(11, 13), pd.Interval(14, 15)]]     # KeyError
df.loc[12]     # KeyError
df.loc[9]     # KeyError
df.loc[9, 12]     # KeyError
df.loc[11:13]     # KeyError

As a non-pandas expert, I believe this is most in keeping with what loc currently does on other index types, but I think this also obliterates the reason for having intervals, since you've essentially reduced the interval down to a token which you're selecting for.

The other possible behavior is that loc returns all overlaps:

df = pd.DataFrame(['a'], columns=['col'], 
          index=pd.IntervalIndex.from_tuples([(10, 15)], closed='both'))

df.loc[pd.Interval(10,15, closed='both')]      # a
df.loc[pd.Interval(10, 15, closed='right')]      # a
df.loc[pd.Interval(11, 14)]      # a
df.loc[pd.Interval(11, 16)]      # a
df.loc[[pd.Interval(11, 13):pd.Interval(14, 15)]]      # a, though I'm not totally sure what the semantics of this query are. 
df.loc[[pd.Interval(11, 13), pd.Interval(14, 15)]]      # a
df.loc[12]      # a
df.loc[9]     # None, because if this returned a KeyError the next case would be really hard.
df.loc[9, 12]      # a
df.loc[11:13]      # a

I think what makes most sense is to use loc for the first of these, and define a new special overlap function to implement the second of these, which would have a signature like so:

pd.IntervalIndex().overlaps(other_intervalindex_or_dataframe_with_intervalIndex)

or more likely:

dataframe_with_intervalindex.index.overlaps(other_intervalindex_or_dataframe_with_intervalIndex)

This would return something sensible, like the indices from dataframe_with_intervalindex which overlap intervals in other_intervalindex_or_dataframe_with_intervalIndex. Actually returning the values from both should be left to merge(left_index=True, right_index=True)

Maybe it would be easier to define overlaps like so:

pd.overlaps(intervalIndex1, intervalindex2)

In that case it's a little harder to decide what to return, since you would want it to return the same thing as:

pd.overlaps(intervalIndex2, intervalindex1)

What do you think? @jreback

TomAugspurger

TomAugspurger commented on May 12, 2017

@TomAugspurger
Contributor

As a non-pandas expert, I believe this is most in keeping with what loc currently does on other index types

The closest analogy here is probably partial string indexing into Datetimes. We accept .loc['2017'] rather than .loc[pd.Timestamp(2017, ...)]. Stretching the analogy a bit further then, .loc should be "exact" when passed Intervals, and "non-exact" when passed the elements making up the Intervals (ints, strs, whatever). So these both return the same (which is what happens currently)

In [83]: df = pd.DataFrame(['a', 'b'], columns=['col'], index=pd.IntervalIndex.from_tuples([(1, 5), (7, 8)]))
    ...: df
    ...:
Out[83]:
       col
(1, 5]   a
(7, 8]   b

In [61]: df.loc[pd.Interval(1, 5)]
Out[61]:
col    a
Name: (1, 5], dtype: object

In [62]: df.loc[3]
Out[62]:
col    a
Name: (1, 5], dtype: object```

but, then

In [50]: df.loc[pd.Interval(1, 4)]

would raise a KeyError, since it doesn't match exactly.

Users wishing to do indexing by passing an IntervalIndex should use boolean indexing, and use some methods on Intervalndex to assist

For more flexible indexing with iterables of Intervals, I propose we enhance IntervalIndex.contains (or maybe a new method) to accept an Iterable other and return an Iterable[bool] of the same length.

def IntervalIndex.covers(self, other: Iterable) -> Array[bool]:
    """Boolean mask for whether items of `other` overlap with anything `self`.
    Output is the same same shape as `other`"""
    # maybe enhance `.contains` to do this?

def IntervalIndex.covered_by(self, other: Iterable) -> Array[Bool]:
    """Boolean mask for whether items in self overlap with anything in `other`.
    Output is the same shape as `self`"""
    # maybe modify `.isin` to do this?

So to summarize:

  • df.loc with scalars or lists of nonIntervals we match wherever it's covered by df.index
  • df.loc with scalar Intervals a list / IntervalIndex of Intervals will match only exactly
  • Users should use df[df.index.covers(values)] for non-strict matching with iterables of Intervals
shoyer

shoyer commented on May 12, 2017

@shoyer
Member

So to summarize:
df.loc with scalars or lists of nonIntervals we match wherever it's covered by df.index
df.loc with scalar Intervals a list / IntervalIndex of Intervals will match only exactly
Users should use df[df.index.covers(values)] for non-strict matching with iterables of Intervals

This sounds good to me!

I think the original intent was to match fully contained intervals only, but clearly that logic is not working right. In any case it's certainly better (simpler / more explicit) to switch to requiring specific methods for this functionality. It seems like we need may need at least three methods for handling IntervalIndex/IntervalIndex matches, e.g., IntervalIndex.covers, IntervalIndex.covered_by and IntervalIndex.overlaps. Possibly worth adding these methods to Interval, too (at least Interval.overlaps().

alexlenail

alexlenail commented on May 13, 2017

@alexlenail
ContributorAuthor

@TomAugspurger

I think your ideas represent a good compromise. A couple concerns though:

(which is what happens currently)

  • The current behavior is definitely wrong with respect to what you suggest it should be, so this isn't quite right. Do you agree?

  • I'm not sure whether we need both covers and covered_by given that they seem to be perfect opposites? a.covers(b) == b.covered_by(a) right? In which case you can just flip the expression and don't need both. Let me know if I'm misreading this.

28 remaining items

alexlenail

alexlenail commented on Dec 30, 2017

@alexlenail
ContributorAuthor

@shoyer returning to

class Interval:
    def covers(self, other: Interval) -> bool
    def covers(self, other: IntervalIndex) -> IntegerArray1D
    def overlaps(self, other: Interval) -> bool
    def overlaps(self, other: IntervalIndex) -> IntegerArray1D

class IntervalIndex:
    def covers(self, other: Interval) -> IntegerArray1D
    def covers(self, other: IntervalIndex) -> Tuple[IntegerArray1D, IntegerArray1D]
    def overlaps(self, other: Interval) -> IntegerArray1D
    def overlaps(self, other: IntervalIndex) -> Tuple[IntegerArray1D, IntegerArray1D]

What were you thinking w.r.t. the relationship between IntervalIndex.overlaps(interval) and interval.overlaps(IntervalIndex) (and same question for .covers()) ?

shoyer

shoyer commented on Dec 30, 2017

@shoyer
Member

What were you thinking w.r.t. the relationship between IntervalIndex.overlaps(interval) and interval.overlaps(IntervalIndex) (and same question for .covers()) ?

It's been a while since I thought about it, but my initial thought would be that interval.overlaps(interval_index) and interval_index.overlaps(interval) would return the same thing.

alexlenail

alexlenail commented on Dec 30, 2017

@alexlenail
ContributorAuthor

@shoyer okay, that's what I did, currently in #18975. Take a look if you get a chance =)

Note @jreback that means the functions

test_interval_covers_intervalIndex

isn't any different from the function

test_intervalIndex_covers_interval

and

test_interval_overlaps_intervalIndex

is the same as

test_intervalIndex_overlaps_interval

in that the should_overlap and should_cover objects are identical.

jreback

jreback commented on Apr 11, 2018

@jreback
Contributor

came across this library: https://github.com/AlexandreDecan/python-intervals

looks to have some interesting interval semantics

cc @jschendel

jorisvandenbossche

jorisvandenbossche commented on Jun 20, 2019

@jorisvandenbossche
Member

@jschendel this might be a possible topic to work on during the sprint (if you are interested of course)?

It would be good to have this indexing behaviour clean-up in 0.25 before 1.0 (as they will break some behaviour I think, not sure we can do with deprecations).

jschendel

jschendel commented on Jun 20, 2019

@jschendel
Member

Yeah, seems like a good topic to work on during the sprint. I've done a bit of work on this already but have been a bit lazy on finishing it up.

modified the milestones: Contributions Welcome, 0.25.0 on Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesIntervalInterval data type

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Participants

      @jreback@jorisvandenbossche@shoyer@TomAugspurger@sinhrks

      Issue actions

        Confusing (possibly buggy) IntervalIndex behavior · Issue #16316 · pandas-dev/pandas