-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
IndexingRelated to indexing on series/frames, not to indexes themselvesRelated to indexing on series/frames, not to indexes themselvesIntervalInterval data typeInterval data type
Milestone
Description
In the above, I have a region that I'm querying for with a partially overlapping interval. The query succeeds when the interval is partially overlapping until it doesn't, throwing the key error:
KeyError Traceback (most recent call last)
/Users/alex/Documents/GarNet/venv/lib/python3.6/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
1433 if not ax.contains(key):
-> 1434 error()
1435 except TypeError as e:
/Users/alex/Documents/GarNet/venv/lib/python3.6/site-packages/pandas/core/indexing.py in error()
1428 raise KeyError("the label [%s] is not in the [%s]" %
-> 1429 (key, self.obj._get_axis_name(axis)))
1430
KeyError: 'the label [(5409951, 5409965]] is not in the [index]'
I think this is particularly confusing because there doesn't seem to be any prominent difference between the loc
s that succeed and the loc
that fails as far as I can tell. I know we had discussed loc
's behavior in this context but I'm not sure we came to a conclusion.
By the way, my larger question is about how to find intersections between two IntervalIndex
. It seems like the find_intersections
function didn't make it into this release @jreback ? Let me know! =]
Metadata
Metadata
Assignees
Labels
IndexingRelated to indexing on series/frames, not to indexes themselvesRelated to indexing on series/frames, not to indexes themselvesIntervalInterval data typeInterval data type
Activity
jreback commentedon May 10, 2017
can't u change the post to show code that constructs a minimal example
pictures are not very useful
alexlenail commentedon May 10, 2017
@jreback
Make your dataframe
Test all these:
I think that documentation as to what the behavior should be might be helpful. I can't make heads nor tails of what
loc
thinks it should do in this case, so I can't even tell if this is a bug.P.S. Still curious as to how you would address my meta-question about
find_intersections
...jreback commentedon May 11, 2017
well docs are here: http://pandas.pydata.org/pandas-docs/stable/advanced.html#intervalindex
Well we treat indexing with an
Interval
as an exact match, if its there exactly it matches, otherwise you get aKeyError
. Its a a point it can be contained in an interval. Some of this is probably buggy because we didn't have any real exampleswhat would be helpful is using this example:
is to enumerate what those cases about should do semantically. Further having selections like
df.loc[3]
. And further still whatdf.loc[pd.Interval(4,7.5)]
would do.would be really helpful.
jreback commentedon May 11, 2017
further I still don't understand what you need from
find_intersection
a complete example would be helpful. IOW, you have 2 frames as input (show the code) and what you think the output should be.alexlenail commentedon May 11, 2017
@jreback
Sorry, let me include the output from those
.loc
s and explain why I think they might be buggy, to help clarify.Intuitively, I want to say that 'a' is from 1 to 5, closed interval. All good here.
"Do I have any intervals from 1 to 5?" Yes. No surprise
Do I have any intervals from -10 to 10? Yes. Oh, okay, so it's not exact matches. Okay.
Do I have any intervals from 3 to 5? Yes. Okay, that makes sense, it's a partial overlap with [1,5].
Do I have any intervals from 3 to 4?
KeyError
Wait what?I think this qualifies as "Confusing (possibly buggy) IntervalIndex behavior" but it might not -- I might just be thinking about this incorrectly. I'm happy to supplement the docs to clarify for people like me if that's the case.
Thanks!
jreback commentedon May 12, 2017
was looking for more discussion of whether an Interval needs to be an exact match, or matches if other intervals are fully contained
what happens if it partially overlaps?
alexlenail commentedon May 12, 2017
@jreback I think I might see the footprint of the bug. New dataframe: (same as before, just from 10 to 15)
alexlenail commentedon May 12, 2017
So:
jreback commentedon May 12, 2017
@zfrenchee I have enough examples, what I want to know is why you think this should work at all.
IOW, take cases and comment on if they should work or raise (KeyError or other)
df.loc[pd.Interval(10,15, closed='both')]
df.loc[pd.Interval(10, 15, closed='right')]
df.loc[pd.Interval(11, 14)]
df.loc[pd.Interval(11, 16)]
df.loc[[pd.Interval(11, 13):pd.Interval(14, 15)]]
df.loc[[pd.Interval(11, 13), pd.Interval(14, 15)]]
df.loc[12]
df.loc[9]
df.loc[9, 12]
df.loc[11:13]
alexlenail commentedon May 12, 2017
I think there are two behaviors for
loc
which I think would make sense: exact match, or any overlap.So reasonable behavior 1 is:
As a non-pandas expert, I believe this is most in keeping with what loc currently does on other index types, but I think this also obliterates the reason for having intervals, since you've essentially reduced the interval down to a token which you're selecting for.
The other possible behavior is that loc returns all overlaps:
I think what makes most sense is to use loc for the first of these, and define a new special
overlap
function to implement the second of these, which would have a signature like so:or more likely:
This would return something sensible, like the indices from
dataframe_with_intervalindex
which overlap intervals inother_intervalindex_or_dataframe_with_intervalIndex
. Actually returning the values from both should be left tomerge(left_index=True, right_index=True)
Maybe it would be easier to define overlaps like so:
In that case it's a little harder to decide what to return, since you would want it to return the same thing as:
What do you think? @jreback
TomAugspurger commentedon May 12, 2017
The closest analogy here is probably partial string indexing into Datetimes. We accept
.loc['2017']
rather than.loc[pd.Timestamp(2017, ...)]
. Stretching the analogy a bit further then,.loc
should be "exact" when passedIntervals
, and "non-exact" when passed the elements making up the Intervals (ints, strs, whatever). So these both return the same (which is what happens currently)but, then
would raise a KeyError, since it doesn't match exactly.
Users wishing to do indexing by passing an
IntervalIndex
should use boolean indexing, and use some methods onIntervalndex
to assistFor more flexible indexing with iterables of
Intervals
, I propose we enhanceIntervalIndex.contains
(or maybe a new method) to accept an Iterableother
and return an Iterable[bool] of the same length.So to summarize:
df.loc
with scalars or lists of nonIntervals we match wherever it's covered bydf.index
df.loc
with scalar Intervals a list / IntervalIndex of Intervals will match only exactlydf[df.index.covers(values)]
for non-strict matching with iterables of Intervalsshoyer commentedon May 12, 2017
This sounds good to me!
I think the original intent was to match fully contained intervals only, but clearly that logic is not working right. In any case it's certainly better (simpler / more explicit) to switch to requiring specific methods for this functionality. It seems like we need may need at least three methods for handling
IntervalIndex
/IntervalIndex
matches, e.g.,IntervalIndex.covers
,IntervalIndex.covered_by
andIntervalIndex.overlaps
. Possibly worth adding these methods toInterval
, too (at leastInterval.overlaps()
.alexlenail commentedon May 13, 2017
@TomAugspurger
I think your ideas represent a good compromise. A couple concerns though:
The current behavior is definitely wrong with respect to what you suggest it should be, so this isn't quite right. Do you agree?
I'm not sure whether we need both
covers
andcovered_by
given that they seem to be perfect opposites?a.covers(b) == b.covered_by(a)
right? In which case you can just flip the expression and don't need both. Let me know if I'm misreading this.28 remaining items
alexlenail commentedon Dec 30, 2017
@shoyer returning to
What were you thinking w.r.t. the relationship between
IntervalIndex.overlaps(interval)
andinterval.overlaps(IntervalIndex)
(and same question for.covers()
) ?shoyer commentedon Dec 30, 2017
It's been a while since I thought about it, but my initial thought would be that
interval.overlaps(interval_index)
andinterval_index.overlaps(interval)
would return the same thing.alexlenail commentedon Dec 30, 2017
@shoyer okay, that's what I did, currently in #18975. Take a look if you get a chance =)
Note @jreback that means the functions
isn't any different from the function
and
is the same as
in that the
should_overlap
andshould_cover
objects are identical.jreback commentedon Apr 11, 2018
came across this library: https://github.com/AlexandreDecan/python-intervals
looks to have some interesting interval semantics
cc @jschendel
jorisvandenbossche commentedon Jun 20, 2019
@jschendel this might be a possible topic to work on during the sprint (if you are interested of course)?
It would be good to have this indexing behaviour clean-up in 0.25 before 1.0 (as they will break some behaviour I think, not sure we can do with deprecations).
jschendel commentedon Jun 20, 2019
Yeah, seems like a good topic to work on during the sprint. I've done a bit of work on this already but have been a bit lazy on finishing it up.