Skip to content

API: __getitem__[int] vs __setitem__[int] with Float64Index #33469

Closed
@jbrockmendel

Description

@jbrockmendel
Member
ser = pd.Series([1, 2, 3, 4], index=[1.1, 2.1, 3.0, 4.1])

>>> ser[5]                    # <--we are treating 5 as a label
KeyError: 5

>>> ser[5] = 5                # < --we are treating 5 as position
IndexError: index 5 is out of bounds for axis 0 with size 4

>>> ser[3] = 5                # <-- we are treating 3 as a label
>>> ser
1.1    1
2.1    2
3.0    5
4.1    4
dtype: int64

The ser[5] = 5 case is an outlier because instead of having its label-vs-positional behavior determined by ser.index._should_fallback_to_positional, it is defermined by if is_integer(key) and not self.index.inferred_type == "integer":

Activity

added
API - ConsistencyInternal Consistency of API/Behavior
IndexingRelated to indexing on series/frames, not to indexes themselves
and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Apr 13, 2020
jorisvandenbossche

jorisvandenbossche commented on Sep 7, 2020

@jorisvandenbossche
Member

This API is certainly a mess .. ;)

The proposal is to change the setitem case where it fallbacks to positional to also be strictly label-based?

But so the integer would also be interpreted as label, and thus do enlargement in case it is not yet present?

Using your example series:

In [40]: ser
Out[40]: 
1.1    1
2.1    2
3.0    3
4.1    4
dtype: int64

In [41]: ser[2] = 100

In [42]: ser
Out[42]: 
1.1      1
2.1      2
3.0    100
4.1      4
dtype: int64

In [43]: ser[2.0] = -100

In [44]: ser
Out[44]: 
1.1      1
2.1      2
3.0    100
4.1      4
2.0   -100
dtype: int64

I was first thinking this would be a relatively safe change, since it right now already tries label-based for setitem first (so if the integer key is present as a label, it would already do label-based now). So the potential difference in behaviour is only when the label is not present. Right now that can update a value, but in the future add another value like the above? Which is kind of silently resulting in different data. We could also deprecate integer keys for setitem first or raise an error for it?

jbrockmendel

jbrockmendel commented on Sep 8, 2020

@jbrockmendel
MemberAuthor

We could also deprecate integer keys for setitem first or raise an error for it?

Do you mean "deprecate ever treating integer as a label for Float64Index"? Or "deprecate allowing ser[integer] with Float64Index"? The first seems reasonable.

I'd be OK with "integers are always considered labels" or "integers are always considered positional", so there is no question of fall-back behavior.

jbrockmendel

jbrockmendel commented on Jun 9, 2021

@jbrockmendel
MemberAuthor

Options that come to mind for fixing this:

  1. Deprecate casting ints to floats in Float64Index.get_loc, so that ser[4] is always positional when we have a Float64Index.

  2. Change Float64Index._should_fallback_to_positional to return not (self == self.astype(int)).any() (ATM it always returns False).

    • Upsides: Fixes the OP issue without breaking any tests, fixes the inconsistency of how we check for fallback in Series.__setitem__ vs Series.__getitem__
    • Downsides: perf (caching would help some), value-dependent behavior
  3. Change Float64Index._should_fallback_to_positional to always return True.

    • Breaks 5 tests, 4 of which are checking that we raise, the last is trivial to fix.
jorisvandenbossche

jorisvandenbossche commented on Jun 10, 2021

@jorisvandenbossche
Member

1. Deprecate casting ints to floats in Float64Index.get_loc, so that ser[4] is always positional when we have a Float64Index.

Currently I think ser[4] is always strictly label-based for getitem? Since that part is actually consistent / unambiguous (if you know the semantics ..) at this moment, I would maybe leave it as is. Or if we deprecate it, have it raise an error in the future instead of being positional.

2. Change Float64Index._should_fallback_to_positional to return not (self == self.astype(int)).any() (ATM it always returns False).

  • Downsides: perf (caching would help some), value-dependent behavior

For that last reason (value-dependent behavior), I wouldn't go for this. ser[2] meaning something different for pd.Series([..], index=[1.1, 2.0]) vs pd.Series([..], index=[1.0, 2.0]) seems quite confusing or surprising.

3. Change Float64Index._should_fallback_to_positional to always return True.

What's the behavioral consequence of this? So that if the key (cast to float) is not present, always use positional?

Is a 4th option to change Float64Index._should_fallback_to_positional to always return False? (like it is for Int64Index?)

jbrockmendel

jbrockmendel commented on Jun 10, 2021

@jbrockmendel
MemberAuthor

Note: in all the options described above, we are changing in Series.__setitem__

-            if is_integer(key) and self.index.inferred_type != "integer":
+            if is_integer(key) and self.index._should_fallback_to_positional():
  1. Change Float64Index._should_fallback_to_positional to always return True.

What's the behavioral consequence of this? So that if the key (cast to float) is not present, always use positional?

If we do that, and add in __getitem__ a clause so that we try label-based before falling back, then we break 5 tests e.g.

    def test_floating_misc(self):
        ser = Series(np.arange(5), index=np.arange(5) * 2.5, dtype=np.int64)

        with pytest.raises(KeyError, match=r"^4$"):
>           ser[4]
E           Failed: DID NOT RAISE <class 'KeyError'>

(without the extra clause in __getitem__ we break 9 tests, including a few that aren't with pytest.raises(...))

Is a 4th option to change Float64Index._should_fallback_to_positional to always return False? (like it is for Int64Index?)

That is the status quo for _should_fallback_to_positional. If we change the line in Series.__setitem__ and leave everything else unchanged, we break 1 test (huh, thought it was more:

    def test_setitem_index_float64(self):
        val = 5
        obj = pd.Series([1, 2, 3, 4], index=[1.1, 2.1, 3.1, 4.1])
        assert obj.index.dtype == np.float64
    
        # float + int -> int
        temp = obj.copy()
        msg = "index 5 is out of bounds for axis 0 with size 4"
        with pytest.raises(IndexError, match=msg):
>           temp[5] = 5
E           Failed: DID NOT RAISE <class 'IndexError'>
jreback

jreback commented on Jun 24, 2021

@jreback
Contributor

IIUC I think integers should always be label based for FloatIndex. so this means deprecating (and then changing) the setitem behavior.
getitem would be unchanged i think.

added this to the 1.4 milestone on Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    API - ConsistencyInternal Consistency of API/BehaviorAPI DesignEnhancementIndexingRelated to indexing on series/frames, not to indexes themselves

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @jreback@jorisvandenbossche@jbrockmendel@mroeschke

      Issue actions

        API: __getitem__[int] vs __setitem__[int] with Float64Index · Issue #33469 · pandas-dev/pandas