Skip to content

DISCUSS/API: setitem-like operations should only update inplace and never fallback with upcast (i.e never change the dtype) #39584

@jorisvandenbossche

Description

@jorisvandenbossche
Member

Currently, setitem-like operations (i.e. operations that change values in an existing series or dataframe such as __setitem__ and .loc/.iloc setitem, or filling methods like fillna) first try to update in place, but if there is a dtype mismatch, pandas will upcast to a common dtype (typically object dtype).

For example, setting a string into an integer Series upcasts to object:

>>> s = pd.Series([1, 2, 3])
>>> s.loc[1] = "B"
>>> s
0    1
1    B
2    3
dtype: object

or doing a fillna with an invalid fill value also upcasts instead of raising an error:

>>> s = pd.Series(["2020-01-01", "NaT"], dtype="datetime64[ns]")
>>> s
0   2020-01-01
1          NaT
dtype: datetime64[ns]
>>> s.fillna(1)
0    2020-01-01 00:00:00
1                      1
dtype: object

My general proposal would be that in some future (eg pandas 2.0 + after a deprecation), such inherently inplace operation should have the guarantee to either happen in place or either error, and thus never change the dtype of the original Series/DataFrame.

This is similar to eg numpy's behaviour where setitem never changes the dtype. Showing the first example from above in equivalent numpy code:

>>> arr = np.array([1, 2, 3])
>>> arr[1] = "B"
...
ValueError: invalid literal for int() with base 10: 'B'

Apart from that, I also think this is the cleaner behaviour with less surprises. If a user specifically wants to allow mixed types in a column, they can manually cast to object dtype first.

On the other hand, this is quite a big change in how we generally are permissive right now and easily upcast, and such a change will certainly impact quite some user code (but, it's perfectly possible to do this with proper deprecation warnings in advance warning for the specific cases where it will error in the future AFAIK).

There are certainly some more details that need to discussed as well if we want this (which exact values are regarded as compatible with the dtype, eg setting a float in an integer column, should that error or silently round the float?). But what are people's thoughts on the general idea?

cc @pandas-dev/pandas-core

Activity

jorisvandenbossche

jorisvandenbossche commented on Feb 3, 2021

@jorisvandenbossche
MemberAuthor

Sidenote: the extension arrays are actually already more strict on this (which is also needed, otherwise setitem could change the class of the object). But the upcasting logic lives a level higher on the Series/DataFrame, where the underlying array gets swapped when an upcast happens. So in other words, the proposal is to propagate that stricter behaviour of the arrays also to Series/DataFrame.

toobaz

toobaz commented on Feb 3, 2021

@toobaz
Member

I tend to agree this is a step to take sooner or later. I don't think I ever met a case in which implicitly upcasting via a setitem on individual elements was a desired feature and not a bug.

Clearly (?), this should not apply to replacing an entire column of a DataFrame (or multiple columns) with new ones. Or if we want to state it more generally: dtype should not change unless the smallest dtype-bearing block (which is the column) is entirely replaced. And just for completeness: it would apply to df.loc[:, 'col'] = s but not to df['col'] = s (notice that the former currently replaces the dtype e.g. if col had previously int dtype and s has Timestamp, something that I suspect should not happen).

jbrockmendel

jbrockmendel commented on Feb 3, 2021

@jbrockmendel
Member
jorisvandenbossche

jorisvandenbossche commented on Feb 4, 2021

@jorisvandenbossche
MemberAuthor

Clearly (?), this should not apply to replacing an entire column of a DataFrame (or multiple columns) with new ones.

@toobaz Yes, thanks for explicitly stating that, as I forgot to mention it. Indeed, the proposal is about the cases where (a subset of) the values are changed in place, and not where we are replacing a full column.
So df["existing_col"] = new_values will still follow the dtype of the new values and be able to change the dtype of the column.

And just for completeness: it would apply to df.loc[:, 'col'] = s but not to df['col'] = s (notice that the former currently replaces the dtype e.g. if col had previously int dtype and s has Timestamp, something that I suspect should not happen).

Indeed, that's probably the distinction we want to make. There is some discussion about this in #38896 (comment)

[@jbrockmendel] 1) Does this apply to setitem-like ops on Index? In particular, if I add a
new column to a DataFrame and doing so would require casting the existing
columns, does that raise?

Good point to explicitly call out. I would say: yes, Index and Series should generally be consistent. But note that we don't allow direct (by the user) __setitem__ operations for Index anyway. So then it's mostly about methods like fillna, and there I think it can follow the behaviour we decide for Series (and see comment below for more detailed list of methods that are considered).

To be explicit: this proposal does not cover concat/append methods or set operations (union, intersection, etc), as those class of functions inherently create new objects (with a potentially different shape or order) and follow the upcasting rules (_get_common_dtype / find_common_type based). So operations like reset_index() where a value gets added to the .columns Index object are not affected (as that would indeed be annoying if that would raise).

The case you mention of adding a new column to a DataFrame (df[col] = ...) with a label that isn't compatible with the columns Index.dtype, is handled under the hood as an Index.insert. I would also leave that out from the discussion here, as it's not a "pure" setitem (you can't express this as a setitem operation on the underlying array), and I would rather put it in the bucket of set operations following the upcasting rules.

3) Some of the casting is a result of fallback-on-failure, but other pieces are due to downcasting-on-success. Is the proposal to change these behaviors too?

Let's leave that for a separate issue to discuss, as in theory that's orthogonal in implementation (although certainly related). AFAIK we never do that for actual setitem, but only in a few specific methods in the internals (fillna, interpolate, where, replace).
(I will open another issue for it)

toobaz

toobaz commented on Feb 4, 2021

@toobaz
Member

But note that we don't allow direct (by the user) __setitem__ operations for Index anyway. So then it's mostly about methods like fillna, and there I think it can follow the behaviour we decide for Series.

Not sure I follow: Index.fillna does not resemble __setitem__ to me: it does not modify data, rather it creates and returns a copy. I see it more as a sort of arithmetic operation...

jorisvandenbossche

jorisvandenbossche commented on Feb 4, 2021

@jorisvandenbossche
MemberAuthor

Can we get a complete-ish list of what constitutes setitem-like? (e.g. above I'm assuming Index.append counts, but that cant be inplace, so it might reasonably be excluded)

The append method is indeed excluded, as I would label that as a concatenation and not a setitem (it also can't be expressed in setitem operations on the array level)

But indeed good to think about a more complete list. After going through the namespace of Series/DataFrame, I think there are basically two groups:

  • Actual __setitem__ operations (directly called by the user), which includes plain __setitem__ (obj[..] = ..) and loc/iloc/at/iat's setitem
  • Methods that in theory can happen in place (in practice we mostly still return a new object by default, though, unless you specify explicitly inplace=True). These are methods that can be expressed as a setitem operation (eg fillna() can be expressed as arr[arr.isna()] = fill_value).
    I think the more or less full list is:
    • fillna and interpolate
    • replace
    • update
    • where, mask
    • clip
    • (for Index we also have putmask, but this already seems a bit inconsistent in its upcasting behaviour)

So any other method that potentially changes the shape is not included (eg append). Those can also typically not be expressed as setitem equivalent (setitem with arrays cannot expand).
One remark here is that our setitem on DataFrame and Series can expand. Potentially we could think about whether we want to deviate from the rule for such expanding setitem, if desired (but I am not directly arguing for it, you can start with object dtype if you want to expand with arbitrary objects).

For the above list, I would say that direct __setitem__ is all included in the proposal, and for the methods of the second group, we might need to decide on a case by case basis.
For example, my feeling says that for fillna it is logical to preserve the dtype, while for replace this might be less convenient

jorisvandenbossche

jorisvandenbossche commented on Feb 4, 2021

@jorisvandenbossche
MemberAuthor

@toobaz does my last comment clarify that? (we could also keep the two groups (actual setitem vs methods) as two separate discussions, if that helps)

toobaz

toobaz commented on Feb 4, 2021

@toobaz
Member

@toobaz does my last comment clarify that? (we could also keep the two groups (actual setitem vs methods) as two separate discussions, if that helps)

@jorisvandenbossche it does clarify, although as you pointed out, members of your second group of methods are more heterogeneous in their behavior and hence a general rule of (not) upcasting might be hard to enforce (harder than just saying "please be aware that every time you use inplace=True, pandas might upcast, in the sense that it will behave precisely as inplace=False except that it changes the object you're referencing rather than returning a new reference").

To the extent that we (at least in a first stage) focus on the first group, then, Indexes are excluded from the discussion.

(In general, as long as inplace=True will exist in pandas, I think we'll want to stress it's mostly about syntax, and possibly, but not necessarily, about implementation... which means the last thing we want is the implementation of inplace=False to depend on the possibility of inplace=True)

Dr-Irv

Dr-Irv commented on Feb 4, 2021

@Dr-Irv
Contributor

(In general, as long as inplace=True will exist in pandas, I think we'll want to stress it's mostly about syntax, and possibly, but not necessarily, about implementation... which means the last thing we want is the implementation of inplace=False to depend on the possibility of inplace=True)

This might be heresy, but maybe this proposal should be considered in conjunction with an idea of getting rid of inplace=True as an option (Personally, I always use the default of inplace=False). Discussion in #16529 . And if we were to consider getting rid of inplace=True, then does that change the nature of the list that @jorisvandenbossche provided above?

toobaz

toobaz commented on Feb 4, 2021

@toobaz
Member

does that change the nature of the list that @jorisvandenbossche provided above?

I stated my opinion above: if we keep inplace=True, then it should just mimick the inplace=False and hence not infuence our decision here; if we ever drop inplace, then even more so.

jorisvandenbossche

jorisvandenbossche commented on Feb 4, 2021

@jorisvandenbossche
MemberAuthor

I also don't think that the discussion on keeping the inplace keyword or not would influence this discussion much (and if we agree on that, probably best to leave it as a separate discussion to keep this one manageable. But I am fully supportive of rethinking the inplace keyword, in #16529 ).

A method like fillna is one of the few methods where inplace=True actually can work. But right now even for those cases you are not sure that inplace=True actually did it inplace without making a copy, because of the upcasting behaviour. So with this proposal (if we include fillna in it), the situation improves a bit since inplace=True can now actually be guaranteed to be inplace (and the same will be true for any alternative for inplace we might want, like copy=False). That's a nice side effect, but for me personally not the main driver to argue for it. I mainly would prefer that fillna has predictable behaviour with a dtype that gets preserved (also for the default case of returning a new object).

44 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignDtype ConversionsUnexpected or buggy dtype conversionsIndexingRelated to indexing on series/frames, not to indexes themselvesNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jorisvandenbossche@toobaz@bashtage@jbrockmendel@mroeschke

        Issue actions

          DISCUSS/API: setitem-like operations should only update inplace and never fallback with upcast (i.e never change the dtype) · Issue #39584 · pandas-dev/pandas