DISCUSS/API: setitem-like operations should only update inplace and never fallback with upcast (i.e never change the dtype) #39584

New issue

Closed

Labels

API DesignDtype ConversionsIndexingNeeds Discussion

jorisvandenbossche

opened

on Feb 3, 2021

Member

Currently, setitem-like operations (i.e. operations that change values in an existing series or dataframe such as __setitem__ and .loc/.iloc setitem, or filling methods like fillna) first try to update in place, but if there is a dtype mismatch, pandas will upcast to a common dtype (typically object dtype).

For example, setting a string into an integer Series upcasts to object:

>>> s = pd.Series([1, 2, 3])
>>> s.loc[1] = "B"
>>> s
0    1
1    B
2    3
dtype: object

or doing a fillna with an invalid fill value also upcasts instead of raising an error:

>>> s = pd.Series(["2020-01-01", "NaT"], dtype="datetime64[ns]")
>>> s
0   2020-01-01
1          NaT
dtype: datetime64[ns]
>>> s.fillna(1)
0    2020-01-01 00:00:00
1                      1
dtype: object

My general proposal would be that in some future (eg pandas 2.0 + after a deprecation), such inherently inplace operation should have the guarantee to either happen in place or either error, and thus never change the dtype of the original Series/DataFrame.

This is similar to eg numpy's behaviour where setitem never changes the dtype. Showing the first example from above in equivalent numpy code:

>>> arr = np.array([1, 2, 3])
>>> arr[1] = "B"
...
ValueError: invalid literal for int() with base 10: 'B'

Apart from that, I also think this is the cleaner behaviour with less surprises. If a user specifically wants to allow mixed types in a column, they can manually cast to object dtype first.

On the other hand, this is quite a big change in how we generally are permissive right now and easily upcast, and such a change will certainly impact quite some user code (but, it's perfectly possible to do this with proper deprecation warnings in advance warning for the specific cases where it will error in the future AFAIK).

There are certainly some more details that need to discussed as well if we want this (which exact values are regarded as compatible with the dtype, eg setting a float in an integer column, should that error or silently round the float?). But what are people's thoughts on the general idea?

cc @pandas-dev/pandas-core

added

MemberAuthor

Sidenote: the extension arrays are actually already more strict on this (which is also needed, otherwise setitem could change the class of the object). But the upcasting logic lives a level higher on the Series/DataFrame, where the underlying array gets swapped when an upcast happens. So in other words, the proposal is to propagate that stricter behaviour of the arrays also to Series/DataFrame.

toobaz

Member

I tend to agree this is a step to take sooner or later. I don't think I ever met a case in which implicitly upcasting via a setitem on individual elements was a desired feature and not a bug.

Clearly (?), this should not apply to replacing an entire column of a DataFrame (or multiple columns) with new ones. Or if we want to state it more generally: dtype should not change unless the smallest dtype-bearing block (which is the column) is entirely replaced. And just for completeness: it would apply to df.loc[:, 'col'] = s but not to df['col'] = s (notice that the former currently replaces the dtype e.g. if col had previously int dtype and s has Timestamp, something that I suspect should not happen).

jbrockmendel

Member

I am generally positively disposed towards this idea. Some things that are not obvious: 1) Does this apply to setitem-like ops on Index? In particular, if I add a new column to a DataFrame and doing so would require casting the existing columns, does that raise? 2) Can we get a complete-ish list of what constitutes setitem-like? (e.g. above I'm assuming Index.append counts, but that _cant_ be inplace, so it might reasonably be excluded) 3) Some of the casting is a result of fallback-on-failure, but other pieces are due to downcasting-on-success. (see Block._maybe_downcast and Block.convert; affected methods include where, fillna, interpolate, replace). Is the proposal to change these behaviors too? 4) Because the fallback-on-failure only occurs after a failure, it would be fairly cheap to do a pd.get_option lookup (or an obj.flags lookup) to decide on cast-vs-raise. The major downside would be having to test/support two variants, the upside is that the It Just Works behavior is often really convenient.

…

On Wed, Feb 3, 2021 at 2:52 PM Pietro Battiston ***@***.***> wrote: I tend to agree this is a step to take sooner or later. I don't think I ever met a case in which implicitly upcasting via a setitem on individual elements was a desired feature and not a bug. Clearly (?), this should *not* apply to replacing an entire column of a DataFrame (or multiple columns) with new ones. Or if we want to state it more generally: dtype should not change *unless* the smallest dtype-bearing block (which is the column) is entirely replaced. And just for completeness: it would apply to df.loc[:, 'col'] = s but not to df['col'] = s (notice that the former currently replaces the dtype e.g. if col had previously int dtype and s has Timestamp, something that I suspect should not happen). — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#39584 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5UM6DYFKNWBUCZHFEYWX3S5HHSDANCNFSM4XBUHQGA> .

jorisvandenbossche

MemberAuthor

Clearly (?), this should not apply to replacing an entire column of a DataFrame (or multiple columns) with new ones.

@toobaz Yes, thanks for explicitly stating that, as I forgot to mention it. Indeed, the proposal is about the cases where (a subset of) the values are changed in place, and not where we are replacing a full column.
So df["existing_col"] = new_values will still follow the dtype of the new values and be able to change the dtype of the column.

And just for completeness: it would apply to df.loc[:, 'col'] = s but not to df['col'] = s (notice that the former currently replaces the dtype e.g. if col had previously int dtype and s has Timestamp, something that I suspect should not happen).

Indeed, that's probably the distinction we want to make. There is some discussion about this in #38896 (comment)

[@jbrockmendel] 1) Does this apply to setitem-like ops on Index? In particular, if I add a
new column to a DataFrame and doing so would require casting the existing
columns, does that raise?

Good point to explicitly call out. I would say: yes, Index and Series should generally be consistent. But note that we don't allow direct (by the user) __setitem__ operations for Index anyway. So then it's mostly about methods like fillna, and there I think it can follow the behaviour we decide for Series (and see comment below for more detailed list of methods that are considered).

To be explicit: this proposal does not cover concat/append methods or set operations (union, intersection, etc), as those class of functions inherently create new objects (with a potentially different shape or order) and follow the upcasting rules (_get_common_dtype / find_common_type based). So operations like reset_index() where a value gets added to the .columns Index object are not affected (as that would indeed be annoying if that would raise).

The case you mention of adding a new column to a DataFrame (df[col] = ...) with a label that isn't compatible with the columns Index.dtype, is handled under the hood as an Index.insert. I would also leave that out from the discussion here, as it's not a "pure" setitem (you can't express this as a setitem operation on the underlying array), and I would rather put it in the bucket of set operations following the upcasting rules.

3) Some of the casting is a result of fallback-on-failure, but other pieces are due to downcasting-on-success. Is the proposal to change these behaviors too?

Let's leave that for a separate issue to discuss, as in theory that's orthogonal in implementation (although certainly related). AFAIK we never do that for actual setitem, but only in a few specific methods in the internals (fillna, interpolate, where, replace).
(I will open another issue for it)

toobaz

Member

But note that we don't allow direct (by the user) __setitem__ operations for Index anyway. So then it's mostly about methods like fillna, and there I think it can follow the behaviour we decide for Series.

Not sure I follow: Index.fillna does not resemble __setitem__ to me: it does not modify data, rather it creates and returns a copy. I see it more as a sort of arithmetic operation...

jorisvandenbossche

MemberAuthor

Can we get a complete-ish list of what constitutes setitem-like? (e.g. above I'm assuming Index.append counts, but that cant be inplace, so it might reasonably be excluded)

The append method is indeed excluded, as I would label that as a concatenation and not a setitem (it also can't be expressed in setitem operations on the array level)

But indeed good to think about a more complete list. After going through the namespace of Series/DataFrame, I think there are basically two groups:

Actual __setitem__ operations (directly called by the user), which includes plain __setitem__ (obj[..] = ..) and loc/iloc/at/iat's setitem
Methods that in theory can happen in place (in practice we mostly still return a new object by default, though, unless you specify explicitly inplace=True). These are methods that can be expressed as a setitem operation (eg fillna() can be expressed as arr[arr.isna()] = fill_value).
I think the more or less full list is:
- fillna and interpolate
- replace
- update
- where, mask
- clip
- (for Index we also have putmask, but this already seems a bit inconsistent in its upcasting behaviour)

So any other method that potentially changes the shape is not included (eg append). Those can also typically not be expressed as setitem equivalent (setitem with arrays cannot expand).
One remark here is that our setitem on DataFrame and Series can expand. Potentially we could think about whether we want to deviate from the rule for such expanding setitem, if desired (but I am not directly arguing for it, you can start with object dtype if you want to expand with arbitrary objects).

For the above list, I would say that direct __setitem__ is all included in the proposal, and for the methods of the second group, we might need to decide on a case by case basis.
For example, my feeling says that for fillna it is logical to preserve the dtype, while for replace this might be less convenient

jorisvandenbossche

MemberAuthor

@toobaz does my last comment clarify that? (we could also keep the two groups (actual setitem vs methods) as two separate discussions, if that helps)

toobaz

Member

@toobaz does my last comment clarify that? (we could also keep the two groups (actual setitem vs methods) as two separate discussions, if that helps)

@jorisvandenbossche it does clarify, although as you pointed out, members of your second group of methods are more heterogeneous in their behavior and hence a general rule of (not) upcasting might be hard to enforce (harder than just saying "please be aware that every time you use inplace=True, pandas might upcast, in the sense that it will behave precisely as inplace=False except that it changes the object you're referencing rather than returning a new reference").

To the extent that we (at least in a first stage) focus on the first group, then, Indexes are excluded from the discussion.

(In general, as long as inplace=True will exist in pandas, I think we'll want to stress it's mostly about syntax, and possibly, but not necessarily, about implementation... which means the last thing we want is the implementation of inplace=False to depend on the possibility of inplace=True)

Dr-Irv

Contributor

(In general, as long as inplace=True will exist in pandas, I think we'll want to stress it's mostly about syntax, and possibly, but not necessarily, about implementation... which means the last thing we want is the implementation of inplace=False to depend on the possibility of inplace=True)

This might be heresy, but maybe this proposal should be considered in conjunction with an idea of getting rid of inplace=True as an option (Personally, I always use the default of inplace=False). Discussion in #16529 . And if we were to consider getting rid of inplace=True, then does that change the nature of the list that @jorisvandenbossche provided above?

toobaz

Member

does that change the nature of the list that @jorisvandenbossche provided above?

I stated my opinion above: if we keep inplace=True, then it should just mimick the inplace=False and hence not infuence our decision here; if we ever drop inplace, then even more so.

jorisvandenbossche

mentioned this

on Feb 4, 2021

WIP [ArrayManager] API: setitem to set new columns / loc+iloc to update inplace #39578

jorisvandenbossche

MemberAuthor

I also don't think that the discussion on keeping the inplace keyword or not would influence this discussion much (and if we agree on that, probably best to leave it as a separate discussion to keep this one manageable. But I am fully supportive of rethinking the inplace keyword, in #16529 ).

A method like fillna is one of the few methods where inplace=True actually can work. But right now even for those cases you are not sure that inplace=True actually did it inplace without making a copy, because of the upcasting behaviour. So with this proposal (if we include fillna in it), the situation improves a bit since inplace=True can now actually be guaranteed to be inplace (and the same will be true for any alternative for inplace we might want, like copy=False). That's a nice side effect, but for me personally not the main driver to argue for it. I mainly would prefer that fillna has predictable behaviour with a dtype that gets preserved (also for the default case of returning a new object).

jorisvandenbossche

mentioned this

on Feb 8, 2021

[ArrayManager] BUG: fix setitem with non-aligned boolean dataframe #39539

44 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

API DesignDtype ConversionsIndexingNeeds Discussion

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

DISCUSS/API: setitem-like operations should only update inplace and never fallback with upcast (i.e never change the dtype) #39584

44 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

DISCUSS/API: setitem-like operations should only update inplace and never fallback with upcast (i.e never change the dtype) #39584

Description

Activity

jorisvandenbossche commented on Feb 3, 2021

toobaz commented on Feb 3, 2021

jbrockmendel commented on Feb 3, 2021

jorisvandenbossche commented on Feb 4, 2021

toobaz commented on Feb 4, 2021

jorisvandenbossche commented on Feb 4, 2021

jorisvandenbossche commented on Feb 4, 2021

toobaz commented on Feb 4, 2021

Dr-Irv commented on Feb 4, 2021

toobaz commented on Feb 4, 2021

jorisvandenbossche commented on Feb 4, 2021

44 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions