-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
Description
Currently, setitem-like operations (i.e. operations that change values in an existing series or dataframe such as __setitem__ and .loc/.iloc setitem, or filling methods like fillna) first try to update in place, but if there is a dtype mismatch, pandas will upcast to a common dtype (typically object dtype).
For example, setting a string into an integer Series upcasts to object:
>>> s = pd.Series([1, 2, 3])
>>> s.loc[1] = "B"
>>> s
0 1
1 B
2 3
dtype: objector doing a fillna with an invalid fill value also upcasts instead of raising an error:
>>> s = pd.Series(["2020-01-01", "NaT"], dtype="datetime64[ns]")
>>> s
0 2020-01-01
1 NaT
dtype: datetime64[ns]
>>> s.fillna(1)
0 2020-01-01 00:00:00
1 1
dtype: objectMy general proposal would be that in some future (eg pandas 2.0 + after a deprecation), such inherently inplace operation should have the guarantee to either happen in place or either error, and thus never change the dtype of the original Series/DataFrame.
This is similar to eg numpy's behaviour where setitem never changes the dtype. Showing the first example from above in equivalent numpy code:
>>> arr = np.array([1, 2, 3])
>>> arr[1] = "B"
...
ValueError: invalid literal for int() with base 10: 'B'Apart from that, I also think this is the cleaner behaviour with less surprises. If a user specifically wants to allow mixed types in a column, they can manually cast to object dtype first.
On the other hand, this is quite a big change in how we generally are permissive right now and easily upcast, and such a change will certainly impact quite some user code (but, it's perfectly possible to do this with proper deprecation warnings in advance warning for the specific cases where it will error in the future AFAIK).
There are certainly some more details that need to discussed as well if we want this (which exact values are regarded as compatible with the dtype, eg setting a float in an integer column, should that error or silently round the float?). But what are people's thoughts on the general idea?
cc @pandas-dev/pandas-core
Activity
jorisvandenbossche commentedon Feb 3, 2021
Sidenote: the extension arrays are actually already more strict on this (which is also needed, otherwise setitem could change the class of the object). But the upcasting logic lives a level higher on the Series/DataFrame, where the underlying array gets swapped when an upcast happens. So in other words, the proposal is to propagate that stricter behaviour of the arrays also to Series/DataFrame.
toobaz commentedon Feb 3, 2021
I tend to agree this is a step to take sooner or later. I don't think I ever met a case in which implicitly upcasting via a setitem on individual elements was a desired feature and not a bug.
Clearly (?), this should not apply to replacing an entire column of a
DataFrame(or multiple columns) with new ones. Or if we want to state it more generally: dtype should not change unless the smallest dtype-bearing block (which is the column) is entirely replaced. And just for completeness: it would apply todf.loc[:, 'col'] = sbut not todf['col'] = s(notice that the former currently replaces the dtype e.g. ifcolhad previouslyintdtype andshasTimestamp, something that I suspect should not happen).jbrockmendel commentedon Feb 3, 2021
jorisvandenbossche commentedon Feb 4, 2021
@toobaz Yes, thanks for explicitly stating that, as I forgot to mention it. Indeed, the proposal is about the cases where (a subset of) the values are changed in place, and not where we are replacing a full column.
So
df["existing_col"] = new_valueswill still follow the dtype of the new values and be able to change the dtype of the column.Indeed, that's probably the distinction we want to make. There is some discussion about this in #38896 (comment)
Good point to explicitly call out. I would say: yes, Index and Series should generally be consistent. But note that we don't allow direct (by the user)
__setitem__operations for Index anyway. So then it's mostly about methods likefillna, and there I think it can follow the behaviour we decide for Series (and see comment below for more detailed list of methods that are considered).To be explicit: this proposal does not cover concat/append methods or set operations (union, intersection, etc), as those class of functions inherently create new objects (with a potentially different shape or order) and follow the upcasting rules (
_get_common_dtype/find_common_typebased). So operations likereset_index()where a value gets added to the.columnsIndex object are not affected (as that would indeed be annoying if that would raise).The case you mention of adding a new column to a DataFrame (
df[col] = ...) with a label that isn't compatible with the columns Index.dtype, is handled under the hood as anIndex.insert. I would also leave that out from the discussion here, as it's not a "pure" setitem (you can't express this as a setitem operation on the underlying array), and I would rather put it in the bucket of set operations following the upcasting rules.Let's leave that for a separate issue to discuss, as in theory that's orthogonal in implementation (although certainly related). AFAIK we never do that for actual setitem, but only in a few specific methods in the internals (fillna, interpolate, where, replace).
(I will open another issue for it)
toobaz commentedon Feb 4, 2021
Not sure I follow:
Index.fillnadoes not resemble__setitem__to me: it does not modify data, rather it creates and returns a copy. I see it more as a sort of arithmetic operation...jorisvandenbossche commentedon Feb 4, 2021
The
appendmethod is indeed excluded, as I would label that as a concatenation and not a setitem (it also can't be expressed in setitem operations on the array level)But indeed good to think about a more complete list. After going through the namespace of Series/DataFrame, I think there are basically two groups:
__setitem__operations (directly called by the user), which includes plain__setitem__(obj[..] = ..) andloc/iloc/at/iat's setiteminplace=True). These are methods that can be expressed as a setitem operation (egfillna()can be expressed asarr[arr.isna()] = fill_value).I think the more or less full list is:
fillnaandinterpolatereplaceupdatewhere,maskclipputmask, but this already seems a bit inconsistent in its upcasting behaviour)So any other method that potentially changes the shape is not included (eg append). Those can also typically not be expressed as setitem equivalent (setitem with arrays cannot expand).
One remark here is that our setitem on DataFrame and Series can expand. Potentially we could think about whether we want to deviate from the rule for such expanding setitem, if desired (but I am not directly arguing for it, you can start with object dtype if you want to expand with arbitrary objects).
For the above list, I would say that direct
__setitem__is all included in the proposal, and for the methods of the second group, we might need to decide on a case by case basis.For example, my feeling says that for
fillnait is logical to preserve the dtype, while forreplacethis might be less convenientjorisvandenbossche commentedon Feb 4, 2021
@toobaz does my last comment clarify that? (we could also keep the two groups (actual setitem vs methods) as two separate discussions, if that helps)
toobaz commentedon Feb 4, 2021
@jorisvandenbossche it does clarify, although as you pointed out, members of your second group of methods are more heterogeneous in their behavior and hence a general rule of (not) upcasting might be hard to enforce (harder than just saying "please be aware that every time you use
inplace=True, pandas might upcast, in the sense that it will behave precisely asinplace=Falseexcept that it changes the object you're referencing rather than returning a new reference").To the extent that we (at least in a first stage) focus on the first group, then,
Indexes are excluded from the discussion.(In general, as long as
inplace=Truewill exist in pandas, I think we'll want to stress it's mostly about syntax, and possibly, but not necessarily, about implementation... which means the last thing we want is the implementation ofinplace=Falseto depend on the possibility ofinplace=True)Dr-Irv commentedon Feb 4, 2021
This might be heresy, but maybe this proposal should be considered in conjunction with an idea of getting rid of
inplace=Trueas an option (Personally, I always use the default ofinplace=False). Discussion in #16529 . And if we were to consider getting rid ofinplace=True, then does that change the nature of the list that @jorisvandenbossche provided above?toobaz commentedon Feb 4, 2021
I stated my opinion above: if we keep
inplace=True, then it should just mimick theinplace=Falseand hence not infuence our decision here; if we ever dropinplace, then even more so.jorisvandenbossche commentedon Feb 4, 2021
I also don't think that the discussion on keeping the
inplacekeyword or not would influence this discussion much (and if we agree on that, probably best to leave it as a separate discussion to keep this one manageable. But I am fully supportive of rethinking theinplacekeyword, in #16529 ).A method like
fillnais one of the few methods whereinplace=Trueactually can work. But right now even for those cases you are not sure thatinplace=Trueactually did it inplace without making a copy, because of the upcasting behaviour. So with this proposal (if we includefillnain it), the situation improves a bit sinceinplace=Truecan now actually be guaranteed to be inplace (and the same will be true for any alternative for inplace we might want, likecopy=False). That's a nice side effect, but for me personally not the main driver to argue for it. I mainly would prefer thatfillnahas predictable behaviour with a dtype that gets preserved (also for the default case of returning a new object).44 remaining items