-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
dti = pd.DatetimeIndex([pd.NaT, "2016-01-01"], tz="UTC")
tdi = dti - dti[1]
pi = dti.tz_localize(None).to_period("D")
ci = pi.astype("category")
ii = pd.IntervalIndex([None])
pd.Series(dti).fillna("foo") # <- casts
pd.Series(tdi).fillna("foo") # <- raises
pd.Series(pi).fillna("foo") # <- casts, but untested
pd.Series(ci).fillna("foo") # <- raises
pd.Series(ii).fillna("foo") # <- raises, untested
ATM fillna behavior with incompatible values is pretty inconsistent. With numpy dtypes (except dt64 and td64) we always cast. With dt64 and dt64tz we always cast.
With td64 we always raise, but we only have tests for the value being an integer. Back circa 2018 we used to coerce integers to timedelta, so pd.Series(tdi).fillna(1)
would interpret the 1 as pd.Timedelta(seconds=1)
(not nanos!). We deprecated that coercion and now raise, but I don't think there was much thought given to whether to raise versus cast to object.
With Categorical we raise. We have 2 tests specific to that.
With PeriodDtype we cast and with IntervalDtype we raise, but we have tests for neither.
I don't have a particularly strong opinion on this, but would like to be consistent.
Activity
jreback commentedon Jan 1, 2022
i think we should move to a world when we raise for an incompatible by default but allow control thru an errors='raise' (default) or errors='allow'
these are equivalent (and maybe should be the same)
as the proposed cast='safe' and 'unsafe'
jbrockmendel commentedon Jan 1, 2022
i've been thinking something similar for .where and .mask... but in 1.4 we're deprecating the 'errors' keyword since it wasn't actually used. could revert that deprecation in time for the rc and then add a deprecation for actually using the keyword correctly?
jreback commentedon Jan 1, 2022
i think it's ok to leave it
we can always reverse or change it later i think
jbrockmendel commentedon Jan 21, 2022
I've spent some time figuring out what it would take to be consistent across methods/dtypes and what other issues are intertwined with this.
First a reminder of the status quo. The relevant methods are NDFrame
fillna
,where
,mask
,replace
,shift
,unstack
,reindex
and the various__setitem__
methods.Each of these methods involve setting some 'other' (or fill_value) value into an array (np.ndarray | ExtensionArray) 'values'. This discussion is about what we do when 'other' cannot be set into 'values'.
In some cases we raise. In others we coerce 'values' to a dtype that can hold 'other'. This coercion happens in
Block.coerce_to_target_dtype
.When 'values' is an np.ndarray, we always coerce.
With ExtensionArray (EA) 'values', we are inconsistent. Most cases raise, so here I'll list cases that coerce.
The options to make these consistent are roughly:
__setitem__
The big trouble with either a keyword or always-coerce is allowing EAs to specify their own coercion logic (xref #24246) and implementing Categorical coercion logic in a reasonable way.
To qualify as An Elegant Solution, we'd want the coercion logic for Categorical to be used in Categorical._replace to avoid having bespoke logic there, as well as resolve merge/concat inconsistencies #41626, #24093, #42840, #15332, ... (dont have a complete list of these, haven't fully vetted these)
The keyword option is the least opinionated option, but it would mean a significantly increased API surface to test that I'm not looking forward to.
pd.NA
byNone
has no effect #45601jbrockmendel commentedon Jul 12, 2023
@MarcoGorelli is this closed by PDEP6?
MarcoGorelli commentedon Jul 24, 2023
Not by the PRs I currently have open, but I think it'd be in scopeYes, but only for thefillna(..., inplace=True)
case