Description
Context: in the original pd.NA
proposal (#28095) the topic about pd.NA
vs np.nan
was raised several times. And also in the recent pandas-dev mailing list discussion on pandas 2.0 it came up (both in context of np.nan
for float as pd.NaT
for datetime-like).
With the introduction of pd.NA
, and if we want consistent "NA behaviour" across dtypes at some point in the future, I think there are two options for float dtypes:
- Keep using
np.nan
as we do now, but change its behaviour (e.g. in comparison ops) to matchpd.NA
- Start using
pd.NA
in float dtypes
Personally, I think the first one is not really an option. Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me. And it would also give a discrepancy between the vectorized behaviour in pandas containers vs the scalar behaviour of np.nan
.
For the second option, there are still multiple ways this could be implemented (a single array that still uses np.nan as the missing value sentinel but we convert this to pd.NA towards the user, versus a masked approach like we do for the nullable integers). But in this issue, I would like to focus on the user-facing behaviour we want: Do we want to have both np.nan and pd.NA, or only allow pd.NA? Should np.nan still be considered as "missing" or should that be optional? What to do on conversion from/to numpy? (And the answer to some of those questions will also determine which of the two possible implementations is preferrable)
Actual discussion items: assume we are going to add floating dtypes that use pd.NA
as missing value indicator. Then the following question comes up:
If I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?
So yes, it is technically possible to have both np.nan and pd.NA with different behaviour (np.nan
as "normal", unmasked value in the actual data, pd.NA
tracked in the mask). But we also need to decide if we want this.
This was touchec upon a bit in the original issue, but not really further discussed. Quoting a few things from the original thread in #28095:
[@Dr-Irv in comment] I think it is important to distinguish between NA meaning "data missing" versus NaN meaning "not a number" / "bad computational result".
vs
[@datapythonista in comment] I think NaN and NaT, when present, should be copied to the mask, and then we can forget about them (for what I understand values with True in the NA mask won't be ever used).
So I think those two describe nicely the two options we have on the question do we want both pd.NA
and np.nan
in a float dtype and have them signify different things? -> 1) Yes, we can have both, versus 2) No, towards the user, we only have pd.NA and "disallow" NaN (or interpret / convert any NaN on input to NA).
A reason to have both is that they can signify different things (another reason is that most other data tools do this as well, I will put some comparisons in a separate post).
That reasoning was given by @Dr-Irv in #28095 (comment): there are times when I get NaN as a result of a computation, which indicates that I did something numerically wrong, versus NaN meaning "missing data". So should there be separate markers - one to mean "missing value" and the other to mean "bad computational result" (typically 0/0
) ?
A dummy example showing how both can occur:
>>> pd.Series([0, 1, 2]) / pd.Series([0, 1, pd.NA])
0 NaN
1 1.0
2 <NA>
dtype: float64
The NaN is introduced by the computation, the NA is propagated from the input data (although note that in an arithmetic operation like this, NaN would also propagate).
So, yes, it is possible and potentially desirable to allow both pd.NA
and np.nan
in floating dtypes. But, it also brings up several questions / complexities. Foremost, should NaN still be considered as missing? Meaning, should it be seen as missing in functions like isna
/notna
/dropna
/fillna
? Or should that be an option? Should NaN still be considered as missing (and thus skipped) in reducing operations (that have a skipna
keyword, like sum, mean, etc)?
Personally, I think we will need to keep NaN as missing, or at least initially. But, that will also introduce inconsistencies: although NaN would be seen as missing in the methods mentioned above, in arithmeric / comparison / scalar ops, it would behave as NaN and not as NA (so eg comparison gives False instead of propagating). It also means that in the missing-related methods, we will need to check for both NaN in the values as the mask (which can also have performance implications).
Some other various considerations:
-
Having both pd.NA and NaN (np.nan) might actually be more confusing for users.
-
If we want a consistent indicator and behavior for missing values across dtypes, I think we need a separate concept from NaN for float dtypes (i.e. pd.NA). Changing the behavior of NaN when inside a pandas container seems like a non-starter (the behavior of NaN is well defined in IEEE 754, and it would also deviate from the underlying numpy array)
-
How do we handle compatibility with numpy?
The solution that we have come up (for now) for the other nullable dtypes is to convert to object dtype by default, and have ato_numpy(.., na_value=np.nan)
explicit conversion.
But given how np.nan is in practice used in the whole pydata ecosystem as a missing value indicator, this might be annoying.For conversion to numpy, see also some relevant discussion in API: how to handle NA in conversion to numpy arrays #30038
-
What with conversion / inference on input?
Eg creating a Series from a float numpy array with NaNs (pdSeries(np.array([0.1, np.nan]))
) Do we convert NaNs to NA automatically by default?
Activity
jorisvandenbossche commentedon Feb 26, 2020
How do other tools / languages deal with this?
Julia has both as separate concepts:
R also has both, but will treat NaN as missing in
is.na(..)
:Here, the "skipna"
na.rm
keyword also skips NaN (na.rm docs: "logical. Should missing values (including NaN) be removed?"):Apache Arrow also has both (NaN can be a float value, while it tracks missing values in a mask). It doesn't yet have much computational tools, bug eg the
sum
function skips missing values by default but will propagate NaN (like numpy's sum does for float NaN).I think SQL also has both, but didn't yet check in more detail how it handles NaN in missing-like operations.
toobaz commentedon Feb 26, 2020
I still don't know the semantics of
pd.NA
enough to judge in detail, but I am skeptical on whether users do benefit from two distinct concepts. If as a user I divide 0 by 0, it's perfectly fine to me to consider the result as "missing". Even more so because when done in non-vectorized Python, it raises an error, not returning some "not a number" placeholder. I suspect the other languages (e.g. at least R) have semantics which are more driven by implementation than by user experience. And definitely I would have a hard time suggesting "natural" ways in which the propagation of pd.NA and np.nan should differ.So ideally pd.NA and np.nan should be the same to users. If, as I understand, this is not possible given how pd.NA was designed and the compatibility we want to (rightfully) keep with numpy, I think the discrepancies should be limited as much as possible.
toobaz commentedon Feb 26, 2020
Just to provide an example: I want to compute average hourly wages from two variables: monthly hours worked and monthly salary. If for a given worker I have 0 and 0, in my average I will want to disregard this observation precisely as if it was a missing value. In this and many other cases, missing observations are the result of float operations.
TomAugspurger commentedon Feb 26, 2020
Agreed.
My initial preference is for not having both. I think that having both will be confusing for users (and harder to maintain).
jreback commentedon Feb 26, 2020
Dr-Irv commentedon Feb 26, 2020
On the other hand, such a calculation could indicate something wrong in the data that you need to identify and fix. I've had cases where the source data (or some other calculation) I did produced a
NaN
, which pandas treats as missing, and the true source of the problem was either back in the source data (e.g., that data should not have been missing) or a bug elsewhere in my code. So in these cases, where theNaN
was introduced due to a bug in the source data or in my code, my later calculations were perfectly happy because to pandas, theNaN
meant "missing". Finding this kind of bug is non-trivial.I think we should support
np.nan
andpd.NA
. To me, the complexity is in a few places:np.nan
won't mean "missing" in the future needs to be carefully thought out. Maybe we consider a global option to control this behavior?np.nan
andpd.NA
mean "Not a number" and "missing", respectively, and numpy (or another library) treatsnp.nan
as "missing", do we automate the conversions (both going from pandas to numpy/other or ingesting from numpy/other into pandas)We currently also have this inconsistent (IMHO) behavior which relates to (2) above:
toobaz commentedon Feb 26, 2020
Definitely. To me, this is precisely the role of pd.NA - or anything denoting missing data. If you take a monthly average of something that didn't happen in a given month, it is missing, not a sort of strange floating number. Notice I'm not claiming the two concepts are the same, but just that there is no clear-cut distinction, and even less some natural one for users.
Sure. And I think we have all the required machinery to behave as the user desires on missing data (mainly, the
skipna
argument).Dr-Irv commentedon Feb 26, 2020
When I said "such a calculation could indicate something wrong in the data that you need to identify and fix.", the thing that could be wrong in the data might not be missing data. It could be that some combination of values occurred that were not supposed to happen.
There are just two use cases here. One is where the data truly has missing data, like your example of the monthly average. The second is where all the data is there, but some calculation you did creates a
NaN
unexpectedly, and that indicates a different kind of bug.Yes, but
skipna=True
is the default everywhere, so your solution would mean that you have to always useskipna=False
to detect those kinds of errors.toobaz commentedon Feb 26, 2020
My point is precisely that in my example missing data causes a 0 / 0. But it really originates from missing data. Could 0/0 result in pd.NA? Well, we would deviate not just from numpy, but also from a large number of cases in which 0/0 does not originate from missing data.
toobaz commentedon Feb 26, 2020
This is true. But... are there new usability insights compared to those we had back in 2017?
Dr-Irv commentedon Feb 26, 2020
That's why I think having
np.nan
representing "bad calculation" andpd.NA
represent "missing" is the preferred behavior. But I'm one voice among many.shoyer commentedon Feb 26, 2020
+1 for consistency with other computational tools.
On the subject of automatic conversion into NumPy arrays, return an object dtype array seems consistent but could be a very poor user experience. Object arrays are really slow, and break many/most functions that expect numeric NumPy arrays. Float dtype with auto-conversion from NA -> NaN would probably be preferred by users.
150 remaining items
WillAyd commentedon Jul 23, 2024
After reviewing this discussion here and in PDEP-16, I get the general consensus is that there is value in distinguishing these, but there is a lot of concern around the implementation (and rightfully so, given the history of pandas)
With that being the case, maybe we can just concretely start by adding the
nan_is_null
keyword to.fillna
,.isna
, and.hasna
with the default value ofTrue
that I think @jbrockmendel and @jorisvandenbossche landed on? That maintains backwards compatability and has a prior art in pyarrow (save the fact that pyarrow defaults toFalse
).Right now the pd.FloatXXDtype() data types become practically unusable the moment a
NaN
value is introduced, which can happen very easily. By at least giving users the option to.fillna
on those types (or filter them out) they could continue to use the extension types, without casting back to NumPy. Right now I think that is the biggest weakness in our extension type system that prevents you from being able to use it generically.I think starting with a smaller scope to just those few methods is helpful; trying to solve constructors is going to open a can of worms that will deter any progress, and I think more generally should be solved through PDEP-16 anyway (or maybe the follow up to it that focuses on helping to better distinguish these values).
a-reich commentedon Jul 23, 2024
I upvote starting with something that can be improved short-term vs needing to first reach consensus on a new holistic design.
isna
doesn't detectpyarrow.NA
produced by 0/0 #59891vkhodygo commentedon Sep 27, 2024
@WillAyd
Considering the fact
pandas
employspyarrow
engine it should have the same defaults to avoid even more confusion. Breaking backwards compatibility is not a novel thing, one can't just keep legacy code forever; besides,pandas
devs do this all the time, don't you.WillAyd commentedon Sep 27, 2024
For sure, but our history does make things complicated. Unfortunately, for over a decade pandas users have been commonly doing:
to assign what they think is a "missing value". So we can't just immediately change that to literally mean
NaN
without some type of transition plan.There is a larger discussion to that point in PDEP-0016 that you might want to chime in on and follow
#58988 (comment)
theo-brown commentedon Feb 24, 2025
Has this been pushed through without ensuring the conversion works? Many of the docs that refer to NaNs have examples that includenp.nan
that no longer work.Example:skipna=True
in cumulative operations likedf.cummin()
still have examples like:which does not behave as expected (see linked issues above).(Also, as many of these linked issues are relating to the same problem, it would be great if there was a clear master issue to group them together)@MarcoGorelli sorry for the noise, I misunderstood the source of my issue.
MarcoGorelli commentedon Feb 24, 2025
@theo-brown which issue specifically are you referring to, and what behaviour would you expect for
pd.Series([2, np.nan, 5, -1, 0]).cummin(skipna=True)
? That's a Series with classical numpy-backed dtypes, this issue is about nullable types