Skip to content

API: distinguish NA vs NaN in floating dtypes #32265

Open
@jorisvandenbossche

Description

@jorisvandenbossche
Member

Context: in the original pd.NA proposal (#28095) the topic about pd.NA vs np.nan was raised several times. And also in the recent pandas-dev mailing list discussion on pandas 2.0 it came up (both in context of np.nan for float as pd.NaT for datetime-like).

With the introduction of pd.NA, and if we want consistent "NA behaviour" across dtypes at some point in the future, I think there are two options for float dtypes:

  • Keep using np.nan as we do now, but change its behaviour (e.g. in comparison ops) to match pd.NA
  • Start using pd.NA in float dtypes

Personally, I think the first one is not really an option. Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me. And it would also give a discrepancy between the vectorized behaviour in pandas containers vs the scalar behaviour of np.nan.
For the second option, there are still multiple ways this could be implemented (a single array that still uses np.nan as the missing value sentinel but we convert this to pd.NA towards the user, versus a masked approach like we do for the nullable integers). But in this issue, I would like to focus on the user-facing behaviour we want: Do we want to have both np.nan and pd.NA, or only allow pd.NA? Should np.nan still be considered as "missing" or should that be optional? What to do on conversion from/to numpy? (And the answer to some of those questions will also determine which of the two possible implementations is preferrable)


Actual discussion items: assume we are going to add floating dtypes that use pd.NA as missing value indicator. Then the following question comes up:

If I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?

So yes, it is technically possible to have both np.nan and pd.NA with different behaviour (np.nan as "normal", unmasked value in the actual data, pd.NA tracked in the mask). But we also need to decide if we want this.

This was touchec upon a bit in the original issue, but not really further discussed. Quoting a few things from the original thread in #28095:

[@Dr-Irv in comment] I think it is important to distinguish between NA meaning "data missing" versus NaN meaning "not a number" / "bad computational result".

vs

[@datapythonista in comment] I think NaN and NaT, when present, should be copied to the mask, and then we can forget about them (for what I understand values with True in the NA mask won't be ever used).

So I think those two describe nicely the two options we have on the question do we want both pd.NA and np.nan in a float dtype and have them signify different things? -> 1) Yes, we can have both, versus 2) No, towards the user, we only have pd.NA and "disallow" NaN (or interpret / convert any NaN on input to NA).

A reason to have both is that they can signify different things (another reason is that most other data tools do this as well, I will put some comparisons in a separate post).
That reasoning was given by @Dr-Irv in #28095 (comment): there are times when I get NaN as a result of a computation, which indicates that I did something numerically wrong, versus NaN meaning "missing data". So should there be separate markers - one to mean "missing value" and the other to mean "bad computational result" (typically 0/0) ?

A dummy example showing how both can occur:

>>>  pd.Series([0, 1, 2]) / pd.Series([0, 1, pd.NA])
0    NaN
1    1.0
2   <NA>
dtype: float64

The NaN is introduced by the computation, the NA is propagated from the input data (although note that in an arithmetic operation like this, NaN would also propagate).

So, yes, it is possible and potentially desirable to allow both pd.NA and np.nan in floating dtypes. But, it also brings up several questions / complexities. Foremost, should NaN still be considered as missing? Meaning, should it be seen as missing in functions like isna/notna/dropna/fillna ? Or should that be an option? Should NaN still be considered as missing (and thus skipped) in reducing operations (that have a skipna keyword, like sum, mean, etc)?

Personally, I think we will need to keep NaN as missing, or at least initially. But, that will also introduce inconsistencies: although NaN would be seen as missing in the methods mentioned above, in arithmeric / comparison / scalar ops, it would behave as NaN and not as NA (so eg comparison gives False instead of propagating). It also means that in the missing-related methods, we will need to check for both NaN in the values as the mask (which can also have performance implications).


Some other various considerations:

  • Having both pd.NA and NaN (np.nan) might actually be more confusing for users.

  • If we want a consistent indicator and behavior for missing values across dtypes, I think we need a separate concept from NaN for float dtypes (i.e. pd.NA). Changing the behavior of NaN when inside a pandas container seems like a non-starter (the behavior of NaN is well defined in IEEE 754, and it would also deviate from the underlying numpy array)

  • How do we handle compatibility with numpy?
    The solution that we have come up (for now) for the other nullable dtypes is to convert to object dtype by default, and have a to_numpy(.., na_value=np.nan) explicit conversion.
    But given how np.nan is in practice used in the whole pydata ecosystem as a missing value indicator, this might be annoying.

    For conversion to numpy, see also some relevant discussion in API: how to handle NA in conversion to numpy arrays #30038

  • What with conversion / inference on input?
    Eg creating a Series from a float numpy array with NaNs (pdSeries(np.array([0.1, np.nan]))) Do we convert NaNs to NA automatically by default?

cc @pandas-dev/pandas-core @Dr-Irv @dsaxton

Activity

jorisvandenbossche

jorisvandenbossche commented on Feb 26, 2020

@jorisvandenbossche
MemberAuthor

How do other tools / languages deal with this?

Julia has both as separate concepts:

julia> arr = [1.0, missing, NaN]
3-element Array{Union{Missing, Float64},1}:
   1.0     
    missing
 NaN       

julia> ismissing.(arr)
3-element BitArray{1}:
 false
  true
 false

julia> isnan.(arr)
3-element Array{Union{Missing, Bool},1}:
 false       
      missing
  true       

R also has both, but will treat NaN as missing in is.na(..):

> v <- c(1.0, NA, NaN)
> v
[1]   1  NA NaN
> is.na(v)
[1] FALSE  TRUE  TRUE
> is.nan(v)
[1] FALSE FALSE  TRUE

Here, the "skipna" na.rm keyword also skips NaN (na.rm docs: "logical. Should missing values (including NaN) be removed?"):

> sum(v)
[1] NA
> sum(v, na.rm=TRUE)
[1] 1

Apache Arrow also has both (NaN can be a float value, while it tracks missing values in a mask). It doesn't yet have much computational tools, bug eg the sum function skips missing values by default but will propagate NaN (like numpy's sum does for float NaN).

I think SQL also has both, but didn't yet check in more detail how it handles NaN in missing-like operations.

toobaz

toobaz commented on Feb 26, 2020

@toobaz
Member

I still don't know the semantics of pd.NA enough to judge in detail, but I am skeptical on whether users do benefit from two distinct concepts. If as a user I divide 0 by 0, it's perfectly fine to me to consider the result as "missing". Even more so because when done in non-vectorized Python, it raises an error, not returning some "not a number" placeholder. I suspect the other languages (e.g. at least R) have semantics which are more driven by implementation than by user experience. And definitely I would have a hard time suggesting "natural" ways in which the propagation of pd.NA and np.nan should differ.

So ideally pd.NA and np.nan should be the same to users. If, as I understand, this is not possible given how pd.NA was designed and the compatibility we want to (rightfully) keep with numpy, I think the discrepancies should be limited as much as possible.

toobaz

toobaz commented on Feb 26, 2020

@toobaz
Member

Just to provide an example: I want to compute average hourly wages from two variables: monthly hours worked and monthly salary. If for a given worker I have 0 and 0, in my average I will want to disregard this observation precisely as if it was a missing value. In this and many other cases, missing observations are the result of float operations.

TomAugspurger

TomAugspurger commented on Feb 26, 2020

@TomAugspurger
Contributor

Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me.

Agreed.

do we want both pd.NA and np.nan in a float dtype and have them signify different things?

My initial preference is for not having both. I think that having both will be confusing for users (and harder to maintain).

jreback

jreback commented on Feb 26, 2020

@jreback
Contributor
Dr-Irv

Dr-Irv commented on Feb 26, 2020

@Dr-Irv
Contributor

Just to provide an example: I want to compute average hourly wages from two variables: monthly hours worked and monthly salary. If for a given worker I have 0 and 0, in my average I will want to disregard this observation precisely as if it was a missing value. In this and many other cases, missing observations are the result of float operations.

On the other hand, such a calculation could indicate something wrong in the data that you need to identify and fix. I've had cases where the source data (or some other calculation) I did produced a NaN, which pandas treats as missing, and the true source of the problem was either back in the source data (e.g., that data should not have been missing) or a bug elsewhere in my code. So in these cases, where the NaN was introduced due to a bug in the source data or in my code, my later calculations were perfectly happy because to pandas, the NaN meant "missing". Finding this kind of bug is non-trivial.

I think we should support np.nan and pd.NA. To me, the complexity is in a few places:

  1. The transition for users so they know that np.nan won't mean "missing" in the future needs to be carefully thought out. Maybe we consider a global option to control this behavior?
  2. Going back and forth between pandas and numpy (and maybe other libraries). If we eventually have np.nan and pd.NA mean "Not a number" and "missing", respectively, and numpy (or another library) treats np.nan as "missing", do we automate the conversions (both going from pandas to numpy/other or ingesting from numpy/other into pandas)

We currently also have this inconsistent (IMHO) behavior which relates to (2) above:

>>> s=pd.Series([1,2,pd.NA], dtype="Int64")
>>> s
0       1
1       2
2    <NA>
dtype: Int64
>>> s.to_numpy()
array([1, 2, <NA>], dtype=object)
>>> s
0       1
1       2
2    <NA>
dtype: Int64
>>> s.astype(float).to_numpy()
array([ 1.,  2., nan])
toobaz

toobaz commented on Feb 26, 2020

@toobaz
Member

On the other hand, such a calculation could indicate something wrong in the data that you need to identify and fix.

Definitely. To me, this is precisely the role of pd.NA - or anything denoting missing data. If you take a monthly average of something that didn't happen in a given month, it is missing, not a sort of strange floating number. Notice I'm not claiming the two concepts are the same, but just that there is no clear-cut distinction, and even less some natural one for users.

to pandas, the NaN meant "missing"

Sure. And I think we have all the required machinery to behave as the user desires on missing data (mainly, the skipna argument).

Dr-Irv

Dr-Irv commented on Feb 26, 2020

@Dr-Irv
Contributor

On the other hand, such a calculation could indicate something wrong in the data that you need to identify and fix.

Definitely. To me, this is precisely the role of pd.NA - or anything denoting missing data. If you take a monthly average of something that didn't happen in a given month, it is missing, not a sort of strange floating number. Notice I'm not claiming the two concepts are the same, but just that there is no clear-cut distinction, and even less some natural one for users.

When I said "such a calculation could indicate something wrong in the data that you need to identify and fix.", the thing that could be wrong in the data might not be missing data. It could be that some combination of values occurred that were not supposed to happen.

There are just two use cases here. One is where the data truly has missing data, like your example of the monthly average. The second is where all the data is there, but some calculation you did creates a NaN unexpectedly, and that indicates a different kind of bug.

to pandas, the NaN meant "missing"

Sure. And I think we have all the required machinery to behave as the user desires on missing data (mainly, the skipna argument).

Yes, but skipna=True is the default everywhere, so your solution would mean that you have to always use skipna=False to detect those kinds of errors.

toobaz

toobaz commented on Feb 26, 2020

@toobaz
Member

One is where the data truly has missing data, like your example of the monthly average. The second is where all the data is there, but some calculation you did creates a NaN unexpectedly, and that indicates a different kind of bug.

My point is precisely that in my example missing data causes a 0 / 0. But it really originates from missing data. Could 0/0 result in pd.NA? Well, we would deviate not just from numpy, but also from a large number of cases in which 0/0 does not originate from missing data.

toobaz

toobaz commented on Feb 26, 2020

@toobaz
Member

Yes, but skipna=True is the default everywhere, so your solution would mean that you have to always use skipna=False to detect those kinds of errors.

This is true. But... are there new usability insights compared to those we had back in 2017?

Dr-Irv

Dr-Irv commented on Feb 26, 2020

@Dr-Irv
Contributor

My point is precisely that in my example missing data causes a 0 / 0. But it really originates from missing data. Could 0/0 result in pd.NA? Well, we would deviate not just from numpy, but also from a large number of cases in which 0/0 does not originate from missing data.

That's why I think having np.nan representing "bad calculation" and pd.NA represent "missing" is the preferred behavior. But I'm one voice among many.

shoyer

shoyer commented on Feb 26, 2020

@shoyer
Member

That's why I think having np.nan representing "bad calculation" and pd.NA represent "missing" is the preferred behavior. But I'm one voice among many.

+1 for consistency with other computational tools.

On the subject of automatic conversion into NumPy arrays, return an object dtype array seems consistent but could be a very poor user experience. Object arrays are really slow, and break many/most functions that expect numeric NumPy arrays. Float dtype with auto-conversion from NA -> NaN would probably be preferred by users.

150 remaining items

WillAyd

WillAyd commented on Jul 23, 2024

@WillAyd
Member

After reviewing this discussion here and in PDEP-16, I get the general consensus is that there is value in distinguishing these, but there is a lot of concern around the implementation (and rightfully so, given the history of pandas)

With that being the case, maybe we can just concretely start by adding the nan_is_null keyword to .fillna, .isna, and .hasna with the default value of True that I think @jbrockmendel and @jorisvandenbossche landed on? That maintains backwards compatability and has a prior art in pyarrow (save the fact that pyarrow defaults to False).

Right now the pd.FloatXXDtype() data types become practically unusable the moment a NaN value is introduced, which can happen very easily. By at least giving users the option to .fillna on those types (or filter them out) they could continue to use the extension types, without casting back to NumPy. Right now I think that is the biggest weakness in our extension type system that prevents you from being able to use it generically.

I think starting with a smaller scope to just those few methods is helpful; trying to solve constructors is going to open a can of worms that will deter any progress, and I think more generally should be solved through PDEP-16 anyway (or maybe the follow up to it that focuses on helping to better distinguish these values).

a-reich

a-reich commented on Jul 23, 2024

@a-reich

I upvote starting with something that can be improved short-term vs needing to first reach consensus on a new holistic design.

vkhodygo

vkhodygo commented on Sep 27, 2024

@vkhodygo

@WillAyd

That maintains backwards compatibility and has a prior art in pyarrow (save the fact that pyarrow defaults to False).

Considering the fact pandas employs pyarrow engine it should have the same defaults to avoid even more confusion. Breaking backwards compatibility is not a novel thing, one can't just keep legacy code forever; besides, pandas devs do this all the time, don't you.

WillAyd

WillAyd commented on Sep 27, 2024

@WillAyd
Member

For sure, but our history does make things complicated. Unfortunately, for over a decade pandas users have been commonly doing:

ser.iloc[0] = np.nan

to assign what they think is a "missing value". So we can't just immediately change that to literally mean NaN without some type of transition plan.

There is a larger discussion to that point in PDEP-0016 that you might want to chime in on and follow

#58988 (comment)

theo-brown

theo-brown commented on Feb 24, 2025

@theo-brown

Has this been pushed through without ensuring the conversion works? Many of the docs that refer to NaNs have examples that include np.nan that no longer work.

Example: skipna=True in cumulative operations like df.cummin() still have examples like:

pd.Series([2, np.nan, 5, -1, 0]).cummin(skipna=True)

which does not behave as expected (see linked issues above).

(Also, as many of these linked issues are relating to the same problem, it would be great if there was a clear master issue to group them together)

@MarcoGorelli sorry for the noise, I misunderstood the source of my issue.

MarcoGorelli

MarcoGorelli commented on Feb 24, 2025

@MarcoGorelli
Member

@theo-brown which issue specifically are you referring to, and what behaviour would you expect for pd.Series([2, np.nan, 5, -1, 0]).cummin(skipna=True)? That's a Series with classical numpy-backed dtypes, this issue is about nullable types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignEnhancementMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateNA - MaskedArraysRelated to pd.NA and nullable extension arraysPDEP missing valuesIssues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @WillAyd@jreback@jorisvandenbossche@daviskirk@shoyer

        Issue actions

          API: distinguish NA vs NaN in floating dtypes · Issue #32265 · pandas-dev/pandas