API: distinguish NA vs NaN in floating dtypes

**Context**: in the original `pd.NA` proposal (https://github.com/pandas-dev/pandas/issues/28095) the topic about `pd.NA` vs `np.nan` was raised several times. And also in the recent pandas-dev mailing list discussion on pandas 2.0 it came up (both in context of `np.nan` for float as `pd.NaT` for datetime-like).

With the introduction of `pd.NA`, and if we want consistent "NA behaviour" across dtypes at some point in the future, I think there are two options for float dtypes:

- Keep using `np.nan` as we do now, but change its behaviour (e.g. in comparison ops) to match `pd.NA`
- Start using `pd.NA` in float dtypes

Personally, I think the first one is not really an option. Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me. And it would also give a discrepancy between the vectorized behaviour in pandas containers vs the scalar behaviour of `np.nan`.  
For the second option, there are still multiple ways this could be implemented (a single array that still uses np.nan as the missing value sentinel but we convert this to pd.NA towards the user, versus a masked approach like we do for the nullable integers). But in this issue, I would like to focus on the user-facing behaviour we want: Do we want to have both np.nan and pd.NA, or only allow pd.NA? Should np.nan still be considered as "missing" or should that be optional? What to do on conversion from/to numpy? (And the answer to some of those questions will also determine which of the two possible implementations is preferrable)

---

**Actual discussion items:** assume we are going to add floating dtypes that use `pd.NA` as missing value indicator. Then the following question comes up:

> If I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?

So yes, it is *technically possible* to have both np.nan and pd.NA with different behaviour (`np.nan` as "normal", unmasked value in the actual data, `pd.NA` tracked in the mask). But we also need to decide if we *want* this.

This was touchec upon a bit in the original issue, but not really further discussed. Quoting a few things from the original thread in https://github.com/pandas-dev/pandas/issues/28095:


> [@Dr-Irv in [comment](https://github.com/pandas-dev/pandas/issues/28095#issuecomment-543162838)] I think it is important to distinguish between NA meaning "data missing" versus NaN meaning "not a number" / "bad computational result".

vs

> [@datapythonista in [comment](https://github.com/pandas-dev/pandas/issues/28095#issuecomment-539021471)] I think NaN and NaT, when present, should be copied to the mask, and then we can forget about them (for what I understand values with True in the NA mask won't be ever used).

So I think those two describe nicely the two options we have on the question **do we want both `pd.NA` and `np.nan` in a float dtype and have them signify different things?** -> 1) Yes, we can have both, versus 2) No, towards the user, we only have pd.NA and "disallow" NaN (or interpret / convert any NaN on input to NA).

A reason to have both is that they can signify different things (another reason is that most other data tools do this as well, I will put some comparisons in a separate post). 
That reasoning was given by @Dr-Irv in https://github.com/pandas-dev/pandas/issues/28095#issuecomment-538786581: there are times when I get NaN as a result of a computation, which indicates that I did something numerically wrong, versus NaN meaning "missing data". So should there be separate markers - one to mean "missing value" and the other to mean "bad computational result" (typically `0/0`) ?

A dummy example showing how both can occur:

```python
>>>  pd.Series([0, 1, 2]) / pd.Series([0, 1, pd.NA])
0    NaN
1    1.0
2   <NA>
dtype: float64
```

The NaN is introduced by the computation, the NA is propagated from the input data (although note that in an arithmetic operation like this, NaN would also propagate).

So, yes, it is possible and potentially desirable to allow both `pd.NA` and `np.nan` in floating dtypes. But, it also brings up several questions / complexities. Foremost, **should NaN still be considered as missing**? Meaning, should it be seen as missing in functions like `isna`/`notna`/`dropna`/`fillna` ? Or should that be an option? Should NaN still be considered as missing (and thus skipped) in reducing operations (that have a `skipna` keyword, like sum, mean, etc)?

Personally, I think we will need to keep NaN as missing, or at least initially. But, that will also introduce inconsistencies: although NaN would be seen as missing in the methods mentioned above, in arithmeric / comparison / scalar ops, it would behave as NaN and not as NA (so eg comparison gives False instead of propagating). It also means that in the missing-related methods, we will need to check for both NaN in the values as the mask (which can also have performance implications).

---

Some other various considerations:

* Having both pd.NA and NaN (np.nan) might actually be more confusing for users.

* If we want a consistent indicator and behavior for missing values across dtypes, I think we need a separate concept from NaN for float dtypes (i.e. pd.NA). Changing the behavior of NaN when inside a pandas container seems like a non-starter (the behavior of NaN is well defined in IEEE 754, and it would also deviate from the underlying numpy array)

* How do we handle compatibility with numpy? 
  The solution that we have come up (for now) for the other nullable dtypes is to convert to object dtype by default, and have a `to_numpy(.., na_value=np.nan)` explicit conversion. 
  But given how np.nan is in practice used in the whole pydata ecosystem as a missing value indicator, this might be annoying. 

  For conversion to numpy, see also some relevant discussion in https://github.com/pandas-dev/pandas/issues/30038

* What with conversion / inference on input? 
  Eg creating a Series from a float numpy array with NaNs (`pdSeries(np.array([0.1, np.nan]))`) Do we convert NaNs to NA automatically by default?


cc @pandas-dev/pandas-core @Dr-Irv @dsaxton 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

API: distinguish NA vs NaN in floating dtypes #32265

150 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

API: distinguish NA vs NaN in floating dtypes #32265

Description

Activity

jorisvandenbossche commented on Feb 26, 2020

toobaz commented on Feb 26, 2020

toobaz commented on Feb 26, 2020

TomAugspurger commented on Feb 26, 2020

jreback commented on Feb 26, 2020

Dr-Irv commented on Feb 26, 2020

toobaz commented on Feb 26, 2020

Dr-Irv commented on Feb 26, 2020

toobaz commented on Feb 26, 2020

toobaz commented on Feb 26, 2020

Dr-Irv commented on Feb 26, 2020

shoyer commented on Feb 26, 2020

150 remaining items

WillAyd commented on Jul 23, 2024

a-reich commented on Jul 23, 2024

vkhodygo commented on Sep 27, 2024

WillAyd commented on Sep 27, 2024

theo-brown commented on Feb 24, 2025

MarcoGorelli commented on Feb 24, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions