Skip to content

Deprecate na_sentinel, add use_na_sentinel #46910

@rhshadrach

Description

@rhshadrach
Member

Currently specifying na_sentinel=None in pd.factorize will use the sentinel -1 and set (internally) dropna=False. EA's factorize currently doesn't allow na_sentinel being None, and #46601 added the dropna argument to EA's factorize.

It seems best to me to change pd.factorize to match EA's factorize. Tagging as 1.5 since if we are to deprecate, I'd like to get it in.

cc @jbrockmendel @jorisvandenbossche @TomAugspurger

Activity

added
AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
DeprecateFunctionality to remove in pandas
API - ConsistencyInternal Consistency of API/Behavior
on Apr 30, 2022
added this to the 1.5 milestone on Apr 30, 2022
jbrockmendel

jbrockmendel commented on May 1, 2022

@jbrockmendel
Member

+1

jorisvandenbossche

jorisvandenbossche commented on May 2, 2022

@jorisvandenbossche
Member

The other option could also be to allow na_sentinel to be None in EA.factorize? (so to align that with pd.factorize, instead of the other way around)

We actually added dropna at some point, and then removed it again: #35667, #35852 (I didn't yet look back in detail to understand the reasoning or whether I still agree with it ;)).

rhshadrach

rhshadrach commented on May 2, 2022

@rhshadrach
MemberAuthor

Thanks @jorisvandenbossche! When I was reasoning about it in #46601 (comment), I was focused on the code:

dropna = True
if na_sentinel is None:
na_sentinel = -1
dropna = False

thinking that na_sentinel=-1 was used. But of course, that's only temporary; after this block, the sentinel gets replace by a nonnegative code. With this in mind, na_sentinel=None as the argument now makes a lot more sense to me. It still seems a bit odd to me that na_sentinel=None adds the null value to uniques (which is where dropna seems to make more sense). But since thena_sentinel value doesn't get used when "dropna=False", it makes more sense to me to have this all be one parameter.

jbrockmendel

jbrockmendel commented on May 13, 2022

@jbrockmendel
Member

Are there known use cases where users pass na_sentinel other than -1 or None? e.g. i dont think ive ever seen na_sentinel=14.

rhshadrach

rhshadrach commented on May 14, 2022

@rhshadrach
MemberAuthor

A positive sentinel is likely to break things

ser = pd.Series([1, 2, np.nan, 3])
print(pd.factorize(ser, na_sentinel=2))

(array([0, 1, 2, 2]), Float64Index([1.0, 2.0, 3.0], dtype='float64'))

I don't know any use case for specifying na_sentinel in pd.factorize, and if there is one, a user can change the sentinel after with a single line of code.

+1 on deprecating the specifying of the sentinel value. In that case, there are two options:

  • code NAs as -1, do not include null values in uniques
  • na_sentinel is a nonnegative integer; include null values in uniques

This is then very similar to pyarrow's null_encoding argument in dictionary_encode being "mask" or "encode"; perhaps those would be good names going forward? If that is a good direction, do we prefer "null_encoding" vs "na_encoding" (thinking particularly of pd.NA here).

rhshadrach

rhshadrach commented on May 17, 2022

@rhshadrach
MemberAuthor

@jorisvandenbossche - any thoughts on the proposal to deprecate specifying the specific value of na_sentinel?

rhshadrach

rhshadrach commented on May 23, 2022

@rhshadrach
MemberAuthor

@jorisvandenbossche - friendly ping

rhshadrach

rhshadrach commented on May 24, 2022

@rhshadrach
MemberAuthor

Assuming we go forward with my suggestion in #46910 (comment), with the EA interface currently marked as experimental, we have the option of removing the ability to set na_sentinel immediately or we could have the signature (more closely) match pd.factorize and deprecate in parallel.

Proposal 1 - Deprecate na_sentinel in pd.factorize; change EA to future state immediately:

  • pd.factorize(values, sort: bool = False, na_sentinel: int | None = -1, size_hint: int | None = None, na_encoding: str = 'mask')
  • ExtensionArray.factorize(self, na_encoding: str = 'mask')

Proposal 2 - Deprecate na_sentinel in pd.factorize; have EA match (with na_sentinel also deprecated):

  • pd.factorize(values, sort: bool = False, na_sentinel: int | None = -1, size_hint: int | None = None, na_encoding: str = 'mask')
  • ExtensionArray.factorize(self, na_sentinel: int | None = -1, na_encoding: str = 'mask')

Since pd.factorize doesn't match ExtensionArray.factorize today, that makes me think Proposal 1 would be a better route. Any preference @jbrockmendel?

jorisvandenbossche

jorisvandenbossche commented on May 24, 2022

@jorisvandenbossche
Member

I suppose there are theoretical use case of other na_sentinel values as -1. For example taking a large positive number, so missing values sorts last instead of first (not sure that actually makes sense). Whether anybody actually uses this in practice, I also highly doubt. So deprecating it seems fine.

with the EA interface currently marked as experimental,

While we might still label it like that, there are lots of people using it, so I would prefer something with a smooth upgrade path for external EA implementors (so I prefer something more in the line of proposal 2).

For naming, one problem with "mask" is that this words fits less in the pandas context of using "-1" as sentinel (in pyarrow it actually returns an array with nulls (not -1's), which are implemented as a kind of a mask (not fully sure the naming comes from that, though)). Maybe na_encoding="sentinel" instead? (although not fully sure this is actually clearer)

rhshadrach

rhshadrach commented on May 24, 2022

@rhshadrach
MemberAuthor

Thanks @jorisvandenbossche - I don't have any objection to calling it a "mask", but do see how this usage is a bit of a stretch from a typical use. I'm also happy with your suggestion of na_encoding=["sentinel"|"encode"]. One other option that comes to mind is na_sentinel=[True|False]. But being the same name as the current argument might make deprecation confusing.

jbrockmendel

jbrockmendel commented on May 25, 2022

@jbrockmendel
Member

I agree with joris about a smooth deprecation path. No real preference otherwise.

rhshadrach

rhshadrach commented on May 27, 2022

@rhshadrach
MemberAuthor

After trying a few ways, I find myself preferring use_na_sentinel=[True|False]. Will be putting up a PR shortly, but can always change if desired.

changed the title [-]Deprecate na_sentinel being None in pd.factorize, add dropna[/-] [+]Deprecate na_sentinel, add use_na_sentinel[/+] on Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    API - ConsistencyInternal Consistency of API/BehaviorAlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffDeprecateFunctionality to remove in pandas

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @jorisvandenbossche@jbrockmendel@rhshadrach

      Issue actions

        Deprecate na_sentinel, add use_na_sentinel · Issue #46910 · pandas-dev/pandas