-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
API - ConsistencyInternal Consistency of API/BehaviorInternal Consistency of API/BehaviorAlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffDeprecateFunctionality to remove in pandasFunctionality to remove in pandas
Milestone
Description
Currently specifying na_sentinel=None
in pd.factorize will use the sentinel -1 and set (internally) dropna=False. EA's factorize currently doesn't allow na_sentinel
being None, and #46601 added the dropna argument to EA's factorize.
It seems best to me to change pd.factorize to match EA's factorize. Tagging as 1.5 since if we are to deprecate, I'd like to get it in.
Metadata
Metadata
Assignees
Labels
API - ConsistencyInternal Consistency of API/BehaviorInternal Consistency of API/BehaviorAlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffDeprecateFunctionality to remove in pandasFunctionality to remove in pandas
Type
Projects
Relationships
Development
Select code repository
Activity
jbrockmendel commentedon May 1, 2022
+1
jorisvandenbossche commentedon May 2, 2022
The other option could also be to allow
na_sentinel
to be None in EA.factorize? (so to align that withpd.factorize
, instead of the other way around)We actually added
dropna
at some point, and then removed it again: #35667, #35852 (I didn't yet look back in detail to understand the reasoning or whether I still agree with it ;)).rhshadrach commentedon May 2, 2022
Thanks @jorisvandenbossche! When I was reasoning about it in #46601 (comment), I was focused on the code:
pandas/pandas/core/algorithms.py
Lines 715 to 718 in 60ac973
thinking that
na_sentinel=-1
was used. But of course, that's only temporary; after this block, the sentinel gets replace by a nonnegative code. With this in mind,na_sentinel=None
as the argument now makes a lot more sense to me. It still seems a bit odd to me thatna_sentinel=None
adds the null value to uniques (which is where dropna seems to make more sense). But since thena_sentinel
value doesn't get used when "dropna=False", it makes more sense to me to have this all be one parameter.jbrockmendel commentedon May 13, 2022
Are there known use cases where users pass na_sentinel other than -1 or None? e.g. i dont think ive ever seen
na_sentinel=14
.rhshadrach commentedon May 14, 2022
A positive sentinel is likely to break things
I don't know any use case for specifying na_sentinel in pd.factorize, and if there is one, a user can change the sentinel after with a single line of code.
+1 on deprecating the specifying of the sentinel value. In that case, there are two options:
This is then very similar to pyarrow's null_encoding argument in dictionary_encode being "mask" or "encode"; perhaps those would be good names going forward? If that is a good direction, do we prefer "null_encoding" vs "na_encoding" (thinking particularly of pd.NA here).
rhshadrach commentedon May 17, 2022
@jorisvandenbossche - any thoughts on the proposal to deprecate specifying the specific value of
na_sentinel
?rhshadrach commentedon May 23, 2022
@jorisvandenbossche - friendly ping
rhshadrach commentedon May 24, 2022
Assuming we go forward with my suggestion in #46910 (comment), with the EA interface currently marked as experimental, we have the option of removing the ability to set
na_sentinel
immediately or we could have the signature (more closely) matchpd.factorize
and deprecate in parallel.Proposal 1 - Deprecate na_sentinel in pd.factorize; change EA to future state immediately:
Proposal 2 - Deprecate na_sentinel in pd.factorize; have EA match (with na_sentinel also deprecated):
Since pd.factorize doesn't match ExtensionArray.factorize today, that makes me think Proposal 1 would be a better route. Any preference @jbrockmendel?
jorisvandenbossche commentedon May 24, 2022
I suppose there are theoretical use case of other
na_sentinel
values as -1. For example taking a large positive number, so missing values sorts last instead of first (not sure that actually makes sense). Whether anybody actually uses this in practice, I also highly doubt. So deprecating it seems fine.While we might still label it like that, there are lots of people using it, so I would prefer something with a smooth upgrade path for external EA implementors (so I prefer something more in the line of proposal 2).
For naming, one problem with
"mask"
is that this words fits less in the pandas context of using "-1" as sentinel (in pyarrow it actually returns an array with nulls (not -1's), which are implemented as a kind of a mask (not fully sure the naming comes from that, though)). Maybena_encoding="sentinel"
instead? (although not fully sure this is actually clearer)rhshadrach commentedon May 24, 2022
Thanks @jorisvandenbossche - I don't have any objection to calling it a "mask", but do see how this usage is a bit of a stretch from a typical use. I'm also happy with your suggestion of
na_encoding=["sentinel"|"encode"]
. One other option that comes to mind isna_sentinel=[True|False]
. But being the same name as the current argument might make deprecation confusing.jbrockmendel commentedon May 25, 2022
I agree with joris about a smooth deprecation path. No real preference otherwise.
rhshadrach commentedon May 27, 2022
After trying a few ways, I find myself preferring
use_na_sentinel=[True|False]
. Will be putting up a PR shortly, but can always change if desired.[-]Deprecate na_sentinel being None in pd.factorize, add dropna[/-][+]Deprecate na_sentinel, add use_na_sentinel[/+]na_sentinel
with null mask forSeries.label_encoding
rapidsai/cudf#5622use_na_sentinel
in factorize rapidsai/cudf#6946