Skip to content

API/DEPR: numeric_only kwarg for apply/reductions #28900

@jbrockmendel

Description

@jbrockmendel
Member

A lot of complexity in DataFrame._reduce/apply/groupby ops is driven by the numeric_only=None. In these cases we try to apply the reduction to non-numeric dtypes and if it fails, exclude them. This hugely complicated our exception handling.

We should consider removing that option and require users to subset the appropriate columns before doing their operations.

Activity

WillAyd

WillAyd commented on Oct 10, 2019

@WillAyd
Member

This is in need of some clean up and could certainly be refactored; might be a tough sell to do away with the implicit dropping of "nuisance" columns though

I know the OP mentions DataFrame ops, but worth mention this is very prevalent in GroupBy as well

TomAugspurger

TomAugspurger commented on Oct 10, 2019

@TomAugspurger
Contributor

might be a tough sell to do away with the implicit dropping of "nuisance" columns though.

Agreed with this.

Ignoring object-dtype for the moment, can we determine the valid dtypes to include / drop before doing the operation?

jbrockmendel

jbrockmendel commented on Oct 10, 2019

@jbrockmendel
MemberAuthor

@TomAugspurger can you provide some context for the block quote? It looks like a dask thing that might be unrelated (or could be related in a really interesting way).

Ignoring object-dtype for the moment, can we determine the valid dtypes to include / drop before doing the operation?

I think so. Will become more certain as the exception-catching PRs in groupby progress.

TomAugspurger

TomAugspurger commented on Oct 10, 2019

@TomAugspurger
Contributor

Copy-paste fail. Fixed.

jbrockmendel

jbrockmendel commented on Oct 10, 2019

@jbrockmendel
MemberAuthor

Related: what would you expect the behavior to be if you did df.groupby("group").mean(numeric_only=False) on a DataFrame that includes non-numeric columns?

Motivated by a case I'm troubleshooting in tests.groupby.test_function.test_arg_passthru where it looks like it is trying the op on non-numeric (in this case categorical[str]) columns and then dropping those.

jbrockmendel

jbrockmendel commented on Oct 21, 2019

@jbrockmendel
MemberAuthor

@jorisvandenbossche you might be interested in weighing in here. The current behavior of suppressing exceptions and excluding columns is a big part of the reason why groupby is so hard to reason about

jorisvandenbossche

jorisvandenbossche commented on Oct 22, 2019

@jorisvandenbossche
Member

I agree that it seems a big breakage to go away from the automatic dropping of "nuisance" columns.

what would you expect the behavior to be if you did df.groupby("group").mean(numeric_only=False) on a DataFrame that includes non-numeric columns?

I would expect it to raise, like it does for df.mean(numeric_only=False) ...

jbrockmendel

jbrockmendel commented on Sep 5, 2020

@jbrockmendel
MemberAuthor

@jreback on the call today you seemed positive on this idea, asked how many tests rely on current behavior. Looks like 50 total, roughly a third in DataFrame reductions (mostly trying to do math on strings), a third in groupby.apply/agg (some strings, some Categorical), and the rest window (mostly dt64)

I've spent a big part of the afternoon trying to get the deprecation for #36076 right and its a PITA, would be a lot easier following this.

3 remaining items

added
Nuisance ColumnsIdentifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
on Sep 21, 2020
jbrockmendel

jbrockmendel commented on Sep 25, 2020

@jbrockmendel
MemberAuthor

Thinking about how to actually implement this deprecation, it would look something like:

if numeric_only is not False:
   warnings.warn("numeric_only=whatever is deprecated and will raise in a future version.  Instead, select the columns you want to operate on and use `df[columns].whatever(numeric_only=False)`"

and then I guess after that is enforced in 2.0, we then do another deprecation to get rid of the kwarg altogether?

jreback

jreback commented on Sep 25, 2020

@jreback
Contributor

#don't we normally just change the default to None; then can easily tell if it's user changed

and 2.0 just drop it entirely

jbrockmendel

jbrockmendel commented on Sep 25, 2020

@jbrockmendel
MemberAuthor

numeric_only=None is the default, and its the most problem-causing of the options (True, False, None)

jreback

jreback commented on Sep 25, 2020

@jreback
Contributor

ahh then can we do lib.no_default?

we want to warn if it's explicitly passed

jbrockmendel

jbrockmendel commented on Sep 25, 2020

@jbrockmendel
MemberAuthor

we want to warn if it's explicitly passed

Yes, but the behavior is going to change even if it is not explicitly passed, so we need to warn in that case too.

jorisvandenbossche

jorisvandenbossche commented on Sep 25, 2020

@jorisvandenbossche
Member

Personally, I am not yet fully convinced we actually want to deprecate this. Automatically dropping columns that are not relevant for the aggregation is also quite convenient, and I am sure a ton of code relies on this behaviour.
(and I have the impression others also are not yet convinced in the comments above)

But that doesn't mean we can't try to improve the complexity of the situation. For example echoing @TomAugspurger's comment above (#28900 (comment)), would it be possible to more deterministically know which columns will be included?
And if it is only the object dtype columns that are the problem / introduce the complexities, we could maybe also think about only changing something for object dtype?

Thinking about how to actually implement this deprecation, it would look something like:

If we deprecate, I think we should ensure we only raise a warning when actually something would change for the user. So eg when having only numerical columns, the exact value of numeric_only has no impact, and you shouldn't see any warning?

jbrockmendel

jbrockmendel commented on Sep 25, 2020

@jbrockmendel
MemberAuthor

AFAICT there are two options to achieve internally consistent behavior:

  1. Remove numeric_only altogether (this issue)
  2. Make DataFrame._reduce
    a) always operate block-wise (column-wise would be fine too)
    b) operate column-wise for object-dtype
    c) even then, axis=1 is a PITA

My main preference is to fix as much as possible of these without waiting for 2.0.

If we deprecate, I think we should ensure we only raise a warning when actually something would change for the user. So eg when having only numerical columns, the exact value of numeric_only has no impact, and you shouldn't see any warning?

For sufficiently simple cases this is doable, but things get pretty FUBAR pretty quickly.

jbrockmendel

jbrockmendel commented on Oct 6, 2021

@jbrockmendel
MemberAuthor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapDeprecateFunctionality to remove in pandasNeeds DiscussionRequires discussion from core team before further actionNuisance ColumnsIdentifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.applyReduction Operationssum, mean, min, max, etc.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @WillAyd@jreback@jorisvandenbossche@TomAugspurger@jbrockmendel

        Issue actions

          API/DEPR: numeric_only kwarg for apply/reductions · Issue #28900 · pandas-dev/pandas