-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
ApplyApply, Aggregate, Transform, MapApply, Aggregate, Transform, MapDeprecateFunctionality to remove in pandasFunctionality to remove in pandasNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further actionNuisance ColumnsIdentifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.applyIdentifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.applyReduction Operationssum, mean, min, max, etc.sum, mean, min, max, etc.
Description
A lot of complexity in DataFrame._reduce/apply/groupby ops is driven by the numeric_only=None. In these cases we try to apply the reduction to non-numeric dtypes and if it fails, exclude them. This hugely complicated our exception handling.
We should consider removing that option and require users to subset the appropriate columns before doing their operations.
Metadata
Metadata
Assignees
Labels
ApplyApply, Aggregate, Transform, MapApply, Aggregate, Transform, MapDeprecateFunctionality to remove in pandasFunctionality to remove in pandasNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further actionNuisance ColumnsIdentifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.applyIdentifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.applyReduction Operationssum, mean, min, max, etc.sum, mean, min, max, etc.
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
WillAyd commentedon Oct 10, 2019
This is in need of some clean up and could certainly be refactored; might be a tough sell to do away with the implicit dropping of "nuisance" columns though
I know the OP mentions DataFrame ops, but worth mention this is very prevalent in GroupBy as well
TomAugspurger commentedon Oct 10, 2019
Agreed with this.
Ignoring object-dtype for the moment, can we determine the valid dtypes to include / drop before doing the operation?
jbrockmendel commentedon Oct 10, 2019
@TomAugspurger can you provide some context for the block quote? It looks like a dask thing that might be unrelated (or could be related in a really interesting way).
I think so. Will become more certain as the exception-catching PRs in groupby progress.
TomAugspurger commentedon Oct 10, 2019
Copy-paste fail. Fixed.
jbrockmendel commentedon Oct 10, 2019
Related: what would you expect the behavior to be if you did
df.groupby("group").mean(numeric_only=False)
on a DataFrame that includes non-numeric columns?Motivated by a case I'm troubleshooting in tests.groupby.test_function.test_arg_passthru where it looks like it is trying the op on non-numeric (in this case categorical[str]) columns and then dropping those.
jbrockmendel commentedon Oct 21, 2019
@jorisvandenbossche you might be interested in weighing in here. The current behavior of suppressing exceptions and excluding columns is a big part of the reason why groupby is so hard to reason about
jorisvandenbossche commentedon Oct 22, 2019
I agree that it seems a big breakage to go away from the automatic dropping of "nuisance" columns.
I would expect it to raise, like it does for
df.mean(numeric_only=False)
...jbrockmendel commentedon Sep 5, 2020
@jreback on the call today you seemed positive on this idea, asked how many tests rely on current behavior. Looks like 50 total, roughly a third in DataFrame reductions (mostly trying to do math on strings), a third in groupby.apply/agg (some strings, some Categorical), and the rest window (mostly dt64)
I've spent a big part of the afternoon trying to get the deprecation for #36076 right and its a PITA, would be a lot easier following this.
3 remaining items
jbrockmendel commentedon Sep 25, 2020
Thinking about how to actually implement this deprecation, it would look something like:
and then I guess after that is enforced in 2.0, we then do another deprecation to get rid of the kwarg altogether?
jreback commentedon Sep 25, 2020
#don't we normally just change the default to None; then can easily tell if it's user changed
and 2.0 just drop it entirely
jbrockmendel commentedon Sep 25, 2020
numeric_only=None is the default, and its the most problem-causing of the options (True, False, None)
jreback commentedon Sep 25, 2020
ahh then can we do lib.no_default?
we want to warn if it's explicitly passed
jbrockmendel commentedon Sep 25, 2020
Yes, but the behavior is going to change even if it is not explicitly passed, so we need to warn in that case too.
jorisvandenbossche commentedon Sep 25, 2020
Personally, I am not yet fully convinced we actually want to deprecate this. Automatically dropping columns that are not relevant for the aggregation is also quite convenient, and I am sure a ton of code relies on this behaviour.
(and I have the impression others also are not yet convinced in the comments above)
But that doesn't mean we can't try to improve the complexity of the situation. For example echoing @TomAugspurger's comment above (#28900 (comment)), would it be possible to more deterministically know which columns will be included?
And if it is only the object dtype columns that are the problem / introduce the complexities, we could maybe also think about only changing something for object dtype?
If we deprecate, I think we should ensure we only raise a warning when actually something would change for the user. So eg when having only numerical columns, the exact value of
numeric_only
has no impact, and you shouldn't see any warning?jbrockmendel commentedon Sep 25, 2020
AFAICT there are two options to achieve internally consistent behavior:
a) always operate block-wise (column-wise would be fine too)
b) operate column-wise for object-dtype
c) even then, axis=1 is a PITA
My main preference is to fix as much as possible of these without waiting for 2.0.
For sufficiently simple cases this is doable, but things get pretty FUBAR pretty quickly.
jbrockmendel commentedon Oct 6, 2021
Closed by #43743, #43741, #42834, #41480, #41475