Skip to content

ENH/PERF: copy=True keyword in methods where copy=False can improve perf #48141

Closed
@jbrockmendel

Description

@jbrockmendel
Member

xref #48117, #47993, #48043, #47934, #47932

We have a number of methods that currently make copies but don't have to, mainly methods that revolve around index/columns while leaving the data untouched. e.g. DataFrame.droplevel. By adding a keyword copy=True to these methods, we give users the choice to avoid making a copy by setting copy=False, thereby improving performance.

@jorisvandenbossche makes a reasonable point #48117 (review) that adding this may interact with potential copy/view changes in upcoming versions.

Besides droplevel and drop, for which PRs are already open, the other methods I'm looking at are reset_index, replace, and fillna (there may be others I haven't found). In all three of those cases, the idea would be to add copy and deprecate inplace (xref #16529).

Activity

mroeschke

mroeschke commented on Aug 18, 2022

@mroeschke
Member

Should we revert #48043, #47934, #47932 in the meantime?

jorisvandenbossche

jorisvandenbossche commented on Aug 20, 2022

@jorisvandenbossche
Member

On the PR I said:

And that change of behaviour is quite subtle (it's not something that we can easily deprecate about for those specific cases). I would really like to avoid that more people start to rely on the mutability of dataframes returned in such a copy=False case.

On second thought, this is not fully true. We could deprecate the copy=False behaviour, by just deprecating the full keyword (or specific option). But even then it is still the case that we are adding a new keyword that we would deprecate again shortly, which IMO also wouldn't be ideal.

jreback

jreback commented on Aug 20, 2022

@jreback
Contributor

ok i agree that this could be a can of worms and we have one change to do this right so will propose

  • let's revert the changes adding copy= for 1.5
  • let's inventory (create an issue) all methods and see where we have copy= and/or inplace= keywords
  • after 2.0 i think we should deprecate inplace= and add copy=
added
PerformanceMemory or execution speed performance
Needs DiscussionRequires discussion from core team before further action
and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Aug 20, 2022
jbrockmendel

jbrockmendel commented on Aug 20, 2022

@jbrockmendel
MemberAuthor

We could deprecate the copy=False behaviour, by just deprecating the full keyword (or specific option).

I've been assuming that, since we have existing methods with copy=True keywords, that we will end up making a decorator to deprecate those if/when CoW makes them redundant. If so, the marginal cost (on the dev side) to deprecating extra methods would be low.

But even then it is still the case that we are adding a new keyword that we would deprecate again shortly, which IMO also wouldn't be ideal.

That's fair. My thought going in was a) we an offer improved performance now, while such a deprecation is both future and not-assured, b) all else equal, an explicit copy is a better API than an implicit one (CoW).

If we do add the keyword now(ish), we could also document that there is a decent chance of an upcoming change.

jbrockmendel

jbrockmendel commented on Aug 20, 2022

@jbrockmendel
MemberAuthor

@jorisvandenbossche what if we did the following if/when CoW becomes the default: keep a copy keyword but have the default be lib.no_default, so copy=True makes a copy eagerly, not-specifying-copy defaults to CoW, and copy=False does something like result._mgr.refs = None before returning result? Would result be an actual view in that case? Would that Break The World for CoW?

jorisvandenbossche

jorisvandenbossche commented on Aug 21, 2022

@jorisvandenbossche
Member

what if we did the following if/when CoW becomes the default: keep a copy keyword but have the default be lib.no_default, so copy=True makes a copy eagerly, not-specifying-copy defaults to CoW, and copy=False does something like result._mgr.refs = None before returning result? Would result be an actual view in that case? Would that Break The World for CoW?

Yeah, based on my current understanding, that would break the CoW logic. Or at least it would introduce hard to predict behaviour. For example, you can set the refs of the resulting dataframe itself to None, but there might still be other objects referencing the same underlying data. So it is not a guarantee that no CoW happens (since we also check if the data is being referenced), or if we avoid CoW, it can break the guarantee of another object not getting mutated.
See also my comment here #36195 (comment) about the shallow copy concept (i.e. copy=False or copy(deep=False)) not being possible anymore in its current meaning when using CoW.

We might want to add an advanced feature to actually do something like this, at some point if there is demand for that. But if we do so, I think it should be 1) new API (so you don't accidentally get this because you already did copy=False before), and 2) it might be better to expose this in methods that do the actual mutation (so you can say: I know what I am doing, for this mutation, do not do any CoW ref checks)

In general for the copy keyword, we still need to discuss what we actually want to do with this keyword. My current thinking was that we can indeed keep this, and have copy=True do an actual (eager) copy, while copy=False (the default) uses delayed CoW (that's what #46958 did for the few methods where I already updated this, see #46958 (comment)).
There can be some value in allowing users to do copy=True (although limited, since you can also use the copy() method afterwards if you want this). Alternatively, we could also use copy=<no_default> instead of copy=False as the default. Or we could also simply remove the copy keyword eventually.

If so, the marginal cost (on the dev side) to deprecating extra methods would be low.

On the dev side yes, but on the user side we would still be advising people to use copy=False now (eg from an inplace deprecation), which they potentially need to change again then, so in which case users would have to update their code twice.

all else equal, an explicit copy is a better API than an implicit one (CoW).

Yes, but that's for copy=True. If we are adding a new copy keyword now in certain methods, it is for users to do copy=False? (since the True case is already the current behaviour)

mroeschke

mroeschke commented on Aug 21, 2022

@mroeschke
Member

@jbrockmendel would you have time to revert #48043, #47934, #47932? I was attempting to revert but got lost in all the merge conflicts as I think inplace might or might not supposed to be deprecated somewhere as well

18 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Closing CandidateMay be closeable, needs more eyeballsEnhancementNeeds DiscussionRequires discussion from core team before further actionPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jreback@jorisvandenbossche@jbrockmendel@datapythonista@mroeschke

        Issue actions

          ENH/PERF: copy=True keyword in methods where copy=False can improve perf · Issue #48141 · pandas-dev/pandas