Description
xref #48117, #47993, #48043, #47934, #47932
We have a number of methods that currently make copies but don't have to, mainly methods that revolve around index/columns while leaving the data untouched. e.g. DataFrame.droplevel
. By adding a keyword copy=True
to these methods, we give users the choice to avoid making a copy by setting copy=False
, thereby improving performance.
@jorisvandenbossche makes a reasonable point #48117 (review) that adding this may interact with potential copy/view changes in upcoming versions.
Besides droplevel
and drop
, for which PRs are already open, the other methods I'm looking at are reset_index
, replace
, and fillna
(there may be others I haven't found). In all three of those cases, the idea would be to add copy and deprecate inplace (xref #16529).
Activity
mroeschke commentedon Aug 18, 2022
Should we revert #48043, #47934, #47932 in the meantime?
jorisvandenbossche commentedon Aug 20, 2022
On the PR I said:
On second thought, this is not fully true. We could deprecate the
copy=False
behaviour, by just deprecating the full keyword (or specific option). But even then it is still the case that we are adding a new keyword that we would deprecate again shortly, which IMO also wouldn't be ideal.jreback commentedon Aug 20, 2022
ok i agree that this could be a can of worms and we have one change to do this right so will propose
jbrockmendel commentedon Aug 20, 2022
I've been assuming that, since we have existing methods with copy=True keywords, that we will end up making a decorator to deprecate those if/when CoW makes them redundant. If so, the marginal cost (on the dev side) to deprecating extra methods would be low.
That's fair. My thought going in was a) we an offer improved performance now, while such a deprecation is both future and not-assured, b) all else equal, an explicit copy is a better API than an implicit one (CoW).
If we do add the keyword now(ish), we could also document that there is a decent chance of an upcoming change.
jbrockmendel commentedon Aug 20, 2022
@jorisvandenbossche what if we did the following if/when CoW becomes the default: keep a copy keyword but have the default be lib.no_default, so
copy=True
makes a copy eagerly, not-specifying-copy defaults to CoW, and copy=False does something likeresult._mgr.refs = None
before returningresult
? Wouldresult
be an actual view in that case? Would that Break The World for CoW?jorisvandenbossche commentedon Aug 21, 2022
Yeah, based on my current understanding, that would break the CoW logic. Or at least it would introduce hard to predict behaviour. For example, you can set the
refs
of the resulting dataframe itself to None, but there might still be other objects referencing the same underlying data. So it is not a guarantee that no CoW happens (since we also check if the data is being referenced), or if we avoid CoW, it can break the guarantee of another object not getting mutated.See also my comment here #36195 (comment) about the shallow copy concept (i.e.
copy=False
orcopy(deep=False)
) not being possible anymore in its current meaning when using CoW.We might want to add an advanced feature to actually do something like this, at some point if there is demand for that. But if we do so, I think it should be 1) new API (so you don't accidentally get this because you already did
copy=False
before), and 2) it might be better to expose this in methods that do the actual mutation (so you can say: I know what I am doing, for this mutation, do not do any CoW ref checks)In general for the
copy
keyword, we still need to discuss what we actually want to do with this keyword. My current thinking was that we can indeed keep this, and havecopy=True
do an actual (eager) copy, whilecopy=False
(the default) uses delayed CoW (that's what #46958 did for the few methods where I already updated this, see #46958 (comment)).There can be some value in allowing users to do
copy=True
(although limited, since you can also use thecopy()
method afterwards if you want this). Alternatively, we could also usecopy=<no_default>
instead ofcopy=False
as the default. Or we could also simply remove thecopy
keyword eventually.On the dev side yes, but on the user side we would still be advising people to use
copy=False
now (eg from an inplace deprecation), which they potentially need to change again then, so in which case users would have to update their code twice.Yes, but that's for
copy=True
. If we are adding a newcopy
keyword now in certain methods, it is for users to docopy=False
? (since the True case is already the current behaviour)mroeschke commentedon Aug 21, 2022
@jbrockmendel would you have time to revert #48043, #47934, #47932? I was attempting to revert but got lost in all the merge conflicts as I think
inplace
might or might not supposed to be deprecated somewhere as well18 remaining items