Description
With the Copy-on-Write implementation (see #36195 / proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit, and overview follow up issue #48998), we can avoid doing an actual copy of the data in DataFrame and Series methods that typically return a copy / new object.
A typical example is the following:
df2 = df.rename(columns=str.lower)
By default, the rename()
method returns a new object (DataFrame) with a copy of the data of the original DataFrame (and thus, mutating values in df2
never mutates df
). With CoW enabled (pd.options.mode.copy_on_write = True
), we can still return a new object, but now pointing to the same data under the hood (avoiding an initial copy), while preserving the observed behaviour of df2
being a copy / not mutating df
when df2
is mutated (though the CoW mechanism, only copying the data in df2
when actually needed upon mutation, i.e. a delayed or lazy copy).
The way this is done in practice for a method like rename()
or reset_index()
is by using the fact that copy(deep=None)
will mean a true deep copy (current default behaviour) if CoW is not enabled, and this "lazy" copy when CoW is enabled. For example:
Lines 6246 to 6249 in 7bf8d6b
The initial CoW implementation in #46958 only added this logic to a few methods (to ensure this mechanism was working): rename
, reset_index
, reindex
(when reindexing the columns), select_dtypes
, to_frame
and copy
itself.
But there are more methods that can make use of this mechanism, and this issue is meant to as the overview issue to summarize and keep track of the progress on this front.
There is a class of methods that perform an actual operation on the data and return newly calculated data (eg typically reductions or the methods wrapping binary operators) that don't have to be considered here. It's only methods that can (potentially, in certain cases) return the original data that could make use of this optimization.
Series / DataFrame methods to update (I added a ?
for the ones I wasn't directly sure about, have to look into what those exactly do to be sure, but left them here to keep track of those, can remove from the list once we know more):
align
-> ENH: Add lazy copy to align #50432- Needs a follow-up, see comment -> ENH: Make shallow copy for align nocopy with CoW #50917
astype
-> ENH: Add lazy copy to astype #50802between_time
-> ENH: Add lazy copy for take and between_time #50476convert_dtypes
-> ENH: Implement CoW for convert_dtypes #51265copy
(tackled in initial implemention in #46958)drop_duplicates
(in case no duplicates are dropped) -> ENH: Add lazy copy for drop duplicates #50431droplevel
-> ENH: test CoW for drop_level #50552dropna
-> ENH: Use lazy copy for dropna #50429filter
-> TST: Copy on Write for filter #50589infer_objects
-> ENH: Use lazy copy in infer objects #50428insert
?interpolate
-> ENH: Add CoW optimization to interpolate #51249isetitem
-> TST: CoW with df.isetitem() #50692- this is covered by
where
, but could use an independent test -> TST / CoW: Add test for mask #53745
pipe
- > ENH: Add lazy copy to pipe #50567reindex
- Already handled for reindexing the columns in the initial implemention (#46958), but we can still optimize row selection as well? (in case no actual reindexing takes place) -> TST: add test for reindexing rows with matching index uses shallow copy with CoW #53723
reindex_like
-> ENH: Use cow for reindex_like #50426rename
(tackled in initial implementation in #46958)rename_axis
-> ENH: add lazy copy (CoW) mechanism to rename_axis #50415reorder_levels
-> ENH: add copy on write for df reorder_levels GH49473 #50016replace
-> ENH: Add lazy copy to replace #50746- TODO: Optimize when column not explicitly provided in to_replace?
- TODO: Optimize list-like
- TODO: Add note in docs that this is not fully optimized for 2.0 (not necessary if everything is finished by then)
reset_index
(tackled in initial implemention in #46958)round
(for columns that are not rounded) -> ENH: Add lazy copy to concat and round #50501select_dtypes
(tackled in initial implemention in #46958)set_axis
-> ENH/CoW: use lazy copy in set_axis method #49600set_flags
-> TST: Test cow for set_flags #50489set_index
-> ENH/CoW: use lazy copy in set_index method #49557- TODO: check what happens if parent is mutated -> shouldn't mutate the index! (is the data copied when creating the index?)
shift
-> ENH: Add lazy copy to shift #50753sort_index
/sort_values
(optimization if nothing needs to be sorted)sort_index
-> ENH: Add lazy copy for sort_index #50491sort_values
-> ENH: Add lazy copy for sort_values #50643
squeeze
-> TST: Test squeeze with CoW #50590style
. (phofl: I don't think there is anything to do here)swapaxes
-> ENH: Add lazy copy for swapaxes no op #50573swaplevel
-> ENH: Add lazy copy to swaplevel #50478take
(optimization if everything is taken?) -> ENH: Add lazy copy for take and between_time #50476truncate
-> ENH: Add lazy copy for truncate #50477unstack
(in optimized case where each column is a slice?)update
-> TST: add CoW test for update() #51426where
-> ENH: Add lazy copy to where #51336Series.to_frame()
(tackled in initial implemention in #46958)
Top-level functions:
pd.concat
-> ENH: Add lazy copy to concat and round #50501- add tests for
join
Want to contribute to this issue?
Pull requests tackling one of the bullet points above are certainly welcome!
- Pick one of the methods above (best to stick to one method per PR)
- Update the method to make use of a lazy copy (in many cases this might mean using
copy(deep=None)
somewhere, but for some methods it will be more involved) - Add a test for it in
/pandas/tests/copy_view/test_methods.py
(you can mimick on of the existing ones, egtest_select_dtypes
)- You can run the test with
PANDAS_COPY_ON_WRITE=1 pytest pandas/tests/copy_view/test_methods.py
to test it with CoW enabled (pandas will check that environment variable). The test needs to pass with both CoW disabled and enabled. - The tests make use of a
using_copy_on_write
fixture that can be used within the test function to test different expected results depending on whether CoW is enabled or not.
- You can run the test with
Activity
ntachukwu commentedon Nov 6, 2022
...
ntachukwu commentedon Nov 6, 2022
This PR 49557 is an attempt to handle the
set_index
method.seljaks commentedon Nov 9, 2022
Hi, I'll take a look at
drop
and see how it goes.DataFrame.drop
#49689seljaks commentedon Nov 20, 2022
Took at look at
head
andtail
. CoW is already implented for these because they're just.iloc
under the hood. I think they can be ticked off as is, but can submit a PR with explicit tests if needed.seljaks commentedon Nov 20, 2022
Found an inconsistency while looking at
squeeze
. When a dataframe has two or more columnsdf.squeeze
returns the dataframe unchanged. This is done by returningdf.iloc[slice(None), slice(None)]
. So the inconsistency is iniloc
. Example:@jorisvandenbossche This doesn't seem like the desired result. Should I open a separate issue for this or submit a PR linking here when I have a fix?
Sidenote: not sure how to test
squeeze
behavior when goingdf -> series
ordf -> scalar
without writing a util function likeget_array(series)
. Link to squeeze docs.jorisvandenbossche commentedon Nov 28, 2022
Specifically for
head
andtail
, I think we could consider changing those to actually return a hard (non-lazy) copy by default. Because those methods are typically used for interactive inspection of your data (seeing the repr ofdf.head()
), it would be nice to avoid each of those to trigger CoW from just looking at the data.But that for sure needs a separate issue to discuss first.
A PR to just add tests to confirm that right now they use CoW is certainly welcome.
jorisvandenbossche commentedon Nov 28, 2022
Sorry for the slow reply here. That's indeed an inconsistency in
iloc
with full slice. I also noticed that a while ago and had already opened a PR to fix this: #49469So when that is merged, the
squeeze
behaviour should also be fixed.I think for the
df -> series
case you can test this with the common pattern of "mutate subset (series in this case) -> ensure parent is not mutated"? (it's OK to leave out the shares_memory checks if those are difficult to express for a certain case. In the end it is the "mutations-don't-propagate" behaviour that is the actual documented / guaranteed behaviour we certainly want to have tested)For
df -> scalar
, I think that scalars are in general not mutable, so this might not need to be covered?In numpy,
squeeze
returns a 0-dim array, and those are mutable (at least those are views, so if you mutate the parent, the 0-dim "scalar" also gets mutated):but an actual scalar is not:
In the case of pandas,
iloc
used under the hood bysqueeze
will return a numpy scalar (and not 0-dim array):which I think means that the
df / series -> scalar
case is fine (we can still test it though, with mutation the parent and ensuring the scalar is still identical)andrewchen1216 commentedon Nov 28, 2022
I will take a look at
add_prefix
andadd_suffix
.48 remaining items