ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate

With the Copy-on-Write implementation (see https://github.com/pandas-dev/pandas/issues/36195 / proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit, and overview follow up issue https://github.com/pandas-dev/pandas/issues/48998), we can avoid doing an actual copy of the data in DataFrame and Series methods that typically return a copy / new object. 
A typical example is the following:

```python
df2 = df.rename(columns=str.lower)
```

By default, the `rename()` method returns a new object (DataFrame) with a copy of the data of the original DataFrame (and thus, mutating values in `df2` never mutates `df`). With CoW enabled (`pd.options.mode.copy_on_write = True`), we can still return a new object, but now pointing to the same data under the hood (avoiding an initial copy), while preserving the observed behaviour of `df2` being a copy / not mutating `df` when `df2` is mutated (though the CoW mechanism, only copying the data in `df2` when actually needed upon mutation, i.e. a delayed or lazy copy). 

The way this is done in practice for a method like `rename()` or `reset_index()` is by using the fact that `copy(deep=None)` will mean a true deep copy (current default behaviour) if CoW is not enabled, and this "lazy" copy when CoW is enabled. For example:

https://github.com/pandas-dev/pandas/blob/7bf8d6b318e0b385802e181ace3432ae73cbf79b/pandas/core/frame.py#L6246-L6249

The initial CoW implementation in https://github.com/pandas-dev/pandas/pull/46958/ only added this logic to a few methods (to ensure this mechanism was working): `rename`, `reset_index`, `reindex` (when reindexing the columns), `select_dtypes`, `to_frame` and `copy` itself. 
But there are more methods that can make use of this mechanism, and this issue is meant to as the overview issue to summarize and keep track of the progress on this front.

There is a class of methods that perform an actual operation on the data and return newly calculated data (eg typically reductions or the methods wrapping binary operators) that don't have to be considered here. It's only methods that can (potentially, in certain cases) return the original data that could make use of this optimization.

**Series / DataFrame methods to update** (I added a `?` for the ones I wasn't directly sure about, have to look into what those exactly do to be sure, but left them here to keep track of those, can remove from the list once we know more):

- [x] `add_prefix` / `add_suffix` -> https://github.com/pandas-dev/pandas/pull/49991
- [x] `align` -> https://github.com/pandas-dev/pandas/pull/50432
  - [x] Needs a follow-up, see [comment](https://github.com/pandas-dev/pandas/pull/50432#discussion_r1060098901) -> https://github.com/pandas-dev/pandas/pull/50917
- [x] `asfreq` -> https://github.com/pandas-dev/pandas/pull/50916 
- [x] `assign` -> https://github.com/pandas-dev/pandas/pull/50010
- [x] `astype` -> https://github.com/pandas-dev/pandas/pull/50802/
- [x] `between_time` -> https://github.com/pandas-dev/pandas/pull/50476
- [x] `bfill` / `backfill` -> https://github.com/pandas-dev/pandas/pull/51249
- [x] `clip` -> https://github.com/pandas-dev/pandas/pull/51492
- [x] `convert_dtypes` -> https://github.com/pandas-dev/pandas/pull/51265
- [x] `copy` (tackled in initial implemention in [#46958](https://github.com/pandas-dev/pandas/pull/46958/))
- [x] `drop` -> https://github.com/pandas-dev/pandas/pull/49689 
- [x] `drop_duplicates` (in case no duplicates are dropped) -> https://github.com/pandas-dev/pandas/pull/50431
- [x] `droplevel` -> https://github.com/pandas-dev/pandas/pull/50552
- [x] `dropna` -> https://github.com/pandas-dev/pandas/pull/50429
- [x] `eval` -> https://github.com/pandas-dev/pandas/pull/53746
- [x] `ffill` / `pad` -> https://github.com/pandas-dev/pandas/pull/51249
- [x] `fillna` -> https://github.com/pandas-dev/pandas/pull/51279
- [x] `filter` -> #50589
- [x] `get` -> https://github.com/pandas-dev/pandas/pull/51292
- [x] `head` -> https://github.com/pandas-dev/pandas/pull/49963
- [x] `infer_objects` -> https://github.com/pandas-dev/pandas/pull/50428
- [x] `insert`?
- [x] ``interpolate`` -> https://github.com/pandas-dev/pandas/pull/51249
- [x] `isetitem` -> https://github.com/pandas-dev/pandas/pull/50692
- [x] `items` -> https://github.com/pandas-dev/pandas/pull/50595
- [x] `iterrows`? -> https://github.com/pandas-dev/pandas/pull/51271
- [x] `join` / `merge` -> https://github.com/pandas-dev/pandas/pull/51297
- [x] `mask` -> https://github.com/pandas-dev/pandas/pull/51336
  - [x] this is covered by `where`, but could use an independent test -> https://github.com/pandas-dev/pandas/pull/53745
- [x] `pipe` - > https://github.com/pandas-dev/pandas/pull/50567
- [x] `pop` -> https://github.com/pandas-dev/pandas/pull/50569
- [x] `reindex`
  - [x] Already handled for reindexing the columns in the initial implemention ([#46958](https://github.com/pandas-dev/pandas/pull/46958/)), but we can still optimize row selection as well? (in case no actual reindexing takes place) -> https://github.com/pandas-dev/pandas/pull/53723
- [x] `reindex_like` -> https://github.com/pandas-dev/pandas/pull/50426
- [x] `rename` (tackled in initial implementation in [#46958](https://github.com/pandas-dev/pandas/pull/46958/))
- [x] `rename_axis` -> https://github.com/pandas-dev/pandas/pull/50415
- [x] `reorder_levels` -> https://github.com/pandas-dev/pandas/pull/50016
- [x] `replace` -> https://github.com/pandas-dev/pandas/pull/50746
    - [x] https://github.com/pandas-dev/pandas/pull/50918 
    - [x] TODO: Optimize when column not explicitly provided in to_replace?
    - [x] TODO: Optimize list-like
    - [x] TODO: Add note in docs that this is not fully optimized for 2.0 (not necessary if everything is finished by then)
- [x] `reset_index` (tackled in initial implemention in [#46958](https://github.com/pandas-dev/pandas/pull/46958/))
- [x] `round` (for columns that are not rounded) -> https://github.com/pandas-dev/pandas/pull/50501
- [x] `select_dtypes`(tackled in initial implemention in [#46958](https://github.com/pandas-dev/pandas/pull/46958/))
- [x] `set_axis` -> https://github.com/pandas-dev/pandas/pull/49600
- [x] `set_flags` -> https://github.com/pandas-dev/pandas/pull/50489
- [x] `set_index` -> https://github.com/pandas-dev/pandas/pull/49557
  - [x] TODO: check what happens if parent is mutated -> shouldn't mutate the index! (is the data copied when creating the index?) 
- [x] `shift` -> https://github.com/pandas-dev/pandas/pull/50753
- [x] `sort_index` / `sort_values` (optimization if nothing needs to be sorted)
  - [x] `sort_index` -> https://github.com/pandas-dev/pandas/pull/50491
  - [x] `sort_values` -> https://github.com/pandas-dev/pandas/pull/50643
- [x] `squeeze` -> https://github.com/pandas-dev/pandas/pull/50590
- [x] `style`. (phofl: I don't think there is anything to do here)
- [x] `swapaxes` -> https://github.com/pandas-dev/pandas/pull/50573
- [x] `swaplevel` -> https://github.com/pandas-dev/pandas/pull/50478
- [x] `T` / `transpose` -> https://github.com/pandas-dev/pandas/pull/51430
- [x] `tail` -> https://github.com/pandas-dev/pandas/pull/49963
- [x] `take` (optimization if everything is taken?) -> https://github.com/pandas-dev/pandas/pull/50476
- [x] `to_timestamp`/ `to_period` -> https://github.com/pandas-dev/pandas/pull/50575
- [x] `transform` -> https://github.com/pandas-dev/pandas/pull/53747
- [x] `truncate` -> https://github.com/pandas-dev/pandas/pull/50477
- [x] `tz_convert` / `tz_localize` -> https://github.com/pandas-dev/pandas/pull/50490
- [ ] `unstack` (in optimized case where each column is a slice?)
- [x] `update` -> https://github.com/pandas-dev/pandas/pull/51426
- [x] `where` -> https://github.com/pandas-dev/pandas/pull/51336
- [x] `xs` -> https://github.com/pandas-dev/pandas/pull/51292
- [x] `Series.to_frame()` (tackled in initial implemention in [#46958](https://github.com/pandas-dev/pandas/pull/46958/))

Top-level functions:

- [x] `pd.concat` -> https://github.com/pandas-dev/pandas/pull/50501
- [x] `pd.merge` et al? -> https://github.com/pandas-dev/pandas/pull/51297, https://github.com/pandas-dev/pandas/pull/51327
  - [x] add tests for `join`

### Want to contribute to this issue? 

Pull requests tackling one of the bullet points above are certainly welcome!

- Pick one of the methods above (best to stick to one method per PR)
- Update the method to make use of a lazy copy (in many cases this might mean using `copy(deep=None)` somewhere, but for some methods it will be more involved)
- Add a test for it in `/pandas/tests/copy_view/test_methods.py` (you can mimick on of the existing ones, eg `test_select_dtypes`)
  - You can run the test with `PANDAS_COPY_ON_WRITE=1 pytest pandas/tests/copy_view/test_methods.py` to test it with CoW enabled (pandas will check that environment variable). The test needs to pass with both CoW disabled and enabled.
  - The tests make use of a `using_copy_on_write` fixture that can be used within the test function to test different expected results depending on whether CoW is enabled or not.



	if inplace:
	new_obj = self
	else:
	new_obj = self.copy(deep=None)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

Want to contribute to this issue?

48 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

Description

Want to contribute to this issue?

Activity

ntachukwu commented on Nov 6, 2022

ntachukwu commented on Nov 6, 2022

seljaks commented on Nov 9, 2022

seljaks commented on Nov 20, 2022

seljaks commented on Nov 20, 2022

jorisvandenbossche commented on Nov 28, 2022

jorisvandenbossche commented on Nov 28, 2022

andrewchen1216 commented on Nov 28, 2022

48 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions