Skip to content

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

Closed
@jorisvandenbossche

Description

@jorisvandenbossche
Member

With the Copy-on-Write implementation (see #36195 / proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit, and overview follow up issue #48998), we can avoid doing an actual copy of the data in DataFrame and Series methods that typically return a copy / new object.
A typical example is the following:

df2 = df.rename(columns=str.lower)

By default, the rename() method returns a new object (DataFrame) with a copy of the data of the original DataFrame (and thus, mutating values in df2 never mutates df). With CoW enabled (pd.options.mode.copy_on_write = True), we can still return a new object, but now pointing to the same data under the hood (avoiding an initial copy), while preserving the observed behaviour of df2 being a copy / not mutating df when df2 is mutated (though the CoW mechanism, only copying the data in df2 when actually needed upon mutation, i.e. a delayed or lazy copy).

The way this is done in practice for a method like rename() or reset_index() is by using the fact that copy(deep=None) will mean a true deep copy (current default behaviour) if CoW is not enabled, and this "lazy" copy when CoW is enabled. For example:

pandas/pandas/core/frame.py

Lines 6246 to 6249 in 7bf8d6b

if inplace:
new_obj = self
else:
new_obj = self.copy(deep=None)

The initial CoW implementation in #46958 only added this logic to a few methods (to ensure this mechanism was working): rename, reset_index, reindex (when reindexing the columns), select_dtypes, to_frame and copy itself.
But there are more methods that can make use of this mechanism, and this issue is meant to as the overview issue to summarize and keep track of the progress on this front.

There is a class of methods that perform an actual operation on the data and return newly calculated data (eg typically reductions or the methods wrapping binary operators) that don't have to be considered here. It's only methods that can (potentially, in certain cases) return the original data that could make use of this optimization.

Series / DataFrame methods to update (I added a ? for the ones I wasn't directly sure about, have to look into what those exactly do to be sure, but left them here to keep track of those, can remove from the list once we know more):

Top-level functions:

Want to contribute to this issue?

Pull requests tackling one of the bullet points above are certainly welcome!

  • Pick one of the methods above (best to stick to one method per PR)
  • Update the method to make use of a lazy copy (in many cases this might mean using copy(deep=None) somewhere, but for some methods it will be more involved)
  • Add a test for it in /pandas/tests/copy_view/test_methods.py (you can mimick on of the existing ones, eg test_select_dtypes)
    • You can run the test with PANDAS_COPY_ON_WRITE=1 pytest pandas/tests/copy_view/test_methods.py to test it with CoW enabled (pandas will check that environment variable). The test needs to pass with both CoW disabled and enabled.
    • The tests make use of a using_copy_on_write fixture that can be used within the test function to test different expected results depending on whether CoW is enabled or not.

Activity

ntachukwu

ntachukwu commented on Nov 6, 2022

@ntachukwu
Contributor

...

ntachukwu

ntachukwu commented on Nov 6, 2022

@ntachukwu
Contributor

[] set_index

This PR 49557 is an attempt to handle the set_index method.

seljaks

seljaks commented on Nov 9, 2022

@seljaks
Contributor

Hi, I'll take a look at drop and see how it goes.

seljaks

seljaks commented on Nov 20, 2022

@seljaks
Contributor

Took at look at head and tail. CoW is already implented for these because they're just .iloc under the hood. I think they can be ticked off as is, but can submit a PR with explicit tests if needed.

seljaks

seljaks commented on Nov 20, 2022

@seljaks
Contributor

Found an inconsistency while looking at squeeze. When a dataframe has two or more columns df.squeeze returns the dataframe unchanged. This is done by returning df.iloc[slice(None), slice(None)]. So the inconsistency is in iloc . Example:

import pandas as pd

pd.options.mode.copy_on_write = True

df = pd.DataFrame({"a": [1, 2], "b": [0.4, 0.5]})
df2 = df.iloc[slice(None), slice(None)]  # or [:, :]

print("before:", df, df2, sep="\n")
df2.iloc[0, 0] = 0  # does not trigger copy_on_write
print("after:", df, df2, sep="\n")  # both df and df2 are 0 at [0, 0]

@jorisvandenbossche This doesn't seem like the desired result. Should I open a separate issue for this or submit a PR linking here when I have a fix?

Sidenote: not sure how to test squeeze behavior when going df -> series or df -> scalar without writing a util function like get_array(series). Link to squeeze docs.

jorisvandenbossche

jorisvandenbossche commented on Nov 28, 2022

@jorisvandenbossche
MemberAuthor

Took at look at head and tail. CoW is already implented for these because they're just .iloc under the hood. I think they can be ticked off as is, but can submit a PR with explicit tests if needed.

Specifically for head and tail, I think we could consider changing those to actually return a hard (non-lazy) copy by default. Because those methods are typically used for interactive inspection of your data (seeing the repr of df.head()), it would be nice to avoid each of those to trigger CoW from just looking at the data.
But that for sure needs a separate issue to discuss first.

A PR to just add tests to confirm that right now they use CoW is certainly welcome.

jorisvandenbossche

jorisvandenbossche commented on Nov 28, 2022

@jorisvandenbossche
MemberAuthor

When a dataframe has two or more columns df.squeeze returns the dataframe unchanged. This is done by returning df.iloc[slice(None), slice(None)]. So the inconsistency is in iloc . .... This doesn't seem like the desired result. Should I open a separate issue for this or submit a PR linking here when I have a fix?

Sorry for the slow reply here. That's indeed an inconsistency in iloc with full slice. I also noticed that a while ago and had already opened a PR to fix this: #49469

So when that is merged, the squeeze behaviour should also be fixed.

Sidenote: not sure how to test squeeze behavior when going df -> series or df -> scalar without writing a util function like get_array(series)

I think for the df -> series case you can test this with the common pattern of "mutate subset (series in this case) -> ensure parent is not mutated"? (it's OK to leave out the shares_memory checks if those are difficult to express for a certain case. In the end it is the "mutations-don't-propagate" behaviour that is the actual documented / guaranteed behaviour we certainly want to have tested)

For df -> scalar, I think that scalars are in general not mutable, so this might not need to be covered?

In numpy, squeeze returns a 0-dim array, and those are mutable (at least those are views, so if you mutate the parent, the 0-dim "scalar" also gets mutated):

>>> arr = np.array([1])
>>> scalar = arr.squeeze()
>>> scalar
array(1)
>>> arr[0] = 10
>>> scalar
array(10)

but an actual scalar is not:

>>> scalar = arr[0]
>>> scalar
10
>>> type(scalar)
numpy.int64
>>> arr[0] = 100
scalar
10

In the case of pandas, iloc used under the hood by squeeze will return a numpy scalar (and not 0-dim array):

>>> type(pd.Series(arr).iloc[0])
numpy.int64

which I think means that the df / series -> scalar case is fine (we can still test it though, with mutation the parent and ensuring the scalar is still identical)

andrewchen1216

andrewchen1216 commented on Nov 28, 2022

@andrewchen1216
Contributor

I will take a look at add_prefix and add_suffix.

48 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jorisvandenbossche@ntachukwu@seljaks@lithomas1@andrewchen1216

        Issue actions

          ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate · Issue #49473 · pandas-dev/pandas