Skip to content

PDEP-8: Inplace methods in pandas #51466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
433 changes: 433 additions & 0 deletions web/pandas/pdeps/0008-inplace-methods-in-pandas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,433 @@
# PDEP-8: In-place methods in pandas

- Created: 16 February 2023
- Status: Accepted
- Discussion: [PR 51466](https://github.com/pandas-dev/pandas/pull/51466)
- Authors: [Thomas Li](https://github.com/lithomas1),
[Patrick Hoefler](https://github.com/phofl),
[Joris Van den Bossche](https://github.com/jorisvandenbossche)
- Revision: 1

## Abstract

This PDEP proposes that:

- The ``inplace`` parameter will be removed from any method which can never update the
underlying values of a pandas object inplace or which alters the shape of the object,
and where the `inplace=True` option is only syntactic sugar for reassigning the result
to the calling DataFrame/Series.
- As a consequence, the `inplace` parameter is only kept for those methods that can
modify the underlying values of a pandas object inplace, such as `fillna` or `replace`.
- With the introduction of Copy-on-Write ([PDEP-7](^1)), users don't need the `inplace`
keyword to avoid a copy of the data.
- For those methods that will keep the `inplace=True` option:
- the method will do an attempt to do the operation inplace but still silently copy
when needed (for Copy-on-Write), i.e. there is no guarantee it is actually done inplace.
- the method will return the calling object (`self`), instead of the current `None`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also apply to methods that only act inplace and so have no inplace argument? update and insert come to mind.

cc @pandas-dev/pandas-core

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We second this question here #59788

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also apply to methods that only act inplace and so have no inplace argument? update and insert come to mind.

See related discussion at #21858


## Motivation and Scope

The `inplace=True` keyword has been a controversial topic for many years. It is generally seen (at least by several
pandas maintainers and educators) as bad practice and often unnecessary, but at the same time it is also widely used,
partly because of confusion around the impact of the keyword.

Generally, we assume that people use the keyword for the following reasons:

1. Because they think it is more efficient (it is faster and/or can save memory)
2. To save the result to the same variable / update the original variable (avoid the pattern of reassigning to the same
variable name)

For the first reason: efficiency is an important aspect. However, in practice it is not always the case
that `inplace=True` improves anything. Some of the methods with an `inplace` keyword can actually work inplace, but
others still make a copy under the hood anyway. In addition, with the introduction of Copy-on-Write ([PDEP-7](^1)), there are now other
ways to avoid making unnecessary copies by default (without needing to specify a keyword). The next section gives a
detailed overview of those different cases.

For the second reason: we are convinced that this is not worth it. While it might save some keystrokes (if you have a
long variable name), this code style also has sufficient disadvantages that we think it is not worth providing "two
ways" to achieve the same result:

- You can't use method chaining with `inplace=True`
- The ``inplace`` keyword complicates type annotations (because the return value depends on the value of `inplace`)
- Using `inplace=True` gives code that mutates the state of an object and thus has side-effects. That can introduce
subtle bugs and is harder to debug.

Finally, there are also methods that have a `copy` keyword instead of an `inplace` keyword (which also avoids copying
the data when `copy=False`, but returns a new object referencing the same data instead of updating the calling object),
adding to the inconsistencies. This keyword is also redundant now with the introduction of Copy-on-Write.

Given the above reasons, we are convinced that there is no need for neither the `inplace` nor the `copy` keyword, except
for a small subset of methods that can actually update data inplace. Removing those keywords will give a more
consistent and less confusing API. Removing the `copy` keyword is covered by PDEP-7 about Copy-on-Write,
and this PDEP will focus on the `inplace` keyword.

Thus, in this PDEP, we aim to standardize behavior across methods to make control of inplace-ness of methods
consistent, and compatible with Copy-on-Write.

Note: there are also operations (not methods) that work inplace in pandas, such as indexing (
e.g. `df.loc[0, "col"] = val`) or inplace operators (e.g. `df += 1`). This is out of scope for this PDEP, as we focus on
the inplace behaviour of DataFrame and Series _methods_.
Comment on lines +67 to +69
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a subsection for out of scope and pull down the out of scope related items from the abstract

Comment on lines +68 to +69
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to say "as we focus on the inplace behavior of DataFrame and Series methods, whether or not those methods have an inplace keyword."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true we also list the methods that don't have an inplace keyword, but there is not any actual change for those included in the PDEP (I think we list them more for completeness), so not sure it is needed to focus on that aspect here.


## Detailed description

### Status Quo

Many methods in pandas currently have the ability to perform an operation inplace. For example, some methods such
as ``DataFrame.insert`` only support inplace operations, while other methods use the `inplace` keyword to control
whether an operation is done inplace or not.

While we generally speak about "inplace" operations, this term is used in various context. Broadly speaking,
for this PDEP, we can distinguish two kinds of "inplace" operations:

* **"values-inplace"**: an operation that updates the underlying values of a Series or DataFrame columns inplace
(without making a copy of the array).

As illustration, an example of such a values-inplace operation without using a method:

:::python
# if the dtype is compatible, this setitem operation updates the underlying array inplace
df.loc[0, "col"] = val

* **"object-inplace"**: an operation that updates a pandas DataFrame or Series _object_ inplace, but without
updating existing column values inplace.

As illustration, an example of such an object-inplace operation without using a method:

:::python
# we replace the Index on `df` inplace, but without actually
# updating any existing array
df.index = pd.Index(...)
# we update the DataFrame inplace, but by completely replacing a column,
# not by mutating the existing column's underlying array
df["col"] = new_values

Object-inplace operations, while not actually modifying existing column values, keep
(a subset of) those columns and thus can avoid copying the data of those existing columns.

In addition, several methods supporting the ``inplace`` keyword cannot actually be done inplace (in either meaning)
because they make a copy as a
consequence of the operations they perform, regardless of whether ``inplace`` is ``True`` or not. This, coupled with the
fact that the ``inplace=True`` changes the return type of a method from a pandas object to ``None``, makes usage of
the ``inplace`` keyword confusing and non-intuitive.

To summarize the status quo of inplace behavior of methods, we have divided methods that can operate inplace or have
an ``inplace`` keyword into 4 groups:

**Group 1: Methods that always operate inplace (no user-control with ``inplace`` keyword)**

| Method Name |
|:--------------|
| ``insert`` |
| ``pop`` |
| ``update`` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is update always inplace? e.g. in

In [43]: df1 = pd.DataFrame({'numbers': [1, np.nan]})

In [44]: df2 = pd.DataFrame({'numbers': ['foo', 'bar']})

In [45]: df1.update(df2)

Copy link
Member

@phofl phofl Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always operates inplace on the pandas object, but not necessarily on the underlying data/array. Your example would update df1, but the array would be copied before the update actually happens. Same as with replace and friends when the value to set is not compatible with the array dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but then should it belong in group 2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no inplace keyword, so we don't have to change anything, same with insert and pop

Copy link
Member

@MarcoGorelli MarcoGorelli Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the title then

methods that always operate inplace

As in, methods that always operate inplace on the pandas object?

Because in another part of the tutorial the wording

Some of the methods with an inplace keyword can actually work inplace

is used - I find it confusing when you're talking about "actually working inplace" and "modifying the pandas object"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would make it clearer, thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the keywords do we just go all in and make those copy_pandas_object and copy_underlying_data? Or something similar.

Obviously more verbose but yea I agree with @MarcoGorelli sentiment that inplace is vague to the point of misinterpretation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy_pandas_object won't be necessary with CoW. We will only need a keyword where we actually can modify the underlying data inplace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd to be clear, my comment above was about terminology to use in the text explaining this, not for actual keyword names (from your comment I get the impression you are talking about actual keywords)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably way too abstracted from all of this, but find the terminology confusing. I didn't realize that CoW only ever referred to the pandas "container" object and didn't take a point of view on the underlying data. So if that' is the long term commitment then yea copy_pandas_object is a bit redundant, although I wonder if there is other terminology we need to cover if/when CoW extends to the underlying data

Generally it feels like there is a lot to be gained from a more explicit keyword than inplace when we want to modify the underlying data; I think is in line with @Dr-Irv points

| ``isetitem`` |

This group encompasses both kinds of inplace: `update` can be values-inplace, while the others are object-inplace
(for example, although ``isetitem`` operates on the original pandas object inplace,
it will not change any existing values inplace; rather it will remove the values of the column being set, and insert new values).

**Group 2: Methods that can modify the underlying data of the DataFrame/Series object ("values-inplace")**

| Method Name |
|:----------------|
| ``where`` |
| ``fillna`` |
| ``replace`` |
| ``mask`` |
| ``interpolate`` |
| ``ffill`` |
| ``bfill`` |
| ``clip`` |

These methods don't operate inplace by default, but can be done inplace with `inplace=True` _if_ the dtypes are compatible
(e.g. the values replacing the old values can be stored in the original array without an astype). All those methods leave
the structure of the DataFrame or Series intact (shape, row/column labels), but can mutate some elements of the data of
the DataFrame or Series.

**Group 3: Methods that can modify the DataFrame/Series object, but not the pre-existing values ("object-inplace")**

| Method Name |
|:----------------------------|
| ``drop`` (dropping columns) |
| ``rename`` |
| ``rename_axis`` |
| ``reset_index`` |
| ``set_index`` |

These methods can change the structure of the DataFrame or Series, such as changing the shape by adding or removing
columns, or changing the row/column labels (changing the index/columns attributes), but don't modify the existing
underlying column data of the object.

All those methods make a copy of the full data by default, but can be performed object-inplace with
avoiding copying all data (currently enabled with specifying `inplace=True`).

Note: there are also methods that have a `copy` keyword instead of an `inplace` keyword (e.g. `set_axis`). This serves
a similar purpose (avoid copying all data), but those methods don't update the original object inplace and instead
return a new object referencing the same data.

**Group 4: Methods that can never operate inplace**

| Method Name |
|:-----------------------|
| `drop` (dropping rows) |
| `dropna` |
| `drop_duplicates` |
| `sort_values` |
| `sort_index` |
| `eval` |
| `query` |

Although these methods have the `inplace` keyword, they can never operate inplace, in either meaning, because the nature of the
operation requires copying (such as reordering or dropping rows). For those methods, `inplace=True` is essentially just
syntactic sugar for reassigning the new result to the calling DataFrame/Series.

Note: in the case of a "no-op" (for example when sorting an already sorted DataFrame), some of those methods might not
need to perform a copy and could be considered as "object-inplace" in that case.
This currently happens with Copy-on-Write (regardless of `inplace`), but this is considered an
implementation detail for the purpose of this PDEP.

### Proposed changes and reasoning

The methods from **group 1 (always inplace, no keyword)** won't change behavior, and will remain always inplace.

For methods from **group 4 (never inplace)**, the `inplace` keyword has no actual effect
(except for re-assigning to the calling variable) and is effectively syntactic sugar for
manually re-assigning. For this group, we propose to remove the `inplace` keyword.

For methods from **group 3 (object-inplace)**, the `inplace=True` keyword can currently be
used to avoid a copy. However, with the introduction of Copy-on-Write, every operation
will potentially return a shallow copy of the input object by default (if the performed
operation does not require a copy of the data). This future default is therefore
equivalent to the behavior with `inplace=True` for those methods (minus the return
value).

For the above reasoning, we think there is no benefit of keeping the keyword around for
these methods. To emulate behavior of the `inplace` keyword, we can reassign the result
of an operation to the same variable:

:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
df = df.reset_index()
df.iloc[0, 1] = ...

All references to the original object will go out of scope when the result of the `reset_index` operation is re-assigned
to `df`. As a consequence, `iloc` will continue to operate inplace, and the underlying data will not be copied (with Copy-on-Write).

**Group 2 (values-inplace)** methods differ, though, since they modify the underlying
data, and therefore can actually happen inplace:

:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
df.replace(to_replace=1, value=100, inplace=True)

Currently, the above updates `df` values-inplace, without requiring a copy of the data.
For this type of method, however, we can _not_ emulate the above usage of `inplace` by
re-assigning:

:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
df = df.replace(to_replace=1, value=100)

If we follow the rules of Copy-on-Write[^1] where "any subset or returned
series/dataframe always behaves as a copy of the original, and thus never modifies the
original", then there is no way of doing this operation inplace by default, because the
original object `df` would be modified before the reference goes out of scope (pandas
does not know whether you will re-assign it to `df` or assign it to another variable).
That would violate the Copy-on-Write rules, and therefore the `replace()` method in the
example always needs to make a copy of the underlying data by default

For this case, an `inplace=True` option can have an actual benefit, i.e. allowing to
avoid a data copy. Therefore, we propose to keep the `inplace` argument for this
group of methods.

Summarizing for the `inplace` keyword, we propose to:

- Keep the `inplace` keyword for this subset of methods (group 2) that can update the
underlying values inplace ("values-inplace")
- Remove the `inplace` keyword from all other methods that either can never work inplace
(group 4) or only update the object (group 3, "object-inplace", which can be emulated
with reassigning).

### Other design questions

#### With `inplace=True`, should we silently copy or raise an error if the data has references?

For those methods where we would keep the `inplace=True` option (group 2), there is a complication that actually operating inplace
is not always possible.

For example,

:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
df.replace(to_replace=1, value=100, inplace=True)

can be performed inplace.

This is only true if `df` does not share the values it stores with another pandas object. For example, the following
operations

:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
view = df[:]
# We can't operate inplace, because view would also be modified!
df.replace(to_replace=1, value=100, inplace=True)

would be incompatible with the Copy-on-Write rules when actually done inplace. In this case we can either

- copy the shared values before performing the operation to avoid modifying another object (i.e. follow the standard
Copy-on-Write procedure),
- raise an error to indicate that more than one object would be changed and the inplace operation is not possible.

Raising an error here is problematic since oftentimes users do not have control over whether a method would cause a "
lazy copy" to be triggered under Copy-on-Write. It is also hard to fix, adding a `copy()` before calling a method
with `inplace=True` might actually be worse than triggering the copy under the hood. We would only copy columns that
share data with another object, not the whole object like `.copy()` would.

**Therefore, we propose to silently copy when needed.** The `inplace=True` option would thus mean "try inplace whenever possible", and not guarantee it is actually done inplace.

In the future, if there is demand for it, it could still be possible to add to option to raise a warning whenever this happens.
This would be useful in an IPython shell/Jupyter Notebook setting, where the user would have the opportunity to delete
unused references that are causing the copying to be triggered.

For example:

:::ipython
In [1]: import pandas as pd

In [2]: pd.set_option("mode.copy_on_write", True)

In [3]: ser = pd.Series([1,2,3])

In [4]: ser_vals = ser.values # Save values to check inplace-ness

In [5]: ser
Out[5]:
0 1
1 2
2 3
dtype: int64

In [6]: ser = ser[:] # Original series should go out of scope

In [7]: ser.iloc[0] = -1 # This should be inplace

In [8]: ser
Out[8]:
0 -1
1 2
2 3
dtype: int64

In [9]: ser_vals
Out[9]: array([1, 2, 3]) # It's not modified!

In [10]: Out[5] # IPython kept our series alive since we displayed it!

While there are ways to mitigate this[^5], it may be helpful to let the user know that an operation that they performed
was not inplace, since it is possible to go out of memory because of this.

#### Return the calling object (`self`) also when using `inplace=True`?

One of the downsides of the `inplace=True` option is that the return type of those methods
depends on the value of `inplace`, and that method chaining does not work.
Those downsides are still relevant for the cases where we keep `inplace=True`.
To address this, we can have those methods return the object that was operated on
inplace when `inplace=True`.

Advantages:

- It enables to use inplace operations in a method chain
- It simplifies type annotations

Disadvantages:

- In general, when a pandas method returns an object, this is a _new_ object, and thus following the Copy-on-Write rules
of behaving as a copy. This would introduce a special case where an _identical_ object would be
returned (`df2 = df.method(inplace=True); assert df2 is df`)
- It would change the behaviour of the current `inplace=True`

We generally assume that changing to return `self` should not give much problems for
existing usage (typically, the current return value of `None` is not actively used).
Further, we think the advantages of simplifying return types and enabling methods chains
outweighs the special case of returning an identical object.
**Therefore, we propose that for those methods with an `inplace=True` option, the calling object (`self`) gets returned.**

## Backward compatibility

Removing the `inplace` keyword is a breaking change, but since the affected behaviour is `inplace=True`, the default
behaviour when not specifying the keyword (i.e. `inplace=False`) will not change and the keyword itself can first be
deprecated before it is removed.

## Rejected alternatives

### Remove the `inplace` keyword altogether

In the past, it was considered to remove the `inplace` keyword entirely. This was because many methods with
the `inplace` keyword did not actually operate inplace, but made a copy and re-assigned the underlying values under
the hood, causing confusion and providing no real benefit to users.

Because a majority of the methods supporting `inplace` did not operate inplace, it was considered at the time to
deprecate and remove inplace from all methods, and add back the keyword as necessary.[^3]

For methods where the operation actually _can_ be done inplace (group 2), however, removing the `inplace`
keyword could give a significant performance regression when currently using this keyword with large
DataFrames. Therefore, we decided to keep the `inplace` keyword for this small subset of methods.

### Standardize on the `copy` keyword instead of `inplace`

It may seem more natural to standardize on the `copy` keyword instead of the `inplace` keyword, since the `copy`
keyword already returns a new object instead of None (enabling method chaining) and avoids a copy when it is set to `False`.

However, the `copy` keyword is not supported in any of the values-mutating methods listed in Group 2 above
unlike `inplace`, so semantics of future inplace mutation of values align better with the current behavior of
the `inplace` keyword, than with the current behavior of the `copy` keyword.

Furthermore, with the approved Copy-on-Write proposal, the `copy` keyword also has become superfluous. With Copy-on-Write
enabled, methods that return a new pandas object will always try to avoid a copy whenever possible, regardless of
a `copy=False` keyword. Thus, the Copy-on-Write PDEP proposes to actually remove the `copy` keyword from the methods
where it is currently used (so it would be strange to add this as a new keyword to the Group 2 methods).

Currently, when using `copy=False` in methods where it is supported, a new pandas object is returned as the result
of a method call (same as with `copy=True`), but with the values backing this object being shared with the calling
object when possible (but the calling object is never modified). With the proposed inplace behavior for Group 2 methods,
a potential `copy=False` option would return a new pandas object with identical values as the original object (that
was modified inplace, in contrast to current usage of `copy=False`), which may be confusing for users, and lead to
ambiguity with Copy on Write rules.

## History

The future of the `inplace` keyword is something that has been debated a lot over the years.

It may be helpful to review those discussions (see links) [^2] [^3] [^4] to better understand this PDEP.

## Timeline

The `inplace` keyword is widely used, and thus we need to take considerable time to
deprecate and remove this feature.

- For those methods where the `inplace` keyword will be removed, we add a
DeprecationWarning in the first release after acceptance (2.2 if possible, otherwise
3.0)
- Together with enabling Copy-on-Write in the pandas 3.0 major release, we already
update those methods that will keep the `inplace` keyword with the new behaviour
(returning `self`, working inplace when possible)
- Somewhere during the 3.x release cycle (e.g. in 3.1, depending on when the deprecation
was started), we change the DeprecationWarning to a more visible FutureWarning.
- The deprecated keyword is removed in pandas 4.0.

When introducing the warning in 2.2 (or 3.0), users will already have the ability to
enable Copy-on-Write so they can rewrite their code in a way that avoids the deprecation
warning (remove the usage of `inplace`) while keeping the no-copy behaviour (which will
be the default with Copy-on-Write).

## PDEP History

- 16 February 2023: Initial draft

## References

[^1]: [Copy on Write Specification](https://pandas.pydata.org/pdeps/0007-copy-on-write.html)
[^2]: <https://github.com/pandas-dev/pandas/issues/48141>
[^3]: <https://github.com/pandas-dev/pandas/issues/16529>
[^4]: <https://github.com/pandas-dev/pandas/issues/50535>
[^5]: <https://stackoverflow.com/questions/37808904/prevent-ipython-from-storing-outputs-in-out-variable>