Skip to content

BUG: CoW and np.asarray leads to surprising issues #52823

Open
@bashtage

Description

@bashtage
Contributor

Reproducible Example

import pandas as pd
import numpy as np

x = pd.Series([1,2,3.0])
y = np.asarray(x)
y[1:] = np.nan

pd.options.mode.copy_on_write = True
w = pd.Series([1,2,3.0])
z = np.asarray(w)
z[1:] = np.nan

Issue Description

This probably isn't a bug but it makes interaction outside of pandas considerably harder.

It seems that it would be good to somewhere have a formal way too ensure that casts to numpy arrays are always writable. Perhaps a writeable keyword in to_numpy?

Expected Behavior

Maybe leave the CoW behavior behind when moving outside of pandas.

Installed Versions

INSTALLED VERSIONS

commit : 478d340
python : 3.11.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22621
machine : AMD64
processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 2.0.0
numpy : 1.23.5
pytz : 2023.3
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : 0.29.33
pytest : 7.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli :
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.0.dev0+1817.98c51a6
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Activity

added
Needs TriageIssue that has not been reviewed by a pandas team member
on Apr 21, 2023
bashtage

bashtage commented on Apr 21, 2023

@bashtage
ContributorAuthor

As a side note, enabling CoW on statsmodels leads to ~300 test failures.

bashtage

bashtage commented on Apr 21, 2023

@bashtage
ContributorAuthor
phofl

phofl commented on Apr 21, 2023

@phofl
Member

This is intended for now. There was some discussion in another issue. We can reopen if you like.

Can you share a stats models build?

bashtage

bashtage commented on Apr 21, 2023

@bashtage
ContributorAuthor

I"m just testing locally after statsmodels/statsmodels#8811 was reported. I am planning on setting up a canary build with CoW enabled, and I'll post that when it is done.

jbrockmendel

jbrockmendel commented on Apr 21, 2023

@jbrockmendel
Member

Maybe leave the CoW behavior behind when moving outside of pandas.

I'm leaning towards this (also on the front-end)

jorisvandenbossche

jorisvandenbossche commented on May 2, 2023

@jorisvandenbossche
Member

As a side note, enabling CoW on statsmodels leads to ~300 test failures.

@bashtage To understand what was going on in this specific case: the issue was not that you were actually relying on being able to modify those arrays, but rather that you were passing those arrays to cython code that couldn't handle read-only (const) memoryviews (statsmodels/statsmodels#8820). Is that correct?
And what that the only/main source of failures, or are there more?

I think the cython issue is going to be a very common one for code passing data from DataFrames to cython algos.
(but also: making your code robust for const arrays can be useful in general, as there are other reasons you can end up with read-only arrays, for example when converted from pyarrow in certain ways)

bashtage

bashtage commented on May 15, 2023

@bashtage
ContributorAuthor

A lot of them where caes where const was needed in Cython. There were others where a DataFrame was intentionally created, e.g., to use a groupby, and then asarray was used on that. The output of the asarray when then further modified in place (or attempted), which failed.

There are bigger question as to what asarray should return when getting a DataFrame, especially when extension arrays become much more common (i.e., strings).

jorisvandenbossche

jorisvandenbossche commented on May 15, 2023

@jorisvandenbossche
Member

There were others where a DataFrame was intentionally created, e.g., to use a groupby, and then asarray was used on that. The output of the asarray when then further modified in place (or attempted), which failed.

For cases like this, where you know that you created the DataFrame yourself (not passed in by the user) or is result of a calculation that is never a view of original data (an aggregation), it might be nice if we can provide an easier way to get a writeable array.

I think the current recommendation would be: if you know you can safely write to the array, "just" change the flag:

# ... code that creates a dataframe
df = ...
arr = np.asarray(df)
# I know it is safe to ignore the readonly flag
df.flags.readonly = False
# ... code that modifies `arr` in place

On the one hand, this is not super nice, and we could think about some keyword in to_numpy() to override this? (so you could do arr = df.to_numpy(keep_writeable_i_know_what_im_doing=True) but with a decent name)
On the other hand, doing this more manually also makes it very explicit that this is just overwriting that flag (and not "ensuring" the return value is writeable by making a copy of it, as one would typically expect).

jorisvandenbossche

jorisvandenbossche commented on May 15, 2023

@jorisvandenbossche
Member

Note that you can already do arr = df.to_numpy(copy=True) if you want to be sure to have a writeable result, and don't care about making the extra copy (that setting the flag like the above avoids).
So I think this already should give you what a potential arr = np.asarray(df, writable=True) would do as proposed in numpy/numpy#23625

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jorisvandenbossche@bashtage@jbrockmendel@DeaMariaLeon@phofl

        Issue actions

          BUG: CoW and np.asarray leads to surprising issues · Issue #52823 · pandas-dev/pandas