Description
Reproducible Example
import pandas as pd
import numpy as np
x = pd.Series([1,2,3.0])
y = np.asarray(x)
y[1:] = np.nan
pd.options.mode.copy_on_write = True
w = pd.Series([1,2,3.0])
z = np.asarray(w)
z[1:] = np.nan
Issue Description
This probably isn't a bug but it makes interaction outside of pandas considerably harder.
It seems that it would be good to somewhere have a formal way too ensure that casts to numpy arrays are always writable. Perhaps a writeable
keyword in to_numpy
?
Expected Behavior
Maybe leave the CoW behavior behind when moving outside of pandas.
Installed Versions
INSTALLED VERSIONS
commit : 478d340
python : 3.11.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22621
machine : AMD64
processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 2.0.0
numpy : 1.23.5
pytz : 2023.3
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : 0.29.33
pytest : 7.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli :
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.0.dev0+1817.98c51a6
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
Activity
bashtage commentedon Apr 21, 2023
As a side note, enabling CoW on statsmodels leads to ~300 test failures.
bashtage commentedon Apr 21, 2023
xref numpy/numpy#23625
phofl commentedon Apr 21, 2023
This is intended for now. There was some discussion in another issue. We can reopen if you like.
Can you share a stats models build?
bashtage commentedon Apr 21, 2023
I"m just testing locally after statsmodels/statsmodels#8811 was reported. I am planning on setting up a canary build with CoW enabled, and I'll post that when it is done.
jbrockmendel commentedon Apr 21, 2023
I'm leaning towards this (also on the front-end)
jorisvandenbossche commentedon May 2, 2023
@bashtage To understand what was going on in this specific case: the issue was not that you were actually relying on being able to modify those arrays, but rather that you were passing those arrays to cython code that couldn't handle read-only (const) memoryviews (statsmodels/statsmodels#8820). Is that correct?
And what that the only/main source of failures, or are there more?
I think the cython issue is going to be a very common one for code passing data from DataFrames to cython algos.
(but also: making your code robust for const arrays can be useful in general, as there are other reasons you can end up with read-only arrays, for example when converted from pyarrow in certain ways)
bashtage commentedon May 15, 2023
A lot of them where caes where
const
was needed in Cython. There were others where aDataFrame
was intentionally created, e.g., to use agroupby
, and thenasarray
was used on that. The output of theasarray
when then further modified in place (or attempted), which failed.There are bigger question as to what
asarray
should return when getting aDataFrame
, especially when extension arrays become much more common (i.e., strings).jorisvandenbossche commentedon May 15, 2023
For cases like this, where you know that you created the DataFrame yourself (not passed in by the user) or is result of a calculation that is never a view of original data (an aggregation), it might be nice if we can provide an easier way to get a writeable array.
I think the current recommendation would be: if you know you can safely write to the array, "just" change the flag:
On the one hand, this is not super nice, and we could think about some keyword in
to_numpy()
to override this? (so you could doarr = df.to_numpy(keep_writeable_i_know_what_im_doing=True)
but with a decent name)On the other hand, doing this more manually also makes it very explicit that this is just overwriting that flag (and not "ensuring" the return value is writeable by making a copy of it, as one would typically expect).
jorisvandenbossche commentedon May 15, 2023
Note that you can already do
arr = df.to_numpy(copy=True)
if you want to be sure to have a writeable result, and don't care about making the extra copy (that setting the flag like the above avoids).So I think this already should give you what a potential
arr = np.asarray(df, writable=True)
would do as proposed in numpy/numpy#23625