Skip to content

DLPack and readonly/immutable arrays #191

Closed
@seberg

Description

@seberg
Contributor

Sorry if there is some more discussion (or just clarification) in the pipeline. But I still think this is important, so here is an issue :). I realize that there was some related discussion in JAX before, but I am not clear how deep it really was. (I have talked with Ralf about this – we disagree about how important this is – but I wonder what others think, and we need to at least have an issue/clear decision somwhere.)


The problem (and current state/issues also with the buffer protocol that supports "readonly")

NumPy (and the buffer protocol) has readonly (writeable=False arrays). NumPy actually has a slight distinction (I do not think the matters, though):

  • Readonly arrays for which NumPy owns the memory. These can be set to writeable by a (power) user.
  • Arrays that are truly readonly.

The current situation in the "buffer protocol" (and array protocols) world is that NumPy supports a writeable and readonly arrays, JAX is immutable (and thus readonly), while PyTorch is always writable:

import jax
import numpy as np
import torch

JAX to NumPy works correctly:

jax_tensor = jax.numpy.array([1., 2., 3.], dtype="float32")
numpy_array = np.asarray(jax_tensor)
assert not numpy_array.flags.writeable  # JAX exports as readonly.

NumPy to JAX ignores that NumPy is writeable when importing (same for memoryview):

numpy_arr = np.arange(20).astype(np.float32)
jax_tensor = jax.numpy.asarray(numpy_arr)  # JAX imports mutable array though
numpy_arr[0] = 5.
print(repr(jax_tensor))
# DeviceArray([5., 1., 2.], dtype=float32)

PyTorch also breaks the first part when talking to JAX (although you have to go via ndarray, I guess):

jax_tensor = jax.numpy.array([1., 2., 3.], dtype="float32")
torch_tensor = torch.Tensor(np.asarray(jax_tensor))
# UserWarning: The given NumPy array is not writeable, ... (scary warning!)
torch_tensor[0] = 5  # modifies the "immutable" jax tensor
print(repr(jax_tensor))
# DeviceArray([5., 2., 3.], dtype=float32)

And NumPy arrays can be backed by truly read-only memory:

arr = np.arange(20).astype(np.float32)  # torch default dtype
np.save("/tmp/test-array.npy", arr)
# The following is memory mapped read-only:
arr = np.load("/tmp/test-array.npy", mmap_mode="r")
torch_tensor = torch.Tensor(arr)  # (scary warning here)
torch_tensor[0] = 5.
# segmentation fault

This is in a world where "readonly" information exists, but JAX and PyTorch don't support it, and we leave it up to the user to know about these limitations. Because of that pretty severe limitation PyTorch decides to give that scary (and ugly) warning.
JAX tries to protect the user during export – although not in DLPack – but silently accepts that it is the users problem during import.

I do realize that within the "array API", you are not supposed to write to an array. So within the "array API" world everything is read-only and that would solve the problem. But that feels like a narrow view to me: we want DLPack to be widely adopted and most Array API users are also NumPy, PyTorch, JAX, etc. users (or interact with them). So they should expect to interact with NumPy where mutability is common. Also __dlpack__ could very much be more generally useful than the Array API itself.

Both JAX (immutable) and PyTorch (always writeable) have limitations that their users must be aware of when exchanging data currently. But it feels strange to me to force these limitations on NumPy. Especially, I do not like them in np.asarray(dlpack_object). To get around the limitations in np.asarray we would have to:

  • Import all DLPack arrays as readonly or always copy? This could be part of the standard. In that case it would make the second point unnecessary. But: That prevents any in-place algorithms from accepting DLPack directly.
  • Possibly, export only writeable arrays (to play safe with PyTorch). Seems fine to me, at least for now (a bit weird if combined with first point, and doesn't round-trip)

Clarifying something like that is probably sufficient, at least for now.


But IMO, the proper fix is to add a read-only bit to DLPack (or the __dlpack__ protocol). Adding that (not awkwardly) requires either extending the API (e.g. having a new struct to query metadata) or breaking ABI. I don't know what the solution is, but whatever it is, I would expect that DLPack is future-proofed anyway.

Once DLPack is future-proofed, the decision of adding a flag could also be deferred to a future version…

As is, I am not sure NumPy should aim to talk preferably in __dlpack__ in the future (whether likely to happen or not). Rather, it feels like NumPy should support DLPack mainly for the sake of those who choose to use it. (Unlike the buffer-protocol, which users use without even knowing, e.g. when writing cython typed memoryviews.)

Neither JAX nor pytorch currently quite support "readonly" properly (and maybe never will). But I do not think that limitation is an argument against supporting it properly in __dlpack__. NumPy, dask, cupy, cython (typed memoryviews) do support it properly after all. It seems almost like turning a PyTorch "user problem" into an ecosystem wide "user problem"?

Of course, NumPy can continue talking buffer-protocol with cython, and many others (and likely will do in any case). And I can live with the limitations at least in an np.from_dlpack. But I don't like them in np.asarray(), and they just seem like unnecessary issues to me. (In that case, I may still prefer not to export readonly arrays.)


Or am I the only person who thinks this is an important problem that we have to solve for the user, rather than expect the user to be aware of the limitations?


EDIT: xref discussion about it at cupy, that was mentioning that nobody supports the readonly flag which __cuda_array_interface__ actually includes, and asked for support to cupy. (I am not sure why C-level matters – unless it is high level C-API – the Python API is what matters most? NumPy has a PyArray_FailUnlessWriteable() function for this in the public API.)

Activity

leofang

leofang commented on Jun 6, 2021

@leofang
Contributor

Just a drop-by comment (on my cell): CuPy currently does not support the readonly flag and does not have any plan afaik.

rgommers

rgommers commented on Jun 9, 2021

@rgommers
Member

Thanks for the summary Sebastian!

Arrays that are truly readonly.

Let me add one other example you gave to make it more concrete: an ndarray that's backed by a HDF5 file opened in read-only mode.

There's other ways to get read-only arrays (memmap, broadcast_arrays, diag), but I think the HDF5 one will be the most widely used one.

I do realize that within the "array API", you are not supposed to write to an array. So within the "array API" world everything is read-only and that would solve the problem.

I think you wrote this with a certain scenario in mind, but just to be sure: this is not the case in general. There are in-place operators, and item and slice assignment is supported. Also, DLPack is a memory sharing protocol, and there's in principle nothing that forbids writing to that shared memory.

The only thing the array API says is:

  • if you are trying to write portable implementation-agnostic code,
  • and you use operations that may produce views (such as calling from_dlpack),
  • then in such cases avoid in-place operations (because semantics are ill-defined).

This is in a world where "readonly" information exists, but JAX and PyTorch don't support it, and we leave it up to the user to know about these limitations. Because of that pretty severe limitation PyTorch decides to give that scary (and ugly) warning. JAX tries to protect the user during export – although not in DLPack – but silently accepts that it is the users problem during import.

I agree, this is not ideal. It's also very much nontrivial to implement (see, e.g., pytorch/pytorch#44027). For PyTorch it may still come at least, for JAX and TensorFlow I don't think there's any chance of that. So given that this doesn't exist in most libraries, the options are:

  1. emit a warning and continue
  2. raise an exception, force the user to make a copy
  3. silently make a copy before exporting memory, to be safe
  4. just continue, and document what the user should/shouldn't do
  • Import all DLPack arrays as readonly or always copy? This could be part of the standard. In that case it would make the second point unnecessary. But: That prevents any in-place algorithms from accepting DLPack directly.

Import as read-only is possible for NumPy, but not for PyTorch - so it's not a general solution. Always copy goes against the main purpose of the protocol I think.

  • Possibly, export only writeable arrays (to play safe with PyTorch). Seems fine to me, at least for now (a bit weird if combined with first point, and doesn't round-trip)

This makes sense to me, in particular for arrays that are backed by memory-mapped files. There's other cases here for other libraries, like lazy data structures. If the protocol is for in-memory data sharing, the data should be in memory. That does still leave the question if the copy should be made by the library, or if the library should raise and make the user do it.

rgommers

rgommers commented on Jun 9, 2021

@rgommers
Member

NumPy, dask, cupy, cython (typed memoryviews) do support it properly after all. It seems almost like turning a PyTorch "user problem" into an ecosystem wide "user problem"?

Dask also doesn't quite support it, right? It's only able to hold numpy arrays with the writeable flag set, but there's no support at the dask.array level. So really out of all array/tensor libraries, it's NumPy that supports both modes, and PyTorch/JAX/TensorFlow/CuPy/MXNet/Dask don't.

And even in NumPy it's a fairly niche feature imho - I'd say the "open a file in read-only mode" is the main use case. It's important enough for people that do a lot of their work with HDF5/NetCDF files to carefully consider, but I'd not say this is an "ecosystem wide user problem".

But IMO, the proper fix is to add a read-only bit to DLPack (or the __dlpack__ protocol).

This would not help much compared to the situation now. Take a library like PyTorch, what is it supposed to do when it sees a read-only bit? It has no way to create read-only tensors, so it still has exactly the same set of choices 1-4 as I listed above. Same for JAX - it only has one kind of array, so nothing changes.

added this to the v2021 milestone on Oct 4, 2021
kgryte

kgryte commented on Nov 4, 2021

@kgryte
Contributor

DLPack support was removed from the specification for asarray in #301.

Concerns are still applicable for the buffer protocol and its support for "readonly".

However, it would not seem that there's anything further we need to address in the spec. If there is, feel free to reopen this issue to continue the discussion.

rgommers

rgommers commented on Nov 8, 2021

@rgommers
Member

DLPack support was removed from the specification for asarray in #301.

We had some more recent discussion in a NumPy community meeting, and I think that aligned with the outcome here - removing DLPack support from asarray and leaving it to a specific separate function (from_dlpack) makes the introduction safer (since asarray is already widely used).

Concerns are still applicable for the buffer protocol and its support for "readonly".

However, it would not seem that there's anything further we need to address in the spec. If there is, feel free to reopen this issue to continue the discussion.

Good point. Since it's older, there are no/fewer concerns about the readonly behavior of the buffer protocol, but yes it's a potential footgun still. Anyway that existed for a long time before, so I'm happy to leave that as is and leave it up to libraries to deal with the readonly flag by raising a warning, an exception or by ignoring it as they see fit.

leofang

leofang commented on Jan 9, 2022

@leofang
Contributor

Reopen this issue to address readonly tensors. As discussed in the Jan-6-2022 meeting, it is preferable to still export readonly tensors and let the consumer decide how to handle in some way. So, the above comment

Anyway that existed for a long time before, so I'm happy to leave that as is and leave it up to libraries to deal with the readonly flag by raising a warning, an exception or by ignoring it as they see fit.

would not be fully applicable. The exporter should not raise.

This issue will close numpy/numpy#20742.

reopened this on Jan 9, 2022
rgommers

rgommers commented on Jan 9, 2022

@rgommers
Member

The exporter should not raise.

Agreed, with "libraries" I meant importers. The buffer protocol has a readonly flag, so there's no problem for exporters.

32 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @seberg@rgommers@pearu@kgryte@leofang

        Issue actions

          DLPack and readonly/immutable arrays · Issue #191 · data-apis/array-api