Description
Apache Arrow has support for natively storing UTF-8 data. And work is ongoing
adding kernels (e.g. str.isupper()
) to operate on that data. This issue
is to discuss how we can expose the native string dtype to pandas' users.
There are several things to discuss:
- How do users opt into this behavior?
- A fallback mode for not implemented kernels.
How do users opt into Arrow-backed StringArray?
The primary difficulty is the additional Arrow dependency. I'm assuming that we
are not ready to adopt it as a required dependency, so all of this will be
opt-in for now (though this point is open for discussion).
StringArray is marked as experimental, so our usual API-breaking restrictions
rules don't apply. But we want to do this in a way that's not too disruptive.
There are three was to get a StringDtype
-dtype array today:
- Infer:
pd.array(['a', 'b', None])
- Explicit
dtype=pd.StringDtype()
- String alias
dtype="string"
My preference is for all of these to stay consistent. They all either give a
StringArray backed by an object-dtype ndarray or a StringArray backed by Arrow
memory.
I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like pd.PythonStringDtype()
as a
way to get the StringArray backed by an object-dtype ndarray.
The easiest way to support this is, I think, an option.
>>> pd.options.mode.use_arrow_string_dtype = True
Then all of those would create an Arrow-backed StringArray.
Fallback Mode
It's likely that Arrow 1.0 will not implement all the string kernels we need.
So when someone does
>>> Series(['a', 'b'], dtype="string").str.normalize() # no arrow kernel
we have a few options:
- Raise, stating that there's no kernel for normalize.
- PerformanceWarning, astype to object, do the operation, and convert back
I'm not sure which is best. My preference for now is probably to raise, but I could see doing either.
Activity
jreback commentedon Jul 7, 2020
why make this complicated? I would just make arrow an import of StringArray and call it a day, its already experimental. Then bump the version of arrow required as needed. we did exactly this with parquet.
xhochy commentedon Jul 8, 2020
Points that come to mind having work a bit on that in
fletcher
:pyarrow
type of the StringArray is.pyarrow.Array
(one continuous block of memory) or apyarrow.ChunkedArray
(concatenation is free)string
(32bit offsets for start/end) andlarge_string
(64bit offsets for start/end). Do we support both? Or limit to one?bool
andList[str]
), alwaysobject
?pandas
supports in-place modifications of arrays. Thepyarrow.(Chunked)Array
structures doesn't support this. There are two ways to approach this but for strings arrays only the first makes sense due to the structure of how strings are stored in Arrow. If other Arrow types were used in some future times, then this is actually a more critical decision. Keeping this here mainly to raise awareness of the immutability.pandas.ArrowArray
class that adheres to the Arrow memory layout (and thus construction ofpyarrow.(Chunked)Array
is zero-copy) but allow in-place edits.object
mode will be triggered. In my test, this has a roughly 2x performance penalty to a pure-object
typedStringArray
. This makes such a class less desirable. Although I would expect that the development of an Arrow-backed StringArray inpandas
might take some time and we probably can work on this and merge to master once the next release post-1.0 is published.pd.options.mode.use_arrow_string_dtype
but rather keep the switch withdtype=object
anddtype=string
so that one can decided on a per-column basis whether to use Arrow. This makes it easier if you implement algorithms on top of the arrays using e.g.numba
but aren't able to convert them all-at-once to the new dtype.Also noting here: I have spent quite some time (sadly not sufficient time) prototyping things around this in
fletcher
. I'm happy to just contribute them here to get things going. My feeling is that the parts that aren't yet working nicely mostly need work on the Arrow side and not inpandas
. But having a Draft PR up here would probably help people understand what is needed.xhochy commentedon Jul 8, 2020
Also note that I track the algorithm coverage of the current
pandas
API vs what is implemented inpyarrow
here: xhochy/fletcher#121TomAugspurger commentedon Jul 8, 2020
That's the simplest way from our end. Are we willing to require arrow to opt into the new string dtype?
Thanks for that list @xhochy, that's extremely valuable. In particular
ListDtype
, based on arrow memory, at the same time. This would support things like.str.split(..., expand=False)
. I'll open a separate issue for that.xhochy commentedon Jul 8, 2020
Regarding the immutability I posted an explanation on that in #8640 (comment) if a reader of this thread is unclear why we aren't just making a mutable type here instead.
jorisvandenbossche commentedon Jul 8, 2020
@TomAugspurger thanks for getting this discussion started!
I would personally not prefer to tie the use of "string" dtype to the presence of pyarrow. If pyarrow is optional (and I personally prefer to keep it that way for now), I would prefer keeping it optional for the new dtypes as well.
As long as pyarrow is optional, I think we need to keep the "old" implementation around (related to the above). But I agree we probably don't want a
pd.PythonStringDtype()
.In case we keep the use of pyarrow string array optional for "string" dtype, I think there are basically two options: have different dtypes (like
StringDtype
andPythonStringDtype
or other names), or have a single "parameterized" dtype (egStringDtype(use_objects=True/False)
, where the default could be None and detects whether pyarrow is installed or not).Of course, such a single dtype might also give corner cases with losing that information etc.
jorisvandenbossche commentedon Jul 8, 2020
As in general with our new nullable dtypes, operations should as much as possible also return nullable dtypes (so eg the nullable boolean dtype). For List that's of course not yet possible.
@xhochy based on that experience, what's your current take on the Array vs ChunkedArray point your raise above?
I think fletcher right now handles this very explicit and uses different EAs/dtypes for both? But I think for pandas we want to "hide" this somewhat from the user (so either need to choose one or either handle both in the same array under the hood)?
xhochy commentedon Jul 8, 2020
"I have no idea anymore".
From an Arrow standpoint,
ChunkedArray
is more natural as it provides the full flexibility and operation. It is just that chunking makes implementing algorithms on top more complicated.spatialpandas
thus uses anArray
as this makes the implementation simpler. Nowadays I lean a bit more towardsArray
as this is simpler for the end-user. But I'm totally undecided here and interested in other people's view.jorisvandenbossche commentedon Jul 9, 2020
We discussed this a bit yesterday on the call. Will try to summarize some of the points that were brought up here (based on our notes and my recollection) with some additional background:
Array
vsChunkedArray
Array
is a single contiguous array,ChunkedArray
can be seen as a list of arrays that represents a single logical array.String
type in pyarrow uses int32 offsets into the contiguous binary array of characters (offsets denote start / stop of each "scalar" string in the contiguous array). This means that the maximum number of bytes in a single string array is limited tonp.iinfo(np.int32).max
->np.iinfo(np.int32).max /1024 / 1024 / 1024 == ~2GB
(max for the full array, but hence also for a single element). To overcome this limitation (without chunking), there is also aLargeString
type using int64 offsets (so increasing the memory use, but giving basically unlimited size (~8000 PB)).__setitem__
) leads to a copy of (part of) the data. This will however give a different performance experience for this operation (especially when doing assignments in a loop), and this migth also change the API regarding "views" (when mutating leads to a copy, other arrays that are a view on the original one are not updated a you would expect?).replace
method where you provide a replacement mapping, assignment with a list / boolean mask, ...)toobaz commentedon Jul 9, 2020
As a user, this is what has always worried me most, since we started talking about arrow as a backend for strings. But I think asking users to choose between mutability and efficiency would be acceptable - as opposed to (I think) making efficient, non-mutable strings the default and (definitely) entirely replacing the current mutable type with one that isn't.
xhochy commentedon Jul 10, 2020
As this will probably need more than the string algorithms that are exposed by the
str
accessor (covered by ARROW-555), I have setup an umbrella issue on the Arrow side for the remaining parts: ARROW-9401; notably we probably need to implement some custom pandas-take operations.I plan to start a PR with the basic scaffolding for the data type next week.
32 remaining items