Skip to content

Plan for a native string dtype #35169

Closed
Closed
@TomAugspurger

Description

@TomAugspurger
Contributor

Apache Arrow has support for natively storing UTF-8 data. And work is ongoing
adding kernels (e.g. str.isupper()) to operate on that data. This issue
is to discuss how we can expose the native string dtype to pandas' users.

There are several things to discuss:

  1. How do users opt into this behavior?
  2. A fallback mode for not implemented kernels.

How do users opt into Arrow-backed StringArray?

The primary difficulty is the additional Arrow dependency. I'm assuming that we
are not ready to adopt it as a required dependency, so all of this will be
opt-in for now (though this point is open for discussion).

StringArray is marked as experimental, so our usual API-breaking restrictions
rules don't apply. But we want to do this in a way that's not too disruptive.

There are three was to get a StringDtype-dtype array today:

  1. Infer: pd.array(['a', 'b', None])
  2. Explicit dtype=pd.StringDtype()
  3. String alias dtype="string"

My preference is for all of these to stay consistent. They all either give a
StringArray backed by an object-dtype ndarray or a StringArray backed by Arrow
memory.

I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like pd.PythonStringDtype() as a
way to get the StringArray backed by an object-dtype ndarray.

The easiest way to support this is, I think, an option.

>>> pd.options.mode.use_arrow_string_dtype = True

Then all of those would create an Arrow-backed StringArray.

Fallback Mode

It's likely that Arrow 1.0 will not implement all the string kernels we need.
So when someone does

>>> Series(['a', 'b'], dtype="string").str.normalize()  # no arrow kernel

we have a few options:

  1. Raise, stating that there's no kernel for normalize.
  2. PerformanceWarning, astype to object, do the operation, and convert back

I'm not sure which is best. My preference for now is probably to raise, but I could see doing either.

Activity

jreback

jreback commented on Jul 7, 2020

@jreback
Contributor

why make this complicated? I would just make arrow an import of StringArray and call it a day, its already experimental. Then bump the version of arrow required as needed. we did exactly this with parquet.

xhochy

xhochy commented on Jul 8, 2020

@xhochy
Contributor

Points that come to mind having work a bit on that in fletcher:

  • Arrow's data structures aren't as straightforward as with numpy. There needs to be a decision what the backing pyarrow type of the StringArray is.
    • Should it be a pyarrow.Array (one continuous block of memory) or a pyarrow.ChunkedArray (concatenation is free)
    • Arrow has two strings types: string (32bit offsets for start/end) and large_string (64bit offsets for start/end). Do we support both? Or limit to one?
  • What should be the return type of algorithms that return non-string types (mostly bool and List[str]), always object?
  • pandas supports in-place modifications of arrays. The pyarrow.(Chunked)Array structures doesn't support this. There are two ways to approach this but for strings arrays only the first makes sense due to the structure of how strings are stored in Arrow. If other Arrow types were used in some future times, then this is actually a more critical decision. Keeping this here mainly to raise awareness of the immutability.
    • Always copy the whole array even on 1-element modifications.
    • Implement a separate pandas.ArrowArray class that adheres to the Arrow memory layout (and thus construction of pyarrow.(Chunked)Array is zero-copy) but allow in-place edits.
  • Arrow 1.0 will probably only implement ~20% of the algorithms thus in most cases a fallback to object mode will be triggered. In my test, this has a roughly 2x performance penalty to a pure-object typed StringArray. This makes such a class less desirable. Although I would expect that the development of an Arrow-backed StringArray in pandas might take some time and we probably can work on this and merge to master once the next release post-1.0 is published.
  • Additionally, I would strongly prefer to not have a global option pd.options.mode.use_arrow_string_dtype but rather keep the switch with dtype=object and dtype=string so that one can decided on a per-column basis whether to use Arrow. This makes it easier if you implement algorithms on top of the arrays using e.g. numba but aren't able to convert them all-at-once to the new dtype.

Also noting here: I have spent quite some time (sadly not sufficient time) prototyping things around this in fletcher. I'm happy to just contribute them here to get things going. My feeling is that the parts that aren't yet working nicely mostly need work on the Arrow side and not in pandas. But having a Draft PR up here would probably help people understand what is needed.

xhochy

xhochy commented on Jul 8, 2020

@xhochy
Contributor

Also note that I track the algorithm coverage of the current pandas API vs what is implemented in pyarrow here: xhochy/fletcher#121

TomAugspurger

TomAugspurger commented on Jul 8, 2020

@TomAugspurger
ContributorAuthor

why make this complicated? I would just make arrow an import of StringArray and call it a day, its already experimental. Then bump the version of arrow required as needed. we did exactly this with parquet.

That's the simplest way from our end. Are we willing to require arrow to opt into the new string dtype?

Thanks for that list @xhochy, that's extremely valuable. In particular

  1. It makes this feel like less of a drop-in replacement for our current string implementation (especially around inplace mutation).
  2. We might want to consider implementing a ListDtype, based on arrow memory, at the same time. This would support things like .str.split(..., expand=False). I'll open a separate issue for that.
xhochy

xhochy commented on Jul 8, 2020

@xhochy
Contributor

Regarding the immutability I posted an explanation on that in #8640 (comment) if a reader of this thread is unclear why we aren't just making a mutable type here instead.

jorisvandenbossche

jorisvandenbossche commented on Jul 8, 2020

@jorisvandenbossche
Member

@TomAugspurger thanks for getting this discussion started!

[jreback] I would just make arrow an import of StringArray and call it a day, its already experimental.

I would personally not prefer to tie the use of "string" dtype to the presence of pyarrow. If pyarrow is optional (and I personally prefer to keep it that way for now), I would prefer keeping it optional for the new dtypes as well.

[TomAugspurger ] I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like pd.PythonStringDtype() as a
way to get the StringArray backed by an object-dtype ndarray.

As long as pyarrow is optional, I think we need to keep the "old" implementation around (related to the above). But I agree we probably don't want a pd.PythonStringDtype().
In case we keep the use of pyarrow string array optional for "string" dtype, I think there are basically two options: have different dtypes (like StringDtype and PythonStringDtype or other names), or have a single "parameterized" dtype (eg StringDtype(use_objects=True/False), where the default could be None and detects whether pyarrow is installed or not).
Of course, such a single dtype might also give corner cases with losing that information etc.

jorisvandenbossche

jorisvandenbossche commented on Jul 8, 2020

@jorisvandenbossche
Member

What should be the return type of algorithms that return non-string types (mostly bool and List[str]), always object?

As in general with our new nullable dtypes, operations should as much as possible also return nullable dtypes (so eg the nullable boolean dtype). For List that's of course not yet possible.

I have spent quite some time (sadly not sufficient time) prototyping things around this in fletcher.

@xhochy based on that experience, what's your current take on the Array vs ChunkedArray point your raise above?
I think fletcher right now handles this very explicit and uses different EAs/dtypes for both? But I think for pandas we want to "hide" this somewhat from the user (so either need to choose one or either handle both in the same array under the hood)?

xhochy

xhochy commented on Jul 8, 2020

@xhochy
Contributor

@xhochy based on that experience, what's your current take on the Array vs ChunkedArray point your raise above?

"I have no idea anymore".

From an Arrow standpoint, ChunkedArray is more natural as it provides the full flexibility and operation. It is just that chunking makes implementing algorithms on top more complicated. spatialpandas thus uses an Array as this makes the implementation simpler. Nowadays I lean a bit more towards Array as this is simpler for the end-user. But I'm totally undecided here and interested in other people's view.

jorisvandenbossche

jorisvandenbossche commented on Jul 9, 2020

@jorisvandenbossche
Member

We discussed this a bit yesterday on the call. Will try to summarize some of the points that were brought up here (based on our notes and my recollection) with some additional background:

  • Array vs ChunkedArray
    • Context: pyarrow provides two possible data structures to use under the hood for a column in a DataFrame. Array is a single contiguous array, ChunkedArray can be seen as a list of arrays that represents a single logical array.
    • In fletcher, support for both is implemented, but this is made explicit (different dtype for both: FletcherContinuousDtype and FletcherChunkedDtype).
    • When you can call out to pyarrow functions, it is mostly smooth to use either (most kernels in pyarrow will accept both). But when needing to implement some custom functionality in pandas, ChunkedArrays give added complexity (eg finding a position in a ChunkedArray first needs to find which chunk, and then position in chunk -> "double" indexing, see eg fletcher setitem example)
    • When reading large binary/string data (> 2GB of data in total for a single column), pyarrow can automatically return a ChunkedArray instead of an Array.
    • ChunkedArray can give a cheap concat (as the arrays don't need to be copied into a single contiguous array).
    • Conclusion? No definitive, but general inclination towards starting with ChunkedArray. Since this is mostly hidden from the user, we can always re-evaluate along the way (and if we decide to swtich from ChunkedArray to Array, the code only gets easier later on).
  • String vs LargeString type
    • Context: the default String type in pyarrow uses int32 offsets into the contiguous binary array of characters (offsets denote start / stop of each "scalar" string in the contiguous array). This means that the maximum number of bytes in a single string array is limited to np.iinfo(np.int32).max -> np.iinfo(np.int32).max /1024 / 1024 / 1024 == ~2GB (max for the full array, but hence also for a single element). To overcome this limitation (without chunking), there is also a LargeString type using int64 offsets (so increasing the memory use, but giving basically unlimited size (~8000 PB)).
    • We didn't discuss much about this, but I think there are multiple options: a) choose a single one (eg "string" and rely on chunking to support larger data) b) support both using a single pandas dtype towards the user (a dtype parametrized on the offsets type) c) support both with separate pandas dtypes (following the string / large_string distinction of Arrow).
  • Mutability
    • Context: since the array of strings is stored in one contiguous chunk of bytes, assigning a different string to a single element is in general not possible (eg with the exception of strings of the exact same length in bytes). See also API/ENH: dtype='string' / pd.String #8640 (comment)
    • There are workarounds possible to still provide the same end-user experience of mutability, but this means that basically each assignment (__setitem__) leads to a copy of (part of) the data. This will however give a different performance experience for this operation (especially when doing assignments in a loop), and this migth also change the API regarding "views" (when mutating leads to a copy, other arrays that are a view on the original one are not updated a you would expect?).
    • This might be a reason to keep the “object python string” dtype for people that want to do a lot of mutations? (apart from keeping it in case we don't require pyarrow for StringDtype)
    • We need to ensure we provide effcient "bulk" assignment/replacement operations (eg the replace method where you provide a replacement mapping, assignment with a list / boolean mask, ...)
toobaz

toobaz commented on Jul 9, 2020

@toobaz
Member
* Mutability

As a user, this is what has always worried me most, since we started talking about arrow as a backend for strings. But I think asking users to choose between mutability and efficiency would be acceptable - as opposed to (I think) making efficient, non-mutable strings the default and (definitely) entirely replacing the current mutable type with one that isn't.

xhochy

xhochy commented on Jul 10, 2020

@xhochy
Contributor

As this will probably need more than the string algorithms that are exposed by the str accessor (covered by ARROW-555), I have setup an umbrella issue on the Arrow side for the remaining parts: ARROW-9401; notably we probably need to implement some custom pandas-take operations.

I plan to start a PR with the basic scaffolding for the data type next week.

32 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @xhochy@jreback@jorisvandenbossche@toobaz@TomAugspurger

      Issue actions

        Plan for a native string dtype · Issue #35169 · pandas-dev/pandas