Skip to content

API/ENH: dtype='string' / pd.String #8640

@jreback

Description

@jreback
Contributor

update for 2019-10-07: We have a StringDtype extension dtype. It's memory model is the same as the old implementation, an object-dtype ndarray of strings. The next step is to store & process it natively.


xref #8627
xref #8643, #8350

Since we introduced Categorical in 0.15.0, I think we have found 2 main uses.

  1. as a 'real' Categorical/Factor type to represent a limited of subset of values that the column can take on
  2. as a memory saving representation for object dtypes.

I could see introducting a dtype='string' where String is a slightly specialized sub-class of Categroical, with 2 differences compared to a 'regular' Categorical:

  • it allows unions of arbitrary other string types, currently Categorical will complain if you do this:
In [1]: df = DataFrame({'A' : Series(list('abc'),dtype='category')})
In [2]: df2 = DataFrame({'A' : Series(list('abd'),dtype='category')})
In [3]: pd.concat([df,df2])
ValueError: incompatible levels in categorical block merge

Note that this works if they are Series (and prob should raise as well, side -issue)

But, if these were both 'string' dtypes, then its a simple matter to combine (efficiently).

  • you can restrict the 'sub-dtype' (e.g. the dtype of the categories) to string/unicode (iow, don't allow numbers / arbitrary objects), makes the constructor a bit simpler, but more importantly, you now have a 'real' non-object string dtype.

I don't think this would be that complicated to do. The big change here would be to essentially convert any object dtypes that are strings to dtype='string' e.g. on reading/conversion/etc. might be a perf issue for some things, but I think the memory savings greatly outweigh.

We would then have a 'real' looking object dtype (and object would be relegated to actual python object types, so would be used much less).

cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche
cc @mwiebe
thoughts?

Activity

added this to the 0.16.0 milestone on Oct 26, 2014
jorisvandenbossche

jorisvandenbossche commented on Oct 26, 2014

@jorisvandenbossche
Member

I think it would be a very nice improvement to have a real 'string' dtype in pandas.
So no longer having the confusion in pandas of object dtype being actually in most cases a string, and sometimes a 'real' object.

However, I don't know if this should be 'coupled' to categorical. Maybe that is only a technical implementation detail, but for me it should just be a string dtype, a dtype that holds string values, and has in essence nothing to do with categorical.

If I think about a string dtype, I am more thinking about numpy's strings types (but it has of course also impractialities, that is has fixed sizes), or the CHAR/VARCHAR in sql.

shoyer

shoyer commented on Oct 26, 2014

@shoyer
Member

I'm of two minds about this. This could be quite useful, but on the other hand, it would be way better if this could be done upstream in numpy or dynd. Pandas specific array types are not great for compatibility with the broader ecosystem.

I understand there are good reasons it may not be feasible to implement this upstream (#8350), but these solutions do feel very stop-gap. For example, if @teoliphant is right that dynd could be hooked up in the near future to replace numpy in pandas internals, I would be much more excited about exploring that possibility.

As for this specific proposal:

  1. Would we really use this in place of object dtype for almost all string data in pandas? If so, this needs to meet a much higher standard than if it's merely an option.
  2. It would be premature to call this the dtype "string" rather than "interned_string", unless we're sure interning is always a good idea. Also, libraries like dynd do implement a true variable length string type (unlike numpy), and I think it is a good long term goal to align pandas dtypes with dtypes on the ndarray used for storage.
  3. The worst of the performance consequences might be avoided if we do not guarantee that the string "categories" are unique. Otherwise every str op requires a call to factorize.
  4. Especially if this is the default/standard, I really think we should try to make it work for N-dimensional data (I still need to finish up my patch for categorical).
jreback

jreback commented on Oct 26, 2014

@jreback
ContributorAuthor

So I have tagged a related issue, about including integer NA support by using libdynd (#8643). This will actuall be the first thing I do. (as its new and cool, and I think a slightly more straightforward path to include dynd as an optional dep).

@mwiebe

can you maybe explain a bit about the tradeoffs involved with representing strings in 2 ways using libdynd

  • as a libdynd categorical (like proposing above but using the native categorical type which DOES exist in libdynd currently)
  • as vlen strings (another libdynd feature that DOES exist).

cc @teoliphant

mwiebe

mwiebe commented on Oct 31, 2014

@mwiebe
Contributor

I've had in mind an intention to tweak the string representation in dynd slightly, and have written that up now. libdynd/libdynd#158 The vlen string in dynd does work presently, but it has slightly different properties than what I'm writing here.

Properties that this vlen string has are a 16 byte representation, using the small string optimization. This means strings with size <= 15 bytes encoded as utf-8 will fit in that memory. Bigger strings will involve a dynamic memory allocation per string, a little bit like Python's string, but with the utf-8 encoding and knowledge that it is a string instead of having to go through dynamic dispatch like in numpy object arrays of strings.

Representing strings as a dynd categorical is a bit more complicated, and wouldn't be dynamically updatable in the same way. The types in dynd are immutable, so a categorical type, once created, has a fixed memory layout, etc. This allows for optimized storage, e.g. if the total number of categories is <= 256, each element can be stored as one byte in the array, but does not allow the assignment of a new string that was not already a in the array of categories.

jankatins

jankatins commented on Mar 3, 2015

@jankatins
Contributor

The issue mentioned in the last comment is now at libdynd/libdynd#158

modified the milestones: 0.16.0, Next Major Release on Mar 6, 2015

73 remaining items

WillAyd

WillAyd commented on Nov 11, 2019

@WillAyd
Member

closed via #27949

jorisvandenbossche

jorisvandenbossche commented on Nov 11, 2019

@jorisvandenbossche
Member

There is still relevant discussion here on the second part of this enhancement: a native storage (Tom also updated the top comment to reflect this)

maartenbreddels

maartenbreddels commented on Nov 12, 2019

@maartenbreddels

After learning more about the goal of Apache Arrow, vaex will happily depend on it in the (near?) future.

I want to ignore the discussion on where the c++ string library code should live (in or outside arrow), not to get sidetracked.

I'm happy to spend a bit of my time to see if I can move algorithms and unit tests to Apache Arrow, but it would be good if some pandas/arrow devs could assist me a bit (I believe @xhochy offered me help once, does that offer still stand?).

Vaex's string API is modeled on Pandas (80-90% compatible), so my guess is that Pandas should be able to make use of this move to Arrow, since it could simply forward many of the string method calls directly to Arrow once the algorithms are moved.

In short:

  • Is Arrow interested in string contributions from vaex' codebase (with cleanups), and willing to assist me?
  • Would pandas benefit from this, i.e. would it use Arrow for string processing if all of the vaex algorithms are in Arrow?
TomAugspurger

TomAugspurger commented on Nov 12, 2019

@TomAugspurger
Contributor

Thanks for the update @maartenbreddels.

Speaking for myself (not pandas-dev) I don't have a strong opinion on where these algorithms should live. I think pandas will find a way to use them regardless. Putting them in Arrow is probably convenient since we're dancing around a hard dependency on pyarrow in a few places.

I may be wrong, but I don't think any of the core pandas maintainers has C++ experience. One of us could likely help with the Python bindings though, if that'd be helpful.

TomAugspurger

TomAugspurger commented on Jul 7, 2020

@TomAugspurger
Contributor

I opened #35169 for discussing how we can expose an Arrow-backed StringArray to users.

jbrockmendel

jbrockmendel commented on Oct 17, 2022

@jbrockmendel
Member

@mroeschke closable?

mroeschke

mroeschke commented on Oct 17, 2022

@mroeschke
Member

Yeah I believe the current StringDtype(storage="pyarrow"|"python") has satisfied the goal of this issue so closing. Can open up more specific issues if there are followups

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementExtensionArrayExtending pandas with custom dtypes or arrays.PerformanceMemory or execution speed performanceStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @xhochy@wesm@mwiebe@WillAyd@jankatins

        Issue actions

          API/ENH: dtype='string' / pd.String · Issue #8640 · pandas-dev/pandas