-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
update for 2019-10-07: We have a StringDtype extension dtype. It's memory model is the same as the old implementation, an object-dtype ndarray of strings. The next step is to store & process it natively.
Since we introduced Categorical
in 0.15.0, I think we have found 2 main uses.
- as a 'real' Categorical/Factor type to represent a limited of subset of values that the column can take on
- as a memory saving representation for object dtypes.
I could see introducting a dtype='string'
where String
is a slightly specialized sub-class of Categroical
, with 2 differences compared to a 'regular' Categorical:
- it allows unions of arbitrary other string types, currently
Categorical
will complain if you do this:
In [1]: df = DataFrame({'A' : Series(list('abc'),dtype='category')})
In [2]: df2 = DataFrame({'A' : Series(list('abd'),dtype='category')})
In [3]: pd.concat([df,df2])
ValueError: incompatible levels in categorical block merge
Note that this works if they are Series
(and prob should raise as well, side -issue)
But, if these were both 'string' dtypes, then its a simple matter to combine (efficiently).
- you can restrict the 'sub-dtype' (e.g. the dtype of the categories) to
string/unicode
(iow, don't allow numbers / arbitrary objects), makes the constructor a bit simpler, but more importantly, you now have a 'real' non-object string dtype.
I don't think this would be that complicated to do. The big change here would be to essentially convert any object dtypes that are strings to dtype='string'
e.g. on reading/conversion/etc. might be a perf issue for some things, but I think the memory savings greatly outweigh.
We would then have a 'real' looking object dtype (and object
would be relegated to actual python object types, so would be used much less).
cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche
cc @mwiebe
thoughts?
Activity
jorisvandenbossche commentedon Oct 26, 2014
I think it would be a very nice improvement to have a real 'string' dtype in pandas.
So no longer having the confusion in pandas of
object
dtype being actually in most cases a string, and sometimes a 'real'object
.However, I don't know if this should be 'coupled' to categorical. Maybe that is only a technical implementation detail, but for me it should just be a string dtype, a dtype that holds string values, and has in essence nothing to do with categorical.
If I think about a string dtype, I am more thinking about numpy's strings types (but it has of course also impractialities, that is has fixed sizes), or the CHAR/VARCHAR in sql.
shoyer commentedon Oct 26, 2014
I'm of two minds about this. This could be quite useful, but on the other hand, it would be way better if this could be done upstream in numpy or dynd. Pandas specific array types are not great for compatibility with the broader ecosystem.
I understand there are good reasons it may not be feasible to implement this upstream (#8350), but these solutions do feel very stop-gap. For example, if @teoliphant is right that dynd could be hooked up in the near future to replace numpy in pandas internals, I would be much more excited about exploring that possibility.
As for this specific proposal:
"string"
rather than"interned_string"
, unless we're sure interning is always a good idea. Also, libraries like dynd do implement a true variable length string type (unlike numpy), and I think it is a good long term goal to align pandas dtypes with dtypes on the ndarray used for storage.factorize
.jreback commentedon Oct 26, 2014
So I have tagged a related issue, about including integer NA support by using
libdynd
(#8643). This will actuall be the first thing I do. (as its new and cool, and I think a slightly more straightforward path to include dynd as an optional dep).@mwiebe
can you maybe explain a bit about the tradeoffs involved with representing strings in 2 ways using
libdynd
cc @teoliphant
mwiebe commentedon Oct 31, 2014
I've had in mind an intention to tweak the string representation in dynd slightly, and have written that up now. libdynd/libdynd#158 The vlen string in dynd does work presently, but it has slightly different properties than what I'm writing here.
Properties that this vlen string has are a 16 byte representation, using the small string optimization. This means strings with size <= 15 bytes encoded as utf-8 will fit in that memory. Bigger strings will involve a dynamic memory allocation per string, a little bit like Python's string, but with the utf-8 encoding and knowledge that it is a string instead of having to go through dynamic dispatch like in numpy object arrays of strings.
Representing strings as a dynd categorical is a bit more complicated, and wouldn't be dynamically updatable in the same way. The types in dynd are immutable, so a categorical type, once created, has a fixed memory layout, etc. This allows for optimized storage, e.g. if the total number of categories is <= 256, each element can be stored as one byte in the array, but does not allow the assignment of a new string that was not already a in the array of categories.
jankatins commentedon Mar 3, 2015
The issue mentioned in the last comment is now at libdynd/libdynd#158
73 remaining items
WillAyd commentedon Nov 11, 2019
closed via #27949
jorisvandenbossche commentedon Nov 11, 2019
There is still relevant discussion here on the second part of this enhancement: a native storage (Tom also updated the top comment to reflect this)
maartenbreddels commentedon Nov 12, 2019
After learning more about the goal of Apache Arrow, vaex will happily depend on it in the (near?) future.
I want to ignore the discussion on where the c++ string library code should live (in or outside arrow), not to get sidetracked.
I'm happy to spend a bit of my time to see if I can move algorithms and unit tests to Apache Arrow, but it would be good if some pandas/arrow devs could assist me a bit (I believe @xhochy offered me help once, does that offer still stand?).
Vaex's string API is modeled on Pandas (80-90% compatible), so my guess is that Pandas should be able to make use of this move to Arrow, since it could simply forward many of the string method calls directly to Arrow once the algorithms are moved.
In short:
TomAugspurger commentedon Nov 12, 2019
Thanks for the update @maartenbreddels.
Speaking for myself (not pandas-dev) I don't have a strong opinion on where these algorithms should live. I think pandas will find a way to use them regardless. Putting them in Arrow is probably convenient since we're dancing around a hard dependency on pyarrow in a few places.
I may be wrong, but I don't think any of the core pandas maintainers has C++ experience. One of us could likely help with the Python bindings though, if that'd be helpful.
is_string_dtype
scverse/anndata#115TomAugspurger commentedon Jul 7, 2020
I opened #35169 for discussing how we can expose an Arrow-backed StringArray to users.
jbrockmendel commentedon Oct 17, 2022
@mroeschke closable?
mroeschke commentedon Oct 17, 2022
Yeah I believe the current
StringDtype(storage="pyarrow"|"python")
has satisfied the goal of this issue so closing. Can open up more specific issues if there are followups