API/ENH: dtype='string' / pd.String

*update for 2019-10-07: We have a StringDtype extension dtype. It's memory model is the same as the old implementation, an object-dtype ndarray of strings. The next step is to store & process it natively*.

---

xref #8627 
xref #8643, #8350

Since we introduced `Categorical` in 0.15.0, I think we have found 2 main uses.

1) as a 'real' Categorical/Factor type to represent a limited of subset of values that the column can take on
2) as a memory saving representation for object dtypes.

I could see introducting a `dtype='string'` where `String` is a slightly specialized sub-class of `Categroical`, with 2 differences compared to a 'regular' Categorical:
- it allows unions of arbitrary other string types, currently `Categorical` will complain if you do this:

```
In [1]: df = DataFrame({'A' : Series(list('abc'),dtype='category')})
In [2]: df2 = DataFrame({'A' : Series(list('abd'),dtype='category')})
In [3]: pd.concat([df,df2])
ValueError: incompatible levels in categorical block merge
```

Note that this works if they are `Series` (and prob should raise as well, side -issue)

But, if these were both 'string' dtypes, then its a simple matter to combine (efficiently).
- you can restrict the 'sub-dtype' (e.g. the dtype of the categories) to `string/unicode` (iow, don't allow numbers / arbitrary objects), makes the constructor a bit simpler, but more importantly, you now have a 'real' non-object string dtype.

I don't think this would be that complicated to do. The big change here would be to essentially convert any object dtypes that are strings to `dtype='string'` e.g. on reading/conversion/etc. might be a perf issue for some things, but I think the memory savings greatly outweigh.

We would then have a 'real' looking object dtype (and `object` would be relegated to actual python object types, so would be used much less).

cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche 
cc @mwiebe 
thoughts?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

API/ENH: dtype='string' / pd.String #8640

73 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

API/ENH: dtype='string' / pd.String #8640

Description

Activity

jorisvandenbossche commented on Oct 26, 2014

shoyer commented on Oct 26, 2014

jreback commented on Oct 26, 2014

mwiebe commented on Oct 31, 2014

jankatins commented on Mar 3, 2015

73 remaining items

WillAyd commented on Nov 11, 2019

jorisvandenbossche commented on Nov 11, 2019

maartenbreddels commented on Nov 12, 2019

TomAugspurger commented on Nov 12, 2019

TomAugspurger commented on Jul 7, 2020

jbrockmendel commented on Oct 17, 2022

mroeschke commented on Oct 17, 2022

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions