Description
Code Sample, a copy-pastable example if possible
pd.Series(range(5, 10), dtype="Int64").astype("string")
raises TypeError: data type not understood
while
pd.Series(range(5, 10)).astype("string")
raises ValueError: StringArray requires a sequence of strings or missing values.
If you first do astype(str)
:
pd.Series(range(5, 10)).astype(str).astype("string")
and
pd.Series(range(5, 10), dtype="Int64").astype(str).astype("string")
work as expected:
0 5
1 6
2 7
3 8
4 9
dtype: string
While astype(object)
raises in both cases ValueError: StringArray requires a sequence of strings or missing values.
Problem description
I can understand the ValueError
, since you don't feed strings to the StringArray
. Best for me would be if the astype("string")
converts it to strings, or if the astype(str)
would return a StringArray
, but in any case, I would expect both pd.Series(range(5, 10), dtype="Int64").astype("string")
and pd.Series(range(5, 10)).astype("string")
to raise the same error.
Expected Output
0 5
1 6
2 7
3 8
4 9
dtype: string
or
ValueError: StringArray requires a sequence of strings or missing values.
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
Activity
TomAugspurger commentedon Jan 22, 2020
Overlaps with #22384, which is trying to solve this problem in general.
In the meantime, we can add support for this in IntegerArray.astype.
Dr-Irv commentedon Jan 24, 2020
@TomAugspurger Was skimming issues and saw this one.
I'm wondering if
Series.astype('string')
should be treated as a special case independent of the underlying dtype of the underlying Series. That's because the following code should always work assumings
is aSeries
:So since we know that the underlying objects in the EA have to support
str()
, there is a straightforward way of doing that conversion.If you agree, I can look into doing a PR
TomAugspurger commentedon Jan 24, 2020
Dr-Irv commentedon Jan 24, 2020
Well, I'm suggesting that we do want a special case just for
StringDtype
inNDFrame.astype
. It seems natural that the 2 operations below should produce the same result, modulo having anobject
vsstring
dtype, for a given Seriess
(and returningnp.nan
vs.pd.NA
for missing values), independent of the dtype of the Seriess
:TomAugspurger commentedon Jan 24, 2020
Right, that's definitely desirable. But I don't think NDFrame.astype is the place for the fix.
TomAugspurger commentedon Jan 24, 2020
For example, consider SparseArray. It implements astype such that the result is also a SparseArray and preserves the sparsity. So
Series[SparseArray].astype("string")
would return aSeries[SparseArray[string]]
. But NDFrame.astype has no awareness of that.Dr-Irv commentedon Jan 24, 2020
I guess this depends on the semantics of
astype()
in the following sense.If I have an EA of type "current_dtype" and I write
s.astype("target_dtype")
, is it either:I think you are saying that the design we have supports (1), and I'm suggesting a design corresponding to (2).
Now, the reason that I prefer (2) is that when I construct a
Series
, and I provide thedtype
as an argument, then pandas figures out how to convert the data passed to the Series to the corresponding dtype if it can. That is behavior corresponding to (2). Sos=pd.Series([1,2,3], dtype="category")
ands=pd.Series([1,0,pd.NA], "boolean")
both work, buts=pd.Series([1,2,3], dtype="string")
does not.Another possible design would be to have a property of EA's called something like
can_convert_anydtype
beingTrue
orFalse
, and ifTrue
, thenastype
knows it can ask the EA to convert any dtype, and ifFalse
, it then asks the target dtype to do the conversion. So, forStringDtype
, we set it to becan_convert_anydtype
toTrue
, and for other dtypes set it toFalse
TomAugspurger commentedon Jan 24, 2020
We have another issue for an astype dispatch mechanism.
tritemio commentedon Feb 10, 2020
I want to chime in just to give another use-case from the duplicated issue #31839 .
In addition to conversion to
"string"
, converting"string"
to"Int8/16/64"
when the initial series contains pd.NA is currently quite tricky:It should be possible to do simply
s.astype('Int8')
vadella commentedon Feb 12, 2020
Your solution can give rounding errors when dealing with large integers. (I've been bitten by this when importing production data. The batch numbers were too large to fit exactly in a float64)
This explicitly loops over the column, so is not ideal performance wise
tritemio commentedon Feb 12, 2020
@vadella, thanks for the code example. I had the same problem too and I currently side-step it by loading the data directly in string format. Another reason why it is important that this conversion is handled by pandas internally.
test_astype
,test_neg
for old pandas versions apache/spark#33250