Skip to content

convert numeric column to dedicated pd.StringDtype() #31204

Closed
@vadella

Description

@vadella

Code Sample, a copy-pastable example if possible

pd.Series(range(5, 10), dtype="Int64").astype("string")

raises TypeError: data type not understood

while

pd.Series(range(5, 10)).astype("string")

raises ValueError: StringArray requires a sequence of strings or missing values.

If you first do astype(str):

pd.Series(range(5, 10)).astype(str).astype("string")

and

pd.Series(range(5, 10), dtype="Int64").astype(str).astype("string")

work as expected:

0    5
1    6
2    7
3    8
4    9
dtype: string

While astype(object) raises in both cases ValueError: StringArray requires a sequence of strings or missing values.

Problem description

I can understand the ValueError, since you don't feed strings to the StringArray. Best for me would be if the astype("string") converts it to strings, or if the astype(str) would return a StringArray, but in any case, I would expect both pd.Series(range(5, 10), dtype="Int64").astype("string") and pd.Series(range(5, 10)).astype("string") to raise the same error.

Expected Output

0    5
1    6
2    7
3    8
4    9
dtype: string

or

ValueError: StringArray requires a sequence of strings or missing values.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

Activity

TomAugspurger

TomAugspurger commented on Jan 22, 2020

@TomAugspurger
Contributor

Overlaps with #22384, which is trying to solve this problem in general.

In the meantime, we can add support for this in IntegerArray.astype.

added this to the Contributions Welcome milestone on Jan 22, 2020
Dr-Irv

Dr-Irv commented on Jan 24, 2020

@Dr-Irv
Contributor

@TomAugspurger Was skimming issues and saw this one.

I'm wondering if Series.astype('string') should be treated as a special case independent of the underlying dtype of the underlying Series. That's because the following code should always work assuming s is a Series:

pd.Series([str(x) if not pd.isna(x) else pd.NA for x in s], dtype="string")

So since we know that the underlying objects in the EA have to support str(), there is a straightforward way of doing that conversion.

If you agree, I can look into doing a PR

TomAugspurger

TomAugspurger commented on Jan 24, 2020

@TomAugspurger
Contributor
Dr-Irv

Dr-Irv commented on Jan 24, 2020

@Dr-Irv
Contributor

And code-wise, I don't think we'd want a special case for this in NDFrame.astype.

Well, I'm suggesting that we do want a special case just for StringDtype in NDFrame.astype. It seems natural that the 2 operations below should produce the same result, modulo having an object vs string dtype, for a given Series s (and returning np.nan vs. pd.NA for missing values), independent of the dtype of the Series s:

s.astype(str)
s.astype('string')
TomAugspurger

TomAugspurger commented on Jan 24, 2020

@TomAugspurger
Contributor

Right, that's definitely desirable. But I don't think NDFrame.astype is the place for the fix.

TomAugspurger

TomAugspurger commented on Jan 24, 2020

@TomAugspurger
Contributor

For example, consider SparseArray. It implements astype such that the result is also a SparseArray and preserves the sparsity. So Series[SparseArray].astype("string") would return a Series[SparseArray[string]]. But NDFrame.astype has no awareness of that.

Dr-Irv

Dr-Irv commented on Jan 24, 2020

@Dr-Irv
Contributor

For example, consider SparseArray. It implements astype such that the result is also a SparseArray and preserves the sparsity. So Series[SparseArray].astype("string") would return a Series[SparseArray[string]]. But NDFrame.astype has no awareness of that.

I guess this depends on the semantics of astype() in the following sense.

If I have an EA of type "current_dtype" and I write s.astype("target_dtype"), is it either:

  1. the responsibility of the EA with type "current_dtype" to know how to convert to every possible "target_dtype", or
  2. the responsibility of EA's of type "target_dtype" to know how to convert whatever types it can to "target_dtype"?

I think you are saying that the design we have supports (1), and I'm suggesting a design corresponding to (2).

Now, the reason that I prefer (2) is that when I construct a Series, and I provide the dtype as an argument, then pandas figures out how to convert the data passed to the Series to the corresponding dtype if it can. That is behavior corresponding to (2). So s=pd.Series([1,2,3], dtype="category") and s=pd.Series([1,0,pd.NA], "boolean") both work, but s=pd.Series([1,2,3], dtype="string") does not.

Another possible design would be to have a property of EA's called something like can_convert_anydtype being True or False, and if True, then astype knows it can ask the EA to convert any dtype, and if False, it then asks the target dtype to do the conversion. So, for StringDtype, we set it to be can_convert_anydtype to True, and for other dtypes set it to False

TomAugspurger

TomAugspurger commented on Jan 24, 2020

@TomAugspurger
Contributor

We have another issue for an astype dispatch mechanism.

tritemio

tritemio commented on Feb 10, 2020

@tritemio

I want to chime in just to give another use-case from the duplicated issue #31839 .

In addition to conversion to "string", converting "string" to "Int8/16/64" when the initial series contains pd.NA is currently quite tricky:

s = pd.Series(['0', pd.NA], dtype='string')
s.astype('object').replace(pd.NA, np.nan).astype('float64').astype('Int8')

It should be possible to do simply s.astype('Int8')

vadella

vadella commented on Feb 12, 2020

@vadella
Author

I want to chime in just to give another use-case from the duplicated issue #31839 .

In addition to conversion to "string", converting "string" to "Int8/16/64" when the initial series contains pd.NA is currently quite tricky:

s = pd.Series(['0', pd.NA], dtype='string')
s.astype('object').replace(pd.NA, np.nan).astype('float64').astype('Int8')

It should be possible to do simply s.astype('Int8')

Your solution can give rounding errors when dealing with large integers. (I've been bitten by this when importing production data. The batch numbers were too large to fit exactly in a float64)

s = pd.Series(["0", pd.NA, str(2 ** 60 + 2)], dtype="string")
s.to_frame().assign(
    a=s.astype("object")
    .replace(pd.NA, np.nan)
    .astype("float64")
    .astype("Int64"),
    b=s.apply(lambda x: int(x) if pd.notnull(x) else x).astype("Int64"),
)
  0 a b
0 0 0 0
1
2 1152921504606846978 1152921504606846976 1152921504606846978

This explicitly loops over the column, so is not ideal performance wise

tritemio

tritemio commented on Feb 12, 2020

@tritemio

@vadella, thanks for the code example. I had the same problem too and I currently side-step it by loading the data directly in string format. Another reason why it is important that this conversion is handled by pandas internally.

modified the milestones: Contributions Welcome, 1.1 on May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDtype ConversionsUnexpected or buggy dtype conversionsExtensionArrayExtending pandas with custom dtypes or arrays.NA - MaskedArraysRelated to pd.NA and nullable extension arrays

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Participants

      @jreback@jorisvandenbossche@TomAugspurger@vadella@tritemio

      Issue actions

        convert numeric column to dedicated `pd.StringDtype()` · Issue #31204 · pandas-dev/pandas