Skip to content

API: make CategoricalIndex._concat consistent with pd.concat #41626

Open
@jbrockmendel

Description

@jbrockmendel
Member

Index._concat (used by Index.append) is thin wrapper around concat_compat. It is overriden by CategoricalIndex so that CategoricalDtype is retained more often than it is in concat_compat. We should make these match.

If we just rip CategoricalIndex._concat, we break 6 tests, all of which boil down to:

    def test_append_category_objects(self, ci):
        # with objects
        result = ci.append(Index(["c", "a"]))
        expected = CategoricalIndex(list("aabbcaca"), categories=ci.categories)
>       tm.assert_index_equal(result, expected, exact=True)

If we go the other way and change concat_compat, we break 6 different tests, all of which involve all-empty arrays or arrays that can be losslessly cast to the Categorical's dtype, e.g (edited for legibility)

    def test_concat_empty_series_dtype_category_with_array(self):
        # GH#18515
        left = Series(np.array([]), dtype="category")
        right = Series(dtype="float64")
        result = concat([left, right])
>        assert result.dtype == "float64"


    def test_concat_categorical_coercion(self):
        # GH 13524
    
        # category + not-category => not-category
        s1 = Series([1, 2, np.nan], dtype="category")
        s2 = Series([2, 1, 2])
    
        exp = Series([1, 2, np.nan, 2, 1, 2], dtype="object")
>       tm.assert_series_equal(pd.concat([s1, s2], ignore_index=True), exp)
E       AssertionError: Attributes of Series are different
E       
E       Attribute "dtype" are different
E       [left]:  CategoricalDtype(categories=[1, 2], ordered=False)
E       [right]: object

Changing concat_compat results in much more convenient behavior, but it is textbook "values-dependent behavior" that in general we want to avoid (cc @jorisvandenbossche)

Activity

TomAugspurger

TomAugspurger commented on Jun 2, 2021

@TomAugspurger
Contributor

I vaguely recall some discussions around changing the default behavior of pd.concat to union categories when provided with multiple CategoricalDtype objects, rather than casting to object. IMO, we should address that first (through a deprecation cycle). IIUC it'd then be easier to make the two consistent.

jbrockmendel

jbrockmendel commented on Jun 3, 2021

@jbrockmendel
MemberAuthor

that looks similar but i think may be orthogonal. in all of the affected tests cases i think we're dealing with one Categorical and one non-Categorical

added
API - ConsistencyInternal Consistency of API/Behavior
ReshapingConcat, Merge/Join, Stack/Unstack, Explode
and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Jun 6, 2021
jreback

jreback commented on Jun 24, 2021

@jreback
Contributor

IIUC we should strive to improve concat_compat to make this do better inference, e.g.

If we go the other way and change concat_compat, we break 6 different tests, all of which involve all-empty arrays or arrays that can be losslessly cast to the Categorical's dtype, e.g (edited for legibility)

is what would do. I think is a strict improvement.

jbrockmendel

jbrockmendel commented on Jun 26, 2021

@jbrockmendel
MemberAuthor

@jorisvandenbossche want to weigh in here (before i get started on a PR)? one of the options here is value-dependent behavior

jorisvandenbossche

jorisvandenbossche commented on Jul 11, 2021

@jorisvandenbossche
Member

I think I would opt for preserving the strict behaviour of Series. Although it is certainly tempting to make an exception. But having the behavior depend on which numbers are present (eg in the last test example) really doesn't sound ideal. The user can always cast to the dtype of the first object for doing the concat.

(the case of concatting with an empty other Series is something that could be addressed separately, IMO, eg by having a "null" dtype for empty Series)

Other idea: if we find it onerous for the user to cast all arguments passed to concat/append themselves to ensure consistent dtypes, we could also add a keyword argument to concat/append that would do that for you. But this would then be a more general solution (for all dtypes), instead of adding a special case only for categorical dtype.

jbrockmendel

jbrockmendel commented on Apr 15, 2022

@jbrockmendel
MemberAuthor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    API - ConsistencyInternal Consistency of API/BehaviorCategoricalCategorical Data TypeNeeds DiscussionRequires discussion from core team before further actionReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @jreback@jorisvandenbossche@TomAugspurger@jbrockmendel@mroeschke

      Issue actions

        API: make CategoricalIndex._concat consistent with pd.concat · Issue #41626 · pandas-dev/pandas