Open
Description
Index._concat
(used by Index.append
) is thin wrapper around concat_compat
. It is overriden by CategoricalIndex so that CategoricalDtype is retained more often than it is in concat_compat
. We should make these match.
If we just rip CategoricalIndex._concat
, we break 6 tests, all of which boil down to:
def test_append_category_objects(self, ci):
# with objects
result = ci.append(Index(["c", "a"]))
expected = CategoricalIndex(list("aabbcaca"), categories=ci.categories)
> tm.assert_index_equal(result, expected, exact=True)
If we go the other way and change concat_compat
, we break 6 different tests, all of which involve all-empty arrays or arrays that can be losslessly cast to the Categorical's dtype, e.g (edited for legibility)
def test_concat_empty_series_dtype_category_with_array(self):
# GH#18515
left = Series(np.array([]), dtype="category")
right = Series(dtype="float64")
result = concat([left, right])
> assert result.dtype == "float64"
def test_concat_categorical_coercion(self):
# GH 13524
# category + not-category => not-category
s1 = Series([1, 2, np.nan], dtype="category")
s2 = Series([2, 1, 2])
exp = Series([1, 2, np.nan, 2, 1, 2], dtype="object")
> tm.assert_series_equal(pd.concat([s1, s2], ignore_index=True), exp)
E AssertionError: Attributes of Series are different
E
E Attribute "dtype" are different
E [left]: CategoricalDtype(categories=[1, 2], ordered=False)
E [right]: object
Changing concat_compat results in much more convenient behavior, but it is textbook "values-dependent behavior" that in general we want to avoid (cc @jorisvandenbossche)
Activity
TomAugspurger commentedon Jun 2, 2021
I vaguely recall some discussions around changing the default behavior of
pd.concat
to union categories when provided with multiple CategoricalDtype objects, rather than casting to object. IMO, we should address that first (through a deprecation cycle). IIUC it'd then be easier to make the two consistent.jbrockmendel commentedon Jun 3, 2021
that looks similar but i think may be orthogonal. in all of the affected tests cases i think we're dealing with one Categorical and one non-Categorical
jreback commentedon Jun 24, 2021
IIUC we should strive to improve
concat_compat
to make this do better inference, e.g.is what would do. I think is a strict improvement.
jbrockmendel commentedon Jun 26, 2021
@jorisvandenbossche want to weigh in here (before i get started on a PR)? one of the options here is value-dependent behavior
jorisvandenbossche commentedon Jul 11, 2021
I think I would opt for preserving the strict behaviour of Series. Although it is certainly tempting to make an exception. But having the behavior depend on which numbers are present (eg in the last test example) really doesn't sound ideal. The user can always cast to the dtype of the first object for doing the concat.
(the case of concatting with an empty other Series is something that could be addressed separately, IMO, eg by having a "null" dtype for empty Series)
Other idea: if we find it onerous for the user to cast all arguments passed to
concat
/append
themselves to ensure consistent dtypes, we could also add a keyword argument toconcat
/append
that would do that for you. But this would then be a more general solution (for all dtypes), instead of adding a special case only for categorical dtype.jbrockmendel commentedon Apr 15, 2022
Possibly related: #12509, #14016, #15332, #24093, #24845, #25019, #37480, #44099, #42840