Skip to content

Categorical Arithmetic/Comparison Rules #19513

Closed
@jbrockmendel

Description

@jbrockmendel
Member

Series[not-categorical] > CategoricalIndex is inconsistent with the reversed operation. Which one is canonical?

ser = pd.Series([1, 2, 3])
idx = pd.CategoricalIndex(['A', 'B', 'A'])

>>> ser > idx
0    False
1    False
2    False

>>> idx < ser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/indexes/category.py", line 752, in _evaluate_compare
    return getattr(self.values, opname)(other)
  File "pandas/core/arrays/categorical.py", line 56, in f
    raise TypeError("Unordered Categoricals can only compare "
TypeError: Unordered Categoricals can only compare equality or not

>>> ser > idx.values
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/ops.py", line 819, in wrapper
    .format(op=op, typ=self.dtype))
TypeError: Cannot compare a Categorical for op <built-in function gt> with Series of dtype int64.
If you want to compare values, use 'series <op> np.asarray(other)'.

I'm guessing the right thing to do is to a) have Series[categorical].__op__ wrap CategoricalIndex.__op__, and b) have Series[non-categorical] to dispatch to reversed-op for is_categorical_dtype(other), want to confirm before making a PR.

Activity

jbrockmendel

jbrockmendel commented on Feb 3, 2018

@jbrockmendel
MemberAuthor

A case derived from tests.categorical.test_operators.TestCategoricalOps.test_comparisons, where it has the comment

# categorical cannot be compared to Series or numpy array, and also
# not the other way around
cat = Series(Categorical([1, 2, 3], ordered=True))
s = Series([2, 2, 2])

>>> s < cat
Traceback (most recent call last):
[...]
TypeError: Cannot compare a Categorical for op __lt__ with type <type 'numpy.ndarray'>.
If you want to compare values, use 'np.asarray(cat) <op> other'.

>>> s < pd.CategoricalIndex(cat)
0    False
1    False
2     True
dtype: bool
jreback

jreback commented on Feb 6, 2018

@jreback
Contributor

there are some issues about this already. see if you can find them.

jbrockmendel

jbrockmendel commented on Feb 6, 2018

@jbrockmendel
MemberAuthor

Closest issues with categorical label are #18050, #8995, neither of which seem to differentiate between the behavior of Categorical.__op__ vs CategoricalIndex.__op__ vs Series[category].__op__

@TomAugspurger any thoughts on the One True Implementation for these operations?

TomAugspurger

TomAugspurger commented on Feb 6, 2018

@TomAugspurger
Contributor

No strong thoughts, but I suppose a decent heuristic is "what would the result be if NumPy had a categorical dtype?" In that case, let's just do the same as Series[int].op(Index[int]) and Series[int].op(ndarray[int]), etc. So

self other return
series ndarray series
series index series
index series ndarray / extension array
index series ndarray / extension array
ndarray series series
ndarray index ndarray / extension array

Does that make sense?

jbrockmendel

jbrockmendel commented on Feb 6, 2018

@jbrockmendel
MemberAuthor

I should clarify. The question isn't what the return type should be, but which operations are allowed at all.

The easy (i.e. pretty obviously wrong ATM) part is cases where x < y is allowed but y > x raises:

cat = pd.Categorical([1, 2, 3], ordered=True)
idx = pd.CategoricalIndex(cat)
ser = pd.Series(cat)

>>> cat < ser   # raises TypeError
>>> ser > cat   # returns Series
>>> cat < idx   # raises TypeError
>>> idx > cat   # returns array
>>> idx < ser   # raises TypeError
>>> ser > idx   # raises TypeError

The less obvious part is what casting rules are allowed. ATM CategoricalIndex.__op__ is much more forgiving about input dtypes and will cast other before dispatching to Categorical.__op__.

>>> ser < ser.astype('i8')   # raises TypeError
>>> idx < idx.astype('i8')   # returns array
>>> cat < cat.astype('i8')   # raises TypeError

>>> ser == ser.astype('i8')  # returns Series
>>> idx == idx.astype('i8')   # returns array
>>> cat == cat.astype('i8')  # returns array

The goal here is to have one implementation of each of these methods (ideally in Categorical) and then have the other classes wrap that instead of re-implementing and risking having the logic diverge.

jreback

jreback commented on Feb 10, 2018

@jreback
Contributor

Categorical is prob a bit strict about these checks ATM. Categorial and CI should behave exactly the same.

These should all (Series, Cat, CI) should all respect ordered (IOW they have to be the same). for comparison purposes.

@jschendel

jbrockmendel

jbrockmendel commented on Feb 20, 2018

@jbrockmendel
MemberAuthor

It looks like Series.__cmp__(CategoricalIndex) effectively calls other = np.asarray(other) on the CategoricalIndex, which I expect isn't the desired behavior.

jbrockmendel

jbrockmendel commented on Mar 10, 2018

@jbrockmendel
MemberAuthor

One more:

What should (DatetimeIndex|TimedeltaIndex|PeriodIndex).__(add|sub)__(Categorical|CategoricalIndex) do? I think right now they return NotImplemented which seems reasonable, but they do it in a catch-all block instead of intentionally and I don't think the case is tested.

@jschendel are you the appropriate person to ping on this?

jschendel

jschendel commented on Mar 10, 2018

@jschendel
Member

@jschendel are you the appropriate person to ping on this?

I don't think I have any type of definitive say on this; mostly worked on categorical vs. categorical comparisons.

I agree that this probably shouldn't be supported. I suppose you could perform the operation on the underlying category values, but it seems like you'd run into ambiguous corner cases pretty quickly. For example, would you actually want to allow add/sub if the categories were ordered with a non-standard ordering, e.g. Timestamp('2017-01-01') > Timestamp('2018-01-01') ? Not sure why anyone would ever do that, but could conceivably happen.

Also doesn't look like add/sub at the scalar level is implemented for Timestamp|Timedelta|Period vs. Categorical|CategoricalIndex, which I would expect to be implemented if we were to support the same operations on the index equivalent.

14 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jreback@TomAugspurger@jschendel@jbrockmendel@mroeschke

        Issue actions

          Categorical Arithmetic/Comparison Rules · Issue #19513 · pandas-dev/pandas