Skip to content

NaN values impact binary or operations asymmetrically #6528

Closed
@behzadnouri

Description

@behzadnouri
Contributor

I am not sure if this is on purpose, but as far as I know, or operation should be symmetric ( I mean a | b == b | a ); not in pandas:

>>> a = pd.Series( [True] )
>>> b = pd.Series( [np.nan] )
>>> a | b
0    True
dtype: bool
>>> b | a
0    False
dtype: bool

is there any reason pandas is treating NaN values asymmetrically in here?

also, this is a weird outcome as well:

>>> ( b == True ) | ( a == True )
0    True
dtype: bool

and this:

 >>> b.astype( bool )
 0    True
 dtype: bool

this even has internal inconsistency; so before in b | a, NaN was treated as False, but now here it is True;

Activity

jreback

jreback commented on Mar 3, 2014

@jreback
Contributor

this could be a bug, but you boolean comparison of True == np.nan is very odd in the first place
should the result be np.nan because its missing, False because True is equivalent of np.nan?

See the long and compliacted tests here:
https://github.com/pydata/pandas/blob/master/pandas/tests/test_series.py#L3142

if you see a bug, pls submit a PR

will mark it as a bug, but come back to us with your thoughts

added this to the 0.14.0 milestone on Mar 3, 2014
behzadnouri

behzadnouri commented on Mar 3, 2014

@behzadnouri
ContributorAuthor

@jreback there are 3 issues above:

  1. a | b is not equal to b | a;
  2. b.astype( bool ) converts NaN values to True; this is worst of 3 possible values, namely NaN, False, True; and also not consistent with [3] below
  3. b == True is not consistent with [2] in above;

this is really bad; because outcome of | should not depend on order; especially when NaN values are implicitly generated:

>>> ts = pd.Series( [True, False] )
>>> ts | ts.shift( 1 )
0    True
1    True
dtype: bool
>>> ts.shift( 1 ) | ts
0    False
1     True
dtype: bool
jreback

jreback commented on Mar 3, 2014

@jreback
Contributor

@behzadnouri ok.....pls make a PR to fix! Keep in mind their may be some reasons buried for this; e.g. put in tests for the 'correct' behavior and see what breaks.

  1. symmetry, I don't think it tested
  2. This is filling behavior, I am not really sure what to do here; the astype actually should fail IMHO. it is really unclear what to do with np.nan, you normally b.fillna(False).astype(bool).
  3. test as well
jreback

jreback commented on Mar 3, 2014

@jreback
Contributor

@hayd pls chime in here when you can....IIRC you and I discussed these types of things at length

hayd

hayd commented on Mar 3, 2014

@hayd
Contributor

some discussion was here: #4953.

A lot of this is as we're fighting "strange" truthiness for numpy's and python's (!) nan, so things like astype we can't/shouldn't fix... kinda a "gotcha".

IMO asymmetry is a bug but only because we're forcing a boolean return value (the reason this happens is cos in numpy and in general a|b is not symmetric). Now I think about it, I do remember this coming up...

modified the milestones: 0.15.0, 0.14.0 on Apr 9, 2014
seth-p

seth-p commented on Sep 28, 2014

@seth-p
Contributor

My 2c: I would separate this out to two issues:

Issue 1: Any boolean operations should behave the same as if .astype(bool) were first invoked on the structure(s). So, for example, a | b should return the same as a.astype(bool) | b.astype(bool), and b == True should return the same as b.astype(bool) == True. If this is the case, this should address the commutativity problem (as I'm assuming the boolean operations on boolean structures are commutative).

Issue 2: Should .astype(bool) convert a NaN to True or False? Currently it converts it to True, which is consistent with the fact that bool(NaN) is True, as defined by numpy. Personally, I would have defined bool(NaN) as False, and this may be a situation where we want the Pandas behavior (of .astype(bool)) to differ from numpy's. But this should be considered an API change, not a bug fix. I also just raised this issue on https://groups.google.com/forum/#!topic/pydata/U7gQyI_gLMc.

At any rate, I view these as two separate (but obviously related) issues, which I think are being slightly conflated in this issue (and #8151). I would suggest first addressing Issue 1, which seems to me clearly to be a bug -- keeping the existing behavior of treating NaN as True (so in the first example above, a | b, b | a, and b == True would all return Series([True])); and then separately, after further discussion and reflection, consider changing the behavior of .astype(bool) (and related boolean operations) to convert NaN to False.

behzadnouri

behzadnouri commented on Sep 28, 2014

@behzadnouri
ContributorAuthor

@seth-p

Any boolean operations should behave the same as if .astype(bool) were first invoked on the structure(s)

as long as the logical operations are bound to return dtype='bool'. that is exactly what i am testing here

and b == True should return the same as b.astype(bool) == True

that would not be achievable. for example, 2 != True and 2 != False, so you cannot type cast a series unless it is only zeros and ones.

this may be a situation where we want the Pandas behavior (of .astype(bool)) to differ from numpy's

strongly discourage that, because:

-1- it is not only numpy, but as you also mentioned, and much more importantly that is how python works:

>>> np.where([np.nan], True, False)
array([ True], dtype=bool)
>>> True if np.nan else False
True

so, it can easily break things as simple as ts.map(bool) == ts.astype(bool).

-2- the gain of breaking that consistency, if any, is less than minimal. because, if a user wants to deviate from python's convention that is more than easy: ts.fillna(False), or ts.fillna(0).

seth-p

seth-p commented on Sep 28, 2014

@seth-p
Contributor

D'oh! @behzadnouri, of course you're right about == and !=. I guess I should have said that any logical operations that expect booleans should produce the same results as for .astype(bool).

Re: .astype(bool) converting NaN to True, fair enough. And you're right, of course, that what I'm suggesting is simply the same as .fillna(False).astype(bool).

changed the title [-]`NaN` values impact binary or operations asymmetrically[/-] [+]NaN values impact binary or operations asymmetrically[/+] on Sep 30, 2014

1 remaining item

modified the milestones: Next Major Release, 0.16.0 on Mar 6, 2015
h-vetinari

h-vetinari commented on Dec 6, 2017

@h-vetinari
Contributor

Tried to find the most current open issue on this - the others are closed or merged, but this is open even though being milestoned since 0.14?

I asked a question about this on stackexchange, and someone took the time to identify the apparent source of the issue (as well as a work-around): https://stackoverflow.com/a/47664356/2965879

The main issue seems that in pandas.core.ops.wrapper (https://github.com/pandas-dev/pandas/blob/master/pandas/core/ops.py#L928), after aligning the series, only other is filled (transforming NaN to False), but self isn't, which introduces this symmetry break.

TomAugspurger

TomAugspurger commented on Dec 2, 2019

@TomAugspurger
Contributor

I don't expect the behavior of np.nan in boolean operations to change, but #29842 is implementing a new BooleanArray extension type that I think addressees most things. Let us know if there are additional problems in new issues.

modified the milestones: Contributions Welcome, 1.0 on Dec 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDtype ConversionsUnexpected or buggy dtype conversionsNumeric OperationsArithmetic, Comparison, and Logical operations

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @jreback@behzadnouri@TomAugspurger@hayd@seth-p

      Issue actions

        NaN values impact binary or operations asymmetrically · Issue #6528 · pandas-dev/pandas