Skip to content

Equality between DataFrames misbehaves if columns contain NaN #18455

Closed
@toobaz

Description

@toobaz
Member

Code Sample, a copy-pastable example if possible

In [2]: s = pd.DataFrame(-1, index=[1, np.nan, 2,],
   ...:                  columns=[3, np.nan, 1])
   ...: 

In [3]: s + s # good
Out[3]: 
       3.0  NaN    1.0
 1.0    -2    -2    -2
NaN     -2    -2    -2
 2.0    -2    -2    -2

In [4]: s == s # bad
Out[4]: 
       3.0 NaN    1.0
 1.0  True  NaN  True
NaN   True  NaN  True
 2.0  True  NaN  True

Problem description

While it is true that np.nan != np.nan, pandas disregards this in indexes (indeed, s.loc[:, np.nan] works), so it should be coherent.

Expected Output

In [4]: s == s
Out[4]: 
       3.0  NaN   1.0
 1.0  True  True  True
NaN   True  True  True
 2.0  True  True  True

Output of pd.show_versions()

INSTALLED VERSIONS

commit: b45325e
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.22.0.dev0+201.gb45325e28.dirty
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

Activity

toobaz

toobaz commented on Nov 23, 2017

@toobaz
MemberAuthor

By the way: it works with MultiIndex columns:

In [2]: s = pd.DataFrame(-1, index=[1, np.nan, 2,],
   ...:                  columns=pd.MultiIndex.from_product([[2, np.nan], [3, np.nan]]))
   ...:                  

In [3]: s == s
Out[3]: 
         2         NaN      
         3   NaN     3   NaN
 1.0  True  True  True  True
NaN   True  True  True  True
 2.0  True  True  True  True
toobaz

toobaz commented on Nov 23, 2017

@toobaz
MemberAuthor

I spoke too soon:

In [3]: s == s.copy()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-d24aa33f67ac> in <module>()
----> 1 s == s.copy()

/home/nobackup/repo/pandas/pandas/core/ops.py in f(self, other)
   1324     def f(self, other):
   1325         if isinstance(other, ABCDataFrame):  # Another DataFrame
-> 1326             return self._compare_frame(other, func, str_rep)
   1327         elif isinstance(other, ABCSeries):
   1328             return self._combine_series_infer(other, func, try_cast=False)

/home/nobackup/repo/pandas/pandas/core/frame.py in _compare_frame(self, other, func, str_rep, try_cast)
   4008                              'DataFrame objects')
   4009         return self._compare_frame_evaluate(other, func, str_rep,
-> 4010                                             try_cast=try_cast)
   4011 
   4012     def _flex_compare_frame(self, other, func, str_rep, level, try_cast=True):

/home/nobackup/repo/pandas/pandas/core/frame.py in _compare_frame_evaluate(self, other, func, str_rep, try_cast)
   3987                 return dict((col, func(a[col], b[col])) for col in a.columns)
   3988 
-> 3989             new_data = expressions.evaluate(_compare, str_rep, self, other)
   3990             return self._constructor(data=new_data, index=self.index,
   3991                                      columns=self.columns, copy=False)

/home/nobackup/repo/pandas/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    202     use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b)
    203     if use_numexpr:
--> 204         return _evaluate(op, op_str, a, b, **eval_kwargs)
    205     return _evaluate_standard(op, op_str, a, b)
    206 

/home/nobackup/repo/pandas/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    117 
    118     if result is None:
--> 119         result = _evaluate_standard(op, op_str, a, b)
    120 
    121     return result

/home/nobackup/repo/pandas/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     62         _store_test_result(False)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65 
     66 

/home/nobackup/repo/pandas/pandas/core/frame.py in _compare(a, b)
   3985 
   3986             def _compare(a, b):
-> 3987                 return dict((col, func(a[col], b[col])) for col in a.columns)
   3988 
   3989             new_data = expressions.evaluate(_compare, str_rep, self, other)

/home/nobackup/repo/pandas/pandas/core/frame.py in <genexpr>(.0)
   3985 
   3986             def _compare(a, b):
-> 3987                 return dict((col, func(a[col], b[col])) for col in a.columns)
   3988 
   3989             new_data = expressions.evaluate(_compare, str_rep, self, other)

/home/nobackup/repo/pandas/pandas/core/frame.py in __getitem__(self, key)
   2130             return self._getitem_frame(key)
   2131         elif is_mi_columns:
-> 2132             return self._getitem_multilevel(key)
   2133         else:
   2134             return self._getitem_column(key)

/home/nobackup/repo/pandas/pandas/core/frame.py in _getitem_multilevel(self, key)
   2174 
   2175     def _getitem_multilevel(self, key):
-> 2176         loc = self.columns.get_loc(key)
   2177         if isinstance(loc, (slice, Series, np.ndarray, Index)):
   2178             new_columns = self.columns[loc]

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in get_loc(self, key, method)
   2119             key = _values_from_object(key)
   2120             key = tuple(map(_maybe_str_to_time_stamp, key, self.levels))
-> 2121             return self._engine.get_loc(key)
   2122 
   2123         # -- partial selection or non-unique index

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.MultiIndexObjectEngine.get_loc (pandas/_libs/index.c:14965)()
    616         return super(MultiIndexObjectEngine, self).get_indexer(values)
    617 
--> 618     cpdef get_loc(self, object val):
    619 
    620         # convert a MI to an ndarray

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.MultiIndexObjectEngine.get_loc (pandas/_libs/index.c:14886)()
    621         if hasattr(val, 'values'):
    622             val = val.values
--> 623         return super(MultiIndexObjectEngine, self).get_loc(val)
    624 
    625 

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5832)()
    137             util.set_value_at(arr, loc, value)
    138 
--> 139     cpdef get_loc(self, object val):
    140         if is_definitely_invalid_key(val):
    141             raise TypeError("'{val}' is an invalid key".format(val=val))

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5678)()
    159 
    160         try:
--> 161             return self.mapping.get_item(val)
    162         except (TypeError, ValueError):
    163             raise KeyError(val)

/home/nobackup/repo/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:21018)()
   1263                                        sizeof(uint32_t)) # flags
   1264 
-> 1265     cpdef get_item(self, object val):
   1266         cdef khiter_t k
   1267         if val != val or val is None:

/home/nobackup/repo/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20972)()
   1271             return self.table.vals[k]
   1272         else:
-> 1273             raise KeyError(val)
   1274 
   1275     cpdef set_item(self, object key, Py_ssize_t val):

KeyError: (2.0, nan)

jreback

jreback commented on Nov 24, 2017

@jreback
Contributor

hmm this should work, we already use array_equivalent to compare (inside Index.equals).

In [1]: from pandas.core.dtypes.missing import array_equivalent

In [2]: i = pd.Index([1.0, np.nan, 3])

In [3]: i
Out[3]: Float64Index([1.0, nan, 3.0], dtype='float64')

In [4]: array_equivalent(i.values, i.values)
Out[4]: True
added this to the 0.22.0 milestone on Nov 24, 2017
toobaz

toobaz commented on Nov 24, 2017

@toobaz
MemberAuthor

More general question: is it a desired feature or a limitation that equality works only on objects with similarly ordered index/columns? E.g. compare Series(index=[1,2]) + Series(index=[2,1]) (works) with Series(index=[1,2]) == Series(index=[2,1]) (ValueError): the latter could in principle get an indexer, find out the index actually contains the same elements, and hence compare values (clearly at a cost, which however could be easily avoided in those cases in which the index is indeed the same - that is, the change wouldn't hinder performance for current correct use).

jreback

jreback commented on Nov 24, 2017

@jreback
Contributor

see long discussion here: #1134

toobaz

toobaz commented on Nov 24, 2017

@toobaz
MemberAuthor

see long discussion here: #1134

Interesting, but my understanding is that it does not consider the specific issue of having the same labels but in a different order. I understand the reason not to support comparison between different indexes is to avoid NaNs (or dropping elements/rows). What I suggest instead is just to check if labels are equal after sorting.

jreback

jreback commented on Nov 25, 2017

@jreback
Contributor

how could different orderings be considered equal?

toobaz

toobaz commented on Nov 25, 2017

@toobaz
MemberAuthor

how could different orderings be considered equal?

My idea would be something like

if (self.index == oth.index).all():
    # compare self and oth, return result
if len(self.index.intersection(oth.index)) == len(self.index):
    # compare self and oth.loc[self.index], return result
raise ValueError
jreback

jreback commented on Nov 25, 2017

@jreback
Contributor

i am asking why you think this is a good idea to ignore ordering in an ordered array

14 remaining items

jorisvandenbossche

jorisvandenbossche commented on Jan 16, 2018

@jorisvandenbossche
Member

By the way, I'm not at all against having equality in your category (1) (dropping the exception I just described), and inserting NaNs... I didn't propose it just because the change would be bigger and casting bools to objects is sad

I am not sure this was the reason. Because if comparison operations would align, you would 1) align introducing NaNs in the values and 2) compare and where there are NaNs you just get False (just as you would now get with already aligned objects that contains NaNs).
So even if comparisons do alignment you can still get a normal functioning boolean result.

I think one of the reasons to not let the comparisons align was 1) make series behaviour consistent with dataframe (but of course, we could also have changed the dataframe behaviour to align as well) and 2) people liked the error as a sanity check (as often, when doing a comparison you want to use it for boolean indexing, and then if you get alignment, that might give unexpected results). One example use case that Wes gave: s1[1:] == s2[:1].

toobaz

toobaz commented on Jan 16, 2018

@toobaz
MemberAuthor

So even if comparisons do alignment you can still get a normal functioning boolean result.

Good point: comparison of NaNs is well defined.

make series behaviour consistent with dataframe (but of course, we could also have changed the dataframe behaviour to align as well)

Exactly

people liked the error as a sanity check

True. My idea of introducing NaNs would have provided this sanity check... but it's just too inconsistent. And while I would rather not have this sanity check, changing it now would be too disruptive.

I still think we could just allow for different order of indexes, in unique indexes with same elements, not to matter.

added 7 commits that reference this issue on Feb 4, 2018
e85b207
6f9b502
9a65e8a
07c9ba5
0ddd89e
6397859
ce8f000
added 4 commits that reference this issue on Mar 31, 2018
0df64d2
bcd6328
fe556d6
92fd37f
added a commit that references this issue on Apr 1, 2018
4efb39f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Compatpandas objects compatability with Numpy or Python functionsDtype ConversionsUnexpected or buggy dtype conversions

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @jreback@jorisvandenbossche@shoyer@toobaz

      Issue actions

        Equality between DataFrames misbehaves if columns contain NaN · Issue #18455 · pandas-dev/pandas