Skip to content

TestCategoricalIndex.test_reindexing failing on Travis / Mac build #17323

Closed
@jorisvandenbossche

Description

@jorisvandenbossche
Member

https://travis-ci.org/pandas-dev/pandas/jobs/267760269

Master and some PRs (strangely not all) are all failing on the same test, but only on Mac:

=================================== FAILURES ===================================
_____________________ TestCategoricalIndex.test_reindexing _____________________
[gw1] darwin -- Python 3.5.4 /Users/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.indexes.test_category.TestCategoricalIndex object at 0x111d5e860>
    def test_reindexing(self):
    
        ci = self.create_index()
        oidx = Index(np.array(ci))
    
        for n in [1, 2, 5, len(ci)]:
            finder = oidx[np.random.randint(0, len(ci), size=n)]
            expected = oidx.get_indexer_non_unique(finder)[0]
    
            actual = ci.get_indexer(finder)
>           tm.assert_numpy_array_equal(expected, actual)
pandas/tests/indexes/test_category.py:389: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/util/testing.py:1174: in assert_numpy_array_equal
    _raise(left, right, err_msg)
pandas/util/testing.py:1157: in _raise
    .format(obj=obj), left.shape, right.shape)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
obj = 'numpy array', message = 'numpy array shapes are different', left = (14,)
right = (6,), diff = None
    def raise_assert_detail(obj, message, left, right, diff=None):
        if isinstance(left, np.ndarray):
            left = pprint_thing(left)
        if isinstance(right, np.ndarray):
            right = pprint_thing(right)
    
        msg = """{obj} are different
    
    {message}
    [left]:  {left}
    [right]: {right}""".format(obj=obj, message=message, left=left, right=right)
    
        if diff is not None:
            msg += "\n[diff]: {diff}".format(diff=diff)
    
>       raise AssertionError(msg)
E       AssertionError: numpy array are different
E       
E       numpy array shapes are different
E       [left]:  (14,)
E       [right]: (6,)
pandas/util/testing.py:1105: AssertionError

Activity

added
Testingpandas testing functions or related to the test suite
Unreliable TestUnit tests that occasionally fail
on Aug 24, 2017
gfyoung

gfyoung commented on Aug 24, 2017

@gfyoung
Member

Marking as unreliable given that only some PR's are failing on this test.

gfyoung

gfyoung commented on Aug 24, 2017

@gfyoung
Member

I think the np.random.randint is the problem.

jorisvandenbossche

jorisvandenbossche commented on Aug 24, 2017

@jorisvandenbossche
MemberAuthor

But whathever result of randint that I can think off, the oidx.get_indexer_non_unique(finder)[0] and ci.get_indexer(finder) should still behave the same ..

gfyoung

gfyoung commented on Aug 24, 2017

@gfyoung
Member

I suppose, though I don't know how else you get such a difference of 14 and 6...

added this to the 0.21.0 milestone on Aug 24, 2017
TomAugspurger

TomAugspurger commented on Aug 24, 2017

@TomAugspurger
Contributor

FYI, pytest-repeat is helpful for debugging these.

pytest pandas/tests/indexes/test_category.py::TestCategoricalIndex::test_reindexing --count=1000 --pdb
=================================================================== test session starts ====================================================================
platform darwin -- Python 3.6.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas, inifile: setup.cfg
plugins: xdist-1.15.0, rerunfailures-2.2, repeat-0.4.1, cov-2.4.0
collected 1000 items

pandas/tests/indexes/test_category.py ...............................................................................................................................F

jorisvandenbossche

jorisvandenbossche commented on Aug 24, 2017

@jorisvandenbossche
MemberAuthor

Indeed good tip :-)

So this happens when the random generation results in exactly the same index as the original one:

(Pdb) p n
6
(Pdb) finder
Index(['a', 'a', 'b', 'b', 'c', 'a'], dtype='object')
(Pdb) oidx
Index(['a', 'a', 'b', 'b', 'c', 'a'], dtype='object')
(Pdb) expected
array([0, 1, 5, 0, 1, 5, 2, 3, 2, 3, 4, 0, 1, 5])
(Pdb) ci
CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category')
(Pdb) ci.get_indexer(finder)
array([0, 1, 2, 3, 4, 5])

But I would say this is then rather a bug in get_indexer (and not in the test) ?

(still a bit strange that it turned up in so many builds now, and only on mac, as for this failed in 3 out of 1000 runs)

jorisvandenbossche

jorisvandenbossche commented on Aug 24, 2017

@jorisvandenbossche
MemberAuthor

So reproducible example:

In [4]: ci = pd.CategoricalIndex( list('aabbca'), categories=list('cab'))

In [5]: ci.get_indexer(['a', 'b', 'c'])
Out[5]: array([0, 1, 5, 2, 3, 4])

In [6]: ci.get_indexer(['a', 'a', 'b', 'b', 'c', 'a'])
Out[6]: array([0, 1, 2, 3, 4, 5])

In [7]: ci.get_indexer(['a', 'a', 'b', 'b', 'c', 'a', 'c'])
Out[7]: array([0, 1, 5, 0, 1, 5, 2, 3, 2, 3, 4, 0, 1, 5, 4])

In [8]: ci.get_indexer_non_unique(['a', 'a', 'b', 'b', 'c', 'a'])[0]
Out[8]: array([0, 1, 5, 0, 1, 5, 2, 3, 2, 3, 4, 0, 1, 5])

Compared to normal indices, get_indexer is able to handle non-unique index for Categoricals (for example ci.get_indexer(['a', 'b', 'c']) in [5]). However, inside get_indexer a 'fast path' was added for when the passed index equals the calling one. Therefore in such a case, it does not duplicate the duplicates (as it does in [5] and [8]). This contrast with how get_indexer_non_unique works ([7]).

The 'fast path' was added in the IntervalIndex PR, so this also seems a change compared to 0.19:

In [1]: pd.__version__
Out[1]: '0.19.2'

In [2]: ci = pd.CategoricalIndex( list('aabbca'), categories=list('cab'))

In [3]: ci.get_indexer(['a', 'a', 'b', 'b', 'c', 'a'])
Out[3]: array([0, 1, 5, 0, 1, 5, 2, 3, 2, 3, 4, 0, 1, 5])

(but not sure what actual consequences in user code could be)

jorisvandenbossche

jorisvandenbossche commented on Aug 24, 2017

@jorisvandenbossche
MemberAuthor

A possible fix: add to the fastpath a check for uniqueness of the index here:

def get_indexer(self, target, method=None, limit=None, tolerance=None):
method = missing.clean_reindex_fill_method(method)
target = ibase._ensure_index(target)
if self.equals(target):
return np.arange(len(self), dtype='intp')

added a commit that references this issue on Aug 28, 2017
98876c8

15 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugCategoricalCategorical Data TypeIndexingRelated to indexing on series/frames, not to indexes themselvesTestingpandas testing functions or related to the test suiteUnreliable TestUnit tests that occasionally fail

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jorisvandenbossche@TomAugspurger@jbrockmendel@gfyoung

        Issue actions

          TestCategoricalIndex.test_reindexing failing on Travis / Mac build · Issue #17323 · pandas-dev/pandas