Skip to content

MultiIndex.__contains__ try-catch exception types too narrow #24570

Closed
@xwang777

Description

@xwang777

Code Sample

import pandas as pd 
import numpy as np  
tx = pd.timedelta_range('09:30:00','16:00:00', freq='30 min')       
# ys is Series with MultiIndex, first level is Timedelta
ys = pd.DataFrame(np.random.randn(len(tx), 2), index=tx, columns = ('c0','c1')).set_index('c0', append=True)['c1']  
# this is what sklearn.metrics.r2_score would do
hasattr(ys, 'fit')

Problem description

sklearn tries to handle numpy array-like args, including pd.Series and pd.DataFrame. In trying to validate the argument, sklearn calls hasattr(ys, 'fit'), which calls into the _can_hold_identifiers_and_holds_name function of the MultiIndex object. In line 2111 of pandas/core/indexes/base.py, it tries to check whether 'fit' is in the MultiIndex, which in turns calls into .
pandas/core/indexes/multi.py line 539 for 'self.get_loc(key)'. The get_loc is surrounded by try-catch, but only for LookupError and TypeError, whereas here it's ValueError that's thrown.

The old version 0.22.0 works fine.

For now I added a try-catch _can_hold_identifiers_and_holds_name function and return False. It works around the problem. It looks like this:

    def _can_hold_identifiers_and_holds_name(self, name):
        if self.is_object() or self.is_categorical():
            try:
                return name in self
            except Exception:
                return False
        return False

Expected Output

In [4]: hasattr(ys, 'fit')

ValueError Traceback (most recent call last)
in
----> 1 hasattr(ys, 'fit')

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name)
4372 return object.getattribute(self, name)
4373 else:
-> 4374 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4375 return self[name]
4376 return object.getattribute(self, name)

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_hold_identifiers_and_holds_name(self, name)
2109 """
2110 if self.is_object() or self.is_categorical():
-> 2111 return name in self
2112 return False
2113

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in contains(self, key)
547 hash(key)
548 try:
--> 549 self.get_loc(key)
550 return True
551 except (LookupError, TypeError):

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
2235
2236 if not isinstance(key, tuple):
-> 2237 loc = self._get_level_indexer(key, level=0)
2238
2239 # _get_level_indexer returns an empty slice if the key has

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _get_level_indexer(self, key, level, indexer)
2494 else:
2495
-> 2496 loc = level_index.get_loc(key)
2497 if isinstance(loc, slice):
2498 return loc

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance)
783
784 if _is_convertible_to_td(key):
--> 785 key = Timedelta(key)
786 return Index.get_loc(self, key, method, tolerance)
787

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.new()

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()

ValueError: unit abbreviation w/o a number

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None

, and sklearn: 0.20.1

Activity

changed the title [-]Index._can_hold_identifiers_and_holds_name should try-catch[/-] [+]MultiIndex.__contains__ try-catch exception types too narrow[/+] on Jan 2, 2019
jreback

jreback commented on Jan 2, 2019

@jreback
Contributor

can you provide an example that doesn't use an external package

xwang777

xwang777 commented on Jan 3, 2019

@xwang777
Author

Sure. Here's the reduced example that only uses pandas and numpy. The exception is the same.

import pandas as pd 
import numpy as np  
tx = pd.timedelta_range('09:30:00','16:00:00', freq='30 min')       
# ys is Series with MultiIndex, first level is Timedelta
ys = pd.DataFrame(np.random.randn(len(tx), 2), index=tx, columns = ('c0','c1')).set_index('c0', append=True)['c1']  
# this is what sklearn.metrics would do
hasattr(ys, 'fit')

Output

In [4]: hasattr(ys, 'fit')

ValueError Traceback (most recent call last)
in
----> 1 hasattr(ys, 'fit')

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name)
4372 return object.getattribute(self, name)
4373 else:
-> 4374 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4375 return self[name]
4376 return object.getattribute(self, name)

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_hold_identifiers_and_holds_name(self, name)
2109 """
2110 if self.is_object() or self.is_categorical():
-> 2111 return name in self
2112 return False
2113

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in contains(self, key)
547 hash(key)
548 try:
--> 549 self.get_loc(key)
550 return True
551 except (LookupError, TypeError):

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
2235
2236 if not isinstance(key, tuple):
-> 2237 loc = self._get_level_indexer(key, level=0)
2238
2239 # _get_level_indexer returns an empty slice if the key has

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _get_level_indexer(self, key, level, indexer)
2494 else:
2495
-> 2496 loc = level_index.get_loc(key)
2497 if isinstance(loc, slice):
2498 return loc

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance)
783
784 if _is_convertible_to_td(key):
--> 785 key = Timedelta(key)
786 return Index.get_loc(self, key, method, tolerance)
787

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.new()

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()

ValueError: unit abbreviation w/o a number

jreback

jreback commented on Jan 3, 2019

@jreback
Contributor

can you edit the example at the top with this

jreback

jreback commented on Jan 3, 2019

@jreback
Contributor

yeah this should be handled here:

~/pandas/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance)
    546 
    547         if _is_convertible_to_td(key):
--> 548             key = Timedelta(key)
    549             return Index.get_loc(self, key, method, tolerance)
    550 

~/pandas/pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.__new__()

where if the conversion fails (e.g. put a try/except ValueError) around this.

PRs welcome.

xwang777

xwang777 commented on Jan 3, 2019

@xwang777
Author

OK, I updated the example and output in the first comment.

Regarding the fix, why not try-catch Exception in the _can_hold_identifiers_and_holds_name function? This example is caused by Timedelta, but it could very well be caused by Timestamp, or any other types that fail to be constructed from a str.

jreback

jreback commented on Jan 3, 2019

@jreback
Contributor

yes u can try that see if anything breaks

xwang777

xwang777 commented on Jan 10, 2019

@xwang777
Author

The patch is very simple, I could create pull request easily. However, the tests (python setup.py test) failed right away, but not due to my changes. Anyone please tell me where the problem might be? Thanks

run test output

(pandas-dev) [02:19:47|xwang0-el7 pandas-xw]$ python setup.py test
running test
running egg_info
writing pandas.egg-info/PKG-INFO
writing dependency_links to pandas.egg-info/dependency_links.txt
writing requirements to pandas.egg-info/requires.txt
writing top-level names to pandas.egg-info/top_level.txt
reading manifest file 'pandas.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'doc/build'
warning: no previously-included files matching '.pyd' found anywhere in distribution
warning: no previously-included files matching '
~' found anywhere in distribution
warning: no previously-included files matching '.DS_Store' found anywhere in distribution
warning: no previously-included files matching '.git*' found anywhere in distribution
warning: no previously-included files matching '#*' found anywhere in distribution
writing manifest file 'pandas.egg-info/SOURCES.txt'
running build_ext
/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py:154: FutureWarning: 'pandas.core' is private. Use 'pandas.Categorical'
module = import(module_name)
Traceback (most recent call last):
File "setup.py", line 743, in
**setuptools_kwargs)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/init.py", line 143, in setup
return distutils.core.setup(**attrs)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 228, in run
self.run_tests()
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 250, in run_tests
exit=False,
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 100, in init
self.parseArgs(argv)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 124, in parseArgs
self._do_discovery(argv[2:])
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 244, in _do_discovery
self.createTests(from_discovery=True, Loader=Loader)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 154, in createTests
self.test = loader.discover(self.start, self.pattern, self.top)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 347, in discover
tests = list(self._find_tests(start_dir, pattern))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 404, in _find_tests
full_path, pattern, namespace)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 481, in _find_test_path
tests = self.loadTestsFromModule(package, pattern=pattern)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 191, in loadTestsFromName
return self.loadTestsFromModule(obj)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 191, in loadTestsFromName
return self.loadTestsFromModule(obj)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 191, in loadTestsFromName
return self.loadTestsFromModule(obj)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 154, in loadTestsFromName
module = import(module_name)
File "/home/xwang/dev/pandas-xw/pandas/tests/indexes/datetimes/test_misc.py", line 91, in
class TestDatetime64(object):
File "/home/xwang/dev/pandas-xw/pandas/tests/indexes/datetimes/test_misc.py", line 246, in TestDatetime64
None] if tm.get_locales() is None else [None] + tm.get_locales())
File "/home/xwang/dev/pandas-xw/pandas/util/testing.py", line 516, in get_locales
x, encoding=pd.options.display.encoding))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 4: invalid continuation byte

I forked from master, here's the top of my 'git log --name-status' output:

git log --name-status

commit 6f9ab14
Author: xwang777 30608423+xwang777@users.noreply.github.com
Date: Tue Jan 8 13:52:31 2019 -0500

Address issue 24570

M pandas/core/indexes/base.py

commit f57bd72
Author: Simon Hawkins simonjayhawkins@gmail.com
Date: Tue Jan 8 13:07:49 2019 +0000

REF: io/formats/html.py (and io/formats/format.py) (#24651)

M pandas/io/formats/format.py
M pandas/io/formats/html.py

optilude

optilude commented on Feb 7, 2019

@optilude

I was about to raise an issue for a problem with pivot_table(), but I believe it's the same bug. The relevant code is:

        chart_data.pivot_table(
            index='age',
            columns='priority',
            values='key',
            aggfunc='count'
        )

Here chart_data is a DataFrame with columns age (a Timedelta), priority (a string), and key (another string). There are no N/A values in any of those columns.

Before upgrading to Pandas 0.24.1 this worked. It now throws:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/jira_agile_metrics/calculator.py", line 56, in run_calculators
    c.write()
  File "/usr/lib/python3.6/site-packages/jira_agile_metrics/calculators/debt.py", line 83, in write
    self.write_debt_age_chart(chart_data, self.settings['debt_age_chart'])
  File "/usr/lib/python3.6/site-packages/jira_agile_metrics/calculators/debt.py", line 132, in write_debt_age_chart
    aggfunc='count'
  File "/usr/lib/python3.6/site-packages/pandas/core/frame.py", line 5759, in pivot_table
    margins_name=margins_name)
  File "/usr/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 84, in pivot_table
    agged = agged.dropna(how='all')
  File "/usr/lib/python3.6/site-packages/pandas/core/frame.py", line 4598, in dropna
    result = self.loc(axis=axis)[mask]
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1500, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1859, in _getitem_axis
    if is_iterator(key):
  File "/usr/lib/python3.6/site-packages/pandas/core/dtypes/inference.py", line 157, in is_iterator
    return hasattr(obj, '__next__')
  File "/usr/lib/python3.6/site-packages/pandas/core/generic.py", line 5065, in __getattr__
    if self._info_axis._can_hold_identifiers_and_holds_name(name):
  File "/usr/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3984, in _can_hold_identifiers_and_holds_name
    return name in self
  File "/usr/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 841, in __contains__
    self.get_loc(key)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2397, in get_loc
    loc = self._get_level_indexer(key, level=0)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2652, in _get_level_indexer
    code = level_index.get_loc(key)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexes/timedeltas.py", line 548, in get_loc
    key = Timedelta(key)
  File "pandas/_libs/tslibs/timedeltas.pyx", line 1167, in pandas._libs.tslibs.timedeltas.Timedelta.__new__
  File "pandas/_libs/tslibs/timedeltas.pyx", line 463, in pandas._libs.tslibs.timedeltas.parse_timedelta_string
ValueError: unit abbreviation w/o a number

If I change the pivot_table call by swapping index and columns (so index="priority", columns="age"), it no longer throws.

jorisvandenbossche

jorisvandenbossche commented on Feb 7, 2019

@jorisvandenbossche
Member

@optilude can you try to give a small reproducible example? (eg by creating a dummy dataframe with the specific dtypes you have that shows the problem)

8 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @optilude@jreback@jorisvandenbossche@xwang777

      Issue actions

        MultiIndex.__contains__ try-catch exception types too narrow · Issue #24570 · pandas-dev/pandas