Description
Code Sample
import pandas as pd
import numpy as np
tx = pd.timedelta_range('09:30:00','16:00:00', freq='30 min')
# ys is Series with MultiIndex, first level is Timedelta
ys = pd.DataFrame(np.random.randn(len(tx), 2), index=tx, columns = ('c0','c1')).set_index('c0', append=True)['c1']
# this is what sklearn.metrics.r2_score would do
hasattr(ys, 'fit')
Problem description
sklearn tries to handle numpy array-like args, including pd.Series and pd.DataFrame. In trying to validate the argument, sklearn calls hasattr(ys, 'fit'), which calls into the _can_hold_identifiers_and_holds_name function of the MultiIndex object. In line 2111 of pandas/core/indexes/base.py, it tries to check whether 'fit' is in the MultiIndex, which in turns calls into .
pandas/core/indexes/multi.py line 539 for 'self.get_loc(key)'. The get_loc is surrounded by try-catch, but only for LookupError and TypeError, whereas here it's ValueError that's thrown.
The old version 0.22.0 works fine.
For now I added a try-catch _can_hold_identifiers_and_holds_name function and return False. It works around the problem. It looks like this:
def _can_hold_identifiers_and_holds_name(self, name):
if self.is_object() or self.is_categorical():
try:
return name in self
except Exception:
return False
return False
Expected Output
In [4]: hasattr(ys, 'fit')
ValueError Traceback (most recent call last)
in
----> 1 hasattr(ys, 'fit')
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name)
4372 return object.getattribute(self, name)
4373 else:
-> 4374 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4375 return self[name]
4376 return object.getattribute(self, name)
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_hold_identifiers_and_holds_name(self, name)
2109 """
2110 if self.is_object() or self.is_categorical():
-> 2111 return name in self
2112 return False
2113
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in contains(self, key)
547 hash(key)
548 try:
--> 549 self.get_loc(key)
550 return True
551 except (LookupError, TypeError):
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
2235
2236 if not isinstance(key, tuple):
-> 2237 loc = self._get_level_indexer(key, level=0)
2238
2239 # _get_level_indexer returns an empty slice if the key has
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _get_level_indexer(self, key, level, indexer)
2494 else:
2495
-> 2496 loc = level_index.get_loc(key)
2497 if isinstance(loc, slice):
2498 return loc
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance)
783
784 if _is_convertible_to_td(key):
--> 785 key = Timedelta(key)
786 return Index.get_loc(self, key, method, tolerance)
787
pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.new()
pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()
ValueError: unit abbreviation w/o a number
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
, and sklearn: 0.20.1
Activity
[-]Index._can_hold_identifiers_and_holds_name should try-catch[/-][+]MultiIndex.__contains__ try-catch exception types too narrow[/+]jreback commentedon Jan 2, 2019
can you provide an example that doesn't use an external package
xwang777 commentedon Jan 3, 2019
Sure. Here's the reduced example that only uses pandas and numpy. The exception is the same.
Output
In [4]: hasattr(ys, 'fit')
ValueError Traceback (most recent call last)
in
----> 1 hasattr(ys, 'fit')
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name)
4372 return object.getattribute(self, name)
4373 else:
-> 4374 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4375 return self[name]
4376 return object.getattribute(self, name)
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_hold_identifiers_and_holds_name(self, name)
2109 """
2110 if self.is_object() or self.is_categorical():
-> 2111 return name in self
2112 return False
2113
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in contains(self, key)
547 hash(key)
548 try:
--> 549 self.get_loc(key)
550 return True
551 except (LookupError, TypeError):
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
2235
2236 if not isinstance(key, tuple):
-> 2237 loc = self._get_level_indexer(key, level=0)
2238
2239 # _get_level_indexer returns an empty slice if the key has
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _get_level_indexer(self, key, level, indexer)
2494 else:
2495
-> 2496 loc = level_index.get_loc(key)
2497 if isinstance(loc, slice):
2498 return loc
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance)
783
784 if _is_convertible_to_td(key):
--> 785 key = Timedelta(key)
786 return Index.get_loc(self, key, method, tolerance)
787
pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.new()
pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()
ValueError: unit abbreviation w/o a number
jreback commentedon Jan 3, 2019
can you edit the example at the top with this
jreback commentedon Jan 3, 2019
yeah this should be handled here:
where if the conversion fails (e.g. put a try/except ValueError) around this.
PRs welcome.
xwang777 commentedon Jan 3, 2019
OK, I updated the example and output in the first comment.
Regarding the fix, why not try-catch Exception in the _can_hold_identifiers_and_holds_name function? This example is caused by Timedelta, but it could very well be caused by Timestamp, or any other types that fail to be constructed from a str.
jreback commentedon Jan 3, 2019
yes u can try that see if anything breaks
xwang777 commentedon Jan 10, 2019
The patch is very simple, I could create pull request easily. However, the tests (python setup.py test) failed right away, but not due to my changes. Anyone please tell me where the problem might be? Thanks
run test output
(pandas-dev) [02:19:47|xwang0-el7 pandas-xw]$ python setup.py test
running test
running egg_info
writing pandas.egg-info/PKG-INFO
writing dependency_links to pandas.egg-info/dependency_links.txt
writing requirements to pandas.egg-info/requires.txt
writing top-level names to pandas.egg-info/top_level.txt
reading manifest file 'pandas.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'doc/build'
warning: no previously-included files matching '.pyd' found anywhere in distribution
warning: no previously-included files matching '~' found anywhere in distribution
warning: no previously-included files matching '.DS_Store' found anywhere in distribution
warning: no previously-included files matching '.git*' found anywhere in distribution
warning: no previously-included files matching '#*' found anywhere in distribution
writing manifest file 'pandas.egg-info/SOURCES.txt'
running build_ext
/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py:154: FutureWarning: 'pandas.core' is private. Use 'pandas.Categorical'
module = import(module_name)
Traceback (most recent call last):
File "setup.py", line 743, in
**setuptools_kwargs)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/init.py", line 143, in setup
return distutils.core.setup(**attrs)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 228, in run
self.run_tests()
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 250, in run_tests
exit=False,
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 100, in init
self.parseArgs(argv)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 124, in parseArgs
self._do_discovery(argv[2:])
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 244, in _do_discovery
self.createTests(from_discovery=True, Loader=Loader)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 154, in createTests
self.test = loader.discover(self.start, self.pattern, self.top)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 347, in discover
tests = list(self._find_tests(start_dir, pattern))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 404, in _find_tests
full_path, pattern, namespace)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 481, in _find_test_path
tests = self.loadTestsFromModule(package, pattern=pattern)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 191, in loadTestsFromName
return self.loadTestsFromModule(obj)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 191, in loadTestsFromName
return self.loadTestsFromModule(obj)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 191, in loadTestsFromName
return self.loadTestsFromModule(obj)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 154, in loadTestsFromName
module = import(module_name)
File "/home/xwang/dev/pandas-xw/pandas/tests/indexes/datetimes/test_misc.py", line 91, in
class TestDatetime64(object):
File "/home/xwang/dev/pandas-xw/pandas/tests/indexes/datetimes/test_misc.py", line 246, in TestDatetime64
None] if tm.get_locales() is None else [None] + tm.get_locales())
File "/home/xwang/dev/pandas-xw/pandas/util/testing.py", line 516, in get_locales
x, encoding=pd.options.display.encoding))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 4: invalid continuation byte
I forked from master, here's the top of my 'git log --name-status' output:
git log --name-status
commit 6f9ab14
Author: xwang777 30608423+xwang777@users.noreply.github.com
Date: Tue Jan 8 13:52:31 2019 -0500
M pandas/core/indexes/base.py
commit f57bd72
Author: Simon Hawkins simonjayhawkins@gmail.com
Date: Tue Jan 8 13:07:49 2019 +0000
M pandas/io/formats/format.py
M pandas/io/formats/html.py
optilude commentedon Feb 7, 2019
I was about to raise an issue for a problem with
pivot_table()
, but I believe it's the same bug. The relevant code is:Here
chart_data
is aDataFrame
with columnsage
(aTimedelta
),priority
(a string), andkey
(another string). There are no N/A values in any of those columns.Before upgrading to Pandas 0.24.1 this worked. It now throws:
If I change the
pivot_table
call by swappingindex
andcolumns
(soindex="priority", columns="age"
), it no longer throws.jorisvandenbossche commentedon Feb 7, 2019
@optilude can you try to give a small reproducible example? (eg by creating a dummy dataframe with the specific dtypes you have that shows the problem)
8 remaining items