-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
It seems to me we should either fix ExcelFile.parse or deprecate it entirely, and I lean toward the latter. pandas originally started out with just ExcelFile but now has the top-level read_excel. The signatures started the same, but now read_excel has gained and modified parameters that have not been added/changed in ExcelFile.parse. For example:
ExcelFile.parselacks adtypeparameterExcelFile.parsehas a**kwdsargument that is passed on to pandas internals with no documentation on what can be included. Invalid arguments are just ignored (e.g. BUG: xl.parse index_col ignoring skiprows #50953)
It appears to me that pd.ExcelFile(...).parse(...) offers no advantage over pd.read_excel(pd.ExcelFile(...)), and so rather than fixing parse we can deprecate it and make it internal.
Edit: I no longer think deprecating ExcelFile entirely as mentioned below is a good option. See #58247 (comment).
Another option is to deprecate ExcelFile entirely. The one thing ExcelFile still provides that isn't available elsewhere is to get the underlying book or sheet_names without reading the entire file.
df = pd.DataFrame(np.zeros((100, 100)))
with pd.ExcelWriter("test.xlsx") as writer:
for e in range(10):
df.to_excel(writer, sheet_name=str(e))
%timeit pd.ExcelFile("test.xlsx").sheet_names
# 14.1 ms ± 76 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.read_excel("test.xlsx", sheet_name=None)
# 411 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One can somewhat work around this by using nrows, but it's clunky.
%timeit pd.read_excel("test.xlsx", sheet_name=None, nrows=0).keys()
# 57.3 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)