Description
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
In [1]:
import pandas as pd
import numpy as np
print(pd.__version__)
non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
def for_loop(series):
return pd.Series([str(zipcode).zfill(5) for zipcode in series])
def vectorized(series):
return series.astype(str).str.zfill(5)
%timeit for_loop(non_padded)
%timeit vectorized(non_padded)
Out [1]:
0.25.1
3.32 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.18 ms ± 60.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem description
In most cases, using a for-loop
in pandas is much slower than its vectorized equivalent. However, the above operations takes over twice as long when using vectorization. I have replicated this issue on MacOS & Ubuntu.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 20.0.2
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : 0.14.0
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
Activity
dsaxton commentedon Aug 23, 2020
This is an old version of pandas, if you upgrade the difference doesn't seem to be as much.
It's worth pointing out though that the string accessor method is doing a bit more work than the pure Python version, e.g., trying to handle any missing values that it might encounter.
[-]BUG: Vectorized string operations are slower than for-loops[/-][+]PERF: Vectorized string operations are slower than for-loops[/+]asishm commentedon Aug 24, 2020
On 1.1.0 and I still see a big perf difference with the same test code
Upgraded to 1.1.1 and I seem to get results similar to @dsaxton - so there might have been some changes
dsaxton commentedon Aug 24, 2020
I think #35519 may have caused the performance boost. It makes sense since the main bottleneck looks like it was the astype (which was patched in that change):
bashtage commentedon Aug 24, 2020
If you remove the
astype
so there appears to be no issue with string operations.
asishm commentedon Sep 12, 2020
@bashtage I don't think that is an equivalent comparison. You are comparing
str.zfill
on a string series with a for loop that also does the string conversion. If you convert to str first, the for loop still beats outstr.zfill
(albeit that difference being small)I get similar results in master as well.
topper-123 commentedon Sep 13, 2020
I can verify OP's results.
However, doing just the type conversion (not
zfill
), is faster vectorized (in master, it's probabably slower in v.1.1.1):So must be something in the pandas
Series.str.zfill
method, notSeries.astype
that should be optimized.PRs welcome, of course.
8 remaining items