Skip to content

PERF: Vectorized string operations are slower than for-loops #35864

Open
@Nick-Morgan

Description

@Nick-Morgan
  • I have checked that this issue has not already been reported.

    I have confirmed this bug exists on the latest version of pandas.

    (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

In [1]:

import pandas as pd
import numpy as np
print(pd.__version__)

non_padded = pd.Series(np.random.randint(100, 99999, size=10000))

def for_loop(series):
    return pd.Series([str(zipcode).zfill(5) for zipcode in series])

def vectorized(series):
    return series.astype(str).str.zfill(5)

%timeit for_loop(non_padded)
%timeit vectorized(non_padded)

Out [1]:

0.25.1
3.32 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.18 ms ± 60.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

In most cases, using a for-loop in pandas is much slower than its vectorized equivalent. However, the above operations takes over twice as long when using vectorization. I have replicated this issue on MacOS & Ubuntu.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 20.0.2
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : 0.14.0
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1

Activity

added
PerformanceMemory or execution speed performance
StringsString extension data type and string data
and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Aug 23, 2020
dsaxton

dsaxton commented on Aug 23, 2020

@dsaxton
Member

This is an old version of pandas, if you upgrade the difference doesn't seem to be as much.

In [1]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: print(pd.__version__)
   ...:
   ...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
   ...:
   ...: %timeit pd.Series([str(zipcode).zfill(5) for zipcode in non_padded])
   ...: %timeit non_padded.astype(str).str.zfill(5)
   ...:
1.1.1
3.44 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.44 ms ± 85.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It's worth pointing out though that the string accessor method is doing a bit more work than the pure Python version, e.g., trying to handle any missing values that it might encounter.

In [3]: ser = pd.Series([None, "1", "2"])

In [4]: ser.str.zfill(5)
Out[4]:
0     None
1    00001
2    00002
dtype: object

In [5]: pd.Series([x.zfill(5) for x in ser])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-35047bd45c0a> in <module>
----> 1 pd.Series([x.zfill(5) for x in ser])

<ipython-input-5-35047bd45c0a> in <listcomp>(.0)
----> 1 pd.Series([x.zfill(5) for x in ser])

AttributeError: 'NoneType' object has no attribute 'zfill'
changed the title [-]BUG: Vectorized string operations are slower than for-loops[/-] [+]PERF: Vectorized string operations are slower than for-loops[/+] on Aug 23, 2020
asishm

asishm commented on Aug 24, 2020

@asishm
Contributor

On 1.1.0 and I still see a big perf difference with the same test code

In [65]: pd.__version__
Out[65]: '1.1.0'

In [66]: import pandas as pd
    ...: import numpy as np

In [67]: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
    ...: 
    ...: def for_loop(series):
    ...:     return pd.Series([str(zipcode).zfill(5) for zipcode in series])
    ...: 
    ...: def vectorized(series):
    ...:     return series.astype(str).str.zfill(5)
    ...:

In [68]: %timeit for_loop(non_padded)
5.96 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [69]: %timeit vectorized(non_padded)
12.1 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upgraded to 1.1.1 and I seem to get results similar to @dsaxton - so there might have been some changes

In [1]: import pandas as pd^M
   ...: import numpy as np^M
   ...: print(pd.__version__)^M
   ...: ^M
   ...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))^M
   ...: ^M
   ...: def for_loop(series):^M
   ...:     return pd.Series([str(zipcode).zfill(5) for zipcode in series])^M
   ...: ^M
   ...: def vectorized(series):^M
   ...:     return series.astype(str).str.zfill(5)^M
   ...: ^M
   ...: %timeit for_loop(non_padded)^M
   ...: %timeit vectorized(non_padded)
1.1.1
6.06 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.4 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
dsaxton

dsaxton commented on Aug 24, 2020

@dsaxton
Member

I think #35519 may have caused the performance boost. It makes sense since the main bottleneck looks like it was the astype (which was patched in that change):

In [1]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: print(pd.__version__)
   ...:
   ...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000)).astype(str)  # Do the astype first
   ...:
   ...: %timeit non_padded.str.zfill(5)
   ...: %timeit pd.Series([z.zfill(5) for z in non_padded])
   ...:
1.1.0
2.14 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.69 ms ± 84.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
bashtage

bashtage commented on Aug 24, 2020

@bashtage
Contributor

If you remove the astype

def vectorized2(series):
    return series.str.zfill(5)

np2 = non_padded.astype(str)

%timeit vectorized2(np2)
1.68 ms ± 8.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit for_loop(non_padded)
2.63 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

so there appears to be no issue with string operations.

added
Dtype ConversionsUnexpected or buggy dtype conversions
and removed
StringsString extension data type and string data
on Aug 24, 2020
asishm

asishm commented on Sep 12, 2020

@asishm
Contributor

@bashtage I don't think that is an equivalent comparison. You are comparing str.zfill on a string series with a for loop that also does the string conversion. If you convert to str first, the for loop still beats out str.zfill (albeit that difference being small)

I get similar results in master as well.

In [20]: %timeit pd.Series([k.zfill(5) for k in np2])
    ...: %timeit np2.str.zfill(5)
2.35 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.75 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
topper-123

topper-123 commented on Sep 13, 2020

@topper-123
Contributor

I can verify OP's results.

However, doing just the type conversion (not zfill), is faster vectorized (in master, it's probabably slower in v.1.1.1):

import pandas as pd
import numpy as np
print(pd.__version__)

non_padded = pd.Series(np.random.randint(100, 99999, size=10000))

# type conversion and zfill, vectorized slower
%timeit pd.Series([str(zipcode).zfill(5) for zipcode in series])
5.2 ms ± 91.5 µs per loop
%timeit series.astype(str).str.zfill(5)
6.52 ms ± 78.5 µs per loop

# only type conversion, vectorized faster
%timeit pd.Series([str(zipcode) for zipcode in non_padded])
4.11 ms ± 58.4 µs per loop
%timeit non_padded.astype(str)
2.47 ms ± 21.6 µs per loop

# only zfill, vectorized slower
non_padded_str = pd.Series(np.random.randint(100, 99999, size=10000)).astype(str)
%timeit pd.Series([zipcode.zfill(5) for zipcode in non_padded_str])
2.71 ms ± 22.4 µs per loop
%timeit non_padded_str.str.zfill(5)
3.18 ms ± 27.4 µs per loop

So must be something in the pandas Series.str.zfill method, not Series.astype that should be optimized.

PRs welcome, of course.

8 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @asishm@dsaxton@itamarst@bashtage@gwerbin

        Issue actions

          PERF: Vectorized string operations are slower than for-loops · Issue #35864 · pandas-dev/pandas