You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> **Context:** A `Regex` uses internal mutable space (called a `Cache`)
> while executing a search. Since a `Regex` really wants to be easily
> shared across multiple threads simultaneously, it follows that a
> `Regex` either needs to provide search functions that accept a `&mut
> Cache` (thereby pushing synchronization to a problem for the caller
> to solve) or it needs to do synchronization itself. While there are
> lower level APIs in `regex-automata` that do the former, they are
> less convenient. The higher level APIs, especially in the `regex`
> crate proper, need to do some kind of synchronization to give a
> search the mutable `Cache` that it needs.
>
> The current approach to that synchronization essentially uses a
> `Mutex<Vec<Cache>>` with an optimization for the "owning" thread
> that lets it bypass the `Mutex`. The owning thread optimization
> makes it so the single threaded use case essentially doesn't pay for
> any synchronization overhead, and that all works fine. But once the
> `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>`
> gets hit. And if you're doing a lot of regex searches on short
> haystacks in parallel, that `Mutex` comes under extremely heavy
> contention. To the point that a program can slow down by enormous
> amounts.
>
> This PR attempts to address that problem.
>
> Note that it's worth pointing out that this issue can be worked
> around.
>
> The simplest work-around is to clone a `Regex` and send it to other
> threads instead of sharing a single `Regex`. This won't use any
> additional memory (a `Regex` is reference counted internally),
> but it will force each thread to use the "owner" optimization
> described above. This does mean, for example, that you can't
> share a `Regex` across multiple threads conveniently with a
> `lazy_static`/`OnceCell`/`OnceLock`/whatever.
>
> The other work-around is to use the lower level search APIs on a
> `meta::Regex` in the `regex-automata` crate. Those APIs accept a
> `&mut Cache` explicitly. In that case, you can use the `thread_local`
> crate or even an actual `thread_local!` or something else entirely.
I wish I could say this PR was a home run that fixed the contention
issues with `Regex` once and for all, but it's not. It just makes
things a little better by switching from one stack to eight stacks for
the pool. The stack is chosen by doing `self.stacks[thread_id % 8]`.
It's a pretty dumb strategy, but it limits extra memory usage while at
least reducing contention. Obviously, it works a lot better for the
8-16 thread case, and while it helps with the 64-128 thread case too,
things are still pretty slow there.
A benchmark for this problem is described in #934. We compare 8 and 16
threads, and for each thread count, we compare a `cloned` and `shared`
approach. The `cloned` approach clones the regex before sending it to
each thread where as the `shared` approach shares a single regex across
multiple threads. The `cloned` approach is expected to be fast (and
it is) because it forces each thread into the owner optimization. The
`shared` approach, however, hit the shared stack behind a mutex and
suffers majorly from contention.
Here's what that benchmark looks like before this PR.
```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro
Time (mean ± σ): 2.3 ms ± 0.4 ms [User: 9.4 ms, System: 3.1 ms]
Range (min … max): 1.8 ms … 3.5 ms 823 runs
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro
Time (mean ± σ): 161.6 ms ± 8.0 ms [User: 472.4 ms, System: 477.5 ms]
Range (min … max): 150.7 ms … 176.8 ms 18 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro' ran
70.06 ± 11.43 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro'
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 3.5 ms ± 0.5 ms [User: 26.1 ms, System: 5.2 ms]
Range (min … max): 2.8 ms … 5.7 ms 576 runs
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 433.9 ms ± 7.2 ms [User: 1402.1 ms, System: 4377.1 ms]
Range (min … max): 423.9 ms … 444.4 ms 10 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
122.25 ± 15.80 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'
```
And here's what it looks like after this PR:
```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro
Time (mean ± σ): 2.2 ms ± 0.4 ms [User: 8.5 ms, System: 3.7 ms]
Range (min … max): 1.7 ms … 3.4 ms 781 runs
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro
Time (mean ± σ): 24.6 ms ± 1.8 ms [User: 141.0 ms, System: 1.2 ms]
Range (min … max): 20.8 ms … 27.3 ms 116 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro' ran
10.94 ± 2.05 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro'
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 3.6 ms ± 0.4 ms [User: 26.8 ms, System: 4.4 ms]
Range (min … max): 2.8 ms … 5.4 ms 574 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 99.4 ms ± 5.4 ms [User: 935.0 ms, System: 133.0 ms]
Range (min … max): 85.6 ms … 109.9 ms 27 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
27.95 ± 3.48 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'
```
So instead of things getting over 123x slower in the 16 thread case, it
"only" gets 28x slower.
Other ideas for future work:
* Instead of a `Vec<Mutex<Vec<Cache>>>`, use a
`Vec<LockFreeStack<Cache>>`. I'm not sure this will fully resolve the
problem, but it's likely to make it better I think. AFAIK, the main
technical challenge here is coming up with a lock-free stack in the
first place that avoids the ABA problem. Crossbeam in theory provides
some primitives to help with this (epochs), but I don't want to add any
new dependencies.
* Think up a completely different approach to the problem. I'm drawing
a blank. (The `thread_local` crate is one such avenue, and the regex
crate actually used to use `thread_local` for exactly this. But
it led to huge memory usage in environments with lots of threads.
Specifically, I believe its memory usage scales with the total number
of threads that run a regex search, where as I want memory usage to
scale with the total number of threads *simultaneously* running a regex
search.)
Ref #934. If folks have insights or opinions, I'd appreciate if they
shared them in #934 instead of this PR. :-) Thank you!
0 commit comments