Skip to content

Commit 2aa6744

Browse files
committed
automata: reduce regex contention somewhat
> **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared across multiple threads simultaneously, it follows that a > `Regex` either needs to provide search functions that accept a `&mut > Cache` (thereby pushing synchronization to a problem for the caller > to solve) or it needs to do synchronization itself. While there are > lower level APIs in `regex-automata` that do the former, they are > less convenient. The higher level APIs, especially in the `regex` > crate proper, need to do some kind of synchronization to give a > search the mutable `Cache` that it needs. > > The current approach to that synchronization essentially uses a > `Mutex<Vec<Cache>>` with an optimization for the "owning" thread > that lets it bypass the `Mutex`. The owning thread optimization > makes it so the single threaded use case essentially doesn't pay for > any synchronization overhead, and that all works fine. But once the > `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>` > gets hit. And if you're doing a lot of regex searches on short > haystacks in parallel, that `Mutex` comes under extremely heavy > contention. To the point that a program can slow down by enormous > amounts. > > This PR attempts to address that problem. > > Note that it's worth pointing out that this issue can be worked > around. > > The simplest work-around is to clone a `Regex` and send it to other > threads instead of sharing a single `Regex`. This won't use any > additional memory (a `Regex` is reference counted internally), > but it will force each thread to use the "owner" optimization > described above. This does mean, for example, that you can't > share a `Regex` across multiple threads conveniently with a > `lazy_static`/`OnceCell`/`OnceLock`/whatever. > > The other work-around is to use the lower level search APIs on a > `meta::Regex` in the `regex-automata` crate. Those APIs accept a > `&mut Cache` explicitly. In that case, you can use the `thread_local` > crate or even an actual `thread_local!` or something else entirely. I wish I could say this PR was a home run that fixed the contention issues with `Regex` once and for all, but it's not. It just makes things a little better by switching from one stack to eight stacks for the pool. The stack is chosen by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy, but it limits extra memory usage while at least reducing contention. Obviously, it works a lot better for the 8-16 thread case, and while it helps with the 64-128 thread case too, things are still pretty slow there. A benchmark for this problem is described in #934. We compare 8 and 16 threads, and for each thread count, we compare a `cloned` and `shared` approach. The `cloned` approach clones the regex before sending it to each thread where as the `shared` approach shares a single regex across multiple threads. The `cloned` approach is expected to be fast (and it is) because it forces each thread into the owner optimization. The `shared` approach, however, hit the shared stack behind a mutex and suffers majorly from contention. Here's what that benchmark looks like before this PR. ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro Time (mean ± σ): 2.3 ms ± 0.4 ms [User: 9.4 ms, System: 3.1 ms] Range (min … max): 1.8 ms … 3.5 ms 823 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro Time (mean ± σ): 161.6 ms ± 8.0 ms [User: 472.4 ms, System: 477.5 ms] Range (min … max): 150.7 ms … 176.8 ms 18 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro' ran 70.06 ± 11.43 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro' $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro Time (mean ± σ): 3.5 ms ± 0.5 ms [User: 26.1 ms, System: 5.2 ms] Range (min … max): 2.8 ms … 5.7 ms 576 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro Time (mean ± σ): 433.9 ms ± 7.2 ms [User: 1402.1 ms, System: 4377.1 ms] Range (min … max): 423.9 ms … 444.4 ms 10 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran 122.25 ± 15.80 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro' ``` And here's what it looks like after this PR: ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro Time (mean ± σ): 2.2 ms ± 0.4 ms [User: 8.5 ms, System: 3.7 ms] Range (min … max): 1.7 ms … 3.4 ms 781 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro Time (mean ± σ): 24.6 ms ± 1.8 ms [User: 141.0 ms, System: 1.2 ms] Range (min … max): 20.8 ms … 27.3 ms 116 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro' ran 10.94 ± 2.05 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro' $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro Time (mean ± σ): 3.6 ms ± 0.4 ms [User: 26.8 ms, System: 4.4 ms] Range (min … max): 2.8 ms … 5.4 ms 574 runs Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely. Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro Time (mean ± σ): 99.4 ms ± 5.4 ms [User: 935.0 ms, System: 133.0 ms] Range (min … max): 85.6 ms … 109.9 ms 27 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran 27.95 ± 3.48 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro' ``` So instead of things getting over 123x slower in the 16 thread case, it "only" gets 28x slower. Other ideas for future work: * Instead of a `Vec<Mutex<Vec<Cache>>>`, use a `Vec<LockFreeStack<Cache>>`. I'm not sure this will fully resolve the problem, but it's likely to make it better I think. AFAIK, the main technical challenge here is coming up with a lock-free stack in the first place that avoids the ABA problem. Crossbeam in theory provides some primitives to help with this (epochs), but I don't want to add any new dependencies. * Think up a completely different approach to the problem. I'm drawing a blank. (The `thread_local` crate is one such avenue, and the regex crate actually used to use `thread_local` for exactly this. But it led to huge memory usage in environments with lots of threads. Specifically, I believe its memory usage scales with the total number of threads that run a regex search, where as I want memory usage to scale with the total number of threads *simultaneously* running a regex search.) Ref #934. If folks have insights or opinions, I'd appreciate if they shared them in #934 instead of this PR. :-) Thank you!
1 parent 9a505a1 commit 2aa6744

File tree

1 file changed

+75
-7
lines changed

1 file changed

+75
-7
lines changed

regex-automata/src/util/pool.rs

Lines changed: 75 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,64 @@ mod inner {
268268
/// do.
269269
static THREAD_ID_DROPPED: usize = 2;
270270

271+
/// The number of stacks we use inside of the pool. These are only used for
272+
/// non-owners. That is, these represent the "slow" path.
273+
///
274+
/// In the original implementation of this pool, we only used a single
275+
/// stack. While this might be okay for a couple threads, the prevalence of
276+
/// 32, 64 and even 128 core CPUs has made it untenable. The contention
277+
/// such an environment introduces when threads are doing a lot of searches
278+
/// on short haystacks (a not uncommon use case) is palpable and leads to
279+
/// huge slowdowns.
280+
///
281+
/// This constant reflects a change from using one stack to the number of
282+
/// stacks that this constant is set to. The stack for a particular thread
283+
/// is simply chosen by `thread_id % MAX_POOL_STACKS`. The idea behind
284+
/// this setup is that there should be a good chance that accesses to the
285+
/// pool will be distributed over several stacks instead of all of them
286+
/// converging to one.
287+
///
288+
/// This is not a particular smart or dynamic strategy. Fixing this to a
289+
/// specific number has at least two downsides. First is that it will help,
290+
/// say, an 8 core CPU more than it will a 128 core CPU. (But, crucially,
291+
/// it will still help the 128 core case.) Second is that this may wind
292+
/// up being a little wasteful with respect to memory usage. Namely, if a
293+
/// regex is used on one thread and then moved to another thread, then it
294+
/// could result in creating a new copy of the data in the pool even though
295+
/// one is actually needed.
296+
///
297+
/// And that memory usage bit is why this is set to 8 and not, say, 64.
298+
/// Keeping it at 8 limits, to an extent, how much unnecessary memory can
299+
/// be allocated.
300+
///
301+
/// In an ideal world, we'd be able to have something like this:
302+
///
303+
/// * Grow the number of stacks as the number of concurrent callers
304+
/// increases. I spent a little time trying this, but even just adding an
305+
/// atomic addition/subtraction for each pop/push for tracking concurrent
306+
/// callers led to a big perf hit. Since even more work would seemingly be
307+
/// required than just an addition/subtraction, I abandoned this approach.
308+
/// * The maximum amount of memory used should scale with respect to the
309+
/// number of concurrent callers and *not* the total number of existing
310+
/// threads. This is primarily why the `thread_local` crate isn't used, as
311+
/// as some environments spin up a lot of threads. This led to multiple
312+
/// reports of extremely high memory usage (often described as memory
313+
/// leaks).
314+
/// * Even more ideally, the pool could contract in size. That is, it
315+
/// should grow with bursts and then shrink. But this is a pretty thorny
316+
/// issue to tackle and it might be better to just not.
317+
/// * It would be nice to explore the use of, say, a lock-free stack
318+
/// instead of using a mutex to guard a `Vec` that is ultimately just
319+
/// treated as a stack. The main thing preventing me from exploring this
320+
/// is the ABA problem. The `crossbeam` crate has tools for dealing with
321+
/// this sort of problem (via its epoch based memory reclamation strategy),
322+
/// but I can't justify bringing in all of `crossbeam` as a dependency of
323+
/// `regex` for this.
324+
///
325+
/// See this issue for more context and discussion:
326+
/// https://github.com/rust-lang/regex/issues/934
327+
const MAX_POOL_STACKS: usize = 8;
328+
271329
thread_local!(
272330
/// A thread local used to assign an ID to a thread.
273331
static THREAD_ID: usize = {
@@ -299,12 +357,12 @@ mod inner {
299357
/// This makes the common case of running a regex within a single thread
300358
/// faster by avoiding mutex unlocking.
301359
pub(super) struct Pool<T, F> {
302-
/// A stack of T values to hand out. These are used when a Pool is
303-
/// accessed by a thread that didn't create it.
304-
stack: Mutex<Vec<Box<T>>>,
305360
/// A function to create more T values when stack is empty and a caller
306361
/// has requested a T.
307362
create: F,
363+
/// A stack of T values to hand out. These are used when a Pool is
364+
/// accessed by a thread that didn't create it.
365+
stacks: Vec<Mutex<Vec<Box<T>>>>,
308366
/// The ID of the thread that owns this pool. The owner is the thread
309367
/// that makes the first call to 'get'. When the owner calls 'get', it
310368
/// gets 'owner_val' directly instead of returning a T from 'stack'.
@@ -375,9 +433,13 @@ mod inner {
375433
// 'Pool::new' method as 'const' too. (The alloc-only Pool::new
376434
// is already 'const', so that should "just work" too.) The only
377435
// thing we're waiting for is Mutex::new to be const.
436+
let mut stacks = Vec::with_capacity(MAX_POOL_STACKS);
437+
for _ in 0..MAX_POOL_STACKS {
438+
stacks.push(Mutex::new(vec![]));
439+
}
378440
let owner = AtomicUsize::new(THREAD_ID_UNOWNED);
379441
let owner_val = UnsafeCell::new(None); // init'd on first access
380-
Pool { stack: Mutex::new(vec![]), create, owner, owner_val }
442+
Pool { create, stacks, owner, owner_val }
381443
}
382444
}
383445

@@ -401,6 +463,9 @@ mod inner {
401463
let caller = THREAD_ID.with(|id| *id);
402464
let owner = self.owner.load(Ordering::Acquire);
403465
if caller == owner {
466+
// N.B. We could also do a CAS here instead of a load/store,
467+
// but ad hoc benchmarking suggests it is slower. And a lot
468+
// slower in the case where `get_slow` is common.
404469
self.owner.store(THREAD_ID_INUSE, Ordering::Release);
405470
return self.guard_owned(caller);
406471
}
@@ -444,7 +509,8 @@ mod inner {
444509
return self.guard_owned(caller);
445510
}
446511
}
447-
let mut stack = self.stack.lock().unwrap();
512+
let stack_id = caller % MAX_POOL_STACKS;
513+
let mut stack = self.stacks[stack_id].lock().unwrap();
448514
let value = match stack.pop() {
449515
None => Box::new((self.create)()),
450516
Some(value) => value,
@@ -456,7 +522,9 @@ mod inner {
456522
/// Once the guard that's returned by 'get' is dropped, it is put back
457523
/// into the pool automatically.
458524
fn put_value(&self, value: Box<T>) {
459-
let mut stack = self.stack.lock().unwrap();
525+
let caller = THREAD_ID.with(|id| *id);
526+
let stack_id = caller % MAX_POOL_STACKS;
527+
let mut stack = self.stacks[stack_id].lock().unwrap();
460528
stack.push(value);
461529
}
462530

@@ -474,7 +542,7 @@ mod inner {
474542
impl<T: core::fmt::Debug, F> core::fmt::Debug for Pool<T, F> {
475543
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
476544
f.debug_struct("Pool")
477-
.field("stack", &self.stack)
545+
.field("stacks", &self.stacks)
478546
.field("owner", &self.owner)
479547
.field("owner_val", &self.owner_val)
480548
.finish()

0 commit comments

Comments
 (0)