Skip to content

Commit 5797b1c

Browse files
committed
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a worker_pool. One of the roles that a pwq plays is enforcement of the max_active concurrency limit. Before 636b927 ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU for per-cpu workqueues and per each NUMA node for unbound workqueues, which was a natural result of per-cpu workqueues being served by per-cpu pools and unbound by per-NUMA pools. In terms of max_active enforcement, this was, while not perfect, workable. For per-cpu workqueues, it was fine. For unbound, it wasn't great in that NUMA machines would get max_active that's multiplied by the number of nodes but didn't cause huge problems because NUMA machines are relatively rare and the node count is usually pretty low. However, cache layouts are more complex now and sharing a worker pool across a whole node didn't really work well for unbound workqueues. Thus, a series of commits culminating on 8639ece ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") implemented more flexible affinity mechanism for unbound workqueues which enables using e.g. last-level-cache aligned pools. In the process, 636b927 ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") made unbound workqueues use per-cpu pwqs like per-cpu workqueues. While the change was necessary to enable more flexible affinity scopes, this came with the side effect of blowing up the effective max_active for unbound workqueues. Before, the effective max_active for unbound workqueues was multiplied by the number of nodes. After, by the number of CPUs. 636b927 ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") claims that this should generally be okay. It is okay for users which self-regulates concurrency level which are the vast majority; however, there are enough use cases which actually depend on max_active to prevent the level of concurrency from going bonkers including several IO handling workqueues that can issue a work item for each in-flight IO. With targeted benchmarks, the misbehavior can easily be exposed as reported in http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3. Unfortunately, there is no way to express what these use cases need using per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want to set max_active too low but as soon as we increase max_active a bit, we can end up with unreasonable number of in-flight work items when many CPUs issue IOs at the same time. ie. The acceptable lowest max_active is higher than the acceptable highest max_active. Ideally, max_active for an unbound workqueue should be system-wide so that the users can regulate the total level of concurrency regardless of node and cache layout. The reasons workqueue hasn't implemented that yet are: - One max_active enforcement decouples from pool boundaires, chaining execution after a work item finishes requires inter-pool operations which would require lock dancing, which is nasty. - Sharing a single nr_active count across the whole system can be pretty expensive on NUMA machines. - Per-pwq enforcement had been more or less okay while we were using per-node pools. It looks like we no longer can avoid decoupling max_active enforcement from pool boundaries. This patch implements system-wide nr_active mechanism with the following design characteristics: - To avoid sharing a single counter across multiple nodes, the configured max_active is split across nodes according to the proportion of each workqueue's online effective CPUs per node. e.g. A node with twice more online effective CPUs will get twice higher portion of max_active. - Workqueue used to be able to process a chain of interdependent work items which is as long as max_active. We can't do this anymore as max_active is distributed across the nodes. Instead, a new parameter min_active is introduced which determines the minimum level of concurrency within a node regardless of how max_active distribution comes out to be. It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8. This can lead to higher effective max_weight than configured and also deadlocks if a workqueue was depending on being able to handle chains of interdependent work items that are longer than 8. I believe these should be fine given that the number of CPUs in each NUMA node is usually higher than 8 and work item chain longer than 8 is pretty unlikely. However, if these assumptions turn out to be wrong, we'll need to add an interface to adjust min_active. - Each unbound wq has an array of struct wq_node_nr_active which tracks per-node nr_active. When its pwq wants to run a work item, it has to obtain the matching node's nr_active. If over the node's max_active, the pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish, the completion path round-robins the pending pwqs activating the first inactive work item of each, which involves some pool lock dancing and kicking other pools. It's not the simplest code but doesn't look too bad. v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active(). - wq_adjust_max_active() is now protected by wq->mutex instead of wq_pool_mutex. v3: - wq_node_max_active() used to calculate per-node max_active on the fly based on system-wide CPU online states. Lai pointed out that this can lead to skewed distributions for workqueues with restricted cpumasks. Update the max_active distribution to use per-workqueue effective online CPU counts instead of system-wide and cache the calculation results in node_nr_active->max. v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Naohiro Aota <[email protected]> Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3 Fixes: 636b927 ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") Reviewed-by: Lai Jiangshan <[email protected]>
1 parent 91ccc6e commit 5797b1c

File tree

2 files changed

+341
-35
lines changed

2 files changed

+341
-35
lines changed

include/linux/workqueue.h

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -398,6 +398,13 @@ enum wq_consts {
398398
WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
399399
WQ_UNBOUND_MAX_ACTIVE = WQ_MAX_ACTIVE,
400400
WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2,
401+
402+
/*
403+
* Per-node default cap on min_active. Unless explicitly set, min_active
404+
* is set to min(max_active, WQ_DFL_MIN_ACTIVE). For more details, see
405+
* workqueue_struct->min_active definition.
406+
*/
407+
WQ_DFL_MIN_ACTIVE = 8,
401408
};
402409

403410
/*
@@ -440,11 +447,33 @@ extern struct workqueue_struct *system_freezable_power_efficient_wq;
440447
* alloc_workqueue - allocate a workqueue
441448
* @fmt: printf format for the name of the workqueue
442449
* @flags: WQ_* flags
443-
* @max_active: max in-flight work items per CPU, 0 for default
450+
* @max_active: max in-flight work items, 0 for default
444451
* remaining args: args for @fmt
445452
*
446-
* Allocate a workqueue with the specified parameters. For detailed
447-
* information on WQ_* flags, please refer to
453+
* For a per-cpu workqueue, @max_active limits the number of in-flight work
454+
* items for each CPU. e.g. @max_active of 1 indicates that each CPU can be
455+
* executing at most one work item for the workqueue.
456+
*
457+
* For unbound workqueues, @max_active limits the number of in-flight work items
458+
* for the whole system. e.g. @max_active of 16 indicates that that there can be
459+
* at most 16 work items executing for the workqueue in the whole system.
460+
*
461+
* As sharing the same active counter for an unbound workqueue across multiple
462+
* NUMA nodes can be expensive, @max_active is distributed to each NUMA node
463+
* according to the proportion of the number of online CPUs and enforced
464+
* independently.
465+
*
466+
* Depending on online CPU distribution, a node may end up with per-node
467+
* max_active which is significantly lower than @max_active, which can lead to
468+
* deadlocks if the per-node concurrency limit is lower than the maximum number
469+
* of interdependent work items for the workqueue.
470+
*
471+
* To guarantee forward progress regardless of online CPU distribution, the
472+
* concurrency limit on every node is guaranteed to be equal to or greater than
473+
* min_active which is set to min(@max_active, %WQ_DFL_MIN_ACTIVE). This means
474+
* that the sum of per-node max_active's may be larger than @max_active.
475+
*
476+
* For detailed information on %WQ_* flags, please refer to
448477
* Documentation/core-api/workqueue.rst.
449478
*
450479
* RETURNS:

0 commit comments

Comments
 (0)