Skip to content

[26.0] Batch-prefetch HDCA job_state_summary in history contents listing#22256

Merged
mvdbeek merged 1 commit intogalaxyproject:release_26.0from
mvdbeek:fix-n-plus-one-history-contents
Mar 27, 2026
Merged

[26.0] Batch-prefetch HDCA job_state_summary in history contents listing#22256
mvdbeek merged 1 commit intogalaxyproject:release_26.0from
mvdbeek:fix-n-plus-one-history-contents

Conversation

@mvdbeek
Copy link
Copy Markdown
Member

@mvdbeek mvdbeek commented Mar 25, 2026

When listing history contents, each HDCA triggered an individual SQL query to aggregate job states during serialization. This adds a batch query that prefetches all job state summaries in a single query before serialization begins, injecting results into the HDCA instances' cache.

Before:
HDCA object loading: 1 query (batch IN)
Job state summaries: N queries (one per HDCA)
Total for HDCAs: 1 + N queries

After:
HDCA object loading: 1 query (batch IN)
Job state summaries: 1 query (batch IN + GROUP BY)
Total for HDCAs: 2 queries

Benchmark test results:

  • Old per-HDCA path emits at least N queries for N HDCAs
  • batch_fetch_job_state_summaries emits exactly 1 query regardless of HDCA count
  • Batch results are identical to per-HDCA results
  • Injecting batch results into _job_state_summary cache prevents any further queries
  • Wall-clock benchmark across 1/10/50/100 HDCAs confirms batch path uses fewer queries

Gist: https://gist.github.com/mvdbeek/6806938201eef20470285bc18267747d

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

When listing history contents, each HDCA triggered an individual
SQL query to aggregate job states during serialization. This adds
a batch query that prefetches all job state summaries in a single
query before serialization begins, injecting results into the HDCA
instances' cache.

Before:
  HDCA object loading:   1 query  (batch IN)
  Job state summaries:   N queries (one per HDCA)
  Total for HDCAs:       1 + N queries

After:
  HDCA object loading:   1 query  (batch IN)
  Job state summaries:   1 query  (batch IN + GROUP BY)
  Total for HDCAs:       2 queries

Benchmark test results:
- Old per-HDCA path emits at least N queries for N HDCAs
- batch_fetch_job_state_summaries emits exactly 1 query regardless
  of HDCA count
- Batch results are identical to per-HDCA results
- Injecting batch results into _job_state_summary cache prevents
  any further queries
- Wall-clock benchmark across 1/10/50/100 HDCAs confirms batch
  path uses fewer queries

Gist: https://gist.github.com/mvdbeek/6806938201eef20470285bc18267747d
@mvdbeek mvdbeek force-pushed the fix-n-plus-one-history-contents branch from 204e7c4 to 929deb9 Compare March 25, 2026 11:51
@github-actions github-actions Bot added this to the 26.1 milestone Mar 25, 2026
@github-actions github-actions Bot changed the title Batch-prefetch HDCA job_state_summary in history contents listing [26.0] Batch-prefetch HDCA job_state_summary in history contents listing Mar 25, 2026
@jmchilton
Copy link
Copy Markdown
Member

Wall-clock benchmark across 1/10/50/100 HDCAs confirms batch path uses fewer queries

You're replacing code I understand with code I don't - which I think is normal for performance tweaks and given that I'm old and dense - but there is no evidence of a problem or that this fixes it. Like that sentence doesn't answer if it reduces contention or ups it, does it run faster or not, it just verifies fewer queries. I suspect it might improve things - but there are situations where locking over a longer database operation would actually make things worse I assume - I don't know if such locking even occurs in this query though.

Maybe this is just a salespitch issue... this feels like a stab in the dark and I would take a Marius "I think this might make a problem less bad - I want to deploy to production and test things" over a Claude "Here I did wallclock testing and found... suspiciously not a wall clock result but a query counting result".

I trust you implicitly to monitor the situation and fix it if things go wrong - so I'm going to approve either way obviously. I don't get my hands dirty with deployment stuff so I'm internally grateful that you do and that you do such an amazing job at it.

@mvdbeek
Copy link
Copy Markdown
Member Author

mvdbeek commented Mar 26, 2026

but there is no evidence of a problem

fair, I didn't make this explicit in the PR description

Screenshot 2026-03-26 at 09 35 58

it is among the most heavily used routes, and has the worst p95 response time among those, so that is something data-driven to optimize. You can see the overhead and N+1 pattern quite nicely in https://sentry.galaxyproject.org/organizations/galaxy/insights/backend/summary/trace/d3d0b63c60fe4e068d1a3c1376aa6af6/?node=txn-1d57c06219c549ac8aa8aa9b98c1bb25&project=3&query=http.method%3AGET%20transaction.op%3Ahttp.server&referrer=performance-transaction-summary&source=performance_transaction_summary&statsPeriod=12h&timestamp=1774474456&transaction=%2Fapi%2Fhistories%2F%7Bhistory_id%7D%2Fcontents&unselectedSeries=p100%28%29&unselectedSeries=avg%28%29:
Screenshot 2026-03-26 at 09 38 36

or that this fixes it

The gist has a unit test that prints some data points that I think I can believe (but production is of course always a different story, I concede that):

test/unit/data/model/test_batch_job_state_summary.py::test_timing_comparison[1]
  HDCAs=  1  |  Old:   1 queries 0.0011s  |  Batch:   1 queries 0.0009s  |  Speedup: 1.2x
PASSED
test/unit/data/model/test_batch_job_state_summary.py::test_timing_comparison[10]
  HDCAs= 10  |  Old:  10 queries 0.0076s  |  Batch:   1 queries 0.0013s  |  Speedup: 5.8x
PASSED
test/unit/data/model/test_batch_job_state_summary.py::test_timing_comparison[50]
  HDCAs= 50  |  Old:  50 queries 0.0369s  |  Batch:   1 queries 0.0022s  |  Speedup: 16.7x
PASSED
test/unit/data/model/test_batch_job_state_summary.py::test_timing_comparison[100]
  HDCAs=100  |  Old: 100 queries 0.0754s  |  Batch:   1 queries 0.0038s  |  Speedup: 19.7x
PASSED

@mvdbeek
Copy link
Copy Markdown
Member Author

mvdbeek commented Mar 26, 2026

I'll deploy and report back

@mvdbeek
Copy link
Copy Markdown
Member Author

mvdbeek commented Mar 27, 2026

It isn't the silver bullet, since there are remaining N+1 queries in this endpoint, but it seems to have dropped the P95 response time from 1.2 seconds to 950ms in comparable 12 hour windows (my morning, no load peaks). The generated SQL is fairly readable imo (compared to the caching queries for instance)

@mvdbeek mvdbeek merged commit 0cc0b06 into galaxyproject:release_26.0 Mar 27, 2026
56 of 59 checks passed
@github-project-automation github-project-automation Bot moved this from Needs Review to Done in Galaxy Dev - weeklies Mar 27, 2026
@ahmedhamidawan ahmedhamidawan modified the milestones: 26.1, 26.0 Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants