Skip to content

[26.0] Add debug middleware and regression tests for blocked main event loop#22207

Merged
jmchilton merged 4 commits intogalaxyproject:release_26.0from
mvdbeek:debug_event_loop_and_add_testing
Apr 21, 2026
Merged

[26.0] Add debug middleware and regression tests for blocked main event loop#22207
jmchilton merged 4 commits intogalaxyproject:release_26.0from
mvdbeek:debug_event_loop_and_add_testing

Conversation

@mvdbeek
Copy link
Copy Markdown
Member

@mvdbeek mvdbeek commented Mar 21, 2026

Add middleware that monitors the asyncio event loop for blocking calls:

  • EventLoopWatchdog: daemon thread that probes the event loop and logs
    stack traces (at ERROR level, sent to Sentry) when it's unresponsive
  • EventLoopWatchdogMiddleware: ASGI middleware emitting per-request
    Server-Timing headers with event loop lag and nginx queue time
  • Enabled via galaxy.yml: event_loop_watchdog_threshold:

Add /api/debug/block and /api/debug/ok diagnostic endpoints to
simulate and observe event loop blocking.

Add test infrastructure to fail API tests on event loop blocking:

  • ApiTestInteractor checks Server-Timing header on every response
  • Test server auto-enables watchdog when env var is set
  • Enable in CI: GALAXY_TEST_EVENT_LOOP_BLOCKING_THRESHOLD=0.05

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@github-actions github-actions Bot changed the title Debug event loop and add testing [26.0] Debug event loop and add testing Mar 21, 2026
Comment thread lib/galaxy_test/api/test_event_loop_blocking.py Outdated
@mvdbeek mvdbeek force-pushed the debug_event_loop_and_add_testing branch from 9003f8b to a0f3685 Compare March 23, 2026 15:12
@mvdbeek mvdbeek changed the title [26.0] Debug event loop and add testing [26.0] Add debug middleware and regression tests for blocked main event loop Mar 23, 2026
@mvdbeek
Copy link
Copy Markdown
Member Author

mvdbeek commented Mar 24, 2026

This testing is really worth something!: encode/httpx#3707

@mvdbeek mvdbeek force-pushed the debug_event_loop_and_add_testing branch from 99215a2 to 1b9e4a1 Compare March 24, 2026 08:39
@mvdbeek mvdbeek force-pushed the debug_event_loop_and_add_testing branch 4 times, most recently from 5b9c0aa to 3be16b0 Compare April 20, 2026 21:25
mvdbeek added 4 commits April 21, 2026 10:55
Integrate aiocop (https://github.com/Feverup/aiocop), which uses
sys.audit hooks to catch specific blocking syscalls (socket.connect,
getaddrinfo, subprocess, open, etc.) from inside async tasks and report
the exact call site.

aiocop is pinned to the event-loop thread explicitly because Galaxy's
test harness runs uvicorn in a non-main thread; activation happens on
the ASGI lifespan startup event so the first request is monitored. An
AiocopMiddleware surfaces captured events per-request via an
X-Aiocop-Violations header (count/max-severity/first) so the test
interactor can fail requests on high-severity blocking I/O.

Integration tests spin up a fresh event loop on a new thread per test
module via driver_util.uvicorn_serve, and aiocop has process-global
state (audit hook registration, patch_audit_functions side effects,
detect_slow_tasks's _detect_slow_tasks_configured guard) that must not
be repeated. install_aiocop() is therefore split into a once-per-process
block (audit patching, audit hook registration) and a per-Galaxy-instance
re-arm (main-thread rebinding, callback clear+register, detect_slow_tasks
reset) so each new Galaxy instance actually gets its loop monitored.

Enabled by default under GALAXY_TEST_AIOCOP=1 in run_tests.sh; set to 0
to disable.
TabularToolDataField.to_dict() performs os.path.isdir, glob, and
os.path.getsize (called twice via get_fingerprint), which blocked the
event loop in the async show_field handler and tripped aiocop.
Previously resolved fresh per request via lagom, which re-ran
UrlHeadersConfigFactory.from_app_config and reopened the YAML config
from disk on the async event loop for every /api/tools/fetch and
/api/workflows/* call. All its peer managers with the same dependency
shape are already singletons.
CitationsManager.__init__ constructs a DoiCache, which instantiates a
beaker CacheManager that creates cache data/lock directories
(os.makedirs) and opens lock files. Previously this ran on the event
loop for every /api/tools/*/citations etc. request. All its peer
managers constructed alongside it in _configure_toolbox are singletons.
@mvdbeek mvdbeek force-pushed the debug_event_loop_and_add_testing branch from 3be16b0 to 90fa520 Compare April 21, 2026 08:57
@mvdbeek mvdbeek marked this pull request as ready for review April 21, 2026 13:19
@mvdbeek
Copy link
Copy Markdown
Member Author

mvdbeek commented Apr 21, 2026

Test failures should all be unrelated, the last 3 commits fix actual bugs

@github-actions github-actions Bot added this to the 26.1 milestone Apr 21, 2026
@jmchilton jmchilton merged commit a2396b0 into galaxyproject:release_26.0 Apr 21, 2026
56 of 63 checks passed
@github-project-automation github-project-automation Bot moved this from Needs Review to Done in Galaxy Dev - weeklies Apr 21, 2026
@github-actions
Copy link
Copy Markdown

This PR was merged without a "kind/" label, please correct.

@nsoranzo nsoranzo deleted the debug_event_loop_and_add_testing branch April 21, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants