[V1][Frontend] Improve Shutdown And Logs #11737

robertgshaw2-redhat · 2025-01-04T14:48:39Z

SUMMARY:

Prior to this PR, if we encountered an error in a background process, we kill the process tree immediately, which means that we cannot cleanup resources and cannot return good status codes to clients. This PR overhauls the Error handling to instead shut down the background processes and raise Errors that allow us to return proper HTTP status codes to users
Prior to this PR, we were not properly shutting down when Errors occured during startup, especially in the TP case
Prior to this PR, we used signals to catch errors from background processes. Due to limitations of Python, this prevented us from running outside the main thread. This is a problem for deployments in TritonServer

DESIGN:

for errors during startup, we wrap __init__ code with try...catch and push FAILED over the ready PIPE. This works well since the parent processes are waiting for confirmation
for errors during runtime, we wrap the busy loops with try..catch and push failure messages over the existing IPC mechanisms.

One weakness is that issues with the ipc mechanisms themselves are not handled explicitly

Curious if anyone has ideas on this
This can be a follow on task

TEST MATRIX:

AsyncLLM, TP=1 + TP>1 --- runtime and startup
LLM (MP), TP=1, TP>1 --- runtime and startup
LLM (no-MP), TP=1, TP>1 --- runtime and startup

Fixes: #12690

Signed-off-by: [email protected] <[email protected]>

… handle properly Signed-off-by: [email protected] <[email protected]>

github-actions · 2025-01-04T14:48:50Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-01-04T14:49:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/entrypoints/openai/serving_completion.py

robertgshaw2-redhat · 2025-01-04T17:23:03Z

Here is what the server logs look like for:

TP=2, 1000 concurrent streaming requests
Simulate illegal memory access on RANK 1 after 200 steps of the engine

...
INFO:     127.0.0.1:45354 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:45360 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:45368 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:45372 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:45388 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:45394 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 01-04 17:21:02 core.py:247] RUNNING: 306 | WAITING: 628
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] WorkerProc hit an exception: %s
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] Traceback (most recent call last):
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 397, in worker_busy_loop
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     output = getattr(self.worker, method)(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/vllm/v1/worker/gpu_worker.py", line 204, in execute_model
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/vllm/v1/worker/gpu_model_runner.py", line 615, in execute_model
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     hidden_states = self.model(
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]                     ^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/vllm/model_executor/models/llama.py", line 571, in forward
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     raise RuntimeError("ERROR IN LLAMA!")
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] RuntimeError: ERROR IN LLAMA!
ERROR 01-04 17:21:04 core.py:200] EngineCore hit an exception: Traceback (most recent call last):
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 193, in run_engine_core
ERROR 01-04 17:21:04 core.py:200]     engine_core.run_busy_loop()
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 231, in run_busy_loop
ERROR 01-04 17:21:04 core.py:200]     outputs = self.step()
ERROR 01-04 17:21:04 core.py:200]               ^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 124, in step
ERROR 01-04 17:21:04 core.py:200]     output = self.model_executor.execute_model(scheduler_output)
ERROR 01-04 17:21:04 core.py:200]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 167, in execute_model
ERROR 01-04 17:21:04 core.py:200]     model_output = self.collective_rpc("execute_model",
ERROR 01-04 17:21:04 core.py:200]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 161, in collective_rpc
ERROR 01-04 17:21:04 core.py:200]     raise e
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 150, in collective_rpc
ERROR 01-04 17:21:04 core.py:200]     raise result
ERROR 01-04 17:21:04 core.py:200] RuntimeError: ERROR IN LLAMA!
ERROR 01-04 17:21:04 core.py:200] 
CRITICAL 01-04 17:21:04 async_llm.py:65] AsyncLLM got fatal signal from worker process, shutting down. See stack trace for root cause.
CRITICAL 01-04 17:21:05 launcher.py:91] Engine failed, terminating server.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1067793]

Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: Andrew Feldman <[email protected]>

njhill · 2025-04-14T23:51:52Z

I think this is ready to land now, with an issue to be opened for some remaining follow-on tasks. The current failing CI tests (kernel-related etc) I am fairly certain are unrelated and are issues on main. Let's get agreement to merge as soon as the main branch issues are fixed and the tests are green again. Thanks for all of the great work @robertgshaw2-redhat @afeldman-nm

…er-error-handling

vllm/v1/executor/multiproc_executor.py

vllm/v1/executor/abstract.py

vllm/v1/engine/core.py

…er-error-handling

Signed-off-by: Nick Hill <[email protected]>

njhill · 2025-04-16T22:26:50Z

Kernel CI test failures are unrelated.

Signed-off-by: [email protected] <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Nick Hill <[email protected]>

Signed-off-by: [email protected] <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]>

JaheimLee · 2025-04-22T06:04:51Z

I encountered a timeout error when using torch compile. Why use timeout: Optional[float] = 180.0?

njhill · 2025-04-22T17:39:10Z

@JaheimLee sorry about this. Yes the default timeout here is too low in some cases, will be fixing it shortly.

Fix: #17000

Signed-off-by: [email protected] <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Nick Hill <[email protected]>

Signed-off-by: [email protected] <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>

Signed-off-by: [email protected] <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Mu Huai <[email protected]>

[email protected] added 3 commits January 3, 2025 23:11

checkpoint prototype

eb16239

Signed-off-by: [email protected] <[email protected]>

Issue currently is with streaming. The HTTP exception handlers do not…

8549fdd

… handle properly Signed-off-by: [email protected] <[email protected]>

switch from ValueError -> Exception.

77801cd

mergify bot added the frontend label Jan 4, 2025

mergify bot added the needs-rebase label Jan 4, 2025

robertgshaw2-redhat commented Jan 4, 2025

View reviewed changes

vllm/entrypoints/openai/serving_completion.py Outdated Show resolved Hide resolved

[email protected] added 2 commits January 4, 2025 14:54

merged

1bbc3a4

updated

8eca864

mergify bot removed the needs-rebase label Jan 4, 2025

[email protected] added 3 commits January 4, 2025 15:09

stash

b8c77b3

stash

ce9b8ef

add watchdog

3a760a7

robertgshaw2-redhat marked this pull request as ready for review January 4, 2025 16:29

robertgshaw2-redhat requested review from WoosukKwon, njhill, ywang96, comaniac and alexm-redhat as code owners January 4, 2025 16:29

robertgshaw2-redhat changed the title ~~[Frontend] Improve API Server Error Messages~~ [Frontend] Improve API Server Error Logs Jan 4, 2025

robertgshaw2-redhat changed the title ~~[Frontend] Improve API Server Error Logs~~ [V1][Frontend] Improve Error Handling Shutdown And Logs Jan 4, 2025

[email protected] added 7 commits January 4, 2025 16:41

updated

3024da0

revert spurious changes

5af8189

updated

3cb21bb

updated

7c97308

updated

ea6824a

remove cruft

b278065

cruft

c004bd4

njhill added 3 commits April 11, 2025 18:42

fix

da8c253

Signed-off-by: Nick Hill <[email protected]>

Fix exception message

27d7d82

Signed-off-by: Nick Hill <[email protected]>

Cleanup

444a446

Signed-off-by: Nick Hill <[email protected]>

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 12, 2025

afeldman-nm added 3 commits April 14, 2025 15:30

revert

e33000e

Signed-off-by: Andrew Feldman <[email protected]>

Merge branch 'main' into aseh_merge

b1977ac

Merge branch 'aseh_merge' into aseh

0d0071a

njhill mentioned this pull request Apr 14, 2025

[Bug]: Missing metrics in V1 #16348

Open

1 task

afeldman-nm mentioned this pull request Apr 15, 2025

[Bug]: Follow-up shutdown and logging issues #16667

Open

1 task

Merge remote-tracking branch 'refs/remotes/origin/main' into api-serv…

4ce2771

…er-error-handling

njhill mentioned this pull request Apr 16, 2025

Fix #15483 : Add error handling for model-dependent endpoints during sleep mode #16536

Open

DarkLight1337 reviewed Apr 16, 2025

View reviewed changes

vllm/v1/executor/multiproc_executor.py Outdated Show resolved Hide resolved

vllm/v1/executor/abstract.py Outdated Show resolved Hide resolved

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

njhill added 2 commits April 16, 2025 11:07

Merge remote-tracking branch 'refs/remotes/origin/main' into api-serv…

e8672e8

…er-error-handling

Address review comments from @DarkLight1337

7cf6b6f

Signed-off-by: Nick Hill <[email protected]>

njhill mentioned this pull request Apr 16, 2025

[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] #16432

Merged

vllm-bot merged commit 2b05b8c into vllm-project:main Apr 17, 2025
64 of 69 checks passed

njhill deleted the api-server-error-handling branch April 17, 2025 03:23

russellb mentioned this pull request Apr 29, 2025

[Bug]: vllm.entrypoints.api_server throw exception when startup in v0.8.5 #17395

Closed

1 task

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1][Frontend] Improve Shutdown And Logs #11737

[V1][Frontend] Improve Shutdown And Logs #11737

Uh oh!

robertgshaw2-redhat commented Jan 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 4, 2025

Uh oh!

mergify bot commented Jan 4, 2025

Uh oh!

Uh oh!

robertgshaw2-redhat commented Jan 4, 2025

Uh oh!

njhill commented Apr 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill commented Apr 16, 2025

Uh oh!

Uh oh!

JaheimLee commented Apr 22, 2025

Uh oh!

njhill commented Apr 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[V1][Frontend] Improve Shutdown And Logs #11737

[V1][Frontend] Improve Shutdown And Logs #11737

Uh oh!

Conversation

robertgshaw2-redhat commented Jan 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 4, 2025

Uh oh!

mergify bot commented Jan 4, 2025

Uh oh!

Uh oh!

robertgshaw2-redhat commented Jan 4, 2025

Uh oh!

njhill commented Apr 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill commented Apr 16, 2025

Uh oh!

Uh oh!

JaheimLee commented Apr 22, 2025

Uh oh!

njhill commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented Jan 4, 2025 •

edited by github-actions bot

Loading

njhill commented Apr 22, 2025 •

edited

Loading