Skip to content

[V1][Frontend] Improve Shutdown And Logs #11737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 172 commits into from
Apr 17, 2025
Merged
Show file tree
Hide file tree
Changes from 85 commits
Commits
Show all changes
172 commits
Select commit Hold shift + click to select a range
eb16239
checkpoint prototype
Jan 3, 2025
8549fdd
Issue currently is with streaming. The HTTP exception handlers do not…
Jan 3, 2025
77801cd
switch from ValueError -> Exception.
Jan 4, 2025
1bbc3a4
merged
Jan 4, 2025
8eca864
updated
Jan 4, 2025
b8c77b3
stash
Jan 4, 2025
ce9b8ef
stash
Jan 4, 2025
3a760a7
add watchdog
Jan 4, 2025
3024da0
updated
Jan 4, 2025
5af8189
revert spurious changes
Jan 4, 2025
3cb21bb
updated
Jan 4, 2025
7c97308
updated
Jan 4, 2025
ea6824a
updated
Jan 4, 2025
b278065
remove cruft
Jan 4, 2025
c004bd4
cruft
Jan 4, 2025
2556bc4
stash
Jan 4, 2025
db0b9e6
fix llama
Jan 4, 2025
f722589
updated
Jan 4, 2025
de75cc4
cruft
Jan 4, 2025
ba5ca87
cruft
Jan 4, 2025
4f6b68a
updated
Jan 4, 2025
949d425
updated
Jan 4, 2025
f67398b
updated
Jan 4, 2025
b3d2994
updated
Jan 4, 2025
34a997a
update comment
Jan 4, 2025
32cf91b
update comment
Jan 4, 2025
c73801c
fix more
Jan 4, 2025
1188845
updated
Jan 4, 2025
706782c
udpatd
Jan 4, 2025
1cc0915
added exception file
Jan 4, 2025
8db0eee
updated
Jan 4, 2025
2fc8af6
fixt
Jan 4, 2025
de39af1
reduce cruft
Jan 5, 2025
732ba64
reduce cruft
Jan 5, 2025
4372094
cleanup
Jan 5, 2025
b9144a3
updated
Jan 5, 2025
d90e122
cruft
Jan 5, 2025
2bbac31
updated
Jan 5, 2025
c40542a
revert changes to server
Jan 5, 2025
46734eb
revert debug cruft
Jan 5, 2025
f0baffb
fix error
Jan 5, 2025
8a7f18e
added tests
Jan 5, 2025
a662940
revert
Jan 5, 2025
4ee6390
fixed
Jan 5, 2025
3e23ee2
updated
Jan 5, 2025
45456f9
fixed error
Jan 5, 2025
6128b1a
update test coverage
Jan 5, 2025
de24559
stash
Jan 5, 2025
7adf26e
added tests
Jan 6, 2025
bf92854
stash
Jan 7, 2025
8dae5c6
updated
Feb 7, 2025
6b4fe88
updated
Feb 7, 2025
efe85ee
updared
Feb 7, 2025
6195795
fix typo
Feb 7, 2025
0b25586
updated
Feb 7, 2025
0b77b79
updated
Feb 8, 2025
61f3dd7
stash
Feb 8, 2025
fbf19ad
updated
Feb 8, 2025
d25ce5c
updated
Feb 8, 2025
23342d7
remove signal handler
Feb 8, 2025
ebdf8f9
remove signal handler
Feb 8, 2025
6a37020
update comment
Feb 8, 2025
2ed3349
avoid sigusr1
Feb 8, 2025
f9ef3d8
cleanup
Feb 8, 2025
95c249f
cleanup
Feb 8, 2025
030c671
cleanup
Feb 8, 2025
1bdb212
cleanup
Feb 8, 2025
25412a0
updated
Feb 8, 2025
7cf0647
updated
Feb 8, 2025
352da94
it starts?
Feb 8, 2025
a69e040
updated
Feb 8, 2025
8dddc20
updated
Feb 8, 2025
7b48b87
updated
Feb 8, 2025
7400852
updated
Feb 8, 2025
80317a0
updated
Feb 8, 2025
ca37960
nits
Feb 8, 2025
2d41499
fix test for bunched streaming
Feb 8, 2025
4a39d39
tweak typing
Feb 8, 2025
43360f0
Update tests/v1/shutdown/test_forward_error.py
robertgshaw2-redhat Feb 10, 2025
4d0f44f
Merge branch 'main' into api-server-error-handling
robertgshaw2-redhat Feb 10, 2025
218d095
pre commit
Feb 10, 2025
c395634
Update tests/v1/shutdown/test_forward_error.py
robertgshaw2-redhat Feb 10, 2025
042c486
Update vllm/v1/engine/core.py
robertgshaw2-redhat Feb 10, 2025
b5a7b6f
Update vllm/v1/engine/core.py
robertgshaw2-redhat Feb 10, 2025
dab77cf
Update tests/v1/shutdown/test_forward_error.py
robertgshaw2-redhat Feb 10, 2025
f36305d
afeldman merge first-pass
afeldman-nm Mar 24, 2025
c99567e
afeldman merge
afeldman-nm Mar 24, 2025
a9219b0
Merge branch 'main' into aseh
afeldman-nm Mar 24, 2025
a010281
intermed tensors
afeldman-nm Mar 25, 2025
4a733c9
wip
afeldman-nm Mar 25, 2025
64dcb24
Merge branch 'main' into aseh
afeldman-nm Mar 25, 2025
3971d92
Merge branch 'main' into aseh
afeldman-nm Mar 25, 2025
adebbe3
added multiproc on/off tests
afeldman-nm Mar 25, 2025
f23bc25
wip sync
afeldman-nm Mar 25, 2025
188d929
Merge branch 'main' into aseh
afeldman-nm Mar 25, 2025
ae1dc32
check for correct exception
afeldman-nm Mar 25, 2025
33a7926
Merge branch 'main' into aseh
afeldman-nm Mar 25, 2025
c2afedc
wip llm tests
afeldman-nm Mar 25, 2025
4648d85
Merge branch 'main' into aseh
afeldman-nm Mar 26, 2025
4d5d280
Merge branch 'main' into aseh
afeldman-nm Mar 26, 2025
1422551
Merge branch 'main' into aseh
afeldman-nm Mar 26, 2025
59e2e29
Merge branch 'main' into aseh
afeldman-nm Mar 27, 2025
4e6ca2d
Merge branch 'main' into aseh
afeldman-nm Mar 27, 2025
89a5461
removed tests of LLM engine without MP
afeldman-nm Mar 27, 2025
f60c8b5
SyncMPClient & MPClient finalizers works
afeldman-nm Mar 28, 2025
9aed319
wip delete tests
afeldman-nm Mar 31, 2025
be1a23d
rollback
afeldman-nm Mar 31, 2025
7d85fc5
first merge attempt
afeldman-nm Mar 31, 2025
7a3a5c2
Merge branch 'main' into aseh_merge
afeldman-nm Mar 31, 2025
9f672d8
async fix
afeldman-nm Mar 31, 2025
79c4e19
remove strong refs
afeldman-nm Apr 1, 2025
5b332a9
Merge branch 'main' into aseh_merge
afeldman-nm Apr 1, 2025
07824d5
add back strong refs
afeldman-nm Apr 1, 2025
781dfcc
Merge branch 'main' into aseh_merge
afeldman-nm Apr 2, 2025
74d8e8f
removed async forward error test
afeldman-nm Apr 2, 2025
d66844f
removed sync delete dummy request
afeldman-nm Apr 2, 2025
953db41
Merge branch 'main' into aseh_merge
afeldman-nm Apr 2, 2025
c4a7606
Merge branch 'main' into aseh_merge
afeldman-nm Apr 2, 2025
62f2c3e
Merge branch 'main' into aseh_merge
afeldman-nm Apr 4, 2025
2ee74b6
temporarily removed test case
afeldman-nm Apr 4, 2025
f229a86
Merge branch 'main' into aseh_merge
afeldman-nm Apr 4, 2025
86263dc
test load weights failure
afeldman-nm Apr 4, 2025
7b78cde
Merge branch 'main' into aseh_merge
afeldman-nm Apr 4, 2025
f824c15
Update vllm/v1/engine/exceptions.py
afeldman-nm Apr 4, 2025
40b0e15
Merge remote-tracking branch 'origin/main' into api-server-error-hand…
njhill Apr 8, 2025
7dc02fa
Post main-merge cleanup/fixes
njhill Apr 9, 2025
f1bce10
Some updates to MultiprocExecutor
njhill Apr 9, 2025
038aa31
Merge remote-tracking branch 'refs/remotes/origin/main' into api-serv…
njhill Apr 9, 2025
d014a6b
More multiproc_executor.py streamlining
njhill Apr 9, 2025
c9941da
core_client.py streamlining
njhill Apr 9, 2025
72740ca
timeout
afeldman-nm Apr 10, 2025
9983d30
Merge branch 'api-server-error-handling' of https://github.com/neural…
afeldman-nm Apr 10, 2025
93c2001
Merge branch 'main' into aseh_merge
afeldman-nm Apr 10, 2025
1a76f36
refactor
afeldman-nm Apr 10, 2025
5bde29d
refactor
afeldman-nm Apr 10, 2025
1a0a217
Process monitor for TP workers
njhill Apr 10, 2025
e64c7c9
Merge branch 'main' into aseh_merge
afeldman-nm Apr 10, 2025
26005b0
Merge branch 'api-server-error-handling' of https://github.com/neural…
afeldman-nm Apr 10, 2025
1abcac3
ValueError exception
afeldman-nm Apr 10, 2025
1a4b6a0
added llm 2-rank forward error test back
afeldman-nm Apr 10, 2025
863aa08
added back async test
afeldman-nm Apr 10, 2025
766338e
Merge remote-tracking branch 'origin/main' into api-server-error-hand…
njhill Apr 10, 2025
be9d356
Adjust per request failure log messages
njhill Apr 10, 2025
92916a8
Merge remote-tracking branch 'origin/main' into api-server-error-hand…
njhill Apr 10, 2025
f02185d
Merge branch 'api-server-error-handling' of https://github.com/neural…
afeldman-nm Apr 10, 2025
95a45ba
Move output queue task ref / cleanup to BackgroundResource
njhill Apr 10, 2025
cb70c37
added tests back
afeldman-nm Apr 10, 2025
6215c00
Merge remote-tracking branch 'nm/api-server-error-handling' into api-…
njhill Apr 10, 2025
775e0c3
knobs for tests
afeldman-nm Apr 10, 2025
b309b45
Merge branch 'api-server-error-handling' of https://github.com/neural…
afeldman-nm Apr 10, 2025
3524115
Fix rebase bug
njhill Apr 10, 2025
de51ec1
Fix AsyncLLM garbage collection cleanup issue
njhill Apr 10, 2025
a0536c4
Re-enable failing test (seems to work now)
njhill Apr 11, 2025
76494dc
Re-enable other failing test (also seems to work now)
njhill Apr 11, 2025
b5d8702
CUDA_VISIBLE_DEVICES for shutdown tests in buildkite
afeldman-nm Apr 11, 2025
b067f8d
temporarily enabled v1 fastcheck test
afeldman-nm Apr 11, 2025
e94c89e
moved shutdown tests to 2 GPU section
afeldman-nm Apr 11, 2025
29912d5
Merge branch 'main' into aseh
afeldman-nm Apr 11, 2025
6de94aa
Fix breakage to DP case
njhill Apr 11, 2025
060ecd9
Properly fix DP breakage
njhill Apr 11, 2025
4228bb4
Add timeout to TP execute_model, reply only from rank0
njhill Apr 12, 2025
b5acee3
Merge remote-tracking branch 'origin/main' into api-server-error-hand…
njhill Apr 12, 2025
6c540c3
Cancel shm dequeue on shutdown
njhill Apr 12, 2025
da8c253
fix
njhill Apr 12, 2025
27d7d82
Fix exception message
njhill Apr 12, 2025
444a446
Cleanup
njhill Apr 12, 2025
e33000e
revert
afeldman-nm Apr 14, 2025
b1977ac
Merge branch 'main' into aseh_merge
afeldman-nm Apr 14, 2025
0d0071a
Merge branch 'aseh_merge' into aseh
afeldman-nm Apr 14, 2025
4ce2771
Merge remote-tracking branch 'refs/remotes/origin/main' into api-serv…
njhill Apr 15, 2025
e8672e8
Merge remote-tracking branch 'refs/remotes/origin/main' into api-serv…
njhill Apr 16, 2025
7cf6b6f
Address review comments from @DarkLight1337
njhill Apr 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ steps:
commands:
# split the test to avoid interference
- VLLM_USE_V1=1 pytest -v -s v1/core
- VLLM_USE_V1=1 pytest -v -s v1/shutdown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just sort of a side note, but it seems like updating these commands would be really easy to miss when adding new tests in a new directory.

- VLLM_USE_V1=1 pytest -v -s v1/engine
- VLLM_USE_V1=1 pytest -v -s v1/sample
- VLLM_USE_V1=1 pytest -v -s v1/worker
Expand Down
122 changes: 122 additions & 0 deletions tests/v1/shutdown/test_forward_error.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# SPDX-License-Identifier: Apache-2.0
"""Test that we handle an Error in model forward and shutdown."""

import asyncio

import pytest

from tests.utils import wait_for_gpu_memory_to_clear
from vllm import LLM, SamplingParams
from vllm.distributed import get_tensor_model_parallel_rank
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.model_executor.models.llama import LlamaForCausalLM
from vllm.utils import cuda_device_count_stateless
from vllm.v1.engine.async_llm import AsyncLLM
from vllm.v1.engine.exceptions import EngineDeadError


def evil_forward(self, *args, **kwargs):
"""Evil forward method that raise an exception after 10 calls."""
NUMBER_OF_GOOD_PASSES = 10

if not hasattr(self, "num_calls"):
self.num_calls = 0

if (self.num_calls == NUMBER_OF_GOOD_PASSES
and get_tensor_model_parallel_rank() == 0):
raise Exception("Simulated illegal memory access on Rank 0!")
self.num_calls += 1

return self.model(*args, **kwargs, intermediate_tensors=None)


@pytest.mark.asyncio
@pytest.mark.parametrize("tensor_parallel_size", [2, 1])
async def test_async_llm_model_error(monkeypatch, tensor_parallel_size):

if cuda_device_count_stateless() < tensor_parallel_size:
pytest.skip(reason="Not enough CUDA devices")

with monkeypatch.context() as m:
m.setenv("VLLM_USE_V1", "1")

# Monkeypatch an error in the model.
monkeypatch.setattr(LlamaForCausalLM, "forward", evil_forward)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you created a monkeypatch.context(), did you mean to use m.setattr(...) here?

From reading the docs, it sounds like both will result in the same behavior, though I would find m.setattr(...) a bit more clear that the change is limited to this context.

Suggested change
# Monkeypatch an error in the model.
monkeypatch.setattr(LlamaForCausalLM, "forward", evil_forward)
# Monkeypatch an error in the model.
m.setattr(LlamaForCausalLM, "forward", evil_forward)


engine_args = AsyncEngineArgs(
model="meta-llama/Llama-3.2-1B",
enforce_eager=True,
tensor_parallel_size=tensor_parallel_size)
async_llm = AsyncLLM.from_engine_args(engine_args)

async def generate(request_id: str):
generator = async_llm.generate("Hello my name is",
request_id=request_id,
sampling_params=SamplingParams())
try:
async for _ in generator:
pass
except Exception as e:
return e

NUM_REQS = 3
tasks = [generate(f"request-{idx}") for idx in range(NUM_REQS)]
outputs = await asyncio.gather(*tasks)

# Every request should get an EngineDeadError.
for output in outputs:
assert isinstance(output, EngineDeadError)

# AsyncLLM should be errored.
assert async_llm.errored

# We should not be able to make another request.
with pytest.raises(EngineDeadError):
async for _ in async_llm.generate(
"Hello my name is",
request_id="abc",
sampling_params=SamplingParams()):
raise Exception("We should not get here.")

# Confirm all the processes are cleaned up.
wait_for_gpu_memory_to_clear(
devices=list(range(tensor_parallel_size)),
threshold_bytes=2 * 2**30,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number kind of looks like magic. It would be great to get it put in a constant with a comment explaining it somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a GiB_bytes defined in vllm.utils, you can consider using that.

timeout_s=60,
)

# NOTE: shutdown is handled by the API Server if an exception
# occurs, so it is expected that we would need to call this.
async_llm.shutdown()


@pytest.mark.parametrize("enable_multiprocessing", [True, False])
@pytest.mark.parametrize("tensor_parallel_size", [2, 1])
def test_llm_model_error(monkeypatch, tensor_parallel_size,
enable_multiprocessing):

if cuda_device_count_stateless() < tensor_parallel_size:
pytest.skip(reason="Not enough CUDA devices")

with monkeypatch.context() as m:
m.setenv("VLLM_USE_V1", "1")

MP_VALUE = "1" if enable_multiprocessing else "0"
m.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", MP_VALUE)

# Monkeypatch an error in the model.
m.setattr(LlamaForCausalLM, "forward", evil_forward)

llm = LLM(model="meta-llama/Llama-3.2-1B",
enforce_eager=True,
tensor_parallel_size=tensor_parallel_size)

with pytest.raises(EngineDeadError):
llm.generate("Hello my name is Robert and I")

# Confirm all the processes are cleaned up.
wait_for_gpu_memory_to_clear(
devices=list(range(tensor_parallel_size)),
threshold_bytes=2 * 2**30,
timeout_s=60,
)
65 changes: 65 additions & 0 deletions tests/v1/shutdown/test_processor_error.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# SPDX-License-Identifier: Apache-2.0
"""Test error handling in Processor. Should not impact other reqs."""

import asyncio

import pytest

from vllm import SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.inputs.data import TokensPrompt
from vllm.sampling_params import RequestOutputKind
from vllm.v1.engine.async_llm import AsyncLLM
from vllm.v1.engine.exceptions import EngineGenerateError


@pytest.mark.asyncio
async def test_async_llm_processor_error(monkeypatch):

with monkeypatch.context() as m:
m.setenv("VLLM_USE_V1", "1")

engine_args = AsyncEngineArgs(model="meta-llama/Llama-3.2-1B",
enforce_eager=True)
async_llm = AsyncLLM.from_engine_args(engine_args)

async def generate(request_id: str):
# [] is not allowed and will raise a ValueError in Processor.
generator = async_llm.generate(TokensPrompt([]),
request_id=request_id,
sampling_params=SamplingParams())
try:
async for _ in generator:
pass
except Exception as e:
return e

NUM_REQS = 3
tasks = [generate(f"request-{idx}") for idx in range(NUM_REQS)]
outputs = await asyncio.gather(*tasks)

# Every request should have get an EngineGenerateError.
for output in outputs:
with pytest.raises(EngineGenerateError):
raise output

# AsyncLLM should be errored.
assert not async_llm.errored

# This should be no problem.
EXPECTED_TOKENS = 5
outputs = []
async for out in async_llm.generate(
"Hello my name is",
request_id="abc",
sampling_params=SamplingParams(
max_tokens=EXPECTED_TOKENS,
output_kind=RequestOutputKind.DELTA)):
outputs.append(out)

generated_tokens = []
for out in outputs:
generated_tokens.extend(out.outputs[0].token_ids)
assert len(generated_tokens) == EXPECTED_TOKENS

async_llm.shutdown()
88 changes: 88 additions & 0 deletions tests/v1/shutdown/test_startup_error.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# SPDX-License-Identifier: Apache-2.0
"""Test that we handle a startup Error and shutdown."""

import pytest

from tests.utils import wait_for_gpu_memory_to_clear
from vllm import LLM
from vllm.distributed import get_tensor_model_parallel_rank
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.model_executor.models.llama import LlamaForCausalLM
from vllm.utils import cuda_device_count_stateless
from vllm.v1.engine.async_llm import AsyncLLM


def evil_forward(self, *args, **kwargs):
"""Evil forward method that raise an exception."""

if get_tensor_model_parallel_rank() == 0:
raise Exception("Simulated Error in startup!")

return self.model(*args, **kwargs, intermediate_tensors=None)


MODELS = [
"meta-llama/Llama-3.2-1B", # Raises on first fwd pass.
"mistralai/Mixtral-8x22B-Instruct-v0.1" # Causes OOM.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to download the model and get a real OOM? From a quick look, it doesn't look like this is used elsewhere, so that'd be a net-new model to download during tests? If so, that doesn't seem worth the cost, especially given how unreliable HF has been in CI lately. Maybe I'm misunderstanding, though!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

I do think that it is important to flex both the cases here since there is a subtle difference:

  • "meta-llama/Llama-3.2-1B", # Raises on first fwd pass. happens during the profiling (after IPC mechanisms are created)
  • "mistralai/Mixtral-8x22B-Instruct-v0.1" # Causes OOM. happens during the weight loading (before IPC mechanisms are setup)

I will instead do a monkeypatch to raise an error on load_weights for case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monkeypatch sounds good if the error encountered is clear enough!

]


@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("tensor_parallel_size", [2, 1])
def test_async_llm_startup_error(monkeypatch, model, tensor_parallel_size):

if cuda_device_count_stateless() < tensor_parallel_size:
pytest.skip(reason="Not enough CUDA devices")

with monkeypatch.context() as m:
m.setenv("VLLM_USE_V1", "1")

# Monkeypatch an error in the model.
monkeypatch.setattr(LlamaForCausalLM, "forward", evil_forward)

engine_args = AsyncEngineArgs(
model=model,
enforce_eager=True,
tensor_parallel_size=tensor_parallel_size)

# Confirm we get an exception.
with pytest.raises(Exception, match="initialization failed"):
_ = AsyncLLM.from_engine_args(engine_args)

# Confirm all the processes are cleaned up.
wait_for_gpu_memory_to_clear(
devices=list(range(tensor_parallel_size)),
threshold_bytes=2 * 2**30,
timeout_s=60,
)


@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("tensor_parallel_size", [2, 1])
@pytest.mark.parametrize("enable_multiprocessing", [True, False])
def test_llm_startup_error(monkeypatch, model, tensor_parallel_size,
enable_multiprocessing):

if cuda_device_count_stateless() < tensor_parallel_size:
pytest.skip(reason="Not enough CUDA devices")

with monkeypatch.context() as m:
m.setenv("VLLM_USE_V1", "1")

MP_VALUE = "1" if enable_multiprocessing else "0"
m.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", MP_VALUE)

# Monkeypatch an error in the model.
monkeypatch.setattr(LlamaForCausalLM, "forward", evil_forward)

with pytest.raises(Exception, match="initialization failed"):
_ = LLM(model="meta-llama/Llama-3.2-1B",
enforce_eager=True,
tensor_parallel_size=tensor_parallel_size)

# Confirm all the processes are cleaned up.
wait_for_gpu_memory_to_clear(
devices=list(range(tensor_parallel_size)),
threshold_bytes=2 * 2**30,
timeout_s=60,
)
Loading
Loading