Skip to content

Commit 34bf3bc

Browse files
DarkLight1337dbyoung18
authored andcommitted
[Doc] Add more tips to avoid OOM (vllm-project#16765)
Signed-off-by: DarkLight1337 <[email protected]>
1 parent db24dd9 commit 34bf3bc

File tree

2 files changed

+33
-0
lines changed

2 files changed

+33
-0
lines changed

docs/source/serving/offline_inference.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ Please refer to the above pages for more details about each API.
2828
[API Reference](/api/offline_inference/index)
2929
:::
3030

31+
(configuration-options)=
32+
3133
## Configuration Options
3234

3335
This section lists the most common options for running the vLLM engine.
@@ -184,6 +186,29 @@ llm = LLM(model="google/gemma-3-27b-it",
184186
limit_mm_per_prompt={"image": 0})
185187
```
186188

189+
#### Multi-modal processor arguments
190+
191+
For certain models, you can adjust the multi-modal processor arguments to
192+
reduce the size of the processed multi-modal inputs, which in turn saves memory.
193+
194+
Here are some examples:
195+
196+
```python
197+
from vllm import LLM
198+
199+
# Available for Qwen2-VL series models
200+
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
201+
mm_processor_kwargs={
202+
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
203+
})
204+
205+
# Available for InternVL series models
206+
llm = LLM(model="OpenGVLab/InternVL2-2B",
207+
mm_processor_kwargs={
208+
"max_dynamic_patch": 4, # Default is 12
209+
})
210+
```
211+
187212
### Performance optimization and tuning
188213

189214
You can potentially improve the performance of vLLM by finetuning various options.

docs/source/serving/openai_compatible_server.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,13 @@ print(completion.choices[0].message)
3333
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
3434
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
3535
:::
36+
3637
:::{important}
3738
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
3839

3940
To disable this behavior, please pass `--generation-config vllm` when launching the server.
4041
:::
42+
4143
## Supported APIs
4244

4345
We currently support the following OpenAI APIs:
@@ -172,6 +174,12 @@ print(completion._request_id)
172174

173175
The `vllm serve` command is used to launch the OpenAI-compatible server.
174176

177+
:::{tip}
178+
The vast majority of command-line arguments are based on those for offline inference.
179+
180+
See [here](configuration-options) for some common options.
181+
:::
182+
175183
:::{argparse}
176184
:module: vllm.entrypoints.openai.cli_args
177185
:func: create_parser_for_docs

0 commit comments

Comments
 (0)