You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vault. Refer to the [Intel Gaudi documentation](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#pull-prebuilt-containers)
44
42
for more details.
45
43
46
44
Use the following commands to run a Docker image:
@@ -278,8 +276,9 @@ Lower value corresponds to less usable graph memory reserved for prefill stage,
278
276
:::
279
277
280
278
User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
281
-
\-`max_bs` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), default strategy for decode
282
-
\-`min_tokens` - graph capture queue will be sorted in ascending order by the number of tokens each graph processes (`batch_size*sequence_length`), default strategy for prompt
279
+
280
+
-`max_bs` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), default strategy for decode
281
+
-`min_tokens` - graph capture queue will be sorted in ascending order by the number of tokens each graph processes (`batch_size*sequence_length`), default strategy for prompt
283
282
284
283
When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy.
285
284
@@ -326,8 +325,7 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
326
325
- We recommend running inference on Gaudi 2 with `block_size` of 128
327
326
for BF16 data type. Using default values (16, 32) might lead to
328
327
sub-optimal performance due to Matrix Multiplication Engine
under-utilization (see [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html)).
331
329
- For max throughput on Llama 7B, we recommend running with batch size
332
330
of 128 or 256 and max context length of 2048 with HPU Graphs enabled.
333
331
If you encounter out-of-memory issues, see troubleshooting section.
@@ -336,11 +334,11 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
336
334
337
335
**Diagnostic and profiling knobs:**
338
336
339
-
-`VLLM_PROFILER_ENABLED`: if`true`, high level profiler will be enabled. Resulting JSON traces can be viewed in [perfetto.habana.ai](https://perfetto.habana.ai/#!/viewer). Disabled by default.
340
-
-`VLLM_HPU_LOG_STEP_GRAPH_COMPILATION`: if`true`, will log graph compilations per each vLLM engine step, only when there was any - highly recommended to use alongside`PT_HPU_METRICS_GC_DETAILS=1`. Disabled by default.
341
-
-`VLLM_HPU_LOG_STEP_GRAPH_COMPILATION_ALL`: if`true`, will log graph compilations per each vLLM engine step, always, even if there were none. Disabled by default.
342
-
-`VLLM_HPU_LOG_STEP_CPU_FALLBACKS`: if`true`, will log cpu fallbacks per each vLLM engine step, only when there was any. Disabled by default.
343
-
-`VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL`: if `true`, will log cpu fallbacks per each vLLM engine step, always, even if there were none. Disabled by default.
337
+
-`VLLM_PROFILER_ENABLED`: If`true`, enable the high level profiler. Resulting JSON traces can be viewed in [perfetto.habana.ai](https://perfetto.habana.ai/#!/viewer). `false` by default.
338
+
-`VLLM_HPU_LOG_STEP_GRAPH_COMPILATION`: If`true`, log graph compilations for each vLLM engine stepwhen any occurs. Highly recommended to use with`PT_HPU_METRICS_GC_DETAILS=1`. `false` by default.
339
+
-`VLLM_HPU_LOG_STEP_GRAPH_COMPILATION_ALL`: If`true`, always log graph compilations for each vLLM engine stepeven if none occurred. `false` by default.
340
+
-`VLLM_HPU_LOG_STEP_CPU_FALLBACKS`: If`true`, log CPU fallbacks for each vLLM engine stepwhen any occurs. `false` by default.
341
+
-`VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL`: if `true`, always log CPU fallbacks for each vLLM engine stepeven if none occurred. `false` by default.
344
342
345
343
**Performance tuning knobs:**
346
344
@@ -381,7 +379,7 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
381
379
382
380
Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution:
383
381
384
-
-`PT_HPU_LAZY_MODE`: if `0`, PyTorch Eager backend for Gaudi will be used, if `1` PyTorch Lazy backend for Gaudi will be used,`1` is default
382
+
-`PT_HPU_LAZY_MODE`: if `0`, PyTorch Eager backend for Gaudi will be used; if `1`, PyTorch Lazy backend for Gaudi will be used.`1` is default.
385
383
-`PT_HPU_ENABLE_LAZY_COLLECTIVES`: required to be `true` for tensor parallel inference with HPU Graphs
0 commit comments