Skip to content

nm vllm ent 0.8.5 sync #139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 328 commits into from
May 15, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
328 commits
Select commit Hold shift + click to select a range
7eb4255
[BugFix] Accuracy fix for llama4 int4 - improperly casted scales (#16…
LucasWilkinson Apr 18, 2025
e78587a
Improve-mm-and-pooler-and-decoding-configs (#16789)
hmellor Apr 18, 2025
7bdfd29
[Misc] add collect_env to cli and docker image (#16759)
lengrongfu Apr 18, 2025
aaec845
[ROCm] [Attention] Cleanup ROCm output passing (#16431)
ProExpertProg Apr 18, 2025
e31045f
[Bugfix] fix pp for llama4 (#16746)
luccafong Apr 18, 2025
9c1d5b4
[Doc] add podman setup instructions for official image (#16796)
nathan-weinberg Apr 18, 2025
26507f8
[Docs] Fix a link and grammar issue in production-stack.md (#16809)
windsonsea Apr 18, 2025
87e067d
[Model] use AutoWeightsLoader for BigCode, GPT-J (#16823)
jonghyunchoe Apr 18, 2025
aadb656
[Misc] Clean up Kimi-VL (#16833)
DarkLight1337 Apr 18, 2025
686623c
Fix `nullable_kvs` fallback (#16837)
hmellor Apr 18, 2025
3d3ab36
[New Model]: Snowflake Arctic Embed (Family) (#16649)
noooop Apr 18, 2025
5a5e29d
[Misc] refactor examples series - Chat Completion Client With Tools (…
reidliu41 Apr 18, 2025
490b169
[Doc] Updated Llama section in tool calling docs to have llama 3.2 co…
jmho Apr 18, 2025
5c91212
[release] Publish neuron docker image (#16733)
omrishiv Apr 19, 2025
2c1bd84
[Model][VLM] Add Qwen2.5-Omni model support (thinker only) (#15130)
fyabc Apr 19, 2025
1d4680f
[rocm][MI300] llama4 maverick fp8 moe config tp8 (#16847)
divakar-amd Apr 19, 2025
2ef0dc5
[Frontend] Add sampling params to `v1/audio/transcriptions` endpoint …
NickLucche Apr 19, 2025
9d4ca19
[Misc] Benchmarks for audio models (#16505)
NickLucche Apr 19, 2025
d9737ca
[V1][Misc] stop update prefix cache stats when logs_stats is disabled…
vie-serendipity Apr 19, 2025
83f3c3b
[Model] Refactor Phi-4-multimodal to use merged processor and support…
Isotr0py Apr 19, 2025
5124f5b
[Model] Qwen2.5-Omni Cleanup (#16872)
ywang96 Apr 19, 2025
205d84a
[VLM] Clean up models (#16873)
DarkLight1337 Apr 19, 2025
d6195a7
[doc] update hyperlink (#16877)
reidliu41 Apr 19, 2025
682e0b6
Log how much time loading a compiled artifact takes (#16848)
zou3519 Apr 19, 2025
87aaade
Serialize tensors using int8 views (#16866)
p88h Apr 19, 2025
4b07d36
Improve configs - `CacheConfig` (#16835)
hmellor Apr 20, 2025
fe742ae
[easy] Pass compile_fx only the config patches (#16845)
zou3519 Apr 20, 2025
bb3605d
[Bugfix] Fix v1/spec_decode/test_ngram.py (#16895)
zixi-qi Apr 21, 2025
4c41278
[CI/CD][V1] Add spec decode tests to CI (#16900)
WoosukKwon Apr 21, 2025
26c0406
[Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni (#16907)
fyabc Apr 21, 2025
b34f334
[Doc] Split dummy_processor_inputs() in Multimodal Docs (#16915)
alex-jw-brooks Apr 21, 2025
d41faaf
Restore buffers when wake up from level 2 sleep (#16564) (#16889)
fingertap Apr 21, 2025
d9ac9e3
[Misc] fix collect_env version parse (#15267)
wangxiyuan Apr 21, 2025
7272bfa
[Misc] Refactor platform to get device specific stream and event (#14…
shen-shanshan Apr 21, 2025
55d6d3f
[Bugfix] Fix GLM rotary_dim issue and support v1 (#16912)
Isotr0py Apr 21, 2025
3b34fd5
Raise error for data-parallel with benchmark_throughput (#16737)
kartikx Apr 21, 2025
fe3462c
[XPU][Bugfix] minor fix for XPU (#15591)
yma11 Apr 21, 2025
63e26ff
[doc] install required python3-dev apt package (#16888)
davidxia Apr 21, 2025
f728ab8
[Doc] mention how to install in CPU editable mode (#16923)
davidxia Apr 21, 2025
299ebb6
[Core] Speed up decode by remove synchronizing operation in sampler (…
chanh Apr 21, 2025
3a0fba5
[V1][Spec Decode] Handle draft tokens beyond max_model_len (#16087)
WoosukKwon Apr 21, 2025
471fe65
[TPU][V1] Implicitly adjust page size when there's SMEM OOM (#16871)
yaochengji Apr 21, 2025
71eda0b
Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml (#16946)
mgoin Apr 22, 2025
2102075
[TPU][V1] Capture multimodal encoder during model compilation (#15051)
NickLucche Apr 22, 2025
986537f
[V1] V1 FlashInfer Attention (#16684)
mgoin Apr 22, 2025
fa3bba2
[TPU][V1] Enable Top-P (#16843)
NickLucche Apr 22, 2025
29f395c
[Doc] Remove unnecessary V1 flag (#16924)
DarkLight1337 Apr 22, 2025
1311913
[BugFix][Spec Decode] No in-place update to draft probs (#16952)
WoosukKwon Apr 22, 2025
0e42544
[Bugfix]: fix issue with n>1 sampling on v1 requests overriding each …
jeffrey-dot-li Apr 22, 2025
5b794ca
[ROCm] Add aiter tkw1 kernel for Llama4 fp8 (#16727)
kliuae Apr 22, 2025
c9acbf1
[Misc] Remove the chunked prefill warning for LoRA (#16925)
jeejeelee Apr 22, 2025
7b8a2ab
[Kernel] Add expert_map support to Cutlass FP8 MOE (#16861)
varun-sundar-rabindranath Apr 22, 2025
b9b4746
[V1] Remove additional_config check (#16710)
wangxiyuan Apr 22, 2025
188b7f9
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm (…
charlifu Apr 22, 2025
71ce440
Support S3 Sharded loading with RunAI Model Streamer (#16317)
omer-dayan Apr 22, 2025
d6da932
[Bugfix] Fix f-string for Python 3.9-3.11 (#16962)
DarkLight1337 Apr 22, 2025
3097ce3
[Doc] Update ai_accelerator/hpu-gaudi.inc.md (#16956)
windsonsea Apr 22, 2025
a114bf2
[Perf] Optimize `_update_states` for GPU model runner (#16910)
SnowCharmQ Apr 22, 2025
acba33a
[Bugfix] Fix the issue where llm.generate cannot be called repeatedly…
chaunceyjiang Apr 22, 2025
2689d5c
[Model] Use autoweightloader for mamba (#16950)
sfeng33 Apr 22, 2025
c4ab9f3
[V1] Remove pre-allocation for KV cache (#16941)
WoosukKwon Apr 22, 2025
8d32dc6
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision C…
LeiWang1999 Apr 22, 2025
e4d6144
[BugFix] Fix incremental detokenization perf issue (#16963)
njhill Apr 22, 2025
8f7bace
[Doc] Improve documentation for multimodal CLI args (#16960)
DarkLight1337 Apr 22, 2025
0e237f0
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER (#15001)
vllmellm Apr 22, 2025
4b91c92
[Misc] refactor example series (#16972)
reidliu41 Apr 22, 2025
571e8dd
[Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni (#16974)
fyabc Apr 22, 2025
d059110
Improve configs - `SpeculativeConfig` (#16971)
hmellor Apr 22, 2025
f961d7f
[BugFix] Pass in correct VLLM config in FlashInfer backend (#13207) (…
timzsu Apr 22, 2025
68d4c33
[Misc] Add S3 environment variables for better support of MinIO. (#16…
chaunceyjiang Apr 22, 2025
f344107
[frontend] enhance tool_calls type check (#16882)
reidliu41 Apr 22, 2025
30bc3e0
[FEAT][ROCm]: Support AITER MLA (#15893)
vllmellm Apr 22, 2025
7f58fb9
Add assertion for no objects while hashing hf_config (#16930)
zou3519 Apr 22, 2025
5536b30
Fencing Kernels Tests for enabling on AMD (#16929)
Alexei-V-Ivanov-AMD Apr 22, 2025
5175b88
[BugFix] Remove default multiproc executor `collective_rpc` timeout (…
njhill Apr 22, 2025
83d9337
[Core][V1][TPU] Enable structured decoding on TPU V1 (#16499)
Chenyaaang Apr 23, 2025
36fe787
[Bugfix] validate urls object for multimodal content parts (#16990)
gcalmettes Apr 23, 2025
f67e9e9
add Dockerfile build vllm against torch nightly (#16936)
yangw-dev Apr 23, 2025
bc7c4d2
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 (#13305)
maleksan85 Apr 23, 2025
1e013fa
[V1][DP] More robust DP/EP dummy request coordination (#16277)
njhill Apr 23, 2025
7e081ba
[BugFix] Revert ROCm Custom Paged Attention Env Flag Check (#17022)
vllmellm Apr 23, 2025
6bc1e30
Revert "[Misc] Add S3 environment variables for better support of Min…
chaunceyjiang Apr 23, 2025
e1cf90e
[misc] tune some env vars for GB200 (#16992)
youkaichao Apr 23, 2025
56a7352
[INTEL-HPU][v0] Port delayed sampling to upstream (#16949)
xuechendi Apr 23, 2025
eb8ef42
[doc] add download path tips (#17013)
reidliu41 Apr 23, 2025
047797e
[Bugfix] Triton FA function takes no keyword arguments (#16902)
vllmellm Apr 23, 2025
b2f195c
[V1] Avoid socket errors during shutdown when requests are in in-flig…
njhill Apr 23, 2025
d0da99f
[BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have …
LucasWilkinson Apr 23, 2025
ec69124
[Misc] Improve readability of get_open_port function. (#17024)
gitover22 Apr 23, 2025
8c87a9a
[Bugfix] Fix AssertionError: skip_special_tokens=False is not support…
chaunceyjiang Apr 23, 2025
ce17db8
[CI] Run v1/test_serial_utils.py in CI (#16996)
russellb Apr 23, 2025
aa72d9a
Mistral-format support for compressed-tensors (#16803)
mgoin Apr 23, 2025
6317a51
Categorize `tests/kernels/` based on kernel type (#16799)
mgoin Apr 23, 2025
f7912cb
[Doc] Add top anchor and a note to quantization/bitblas.md (#17042)
windsonsea Apr 23, 2025
53c0fa1
Ensure that `pid` passed to `kill_process_tree` is `int` for `mypy` (…
hmellor Apr 23, 2025
af869f6
[CI] Update structured-output label automation (#17055)
russellb Apr 23, 2025
8e630d6
Improve Transformers backend model loading QoL (#17039)
hmellor Apr 23, 2025
f3a21e9
`CacheConfig.block_size` should always be `int` when used (#17052)
hmellor Apr 23, 2025
bdb3660
Use `@property` and private field for `data_parallel_rank_local` (#17…
hmellor Apr 23, 2025
3cde34a
[Frontend] Support guidance:no-additional-properties for compatibilit…
tjohnson31415 Apr 23, 2025
32d4b66
[BugFix][V1] Fix int32 token index overflow when preparing input ids …
sarckk Apr 23, 2025
41fb013
[V1][Spec Decode] Always use argmax for sampling draft tokens (#16899)
WoosukKwon Apr 23, 2025
b07d741
[CI/Build] workaround for CI build failure (#17070)
csy1204 Apr 23, 2025
6b2427f
[Quantization]add prefix for commandA quantized model (#17017)
CXIAAAAA Apr 24, 2025
46e678b
[Minor] Use larger batch sizes for A100/B100/B200/MI300x (#17073)
WoosukKwon Apr 24, 2025
ed50f46
[Bugfix] Enable V1 usage stats (#16986)
mgoin Apr 24, 2025
2c8ed8e
More informative error when using Transformers backend (#16988)
hmellor Apr 24, 2025
ed2e464
Addendum Fix to support FIPS enabled machines with MD5 hashing (#17043)
sydarb Apr 24, 2025
6167c0e
[Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak …
zhangyuygss Apr 24, 2025
db2f8d9
[V1] Update structured output (#16812)
reidliu41 Apr 24, 2025
9c1244d
[doc] update to hyperlink (#17096)
reidliu41 Apr 24, 2025
2bc0f72
Add docs for runai_streamer_sharded (#17093)
omer-dayan Apr 24, 2025
b411418
[Chore] Remove Sampler from Model Code (#17084)
WoosukKwon Apr 24, 2025
14288d1
Disable enforce_eager for V1 TPU sampler and structured output tests …
mgoin Apr 24, 2025
0a05ed5
Simplify `TokenizerGroup` (#16790)
hmellor Apr 24, 2025
a9138e8
Fix OOT registration test (#17099)
hmellor Apr 24, 2025
c0dfd97
[V1][PP] Optimization: continue scheduling prefill chunks (#17080)
ruisearch42 Apr 24, 2025
b0c1f62
[Misc] Remove OLMo2 config copy (#17066)
Isotr0py Apr 24, 2025
21f4f1c
Improve static type checking in `LoRAModelRunnerMixin` (#17104)
hmellor Apr 24, 2025
b724afe
[V1][Structured Output] Clear xgrammar compiler object when engine co…
shen-shanshan Apr 24, 2025
67309a1
[Frontend] Using matryoshka_dimensions control the allowed output dim…
noooop Apr 24, 2025
82e43b2
Add missing rocm_skinny_gemms kernel test to CI (#17060)
mgoin Apr 24, 2025
1bcbcbf
[Misc] refactor example series - structured outputs (#17040)
reidliu41 Apr 24, 2025
340d7b1
[V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_positi…
markmc Apr 24, 2025
4115f19
[CI] Add automation for the `tool-calling` github label (#17118)
russellb Apr 24, 2025
5adf6f6
Updating builkite job for IBM Power (#17111)
AaruniAggarwal Apr 24, 2025
49f1894
existing torch installation pip command fix for docs (#17059)
atilla00 Apr 24, 2025
47bdee4
Molmo Requirements (#17026)
Eyshika Apr 24, 2025
0422ce1
Add `:markdownhelp:` to `EngineArgs` docs so markdown docstrings rend…
hmellor Apr 24, 2025
0fa939e
Improve configs - `LoRAConfig` + `PromptAdapterConfig` (#16980)
hmellor Apr 24, 2025
6d0df0e
[Docs] Generate correct github links for decorated functions (#17125)
russellb Apr 24, 2025
fe92176
Add collective_rpc to llm engine (#16999)
yinghai Apr 24, 2025
05e1fbf
Add chat template for Llama 4 models (#16428)
maxdebayser Apr 24, 2025
583e900
[Misc] Add example to run DeepSeek with Ray Serve LLM (#17134)
ruisearch42 Apr 24, 2025
9420a1f
Better error message for missing mistral params.json (#17132)
mgoin Apr 24, 2025
0d6e187
Use custom address for listening socket (#15988)
jglaser Apr 25, 2025
eef3647
[FEAT] [ROCm]: AITER Fused MOE V1 Support (#16752)
vllmellm Apr 25, 2025
41ca7eb
[Attention] FA3 decode perf improvement - single mma warp group suppo…
LucasWilkinson Apr 25, 2025
69bff9b
fix float16 support for kimi-vl (#17156)
zhouzaida Apr 25, 2025
7a0a9da
[Doc] V1 : Update LoRA status (#17133)
varun-sundar-rabindranath Apr 25, 2025
6498189
[Docs] Fix True->true in supported_models.md (#17141)
mgoin Apr 25, 2025
6ca0234
Move missed `SchedulerConfig` args into scheduler config group in `En…
hmellor Apr 25, 2025
5aa6efb
[Misc] Clean up redundant code in uniproc_executor.py (#16762)
lifuhuang Apr 25, 2025
2f54045
[Bugfix][Misc] Use TritonPlaceholderModule to defensively import trit…
MengqingCao Apr 25, 2025
881f735
[Misc] Benchmark Serving Script Support Appending Results (#17028)
LucasWilkinson Apr 25, 2025
b22980a
[Perf]Optimize rotary_emb implementation to use Triton operator for i…
cynthieye Apr 25, 2025
6aae216
[Bugfix] remove fallback in guided_json (int range, patterns) (#16725)
csy1204 Apr 25, 2025
a41351f
[Quantization][FP8] Add support for FP8 models with input_scale for o…
rasmith Apr 25, 2025
ef19e67
[Doc] Add headings to improve gptqmodel.md (#17164)
windsonsea Apr 25, 2025
fc966e9
Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 (#1…
houseroad Apr 25, 2025
f851b84
[Doc] Add two links to disagg_prefill.md (#17168)
windsonsea Apr 25, 2025
7feae92
[Doc] Move todo out of beam search docstring (#17183)
alex-jw-brooks Apr 25, 2025
19dcc02
[Bugfix] Fix mistral model tests (#17181)
DarkLight1337 Apr 25, 2025
d5615af
[Bugfix] Fix Mistral ChatCompletionRequest Body Exception (#16769)
JasmondL Apr 25, 2025
0bd7f8f
Bump Transformers to 4.51.3 (#17116)
hmellor Apr 25, 2025
423e9f1
Use Transformers helper `get_text_config()` instead of checking for `…
hmellor Apr 25, 2025
df5c879
[doc] update wrong hf model links (#17184)
reidliu41 Apr 25, 2025
9d98ab5
[Misc] Inline Molmo requirements (#17190)
DarkLight1337 Apr 25, 2025
a5450f1
[Security] Use safe serialization and fix zmq setup for mooncake pipe…
russellb Apr 25, 2025
48cb210
[V1] Move usage stats to worker and start logging TPU hardware (#16211)
dyli-google Apr 25, 2025
43faa04
[Bugfix] Fix hybrid model tests (#17182)
DarkLight1337 Apr 25, 2025
65e262b
Fix Python packaging edge cases (#17159)
tiran Apr 25, 2025
7011645
[BugFix][Frontend] Fix `LLM.chat()` tokenization (#16081)
njhill Apr 25, 2025
a0e619e
[V1][Spec Decode] EAGLE-3 Support (#16937)
benchislett Apr 25, 2025
c53e073
[Misc] Refine ray_serve_deepseek example (#17204)
ruisearch42 Apr 25, 2025
8de2901
[Bugfix] gemma[2,3] interleaved attention when sliding window is disa…
heheda12345 Apr 26, 2025
68af5f6
[AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it i…
rasmith Apr 26, 2025
5e83a72
[v1] [P/D] Adding LMCache KV connector for v1 (#16625)
ApostaC Apr 26, 2025
a6e72e1
[Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env (#17142)
jamesjwu Apr 26, 2025
c8e5be3
[MISC][AMD] Add unused annotation to rocm kernel file (#17097)
houseroad Apr 26, 2025
537d5ee
[doc] add Anything LLM integration (#17216)
reidliu41 Apr 26, 2025
1cf0719
[Minor][Spec Decode] Add use_eagle to SpeculativeConfig (#17213)
WoosukKwon Apr 26, 2025
7bd0c77
[Doc] Minor fix for the vLLM TPU setup page (#17206)
yarongmu-google Apr 26, 2025
b278911
[Minor][Models] Fix Return Types of Llama & Eagle (#17220)
WoosukKwon Apr 26, 2025
9e96f56
Allocate kv_cache with stride order (#16605)
wenscarl Apr 26, 2025
54271bb
[ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. (#17011)
charlifu Apr 26, 2025
53e8cf5
[V1][Metrics] Allow V1 AsyncLLM to use custom logger (#14661)
liuzijing2014 Apr 26, 2025
b07bf83
[BugFix] Avoid race conditions in zero-copy tensor transmission (#17203)
njhill Apr 26, 2025
513f074
[CI/test] Fix Eagle Correctness Test (#17209)
WoosukKwon Apr 26, 2025
df6f3ce
[Core] Remove prompt string from engine core data structures (#17214)
njhill Apr 26, 2025
8c1c926
[Bugfix] Fix missing int type for `-n` in multi-image example (#17223)
Isotr0py Apr 26, 2025
909fdaf
[Bugfix] Fix standard models tests (#17217)
DarkLight1337 Apr 26, 2025
c48334d
[Hardware][Intel-Gaudi] Update hpu-extension and update bucketing sys…
adobrzyn Apr 26, 2025
f8acd01
[V1] Add `structural_tag` support using xgrammar (#17085)
russellb Apr 26, 2025
dc2ceca
[BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set (#…
andyxning Apr 26, 2025
e782e0a
[Chore] added stubs for `vllm_flash_attn` during development mode (#1…
aarnphm Apr 26, 2025
52b4f4a
[Docs] Update structured output doc for V1 (#17135)
russellb Apr 26, 2025
10fd1d7
[Bugfix] fix error due to an uninitialized tokenizer when using `skip…
junstar92 Apr 26, 2025
4d17e20
Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACH…
houseroad Apr 26, 2025
fd11a32
[MISC] rename interval to max_recent_requests (#14285)
andyxning Apr 26, 2025
de7eb10
[Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation (#16878)
imkero Apr 26, 2025
43eea29
[Minor] Fix lint error in main branch (#17233)
WoosukKwon Apr 26, 2025
3642c59
[CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh (#16271)
reidliu41 Apr 26, 2025
9869453
Update test_flash_attn.py (#17102)
ShuaibinLi Apr 26, 2025
8e4b351
[Kernel][Triton][FP8] Adding fp8 and variable length sequence support…
rasmith Apr 27, 2025
93a126f
[Misc] Make cached tokenizer pickle-compatible (#17048)
DarkLight1337 Apr 27, 2025
4283a28
[Bugfix] Fix QWen2 VL multimodal mapping (#17240)
jeejeelee Apr 27, 2025
838ceda
[Bugfix] Get a specific type of layer from forward context (#17222)
heheda12345 Apr 27, 2025
30215ca
[MISC] Use string annotation types for class definitions (#17244)
jianzs Apr 27, 2025
18445ed
[Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32,…
sfc-gh-zhwang Apr 27, 2025
756848e
[Bugfix] Fix Lora Name Parsing (#17196)
alex-jw-brooks Apr 27, 2025
ed7a29d
[NVIDIA] Support Cutlass MLA for Blackwell GPUs (#16032)
kaixih Apr 27, 2025
690fe01
[Feature] support sequence parallelism using compilation pass (#16155)
cascade812 Apr 27, 2025
d92879b
[doc] Add feature status legend (#17257)
reidliu41 Apr 27, 2025
4213475
[Metrics] Fix minor inconsistencies in bucket progression (#17262)
DarkLight1337 Apr 27, 2025
20e489e
[V1][Spec Decode] Make eagle compatible with prefix caching. (#17137)
LiuXiaoxuanPKU Apr 27, 2025
d8bccde
[BugFix] Fix vllm_flash_attn install issues (#17267)
LucasWilkinson Apr 28, 2025
d1aeea7
[Bugfix] Fix missing ARG in Dockerfile for arm64 platforms (#17261)
lkm-schulz Apr 28, 2025
c12df53
[Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c…
Ther-LF Apr 28, 2025
cb3f2d8
[Bugfix] Fix Mistral3 spatial merge error (#17270)
mgoin Apr 28, 2025
9053d0b
[Doc] Fix wrong github link in LMCache examples (#17274)
KuntaiDu Apr 28, 2025
f211331
[Doc] small fix (#17277)
reidliu41 Apr 28, 2025
8262a3e
[Misc] Validate `stop_token_ids` contents (#17268)
njhill Apr 28, 2025
7fcc422
[Minor][Models] Pass partial_rotary_factor parameter to rope (#17266)
Eviannn Apr 28, 2025
aec9674
[Core] Remove legacy input mapper/processor from V0 (#15686)
DarkLight1337 Apr 28, 2025
fa93cd9
[Model] Add Granite Speech Support (#16246)
alex-jw-brooks Apr 28, 2025
72c5b97
Update tpu_worker.py 's typo (#17288)
idouba Apr 28, 2025
fb1c933
Add missing class docstring for `PromptAdapterConfig` (#17302)
hmellor Apr 28, 2025
344e193
[Bugfix] Add missing `get_language_model` to new MLLMs (#17300)
DarkLight1337 Apr 28, 2025
3ad986c
[doc] update wrong model id (#17287)
reidliu41 Apr 28, 2025
889ebb2
[Misc] Minor typo/grammar in `platforms/interface.py` (#17307)
NickLucche Apr 28, 2025
8b464d9
[Misc] Clean up Qwen2.5-Omni code (#17301)
DarkLight1337 Apr 28, 2025
bf36270
add Dockerfile.rocm.ubi
dtrifiro Mar 24, 2025
ae418bd
Dockerfile.rocm.ubi: improvements
dtrifiro Mar 24, 2025
72dfe4c
[Docs] Add a security guide (#17230)
russellb Apr 28, 2025
1e358ff
add ROCm dockerfile (#205)
dtrifiro Apr 28, 2025
f948869
Improve conversion from dataclass configs to argparse arguments (#17303)
hmellor Apr 28, 2025
b6dd32a
Make name of `compressed-tensors` quant method consistent across vLLM…
hmellor Apr 28, 2025
c7941cc
Explicitly explain quant method override ordering and ensure all over…
hmellor Apr 28, 2025
a0304dc
[Security] Don't bind tcp zmq socket to all interfaces (#17197)
russellb Apr 28, 2025
2c89cd9
[Chore] cleanup license indicators in light of SPDX (#17259)
aarnphm Apr 28, 2025
cc5befb
[BugFix] Fix cascade attention - RuntimeError: scheduler_metadata mus…
LucasWilkinson Apr 28, 2025
ed24620
[Bugfix] Fix moe weight losing all extra attrs after `process_weights…
charlifu Apr 28, 2025
dcbac4c
[Model] Qwen3 Dense FP8 Compat Fixes (#17318)
simon-mo Apr 28, 2025
ba41cc9
[Model] Add tuned triton fused_moe configs for Qwen3Moe (#17328)
mgoin Apr 28, 2025
cc463fe
Merge branch 'tag-upstream-v0.8.5' into upstream-v0.8.5
heyselbi Apr 29, 2025
88c65ab
add snyk security scan (#217)
andy-neuma Apr 29, 2025
b357f89
Dockerfile.rocm.ubi: use default torch index ROCm (do not use nightli…
dtrifiro Apr 30, 2025
5cc9ebf
Dockerfile.rocm.ubi: fix torch extra index url
dtrifiro Apr 30, 2025
e335c34
[BugFix] Fix Memory Leak (#17567)
robertgshaw2-redhat May 2, 2025
f8db0bd
[BugFix][Attention] Fix sliding window attention in V1 giving incorre…
LucasWilkinson May 2, 2025
31c73ba
[Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_…
chaunceyjiang Apr 30, 2025
1fe447d
Bump Compressed Tensors version to 0.9.4 (#17478)
rahul-tuli Apr 30, 2025
ba1713a
[model] make llama4 compatible with pure dense layers (#17315)
luccafong Apr 29, 2025
278cc0f
[Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument (#17677)
varun-sundar-rabindranath May 6, 2025
9bbd1ab
Merge remote-tracking branch 'nm-fork/main' into upstream-v0.8.5
heyselbi May 7, 2025
7db80a0
Sync midstream to upstream v0.8.5.post1 (#218)
andy-neuma May 7, 2025
f2481f8
bump cuda to 12-8
wseaton May 7, 2025
998a180
bump cuda to 12-8 (#223)
wseaton May 7, 2025
b023d4f
Merge tag 'midstream-v0.8.5.0' into nm-vllm-ent-0.8.5-sync
ckhordiasma May 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m deepseek-ai/DeepSeek-V2-Lite-Chat -b "auto" -l 1000 -f 5 -t 2
model_name: "deepseek-ai/DeepSeek-V2-Lite-Chat"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5
model_name: "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test -b 32 -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name: "mgoin/Minitron-4B-Base-FP8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic -b "auto" -l 250 -f 5 -t 8
model_name: "neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 -b "auto" -l 250 -f 5 -t 4
model_name: "neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16 -b auto -l 1319 -f 5 -t 1
model_name: "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.31
value: 0.30
- name: "exact_match,flexible-extract"
value: 0.47
value: 0.465
limit: 1319
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-FP8W8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise -b "auto" -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2-57B-A14B-Instruct -b "auto" -l 250 -f 5 -t 4
model_name: "Qwen/Qwen2-57B-A14B-Instruct"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
tasks:
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
import pytest
import yaml

RTOL = 0.05
RTOL = 0.08
TEST_DATA_FILE = os.environ.get(
"LM_EVAL_TEST_DATA_FILE",
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")
Expand Down
15 changes: 15 additions & 0 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,18 @@ steps:
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"

- block: "Build Neuron release image"
key: block-neuron-release-image-build
depends_on: ~

- label: "Build and publish Neuron release image"
depends_on: block-neuron-release-image-build
agents:
queue: neuron-postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:latest --progress plain -f docker/Dockerfile.neuron ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"
7 changes: 7 additions & 0 deletions .buildkite/scripts/hardware_ci/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,13 @@ if [[ $commands == *" kernels "* ]]; then
--ignore=kernels/test_machete_mm.py \
--ignore=kernels/test_mha_attn.py \
--ignore=kernels/test_block_fp8.py \
--ignore=kernels/test_cutlass_moe.py \
--ignore=kernels/test_mamba_ssm_ssd.py \
--ignore=kernels/test_attention.py \
--ignore=kernels/test_block_int8.py \
--ignore=kernels/test_fused_quant_layernorm.py \
--ignore=kernels/test_int8_kernel.py \
--ignore=kernels/test_triton_moe_ptpc_fp8.py \
--ignore=kernels/test_permute_cols.py"
fi

Expand Down
35 changes: 33 additions & 2 deletions .buildkite/scripts/hardware_ci/run-cpu-test-ppc64le.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,41 @@
set -ex

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test || true; docker system prune -f; }
remove_docker_container() {
if [[ -n "$container_id" ]]; then
podman rm -f "$container_id" || true
fi
podman system prune -f
}
trap remove_docker_container EXIT
remove_docker_container

# Try building the docker image
docker build -t cpu-test -f docker/Dockerfile.ppc64le .
podman build -t cpu-test-ubi9-ppc -f docker/Dockerfile.ppc64le .

# Run the image
container_id=$(podman run -itd --entrypoint /bin/bash -v /tmp/:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN cpu-test-ubi9-ppc)

function cpu_tests() {

# offline inference
podman exec -it "$container_id" bash -c "
set -e
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m"

# Run basic model test
podman exec -it "$container_id" bash -c "
set -e
pip install pytest pytest-asyncio einops peft Pillow soundfile transformers_stream_generator matplotlib
pip install sentence-transformers datamodel_code_generator
pytest -v -s tests/models/embedding/language/test_cls_models.py::test_classification_models[float-jason9693/Qwen2.5-1.5B-apeach]
pytest -v -s tests/models/embedding/language/test_embedding.py::test_models[half-BAAI/bge-base-en-v1.5]
pytest -v -s tests/models/encoder_decoder/language -m cpu_model"
}

# All of CPU tests are expected to be finished less than 40 mins.

export container_id
export -f cpu_tests
timeout 40m bash -c cpu_tests

11 changes: 9 additions & 2 deletions .buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,13 @@ source /etc/environment
docker run --privileged --net host --shm-size=16G -it \
-e "HF_TOKEN=$HF_TOKEN" --name tpu-test \
vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git \
&& python3 -m pip install pytest \
&& python3 -m pip install pytest pytest-asyncio tpu-info \
&& python3 -m pip install lm_eval[api]==0.4.4 \
&& export VLLM_XLA_CACHE_PATH= \
&& export VLLM_USE_V1=1 \
&& export VLLM_XLA_CHECK_RECOMPILATION=1 \
&& echo HARDWARE \
&& tpu-info \
&& echo TEST_0 \
&& pytest -v -s /workspace/vllm/tests/v1/tpu/test_perf.py \
&& echo TEST_1 \
Expand All @@ -40,7 +43,11 @@ docker run --privileged --net host --shm-size=16G -it \
&& echo TEST_8 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_topk_topp_sampler.py \
&& echo TEST_9 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py" \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_multimodal.py \
&& echo TEST_10 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py \
&& echo TEST_11 \
&& pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py" \


# TODO: This test fails because it uses RANDOM_SEED sampling
Expand Down
Loading
Loading