Skip to content

Commit bc7d46c

Browse files
msinnha1EdenzzzzAlcanderianguoyuhongsaltyfish66
authored
Rebase 4_6_post_4 to master_next (sgl-project#47)
* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728) * [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722) * Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720) * perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716) * we fix the non existent access of `decrypted_config_file` (sgl-project#5685) * CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682) * Fuse MLA set kv cache kernel (sgl-project#5748) * Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697) * [feature] support for roberta embedding models (sgl-project#5730) * [fix] fix bench_one_batch_server (sgl-project#5607) * support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592) * fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687) * Add Llama 4 to FA3 test (sgl-project#5509) * [misc] more decode step log for batch_one_batch (sgl-project#5565) * Handle JSONDecodeError while processing request data (sgl-project#5599) * fix(srt): check if sample_indices is not None before usage. (sgl-project#5633) * update llguidance to 0.7.11; adds StructTag (sgl-project#4870) * Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971) * Add memory_saver check (sgl-project#4986) Signed-off-by: Kebe <[email protected]> * add switch to disable open api doc (sgl-project#3744) Signed-off-by: congcongke <[email protected]> * Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772) * Fix eagle test case (sgl-project#5776) * Split local attention test from fa3 test (sgl-project#5774) * Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777) * Simplify FA3 tests (sgl-project#5779) * Revert "[fix] fix bench_one_batch_server" (sgl-project#5785) * Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786) * [CI] Tune threshold (sgl-project#5787) * [CI] fix port conflicts (sgl-project#5789) * [CI] Fix ci tests (sgl-project#5769) * [PD]Reduce kv transfer threads (sgl-project#5791) * [CI] Fix test case (sgl-project#5790) * Add 8-GPU Test for Deepseek-V3 (sgl-project#5691) Co-authored-by: Lianmin Zheng <[email protected]> * Release v0.4.6 (sgl-project#5795) * Update nightly-test.yml (sgl-project#5797) * [CI] Improve github summary & enable fa3 for more models (sgl-project#5796) * [Docs] update grafana setup guide in production metrics (sgl-project#5643) Co-authored-by: NoahM <[email protected]> * [Misc] add structure logging, write to file and log tracing for SGL Router * Improve overlap scheduling (sgl-project#5788) * Add Cutlass MLA attention backend (sgl-project#5390) * chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690) * Dockerfile.dev pip scikit_build_core (sgl-project#5807) * Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809) * Turn on overlap scheduler for multimodal models (sgl-project#5771) * Tiny refactor DefaultModelLoader.Source (sgl-project#5482) * [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276) * Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825) * Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551) Co-authored-by: shuaills <[email protected]> Co-authored-by: Chayenne <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: Yineng Zhang <[email protected]> * fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838) * feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833) * fused moe triton tuning script support qwen3 (sgl-project#5842) * feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839) * [PD] support pd fake transfer for warmup (sgl-project#5726) * [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846) * [Doc] Recover history of server_arguments.md (sgl-project#5851) * feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850) * [CI] test chunked prefill more (sgl-project#5798) * ROCm: update AITER (sgl-project#5816) * [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847) Co-authored-by: sighingnow <[email protected]> * [Fix] Missing bootstrap_port field (sgl-project#5823) * feat: update is_fa3_default_architecture (sgl-project#5854) * add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849) * chore: bump v0.4.6.post1 (sgl-project#5845) * Support `max_completion_tokens` for OpenAIChatCompletions (sgl-project#5857) * simplify fused_moe config logging (sgl-project#5801) * [CI] tune the test order to warmup the server (sgl-project#5860) * Cutlass MLA decode - fix dtype error (sgl-project#5868) * cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820) * [Feature] support auto chat template (sgl-project#4949) * Feat: support cuda graph for LoRA (sgl-project#4115) Co-authored-by: Beichen Ma <[email protected]> * Add qwen3 30b fused moe config (sgl-project#5859) * [Fix] Fix a bug for flashmla to run R1 model (sgl-project#5875) Co-authored-by: pengcuo <[email protected]> * Add A800 fused moe config for qwen3 30b (sgl-project#5880) * [Misc] add service discovery for sgl router * [fix]: PyO3 macOS linking and consolidate on tracing for logging * chore: update Dockerfile (sgl-project#5894) * [Docs] Update docs for Qwen3 and Qwen3MoE (sgl-project#5836) * [Doc] Tables instead of bulletpoints for sampling doc (sgl-project#5841) * chore: update CODEOWNERS (sgl-project#5895) * [FEATURE] Enhance platform compatibility for ARM (sgl-project#5746) * [CI] Add test_function_calling.py to run_suite.py (sgl-project#5896) * Auto set draft model path for MTP (sgl-project#5793) * [fix] relax mem_fraction_static for h200 (sgl-project#5893) Co-authored-by: alcanerian <[email protected]> * feat: support pythonic tool call and index in tool call streaming (sgl-project#5725) * [Bugfix]: fix missing queue_time_start for requests from grammar_queue (sgl-project#5696) * Add AMD MI300x Nightly Testing. (sgl-project#5861) * chore: use torch 2.6 for sgl-kernel build (sgl-project#5898) * Fix check_env script (sgl-project#5901) * [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels sgl-project#134 (sgl-project#5830) * Bump Flashinfer to 0.2.5 (sgl-project#5870) Co-authored-by: Yuhao Chen <[email protected]> * [Fix] Unload lora in HF_Runner if needed (sgl-project#5899) * Add A800 fused moe config for qwen3 235b (sgl-project#5900) * Add sm_120 for blackwell (sgl-project#5903) * [Feature] add support kimi vl model (sgl-project#5383) Co-authored-by: wenju.li <[email protected]> * support vlm benchmark profile (sgl-project#5905) * [fix] kimi-vl test in test_vision_openai_server.py (sgl-project#5910) * [Misc] use parallel build for cmake in sgl-kernel (sgl-project#5919) * [qwen3] support qwen3 ep moe (sgl-project#5917) Co-authored-by: sleepcoo <[email protected]> * Add TP2 MOE benchmarks for AMD. (sgl-project#5909) * [Feat] Scale up fa3 kernel to sm8x arch (sgl-project#5912) Co-authored-by: zhyncs <[email protected]> * chore: bump sgl-kernel 0.1.1 (sgl-project#5932) * chore: upgrade sgl-kernel 0.1.1 (sgl-project#5933) * Remove unused method `calculate_num_image_tokens` from qwen2_vl.py (sgl-project#5783) * [PP] Add pipeline parallelism (sgl-project#5724) * Fix lora batch processing when input lora_path contains None (sgl-project#5930) * add Thor & Spark (sgl-project#5915) * fix: correct stream response when enable_thinking is set to false (sgl-project#5881) * fix: update model runner (sgl-project#5934) * chore: bump v0.4.6.post2 (sgl-project#5939) * Support XiaomiMiMo/MiMo model inference (sgl-project#5921) * [PD] Vectorise group_concurrent_contiguous in NumPy (sgl-project#5834) Co-authored-by: luoyuan.luo <[email protected]> * Remove extra contiguous (sgl-project#5953) * Update ci test and doc for MTP api change (sgl-project#5952) * docs: Fix Qwen model typo (sgl-project#5944) Signed-off-by: JiangJiaWei1103 <[email protected]> * Optimize a pad operation to accelerate 25us (sgl-project#5945) * Properly return error response in vertex_generate HTTP endpoint (sgl-project#5956) * feat: add concurrency evaluation logic in mmmu benchmark (sgl-project#5782) * Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. (sgl-project#5960) * feat: Refactor DeepSeekV3 function call (sgl-project#5908) * Remove token in token out in Native API (sgl-project#5967) * Support InternVL3 (sgl-project#5350) Co-authored-by: Mick <[email protected]> Co-authored-by: Chayenne <[email protected]> * Support MMMU benchmark for InternVL (sgl-project#5968) * FA3 speed up: skip len operation and get batch size directly from forward batch (sgl-project#5969) Signed-off-by: Lifu Huang <[email protected]> * [PD] NIXL backend Prefill TP & Decode TP+DP (sgl-project#5681) * Fix set kv cache multi-stream (sgl-project#5975) * Overlap qk norm with two streams (sgl-project#5977) * fix: only upgrade nccl for cu128 (sgl-project#5986) * Fix Phi3 serving which was broke by earlier change (sgl-project#5991) Co-authored-by: Lifu Huang <[email protected]> * [perf] H100 DeepSeek-V3 fused moe tuned config (sgl-project#5998) * [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (sgl-project#5992) * [Minor] Fix duplicate method definitions in conversation.py (sgl-project#6012) Signed-off-by: Lifu Huang <[email protected]> * Fix flaky issues of lora and add multi batch tests (sgl-project#5957) * Tool Call: Add `chat_template_kwargs` documentation (sgl-project#5679) * fix: fix broadcast_pyobj breaking VerlEngine (sgl-project#5997) * [PD] Allow customizing reserved tokens to avoid KV cache waste (sgl-project#6002) * Update dev container config to support live code sync and improve docker setup guide (sgl-project#6018) Signed-off-by: Lifu Huang <[email protected]> * [PD] Optimize disaggregation ib device help info (sgl-project#5781) * [Test] Add flashmla attention backend test (sgl-project#5587) * Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (sgl-project#5555) * feat: Add a unified merge_state API (sgl-project#5428) * feat: append more comprehensive fields in messages instead of merely role and content (sgl-project#5996) * [Security][Bug] Prevent binding to all TCP interfaces (sgl-project#5752) * Fix prefill OOM error in the case of large page size (sgl-project#5081) * Fix problem of large page size with chunked prefill (sgl-project#6046) * docs: add Google Cloud Vertex AI in Adoption and Sponsorship (sgl-project#6047) * docs: add new blog (sgl-project#6048) * Fix not "import os" (sgl-project#6057) * Better PD initialization (sgl-project#5751) * fix: deepep dockerfile, use pip install deepep. (sgl-project#5885) * [Fix] Fix and rename flashmla CI test (sgl-project#6045) * chore: upgrade cutlass 3.9.2 (sgl-project#6004) Co-authored-by: yizhang2077 <[email protected]> * Fix sgl-kernel build on aarch64 platforms (sgl-project#6062) * Add DeepEP to CI PR Test (sgl-project#5655) Co-authored-by: Jinyan Chen <[email protected]> * fix custom_allreduce namespace (sgl-project#6039) * feat: add release workflow for SGLang kernels on aarch64 (sgl-project#6010) Co-authored-by: Qiaolin-Yu <[email protected]> Co-authored-by: Yineng Zhang <[email protected]> * [Feature] Support for Ascend NPU backend (sgl-project#3853) Signed-off-by: Song Zhang <[email protected]> Co-authored-by: 22dimensions <[email protected]> * Fix the timeout for 8 gpu tests (sgl-project#6084) * Hint users DeepEP normal mode is incompatible with CUDA Graph (sgl-project#5014) * Super tiny fix doc (sgl-project#5233) * [Doc]Fix description for dp_size argument (sgl-project#6063) * feat(engine): add bootstrap parameters to generate methods (dynamo) (sgl-project#6075) * [refactor] slightly tidy fp8 module (sgl-project#5993) * Clean up fa3 test from 8 gpus (sgl-project#6105) * Deferring 8 GPU test (sgl-project#6102) * Update doc for MLA attention backends (sgl-project#6034) * Clean logs for DeepSeek-V3 launching (sgl-project#6079) * [CI]Add performance CI for VLM (sgl-project#6038) Signed-off-by: Xinyuan Tong <[email protected]> * adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell (sgl-project#6111) * optimize pad operations in fa3 to accelarate 100+us (sgl-project#6077) * Overlap shared expert and routed expert computations (sgl-project#5121) * Tiny refactor ModelConfig.from_server_args (sgl-project#5219) * Tiny refactor weight loading logic (sgl-project#5232) * [PD] Add control to slow down a server (sgl-project#5572) * Change AMD test threshold (sgl-project#6091) * DeepEP normal support deepgemm-contiguous (sgl-project#5626) Co-authored-by: Yingyi Huang <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: Xuting Zhou <[email protected]> Co-authored-by: ZhengHSI <[email protected]> * [fix] fix pyproject.toml dependencies (sgl-project#6119) * [Feature] Add FlashAttention3 as a backend for VisionAttention (sgl-project#5764) Co-authored-by: othame <[email protected]> Co-authored-by: Mick <[email protected]> Co-authored-by: Yi Zhang <[email protected]> * [perf] dsv3 bmm fallback to bf16 (sgl-project#5662) * [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm (sgl-project#6097) * [sgl-kernel] fix: fix cu118 compile error (sgl-project#6123) Co-authored-by: zhyncs <[email protected]> * upgrade xgrammar to 0.1.19 (sgl-project#6129) * Remove unecessary is_fa3_supported check (sgl-project#6112) * chore: bump sgl-kernel 0.1.2 (sgl-project#6131) * docs: update README (sgl-project#6132) * [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP (sgl-project#5745) * Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 (sgl-project#6101) * opt flashinfer mla cat (sgl-project#5822) Co-authored-by: xuyongfei.xyf <[email protected]> * Update amd nightly concurrency. (sgl-project#6141) * feat: add thinking_budget (sgl-project#6089) * [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (sgl-project#6162) * fix bug that gpu0 occupies more memory when hicache is turned on (sgl-project#5778) Co-authored-by: Zhiqiang Xie <[email protected]> * chore: bump v0.4.6.post3 (sgl-project#6165) * KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (sgl-project#6016) Co-authored-by: 继优 <[email protected]> Co-authored-by: chus-chus <[email protected]> Co-authored-by: Zhiqiang Xie <[email protected]> * [fix] fix determine_n_share_experts_fusion (sgl-project#6118) * Fix and Clean up chat-template requirement for VLM (sgl-project#6114) Signed-off-by: Xinyuan Tong <[email protected]> * [Docs]Delete duplicate content (sgl-project#6146) Co-authored-by: ximing.wxm <[email protected]> * Revert "feat: add thinking_budget (sgl-project#6089)" (sgl-project#6181) * Added async_encode method to Engine (sgl-project#4701) * Fix data parallel perf regression (sgl-project#6183) * Fix request abortion (sgl-project#6184) * Add typo checker in pre-commit (sgl-project#6179) Co-authored-by: Brayden Zhong <[email protected]> * Remove duplicate IO Struct test (sgl-project#6180) Signed-off-by: Emmanuel Ferdman <[email protected]> * [PD] Add simple unit test for disaggregation feature (sgl-project#5654) Signed-off-by: Shangming Cai <[email protected]> * [CI] Disabled deepep tests temporarily because it takes too much time. (sgl-project#6186) * feat: support loogle eval (sgl-project#6190) * [fix] remove mixtral from is_fa3_default_architecture (sgl-project#6191) * fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode (sgl-project#6169) * chore: upgrade deepgemm (sgl-project#6073) * chore: bump sgl-kernel v0.1.2.post1 (sgl-project#6195) * chore: upgrade sgl-kernel v0.1.2.post1 (sgl-project#6196) Co-authored-by: alcanderian <[email protected]> * Handle empty input string for embedding models (sgl-project#5621) Co-authored-by: Ravi Theja Desetty <[email protected]> * doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (sgl-project#6199) * [Docs] minor Qwen3 and reasoning parser docs fix (sgl-project#6032) * Improve structured outputs: fix race condition, server crash, metrics and style (sgl-project#6188) * [CI] Reorganize the 8 gpu tests (sgl-project#6192) * Add dev-deepep docker image (sgl-project#6198) * Replace time.time() to time.perf_counter() for benchmarking. (sgl-project#6178) Signed-off-by: Lifu Huang <[email protected]> * Update README.md (sgl-project#6202) * Fix release-docs.yml to not use python 3.9 (sgl-project#6204) * Fix start_profile does not support with_stack and record_shapes (sgl-project#6043) * [doc] add a note for --n-share-experts-fusion args (sgl-project#6154) * Performing Vocabulary Parallelism for LM Head across Attention TP Groups (sgl-project#5558) Co-authored-by: liusy58 <[email protected]> * Update AMD CI docker to v0.4.6.post3-rocm630. (sgl-project#6213) * Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (sgl-project#6201) Co-authored-by: SangBin Cho <[email protected]> * [CI] Fix PD mooncake dependency error (sgl-project#6212) Signed-off-by: Shangming Cai <[email protected]> * [CI] Re-enable pd disaggregation test (sgl-project#6231) Signed-off-by: Shangming Cai <[email protected]> * fix some typos (sgl-project#6209) Co-authored-by: Brayden Zhong <[email protected]> * [Docs] Add docs for `SGLANG_` and `SGL_` environment variables (sgl-project#6206) * [PP] Fix init_memory_pool desync & add PP for mixtral (sgl-project#6223) * Revert "fix some typos" (sgl-project#6244) * chore: add hf_xet dep (sgl-project#6243) * Update AMD nightly deps. (sgl-project#6241) * [PD] Add support for different TP sizes per DP rank (sgl-project#5922) Signed-off-by: Shangming Cai <[email protected]> * Support incremental streaming of logprob/token_ids between scheduler and detokenizer (sgl-project#6225) Co-authored-by: SangBin Cho <[email protected]> * fix typo (sgl-project#6248) * Support tuning moe for llama 4 model (sgl-project#6042) * Skip the flaky test_stateful_custom_logit_processor (sgl-project#6251) * [Llama4] Add docs note about enable multimodal (sgl-project#6235) * [VERL Use Case] Add torch_memory_saver into deps (sgl-project#6247) * Fix two issues related to `--moe-dense-tp-size=1` (sgl-project#5657) Co-authored-by: liusy58 <[email protected]> Co-authored-by: 颉沆 <[email protected]> * model(vlm): pixtral (sgl-project#5084) * [misc] deep_gemm fallback to NVRTC when NVCC not found (sgl-project#6252) * Enable MI325X AMD CI. (sgl-project#6259) * chore: bump v0.4.6.post4 (sgl-project#6245) * formatting fix for the rebased commit for 4.6.0_post4 Signed-off-by: Mohit Sinha <[email protected]> * fix issues in model runner and python packages fix for following issues: > vLLM dependency for xgrammar==0.1.17 > 'Scheduler' object has no attribute 'device > 'pp_proxy_tensors' unexpected arg in HPUGraphRunner > TODO: Add pipeline parallelism support in HPUGraphRunner Signed-off-by: Mohit Sinha <[email protected]> * fix formatting in model runner Signed-off-by: Mohit Sinha <[email protected]> * base grammar fix for the is_terminated case > 'OutlinesGrammar' object has no attribute 'is_terminated' Signed-off-by: Mohit Sinha <[email protected]> --------- Signed-off-by: Kebe <[email protected]> Signed-off-by: congcongke <[email protected]> Signed-off-by: JiangJiaWei1103 <[email protected]> Signed-off-by: Lifu Huang <[email protected]> Signed-off-by: Song Zhang <[email protected]> Signed-off-by: Xinyuan Tong <[email protected]> Signed-off-by: Emmanuel Ferdman <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Mohit Sinha <[email protected]> Co-authored-by: Wenxuan Tan <[email protected]> Co-authored-by: JieXin Liang <[email protected]> Co-authored-by: Yuhong Guo <[email protected]> Co-authored-by: saltyfish66 <[email protected]> Co-authored-by: vzed <[email protected]> Co-authored-by: Mick <[email protected]> Co-authored-by: Ke Bao <[email protected]> Co-authored-by: saienduri <[email protected]> Co-authored-by: DavidBao <[email protected]> Co-authored-by: Frankey_8080 <[email protected]> Co-authored-by: Stefan He <[email protected]> Co-authored-by: yan97ao <[email protected]> Co-authored-by: aoshen524 <[email protected]> Co-authored-by: Michał Moskal <[email protected]> Co-authored-by: lambert0312 <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: zhanweidu <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> Co-authored-by: Huapeng Zhou <[email protected]> Co-authored-by: NoahM <[email protected]> Co-authored-by: Simo Lin <[email protected]> Co-authored-by: Trevor Morris <[email protected]> Co-authored-by: Yineng Zhang <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: mlmz <[email protected]> Co-authored-by: shuaills <[email protected]> Co-authored-by: Chayenne <[email protected]> Co-authored-by: XinyuanTong <[email protected]> Co-authored-by: yhyang201 <[email protected]> Co-authored-by: ybyang <[email protected]> Co-authored-by: JiLi <[email protected]> Co-authored-by: HAI <[email protected]> Co-authored-by: PGFLMG <[email protected]> Co-authored-by: sighingnow <[email protected]> Co-authored-by: XTY <[email protected]> Co-authored-by: Yi Zhang <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: woodx <[email protected]> Co-authored-by: Qiaolin Yu <[email protected]> Co-authored-by: Beichen Ma <[email protected]> Co-authored-by: pengcuo <[email protected]> Co-authored-by: pengcuo <[email protected]> Co-authored-by: Adarsh Shirawalmath <[email protected]> Co-authored-by: simveit <[email protected]> Co-authored-by: Johnny <[email protected]> Co-authored-by: alcanerian <[email protected]> Co-authored-by: Yuhao Chen <[email protected]> Co-authored-by: zhjunqin <[email protected]> Co-authored-by: liwenju0 <[email protected]> Co-authored-by: wenju.li <[email protected]> Co-authored-by: laixin <[email protected]> Co-authored-by: sleepcoo <[email protected]> Co-authored-by: Ying Sheng <[email protected]> Co-authored-by: ryang <[email protected]> Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: luoyuan.luo <[email protected]> Co-authored-by: 江家瑋 <[email protected]> Co-authored-by: KCFindstr <[email protected]> Co-authored-by: xm:D <[email protected]> Co-authored-by: Lifu Huang <[email protected]> Co-authored-by: Yongtong Wu <[email protected]> Co-authored-by: Junrong Lin <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: DefTruth <[email protected]> Co-authored-by: Zhiqiang Xie <[email protected]> Co-authored-by: Hank Han <[email protected]> Co-authored-by: Qiaolin Yu <[email protected]> Co-authored-by: Jinyan Chen <[email protected]> Co-authored-by: Jinyan Chen <[email protected]> Co-authored-by: Johnny <[email protected]> Co-authored-by: Song Zhang <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: ishandhanani <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: Minglei Zhu <[email protected]> Co-authored-by: lukec <[email protected]> Co-authored-by: Yingyi Huang <[email protected]> Co-authored-by: Xuting Zhou <[email protected]> Co-authored-by: ZhengHSI <[email protected]> Co-authored-by: Zhu Chen <[email protected]> Co-authored-by: othame <[email protected]> Co-authored-by: Hubert Lu <[email protected]> Co-authored-by: Yixin Dong <[email protected]> Co-authored-by: xu-yfei <[email protected]> Co-authored-by: xuyongfei.xyf <[email protected]> Co-authored-by: thyecust <[email protected]> Co-authored-by: huangtingwei <[email protected]> Co-authored-by: Simon (Jiyou) Li <[email protected]> Co-authored-by: 继优 <[email protected]> Co-authored-by: chus-chus <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: Steven Shimizu <[email protected]> Co-authored-by: applesaucethebun <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: Emmanuel Ferdman <[email protected]> Co-authored-by: Yusong Gao <[email protected]> Co-authored-by: alcanderian <[email protected]> Co-authored-by: Ravi Theja <[email protected]> Co-authored-by: Ravi Theja Desetty <[email protected]> Co-authored-by: liusy58 <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: 颉沆 <[email protected]> Co-authored-by: Kiv Chen <[email protected]>
1 parent 3e60d54 commit bc7d46c

File tree

376 files changed

+18558
-4465
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

376 files changed

+18558
-4465
lines changed

.devcontainer/devcontainer.json

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,11 @@
2020
"runArgs": [
2121
"--gpus",
2222
"all"
23-
]
23+
],
24+
// The two lines below ensures that your local changes in the sglang
25+
// repo is automatically synced to the sglang pip package installed
26+
// in the dev docker container. You can remove / comment out these
27+
// two lines if you prefer to sync code changes manually.
28+
"workspaceMount": "source=${localWorkspaceFolder},target=/sgl-workspace/sglang,type=bind",
29+
"workspaceFolder": "/sgl-workspace/sglang"
2430
}

.github/CODEOWNERS

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,16 @@
44
/python/sglang/lang @merrymercy @Ying1123 @hnyls2002 @ByronHsu
55
/python/sglang/srt @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock @ByronHsu
66
/python/sglang/srt/constrained @hnyls2002
7-
/python/sglang/srt/layers @merrymercy @Ying1123 @zhyncs @ispobock @HaiShaw
8-
/python/sglang/srt/lora @Ying1123
7+
/python/sglang/srt/layers @merrymercy @Ying1123 @zhyncs @ispobock @HaiShaw @ch-wan
8+
/python/sglang/srt/lora @Ying1123 @Fridge003
99
/python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
1010
/python/sglang/srt/mem_cache @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
1111
/python/sglang/srt/model_executor @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock
1212
/python/sglang/srt/models @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock @ByronHsu
13-
/python/sglang/srt/openai_api @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock @ByronHsu
13+
/python/sglang/srt/openai_api @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock @ByronHsu @CatherineSue
1414
/python/sglang/srt/sampling @merrymercy @hnyls2002
1515
/python/sglang/srt/speculative @Ying1123 @merrymercy @rkooo567 @kssteven418
1616
/test/lang @merrymercy @Ying1123 @ByronHsu
1717
/test/srt @merrymercy @Ying1123 @zhyncs
18-
/sgl-router @ByronHsu @Ying1123
18+
/sgl-router @ByronHsu @Ying1123 @slin1237
1919
/sgl-kernel @zhyncs @ispobock @HandH1998 @BBuf @yizhang2077 @merrymercy @yinfan98

.github/workflows/execute-notebook.yml

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,6 @@ jobs:
2222
- name: Checkout code
2323
uses: actions/checkout@v4
2424

25-
- name: Set up Python
26-
uses: actions/setup-python@v4
27-
with:
28-
python-version: '3.9'
29-
3025
- name: Install dependencies
3126
run: |
3227
bash scripts/ci_install_dependency.sh
@@ -35,6 +30,8 @@ jobs:
3530
apt-get install -y pandoc
3631
apt-get update && apt-get install -y parallel retry
3732
33+
ln -sf "$(which python3)" /usr/bin/python
34+
3835
- name: Setup Jupyter Kernel
3936
run: |
4037
python -m ipykernel install --user --name python3 --display-name "Python 3"
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
name: Nightly Test (AMD)
2+
3+
on:
4+
schedule:
5+
- cron: '0 0 * * *'
6+
push:
7+
branches:
8+
- main
9+
paths:
10+
- "python/sglang/version.py"
11+
workflow_dispatch:
12+
13+
concurrency:
14+
group: nightly-test-amd-${{ github.ref }}
15+
cancel-in-progress: true
16+
17+
jobs:
18+
nightly-test:
19+
if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
20+
strategy:
21+
matrix:
22+
runner: [linux-mi300-gpu-2, linux-mi325-gpu-2-nightly]
23+
runs-on: ${{matrix.runner}}
24+
steps:
25+
- name: Checkout code
26+
uses: actions/checkout@v4
27+
28+
- name: Setup docker
29+
run: |
30+
# Ensure GPU isolation if pod is part of kubernetes setup with DEVICE_FLAG.
31+
if [ -f "/etc/podinfo/gha-render-devices" ]; then
32+
DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
33+
else
34+
DEVICE_FLAG="--device /dev/dri"
35+
fi
36+
touch github_summary.md
37+
docker pull lmsysorg/sglang:v0.4.6.post3-rocm630
38+
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
39+
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
40+
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
41+
-w /sglang-checkout --name ci_sglang \
42+
lmsysorg/sglang:v0.4.6.post3-rocm630
43+
44+
- name: Install dependencies
45+
run: |
46+
docker exec ci_sglang pip install --upgrade pip
47+
docker exec ci_sglang pip uninstall sgl-kernel -y || true
48+
docker exec -w /sglang-checkout/sgl-kernel ci_sglang bash -c "rm -f pyproject.toml && mv pyproject_rocm.toml pyproject.toml && python3 setup_rocm.py install"
49+
docker exec ci_sglang pip install -e "python[dev_hip]"
50+
51+
docker exec -w / ci_sglang git clone https://github.com/merrymercy/human-eval.git
52+
docker exec -w /human-eval ci_sglang pip install -e .
53+
docker exec ci_sglang pip install huggingface_hub[hf_xet]
54+
55+
- name: Nightly Test
56+
run: |
57+
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" ci_sglang python3 run_suite.py --suite nightly-amd --timeout-per-file 7200
58+
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY

.github/workflows/pr-test-amd.yml

Lines changed: 195 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@ jobs:
2525
accuracy-test-1-gpu-amd:
2626
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
2727
github.event.pull_request.draft == false
28-
runs-on: linux-mi300-gpu-1
28+
strategy:
29+
matrix:
30+
runner: [linux-mi300-gpu-1, linux-mi325-gpu-1]
31+
runs-on: ${{matrix.runner}}
2932
steps:
3033
- name: Checkout code
3134
uses: actions/checkout@v4
@@ -38,12 +41,12 @@ jobs:
3841
else
3942
DEVICE_FLAG="--device /dev/dri"
4043
fi
41-
docker pull ghcr.io/saienduri/sglang-aiter-v0.1.1:428
44+
docker pull lmsysorg/sglang:v0.4.6.post3-rocm630
4245
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
4346
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
4447
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
4548
-w /sglang-checkout --name ci_sglang \
46-
ghcr.io/saienduri/sglang-aiter-v0.1.1:428
49+
lmsysorg/sglang:v0.4.6.post3-rocm630
4750
4851
- name: Install dependencies
4952
run: |
@@ -66,10 +69,54 @@ jobs:
6669
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 ci_sglang python3 test_eval_fp8_accuracy.py
6770
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 ci_sglang python3 models/test_qwen_models.py
6871
72+
accuracy-test-2-gpu-amd:
73+
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
74+
github.event.pull_request.draft == false
75+
strategy:
76+
matrix:
77+
runner: [linux-mi300-gpu-2, linux-mi325-gpu-2]
78+
runs-on: ${{matrix.runner}}
79+
steps:
80+
- name: Checkout code
81+
uses: actions/checkout@v4
82+
83+
- name: Setup docker
84+
run: |
85+
# Ensure GPU isolation if pod is part of kubernetes setup with DEVICE_FLAG.
86+
if [ -f "/etc/podinfo/gha-render-devices" ]; then
87+
DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
88+
else
89+
DEVICE_FLAG="--device /dev/dri"
90+
fi
91+
docker pull lmsysorg/sglang:v0.4.6.post3-rocm630
92+
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
93+
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
94+
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
95+
-w /sglang-checkout --name ci_sglang \
96+
lmsysorg/sglang:v0.4.6.post3-rocm630
97+
98+
- name: Install dependencies
99+
run: |
100+
docker exec ci_sglang pip install --upgrade pip
101+
docker exec ci_sglang pip uninstall sgl-kernel -y || true
102+
docker exec -w /sglang-checkout/sgl-kernel ci_sglang bash -c "rm -f pyproject.toml && mv pyproject_rocm.toml pyproject.toml && python3 setup_rocm.py install"
103+
docker exec ci_sglang pip install -e "python[dev_hip]"
104+
105+
docker exec -w / ci_sglang git clone https://github.com/merrymercy/human-eval.git
106+
docker exec -w /human-eval ci_sglang pip install -e .
107+
108+
- name: Evaluate accuracy (TP=2)
109+
timeout-minutes: 20
110+
run: |
111+
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 ci_sglang python3 test_moe_eval_accuracy_large.py
112+
69113
mla-test-1-gpu-amd:
70114
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
71115
github.event.pull_request.draft == false
72-
runs-on: linux-mi300-gpu-1
116+
strategy:
117+
matrix:
118+
runner: [linux-mi300-gpu-1, linux-mi325-gpu-1]
119+
runs-on: ${{matrix.runner}}
73120
steps:
74121
- name: Checkout code
75122
uses: actions/checkout@v4
@@ -82,12 +129,12 @@ jobs:
82129
else
83130
DEVICE_FLAG="--device /dev/dri"
84131
fi
85-
docker pull ghcr.io/saienduri/sglang-aiter-v0.1.1:428
132+
docker pull lmsysorg/sglang:v0.4.6.post3-rocm630
86133
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
87134
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
88135
--cap-add=SYS_PTRACE -e HF_TOKEN=${{ secrets.AMD_HF_TOKEN }} --security-opt seccomp=unconfined \
89136
-w /sglang-checkout --name ci_sglang \
90-
ghcr.io/saienduri/sglang-aiter-v0.1.1:428
137+
lmsysorg/sglang:v0.4.6.post3-rocm630
91138
92139
- name: Install dependencies
93140
run: |
@@ -104,10 +151,126 @@ jobs:
104151
run: |
105152
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 ci_sglang python3 test_mla.py
106153
154+
performance-test-1-gpu-part-1-amd:
155+
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
156+
github.event.pull_request.draft == false
157+
strategy:
158+
matrix:
159+
runner: [linux-mi300-gpu-1, linux-mi325-gpu-1]
160+
runs-on: ${{matrix.runner}}
161+
steps:
162+
- name: Checkout code
163+
uses: actions/checkout@v4
164+
165+
- name: Setup docker
166+
run: |
167+
# Ensure GPU isolation if pod is part of kubernetes setup with DEVICE_FLAG.
168+
if [ -f "/etc/podinfo/gha-render-devices" ]; then
169+
DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
170+
else
171+
DEVICE_FLAG="--device /dev/dri"
172+
fi
173+
docker pull lmsysorg/sglang:v0.4.6.post3-rocm630
174+
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
175+
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
176+
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
177+
-w /sglang-checkout --name ci_sglang \
178+
lmsysorg/sglang:v0.4.6.post3-rocm630
179+
180+
- name: Install dependencies
181+
run: |
182+
docker exec ci_sglang pip install --upgrade pip
183+
docker exec ci_sglang pip uninstall sgl-kernel -y || true
184+
docker exec -w /sglang-checkout/sgl-kernel ci_sglang bash -c "rm -f pyproject.toml && mv pyproject_rocm.toml pyproject.toml && python3 setup_rocm.py install"
185+
docker exec ci_sglang pip install -e "python[dev_hip]"
186+
187+
docker exec -w / ci_sglang git clone https://github.com/merrymercy/human-eval.git
188+
docker exec -w /human-eval ci_sglang pip install -e .
189+
190+
- name: Benchmark single latency
191+
timeout-minutes: 10
192+
run: |
193+
docker exec -w /sglang-checkout/test/srt -e SGLANG_AMD_CI=1 -e SGLANG_IS_IN_CI=1 ci_sglang python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_bs1_small
194+
docker exec -w /sglang-checkout/test/srt -e SGLANG_AMD_CI=1 -e SGLANG_IS_IN_CI=1 ci_sglang python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_bs1_default
195+
196+
- name: Benchmark online latency
197+
timeout-minutes: 10
198+
run: |
199+
docker exec -w /sglang-checkout/test/srt -e SGLANG_AMD_CI=1 -e SGLANG_IS_IN_CI=1 ci_sglang python3 -m unittest test_bench_serving.TestBenchServing.test_online_latency_default
200+
201+
- name: Benchmark offline throughput
202+
timeout-minutes: 10
203+
run: |
204+
docker exec -w /sglang-checkout/test/srt -e SGLANG_AMD_CI=1 -e SGLANG_IS_IN_CI=1 ci_sglang python3 -m unittest test_bench_serving.TestBenchServing.test_offline_throughput_default
205+
206+
- name: Benchmark offline throughput (Non-streaming, small batch size)
207+
timeout-minutes: 10
208+
run: |
209+
docker exec -w /sglang-checkout/test/srt -e SGLANG_AMD_CI=1 -e SGLANG_IS_IN_CI=1 ci_sglang python3 -m unittest test_bench_serving.TestBenchServing.test_offline_throughput_non_stream_small_batch_size
210+
211+
- name: Benchmark online latency (EAGLE)
212+
timeout-minutes: 10
213+
run: |
214+
docker exec -w /sglang-checkout/test/srt -e SGLANG_AMD_CI=1 -e SGLANG_IS_IN_CI=1 ci_sglang python3 -m unittest test_bench_serving.TestBenchServing.test_online_latency_eagle
215+
216+
performance-test-1-gpu-part-2-amd:
217+
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
218+
github.event.pull_request.draft == false
219+
strategy:
220+
matrix:
221+
runner: [linux-mi300-gpu-1, linux-mi325-gpu-1]
222+
runs-on: ${{matrix.runner}}
223+
steps:
224+
- name: Checkout code
225+
uses: actions/checkout@v4
226+
227+
- name: Setup docker
228+
run: |
229+
# Ensure GPU isolation if pod is part of kubernetes setup with DEVICE_FLAG.
230+
if [ -f "/etc/podinfo/gha-render-devices" ]; then
231+
DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
232+
else
233+
DEVICE_FLAG="--device /dev/dri"
234+
fi
235+
docker pull lmsysorg/sglang:v0.4.6.post3-rocm630
236+
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
237+
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
238+
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
239+
-w /sglang-checkout --name ci_sglang \
240+
lmsysorg/sglang:v0.4.6.post3-rocm630
241+
242+
- name: Install dependencies
243+
run: |
244+
docker exec ci_sglang pip install --upgrade pip
245+
docker exec ci_sglang pip uninstall sgl-kernel -y || true
246+
docker exec -w /sglang-checkout/sgl-kernel ci_sglang bash -c "rm -f pyproject.toml && mv pyproject_rocm.toml pyproject.toml && python3 setup_rocm.py install"
247+
docker exec ci_sglang pip install -e "python[dev_hip]"
248+
249+
docker exec -w / ci_sglang git clone https://github.com/merrymercy/human-eval.git
250+
docker exec -w /human-eval ci_sglang pip install -e .
251+
252+
- name: Benchmark offline throughput (w/o RadixAttention)
253+
timeout-minutes: 10
254+
run: |
255+
docker exec -w /sglang-checkout/test/srt -e SGLANG_AMD_CI=1 -e SGLANG_IS_IN_CI=1 ci_sglang python3 -m unittest test_bench_serving.TestBenchServing.test_offline_throughput_without_radix_cache
256+
257+
- name: Benchmark offline throughput (w/ Triton)
258+
timeout-minutes: 10
259+
run: |
260+
docker exec -w /sglang-checkout/test/srt -e SGLANG_AMD_CI=1 -e SGLANG_IS_IN_CI=1 ci_sglang python3 -m unittest test_bench_serving.TestBenchServing.test_offline_throughput_with_triton_attention_backend
261+
262+
- name: Benchmark offline throughput (w/ FP8)
263+
timeout-minutes: 10
264+
run: |
265+
docker exec -w /sglang-checkout/test/srt -e SGLANG_AMD_CI=1 -e SGLANG_IS_IN_CI=1 ci_sglang python3 -m unittest test_bench_serving.TestBenchServing.test_offline_throughput_default_fp8
266+
107267
bench-test-2-gpu-amd:
108268
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
109269
github.event.pull_request.draft == false
110-
runs-on: linux-mi300-gpu-2
270+
strategy:
271+
matrix:
272+
runner: [linux-mi300-gpu-2, linux-mi325-gpu-2]
273+
runs-on: ${{matrix.runner}}
111274
steps:
112275
- name: Checkout code
113276
uses: actions/checkout@v4
@@ -120,12 +283,12 @@ jobs:
120283
else
121284
DEVICE_FLAG="--device /dev/dri"
122285
fi
123-
docker pull ghcr.io/saienduri/sglang-aiter-v0.1.1:428
286+
docker pull lmsysorg/sglang:v0.4.6.post3-rocm630
124287
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
125288
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
126289
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
127290
-w /sglang-checkout --name ci_sglang \
128-
ghcr.io/saienduri/sglang-aiter-v0.1.1:428
291+
lmsysorg/sglang:v0.4.6.post3-rocm630
129292
130293
- name: Install dependencies
131294
run: |
@@ -141,15 +304,36 @@ jobs:
141304
mkdir -p dummy-grok && wget https://sharkpublic.blob.core.windows.net/sharkpublic/sglang/dummy_grok.json -O dummy-grok/config.json
142305
docker cp ./dummy-grok ci_sglang:/
143306
144-
- name: Evaluate Benchmark
307+
- name: Benchmark dummy grok (TP=2)
145308
timeout-minutes: 20
146309
run: |
147310
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 ci_sglang python3 models/test_dummy_grok_models.py
148311
312+
- name: Benchmark single latency (TP=2)
313+
timeout-minutes: 20
314+
run: |
315+
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 -e SGLANG_AMD_CI=1 ci_sglang python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_moe_tp2_bs1
316+
317+
- name: Benchmark single latency + torch.compile (TP=2)
318+
timeout-minutes: 20
319+
run: |
320+
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 ci_sglang python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_torch_compile_tp2_bs1
321+
322+
- name: Benchmark offline throughput (TP=2)
323+
timeout-minutes: 20
324+
run: |
325+
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 -e SGLANG_AMD_CI=1 ci_sglang python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_default
326+
327+
- name: Benchmark offline throughput (w/o RadixAttention) (TP=2)
328+
timeout-minutes: 20
329+
run: |
330+
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 -e SGLANG_AMD_CI=1 ci_sglang python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_without_radix_cache
331+
149332
finish:
150333
if: always()
151334
needs: [
152-
accuracy-test-1-gpu-amd, mla-test-1-gpu-amd, bench-test-2-gpu-amd
335+
accuracy-test-1-gpu-amd, mla-test-1-gpu-amd, bench-test-2-gpu-amd,
336+
accuracy-test-2-gpu-amd, performance-test-1-gpu-part-1-amd, performance-test-1-gpu-part-2-amd
153337
]
154338
runs-on: ubuntu-latest
155339
steps:

0 commit comments

Comments
 (0)