Skip to content

Commit 3ecb4e3

Browse files
msinnha1fzyzcjymerrymercyvhainqingquansong
authored
rebase sglang to tag v0.4.5.post1 (sgl-project#13)
* Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <[email protected]> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <[email protected]> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <[email protected]> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <[email protected]> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <[email protected]> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <[email protected]> Co-authored-by: laixinn <[email protected]> Co-authored-by: ch-wan <[email protected]> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <[email protected]> Co-authored-by: GeLee <[email protected]> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: zcnrex <[email protected]> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <[email protected]> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <[email protected]> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <[email protected]> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <[email protected]> Co-authored-by: laixinn <[email protected]> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <[email protected]> Co-authored-by: Chunan Zeng <[email protected]> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <[email protected]> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: ispobock <[email protected]> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <[email protected]> * bump v0.4.5 (sgl-project#5117) * [ci] fix llama4 ci error (sgl-project#5126) * Refactor and Optimize FA3 Code (sgl-project#5090) Co-authored-by: Qingquan Song <[email protected]> * Add Llama4 user guide (sgl-project#5133) Co-authored-by: Cheng Wan <[email protected]> * [Misc] Use pytest.mark.skipif in sgl-kernel test (sgl-project#5137) * feat: disable grammar restrictions within reasoning sections (sgl-project#4984) Co-authored-by: tianhaoyu <[email protected]> Co-authored-by: DarkSharpness <[email protected]> * [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method (sgl-project#5145) * [AMD] Fix missing per_token_group_quant_fp8 for ROCm (sgl-project#5140) * fix multimodal hash feature (sgl-project#5083) * Fix run time error in ROCm platform (sgl-project#5147) Co-authored-by: wunhuang <[email protected]> Co-authored-by: root <[email protected]> * [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (sgl-project#5103) * Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (sgl-project#4760) * Use public model for FA3 speculative decode testing (sgl-project#5152) * Add dummy grok test to amd CI. (sgl-project#5115) * fix empty_cache error in pt_weights_iterator (sgl-project#5151) Co-authored-by: dangkai.dk <[email protected]> * Fix torch compile errors (sgl-project#5158) * Fix loading KV quantization scale; Enable modelopt kv cache (sgl-project#4686) Co-authored-by: qingquansong <[email protected]> * [PD] Fix unclosed prefill connection warning of mini_lb (sgl-project#5155) Signed-off-by: Shangming Cai <[email protected]> * Add optimized native kernels in sgl-kernel (sgl-project#5150) Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: blzheng <[email protected]> * [PD] Simplify mini LB (sgl-project#4911) Co-authored-by: Liangsheng Yin <[email protected]> * Small improvement of native api docs (sgl-project#5139) Co-authored-by: zhaochenyang20 <[email protected]> * [feat&refactor] Enhance multimodal input support with refactor io_struct (sgl-project#4938) Signed-off-by: Xinyuan Tong <[email protected]> * Support 2x8xH100 for Llama 4 (sgl-project#5159) * FP4 weight loading and inference (2/2) (sgl-project#3972) * Fix multimodal hashing error (sgl-project#5174) * Tiny disable model that does not work (sgl-project#5175) * [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) * [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) Co-authored-by: ch-wan <[email protected]> * docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) Signed-off-by: Kay Yan <[email protected]> * feat: add DeepGEMM build warning (sgl-project#5176) Co-authored-by: grimoire <[email protected]> * fix: use DeepEPDispatcher on CUDA (sgl-project#5180) * [DeepEP] fix: import buffer error (sgl-project#5179) * Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) * [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) * Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) Co-authored-by: wunhuang <[email protected]> * Optimize topk operation in llama4 (sgl-project#5128) * Support Llama4 fp8 inference (sgl-project#5194) Co-authored-by: laixinn <[email protected]> Co-authored-by: sleepcoo <[email protected]> Co-authored-by: zhyncs <[email protected]> * [ci] fix ci test fused_moe op (sgl-project#5102) * model: support mllama4 (sgl-project#5144) * update grok test (sgl-project#5171) * sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) * Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) * fix: log warning when disable cuda graph (sgl-project#5209) * [metrics] Add in queue metrics (sgl-project#4444) * Fix DeepSeek error when using DeepEP mode (sgl-project#5190) * reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) * [PD] Support KV transfer with mooncake (sgl-project#4880) Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Xuchun Shang <[email protected]> Co-authored-by: shangmingc <[email protected]> * [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (sgl-project#5204) * Update deps for mllama4 (sgl-project#5215) * Fix deepseek-v3 with torch.compile in PyTorch 2.6. (sgl-project#5213) * ROCm sgl-kernel: compatible to later torch (sgl-project#5167) * [Misc] Clean sgl-kernel test (sgl-project#5216) * Update Makefile / build script to avoid installing incompatible torch dependency (sgl-project#5245) * Fix torch.compile cacheing (sgl-project#5259) Co-authored-by: zhyncs <[email protected]> * ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (sgl-project#5228) * Optimize attention in llama4 (sgl-project#5127) * Optimize GPU memory usage in FlashAttentionBackend's strided indexing (sgl-project#5262) Co-authored-by: ch-wan <[email protected]> * Support `--enable-llama4-multimodal` (sgl-project#5254) * [fix] fix mrope positions not picked up (sgl-project#5265) * doc: nested loop code for offline engine (sgl-project#5244) * fix: examples for token_in_token_out_vlm (sgl-project#5193) * Fix a 404 link in send_request.ipynb (sgl-project#5280) Signed-off-by: windsonsea <[email protected]> * fix: enable fp4 compilation on cu128 (sgl-project#5286) * feat: add cu128 identifier for sgl-kernel (sgl-project#5287) * chore: relax the torch version restriction for sgl-kernel compilation (sgl-project#5288) * chore: bump sgl-kernel v0.0.8.post1 (sgl-project#5289) * [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (sgl-project#5292) * [Docs] Supported Model Docs - Major restructuring (sgl-project#5290) Co-authored-by: zhaochenyang20 <[email protected]> * fix: update update_wheel_index for cu128 (sgl-project#5300) * [Docs] Remove the older supported docs section (sgl-project#5301) * remove moe_align_block_size torch.zeros in small batch/expert mode (sgl-project#5298) * feat: add blackwell Dockerfile (sgl-project#5302) * feat: add blackwell workflow (sgl-project#5303) * fix: use fa3 unit test on hopper only (sgl-project#5304) * misc: update blackwell Dockerfile (sgl-project#5306) * fix: remove cublas_grouped_gemm (sgl-project#5307) * fix: update flash attn (sgl-project#5308) * fix: use deepgemm only on hopper (sgl-project#5310) * [VLM] Adopt fast image processor by default (sgl-project#5065) * Adjust ci test threshold (sgl-project#5271) * Blackwell Cutlass MLA kernel (sgl-project#5142) * misc: cleanup 3rdparty (sgl-project#5311) * update variable naming and comments for rocm (sgl-project#5299) * Fix w8a8_int8 model shared experts fusion load weights error (sgl-project#5120) * Add flash_attn_varlen_func to sgl-kernel (sgl-project#5315) * Fix fa3 window size setup (sgl-project#5316) * chore: bump sgl-kernel v0.0.8.post2 (sgl-project#5317) * feat: use fa3 mla by default on hopper (sgl-project#5210) Co-authored-by: yundai424 <[email protected]> Co-authored-by: hebiao064 <[email protected]> * Fix: docs/backend/structured_outputs.ipynb (sgl-project#4884) * Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (sgl-project#5321) * refine fused_moe tuning docs (sgl-project#5294) * Support server based rollout in Verlengine (sgl-project#4848) Co-authored-by: Jin Pan <[email protected]> Co-authored-by: Chayenne <[email protected]> Co-authored-by: Jinn <[email protected]> * [Feat] Add sparse attn to sgl-kernel (sgl-project#5327) * fix: solve cu118 issue for cutlass mla (sgl-project#5331) * chore: bump sgl-kernel v0.0.8.post3 (sgl-project#5332) * ci: update release node (sgl-project#5333) * fix: determine if flashinfer is installed (sgl-project#5336) * feat: adapt merge_state (sgl-project#5337) * misc: update sagemaker Dockerfile (sgl-project#5341) * Fix: Ensure tensors for dist.broadcast match NCCL backend device (sgl-project#5322) * docs: update adoption and sponsorship list with Oracle (sgl-project#5343) * chore: upgrade sgl-kernel 0.0.8.post3 (sgl-project#5342) * Fix typo: infight -> inflight (sgl-project#5357) * [PD] Add transfer backend abstraction (sgl-project#5328) * fix MLATokenToKVPoolHost get_size_per_token bug (sgl-project#5161) Co-authored-by: AniZpZ <[email protected]> * fix sgl-project#5322 (sgl-project#5359) * feat: update experiment_runner (sgl-project#5360) * [DeepEP] Reduce routed scaling overhead (sgl-project#5277) Co-authored-by: Cheng Wan <[email protected]> * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Fix DeepSeek DP Attention + torch compile (sgl-project#5367) Co-authored-by: ispobock <[email protected]> * Support for Qwen2.5-VL Model in bitsandbytes Format (sgl-project#5003) * Fix PD disaggregation bugs (sgl-project#5326) * [PD Bug] fix MLA get_contiguous_buf_infos error (sgl-project#5384) * [perf] experimental enhance fp8 per-tensor quant (sgl-project#5370) * Apply deepseek cuda rope (sgl-project#5385) Co-authored-by: Yineng Zhang <[email protected]> * apply fused moe gate in ds v3/r1 (sgl-project#5371) Co-authored-by: Yineng Zhang <[email protected]> * fix: update test config (sgl-project#5392) * [Fix] Turn off DeepGEMM by default (sgl-project#5263) * minor clean up of sgl-kernel/CMakeLists.txt (sgl-project#5393) * Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5368) * Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5291) Co-authored-by: ximing.wxm <[email protected]> * [fix/misc] remove duplicate row in deepseek v2 model (sgl-project#5279) * chore: upgrade DeepGEMM (sgl-project#5395) * fix: update pr-test-sgl-kernel (sgl-project#5399) * kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381) * chore: bump sgl-kernel 0.0.9 (sgl-project#5400) * chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401) * Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406) * Fix bench_serving with random-ids (sgl-project#5214) * [misc] fix ci flaky case (sgl-project#5352) * [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412) * Support dynamic connection and TP 16 (sgl-project#5351) Co-authored-by: luoyuan.luo <[email protected]> * Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416) * [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415) Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: ybyang <[email protected]> * Distinguish bootstrap key only in decode server (sgl-project#5422) * [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423) * [minor] cleanup cmakelists.txt (sgl-project#5420) * bugfix: fix merge_state_v2 cuda graph (sgl-project#5419) * chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430) * fix: solve release issue (sgl-project#5434) * BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431) * feat: update model_specific_adjustment (sgl-project#5344) Co-authored-by: hebiao064 <[email protected]> * chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436) * Fix ignore_eos parameter when loading a chat template (sgl-project#5264) * add attention backend supporting matrix in the doc (sgl-project#5211) Co-authored-by: Stefan He <[email protected]> * Support BNB quantization for llama/mllama (sgl-project#5038) Co-authored-by: Yuhao Yang <[email protected]> * [Docs] Update start/install.md (sgl-project#5398) * [Minor] Move torch.compile patch to a better place (sgl-project#5397) * [Bug fix] need record start time in pd mode (sgl-project#5425) * Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113) * chore: bump v0.4.5.post1 (sgl-project#5445) * Revert "[SW-226289] rebase sglang to tag v0.4.5 (sgl-project#12)" This reverts commit 0eac714. --------- Signed-off-by: Xinyuan Tong <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Kay Yan <[email protected]> Signed-off-by: windsonsea <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: Juwan Yoo <[email protected]> Co-authored-by: Qingquan Song <[email protected]> Co-authored-by: Yineng Zhang <[email protected]> Co-authored-by: chaobo jia <[email protected]> Co-authored-by: rudy152 <[email protected]> Co-authored-by: Fr4nk1in <[email protected]> Co-authored-by: yinfan98 <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: Ke Bao <[email protected]> Co-authored-by: Yi Zhang <[email protected]> Co-authored-by: Adarsh Shirawalmath <[email protected]> Co-authored-by: Sleepcoo <[email protected]> Co-authored-by: SEPLOS <[email protected]> Co-authored-by: ch-wan <[email protected]> Co-authored-by: Zhiqiang Xie <[email protected]> Co-authored-by: JieXin Liang <[email protected]> Co-authored-by: Mick <[email protected]> Co-authored-by: Yuhong Guo <[email protected]> Co-authored-by: Jinyan Chen <[email protected]> Co-authored-by: laixinn <[email protected]> Co-authored-by: XinyuanTong <[email protected]> Co-authored-by: GeLee <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]> Co-authored-by: hebiao064 <[email protected]> Co-authored-by: zcnrex <[email protected]> Co-authored-by: Kaiyu Yang <[email protected]> Co-authored-by: renxin <[email protected]> Co-authored-by: saltyfish66 <[email protected]> Co-authored-by: yuethe <[email protected]> Co-authored-by: simveit <[email protected]> Co-authored-by: Yifan Zhang <[email protected]> Co-authored-by: Ravi Theja <[email protected]> Co-authored-by: Ravi Theja Desetty <[email protected]> Co-authored-by: AniZpZ <[email protected]> Co-authored-by: 晟海 <[email protected]> Co-authored-by: Tommy Yang <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: inkcherry <[email protected]> Co-authored-by: mlmz <[email protected]> Co-authored-by: shuaills <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: HAI <[email protected]> Co-authored-by: tianhaoyu <[email protected]> Co-authored-by: DarkSharpness <[email protected]> Co-authored-by: Yun Dai <[email protected]> Co-authored-by: Hubert Lu <[email protected]> Co-authored-by: huangtingwei <[email protected]> Co-authored-by: kk <[email protected]> Co-authored-by: wunhuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Yubo Wang <[email protected]> Co-authored-by: saienduri <[email protected]> Co-authored-by: DangKai <[email protected]> Co-authored-by: dangkai.dk <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: Ma Mingfei <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: blzheng <[email protected]> Co-authored-by: Byron Hsu <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> Co-authored-by: zhaochenyang20 <[email protected]> Co-authored-by: Trevor Morris <[email protected]> Co-authored-by: Kay Yan <[email protected]> Co-authored-by: grimoire <[email protected]> Co-authored-by: HandH1998 <[email protected]> Co-authored-by: Zhaoyang Hao <[email protected]> Co-authored-by: Teng Ma <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Xuchun Shang <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Elfie Guo <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yusong Gao <[email protected]> Co-authored-by: Zhaoyi Li <[email protected]> Co-authored-by: lambert0312 <[email protected]> Co-authored-by: tianlian yi <[email protected]> Co-authored-by: Jin Pan <[email protected]> Co-authored-by: Jinn <[email protected]> Co-authored-by: yulei <[email protected]> Co-authored-by: Yongtong Wu <[email protected]> Co-authored-by: yhyang201 <[email protected]> Co-authored-by: ybyang <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: Yangcheng Li <[email protected]> Co-authored-by: DefTruth <[email protected]> Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: luoyuan.luo <[email protected]> Co-authored-by: ybyang <[email protected]> Co-authored-by: mRSun15 <[email protected]> Co-authored-by: ryang <[email protected]> Co-authored-by: Yuhao Yang <[email protected]>
1 parent 0eac714 commit 3ecb4e3

File tree

254 files changed

+19973
-3716
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

254 files changed

+19973
-3716
lines changed

.github/workflows/pr-test-amd.yml

Lines changed: 52 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,14 @@ on:
77
- "python/sglang/**"
88
- "test/**"
99
- "sgl-kernel/**"
10+
- ".github/workflows/pr-test-amd.yml"
1011
pull_request:
1112
branches: [ main ]
1213
paths:
1314
- "python/sglang/**"
1415
- "test/**"
1516
- "sgl-kernel/**"
17+
- ".github/workflows/pr-test-amd.yml"
1618
workflow_dispatch:
1719

1820
concurrency:
@@ -36,12 +38,12 @@ jobs:
3638
else
3739
DEVICE_FLAG="--device /dev/dri"
3840
fi
39-
docker pull lmsysorg/sglang:v0.4.3.post4-rocm630
41+
docker pull lmsysorg/sglang:v0.4.5-rocm630
4042
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
4143
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
4244
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
4345
-w /sglang-checkout --name ci_sglang \
44-
lmsysorg/sglang:v0.4.3.post4-rocm630
46+
lmsysorg/sglang:v0.4.5-rocm630
4547
4648
- name: Install dependencies
4749
run: |
@@ -53,6 +55,10 @@ jobs:
5355
docker exec -w / ci_sglang git clone https://github.com/merrymercy/human-eval.git
5456
docker exec -w /human-eval ci_sglang pip install -e .
5557
58+
docker exec -w / ci_sglang mkdir -p /dummy-grok
59+
mkdir -p dummy-grok && wget https://sharkpublic.blob.core.windows.net/sharkpublic/sglang/dummy_grok.json -O dummy-grok/config.json
60+
docker cp ./dummy-grok ci_sglang:/
61+
5662
- name: Evaluate Accuracy
5763
timeout-minutes: 20
5864
run: |
@@ -76,12 +82,12 @@ jobs:
7682
else
7783
DEVICE_FLAG="--device /dev/dri"
7884
fi
79-
docker pull lmsysorg/sglang:v0.4.3.post4-rocm630
85+
docker pull lmsysorg/sglang:v0.4.5-rocm630
8086
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
8187
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
8288
--cap-add=SYS_PTRACE -e HF_TOKEN=${{ secrets.AMD_HF_TOKEN }} --security-opt seccomp=unconfined \
8389
-w /sglang-checkout --name ci_sglang \
84-
lmsysorg/sglang:v0.4.3.post4-rocm630
90+
lmsysorg/sglang:v0.4.5-rocm630
8591
8692
- name: Install dependencies
8793
run: |
@@ -98,6 +104,48 @@ jobs:
98104
run: |
99105
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 ci_sglang python3 test_mla.py
100106
107+
bench-test-2-gpu-amd:
108+
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
109+
github.event.pull_request.draft == false
110+
runs-on: linux-mi300-gpu-2
111+
steps:
112+
- name: Checkout code
113+
uses: actions/checkout@v4
114+
115+
- name: Setup docker
116+
run: |
117+
# Ensure GPU isolation if pod is part of kubernetes setup with DEVICE_FLAG.
118+
if [ -f "/etc/podinfo/gha-render-devices" ]; then
119+
DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
120+
else
121+
DEVICE_FLAG="--device /dev/dri"
122+
fi
123+
docker pull lmsysorg/sglang:v0.4.5-rocm630
124+
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
125+
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
126+
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
127+
-w /sglang-checkout --name ci_sglang \
128+
lmsysorg/sglang:v0.4.5-rocm630
129+
130+
- name: Install dependencies
131+
run: |
132+
docker exec ci_sglang pip install --upgrade pip
133+
docker exec ci_sglang pip uninstall sgl-kernel -y || true
134+
docker exec -w /sglang-checkout/sgl-kernel ci_sglang bash -c "rm -f pyproject.toml && mv pyproject_rocm.toml pyproject.toml && python3 setup_rocm.py install"
135+
docker exec ci_sglang pip install -e "python[dev_hip]"
136+
137+
docker exec -w / ci_sglang git clone https://github.com/merrymercy/human-eval.git
138+
docker exec -w /human-eval ci_sglang pip install -e .
139+
140+
docker exec -w / ci_sglang mkdir -p /dummy-grok
141+
mkdir -p dummy-grok && wget https://sharkpublic.blob.core.windows.net/sharkpublic/sglang/dummy_grok.json -O dummy-grok/config.json
142+
docker cp ./dummy-grok ci_sglang:/
143+
144+
- name: Evaluate Benchmark
145+
timeout-minutes: 20
146+
run: |
147+
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 ci_sglang python3 models/test_dummy_grok_models.py
148+
101149
finish:
102150
if: always()
103151
needs: [

.github/workflows/pr-test-sgl-kernel.yml

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,14 @@ jobs:
3535
runs-on: sgl-kernel-build-node
3636
strategy:
3737
matrix:
38-
python-version: ['3.9']
39-
cuda-version: ['12.4']
40-
38+
include:
39+
- python-version: '3.9'
40+
cuda-version: '11.8'
41+
- python-version: '3.9'
42+
cuda-version: '12.4'
43+
- python-version: '3.9'
44+
cuda-version: '12.8'
45+
name: Build Wheel (CUDA ${{ matrix.cuda-version }})
4146
steps:
4247
- name: Cleanup
4348
run: |
@@ -52,13 +57,15 @@ jobs:
5257
with:
5358
python-version: ${{ matrix.python-version }}
5459

55-
- name: Build wheels for Python ${{ matrix.python-version }} and CUDA ${{ matrix.cuda-version }}
60+
- name: Build wheel for Python ${{ matrix.python-version }} and CUDA ${{ matrix.cuda-version }}
61+
if: github.event_name != 'push' || (matrix.cuda-version != '11.8' && matrix.cuda-version != '12.8')
5662
run: |
5763
cd sgl-kernel
5864
chmod +x ./build.sh
5965
./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}"
6066
61-
- name: Upload artifacts
67+
- name: Upload artifacts (only for CUDA 12.4)
68+
if: ${{ matrix.cuda-version == '12.4' }}
6269
uses: actions/upload-artifact@v4
6370
with:
6471
name: wheel-python${{ matrix.python-version }}-cuda${{ matrix.cuda-version }}
@@ -81,7 +88,7 @@ jobs:
8188
- name: Install
8289
run: |
8390
bash scripts/ci_install_dependency.sh
84-
pip3 install torch==2.5.1 && pip3 install pytest && pip3 install vllm==0.7.2
91+
pip3 install torch==2.5.1 && pip3 install pytest
8592
pip3 uninstall sgl-kernel -y || true
8693
pip3 install sgl-kernel/dist/*whl --force-reinstall --no-deps
8794
pip3 list | grep sgl-kernel
@@ -128,7 +135,7 @@ jobs:
128135
pip3 uninstall sgl-kernel -y
129136
130137
finish:
131-
needs: [unit-test, mla-test, lint]
138+
needs: [unit-test, mla-test, lint, build-wheels]
132139
runs-on: ubuntu-latest
133140
steps:
134141
- name: Check all dependent job statuses

.github/workflows/pr-test.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -187,8 +187,6 @@ jobs:
187187
timeout-minutes: 10
188188
run: |
189189
cd test/srt
190-
USE_VLLM_CUSTOM_ALLREDUCE=1 python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_moe_tp2_bs1
191-
192190
python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_moe_tp2_bs1
193191
194192
- name: Benchmark single latency + torch.compile (TP=2)
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
name: Build Blackwell Docker Image
2+
3+
on:
4+
workflow_dispatch:
5+
schedule:
6+
- cron: '0 0 * * *'
7+
8+
jobs:
9+
build-dev:
10+
if: ${{ github.repository == 'sgl-project/sglang' }}
11+
runs-on: ubuntu-22.04
12+
steps:
13+
- name: Checkout repository
14+
uses: actions/checkout@v4
15+
16+
- name: Free disk space
17+
uses: jlumbroso/free-disk-space@main
18+
with:
19+
tool-cache: false
20+
docker-images: false
21+
android: true
22+
dotnet: true
23+
haskell: true
24+
large-packages: true
25+
swap-storage: false
26+
27+
- name: Login to Docker Hub
28+
uses: docker/login-action@v2
29+
with:
30+
username: ${{ secrets.DOCKERHUB_USERNAME }}
31+
password: ${{ secrets.DOCKERHUB_TOKEN }}
32+
33+
- name: Build and Push Blackwell Image
34+
run: |
35+
docker build . -f docker/Dockerfile.blackwell -t lmsysorg/sglang:blackwell --no-cache
36+
docker push lmsysorg/sglang:blackwell

.github/workflows/release-pypi-kernel.yml

Lines changed: 0 additions & 44 deletions
This file was deleted.

.github/workflows/release-whl-kernel-cu128.yml renamed to .github/workflows/release-whl-kernel-cu118.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Release SGLang Kernel Wheel (cu128)
1+
name: Release SGLang Kernel Wheel (cu118)
22

33
on:
44
workflow_dispatch:
@@ -14,11 +14,11 @@ on:
1414
jobs:
1515
build-wheels:
1616
if: github.repository == 'sgl-project/sglang'
17-
runs-on: ubuntu-latest
17+
runs-on: sgl-kernel-release-node
1818
strategy:
1919
matrix:
2020
python-version: ['3.9']
21-
cuda-version: ['12.8']
21+
cuda-version: ['11.8']
2222

2323
steps:
2424
- uses: actions/checkout@v4
@@ -80,7 +80,7 @@ jobs:
8080
WHL_TOKEN: ${{ secrets.WHL_TOKEN }}
8181

8282
- name: Update wheel index
83-
run: python3 scripts/update_kernel_whl_index.py --cuda 128
83+
run: python3 scripts/update_kernel_whl_index.py
8484

8585
- name: Push wheel index
8686
run: |

.github/workflows/release-whl-kernel.yml

Lines changed: 45 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,59 @@
1-
name: Release SGLang Kernel Wheel (cu118)
1+
name: Release SGLang Kernels
22

33
on:
4-
workflow_dispatch:
5-
inputs:
6-
tag_name:
7-
type: string
84
push:
95
branches:
106
- main
117
paths:
128
- sgl-kernel/python/sgl_kernel/version.py
9+
workflow_dispatch:
10+
inputs:
11+
tag_name:
12+
type: string
13+
required: false
14+
15+
concurrency:
16+
group: release-sglang-kernels-${{ github.ref }}
17+
cancel-in-progress: true
1318

1419
jobs:
15-
build-wheels:
20+
build-cu124:
1621
if: github.repository == 'sgl-project/sglang'
17-
runs-on: ubuntu-latest
22+
runs-on: sgl-kernel-release-node
1823
strategy:
1924
matrix:
2025
python-version: ['3.9']
21-
cuda-version: ['11.8']
26+
cuda-version: ['12.4']
27+
steps:
28+
- uses: actions/checkout@v4
29+
with:
30+
submodules: 'recursive'
31+
32+
- name: Set up Python ${{ matrix.python-version }}
33+
uses: actions/setup-python@v5
34+
with:
35+
python-version: ${{ matrix.python-version }}
2236

37+
- name: Build wheels
38+
run: |
39+
cd sgl-kernel
40+
chmod +x ./build.sh
41+
./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}"
42+
43+
- name: Upload to PyPI
44+
working-directory: sgl-kernel
45+
run: |
46+
pip install twine
47+
python3 -m twine upload dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }}
48+
49+
build-cu128:
50+
if: github.repository == 'sgl-project/sglang'
51+
needs: build-cu124
52+
runs-on: sgl-kernel-release-node
53+
strategy:
54+
matrix:
55+
python-version: ['3.9']
56+
cuda-version: ['12.8']
2357
steps:
2458
- uses: actions/checkout@v4
2559
with:
@@ -30,7 +64,7 @@ jobs:
3064
with:
3165
python-version: ${{ matrix.python-version }}
3266

33-
- name: Build wheels for Python ${{ matrix.python-version }} and CUDA ${{ matrix.cuda-version }}
67+
- name: Build wheels
3468
run: |
3569
cd sgl-kernel
3670
chmod +x ./build.sh
@@ -43,7 +77,7 @@ jobs:
4377
path: sgl-kernel/dist/*
4478

4579
release:
46-
needs: build-wheels
80+
needs: build-cu128
4781
runs-on: ubuntu-latest
4882
steps:
4983
- uses: actions/checkout@v4
@@ -80,7 +114,7 @@ jobs:
80114
WHL_TOKEN: ${{ secrets.WHL_TOKEN }}
81115

82116
- name: Update wheel index
83-
run: python3 scripts/update_kernel_whl_index.py
117+
run: python3 scripts/update_kernel_whl_index.py --cuda 128
84118

85119
- name: Push wheel index
86120
run: |

.github/workflows/vllm-dependency-test.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ jobs:
3333
run: |
3434
bash scripts/ci_install_dependency.sh
3535
pip install "vllm>=0.6.4.post1,<=0.7.2"
36+
pip install "bitsandbytes>=0.44.0"
3637
3738
- name: Run VLLM dependency tests
3839
timeout-minutes: 60

.gitmodules

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +0,0 @@
1-
[submodule "sgl-kernel/3rdparty/flashinfer"]
2-
path = sgl-kernel/3rdparty/flashinfer
3-
url = https://github.com/sgl-project/flashinfer.git
4-
branch = sgl-kernel

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-s
6363

6464
## Adoption and Sponsorship
6565
The project has been deployed to large-scale production, generating trillions of tokens every day.
66-
It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, Iflytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
66+
It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, Iflytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, Oracle, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
6767

6868
<img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>
6969

0 commit comments

Comments
 (0)