Skip to content

Commit a19e2e2

Browse files
committed
PullRequest: 52 sgl_20250610_sync_tag047
Merge branch 'sgl_20250610_sync_tag047 of [email protected]:Theta/SGLang.git into main https://code.alipay.com/Theta/SGLang/pull_requests/52 Reviewed-by: 剑川 <[email protected]> * [Bugfix] Fix slice operation when chunk size mismatch (sgl-project#6697) * [Bugfix] Fix ChatCompletion endpoint of mini_lb when stream is set (sgl-project#6703) * [CI] Fix setup of disaggregation with different tp (sgl-project#6706) * [PD] Remove Unnecessary Exception Handling for FastQueue.get() (sgl-project#6712) * Fuse routed_scaling_factor in DeepSeek (sgl-project#6710) * Overlap two kernels in DeepSeek with communication (sgl-project#6711) * Minor refactor two-batch overlap (sgl-project#6682) * Speed up when having padding tokens two-batch overlap (sgl-project#6668) * [Feature] Support Flashinfer fp8 blockwise GEMM kernel on Blackwell (sgl-project#6479) * Fix LoRA bench (sgl-project#6719) * temp * Fix PP for Qwen3 MoE (sgl-project#6709) * [feat] triton kernel for get_last_loc (sgl-project#6676) * [fix] more mem for draft_extend cuda_graph (sgl-project#6726) * [PD] bug fix: Update status if nixl receiver send a a dummy req. (sgl-project#6720) * Tune memory arguments on B200 (sgl-project#6718) * Add DeepSeek-R1-0528 function call chat template (sgl-project#6725) * refactor(tool call): Fix BaseFormatDetector tool_index issue and refactor `parse_streaming_increment` (sgl-project#6715) * Add draft extend CUDA graph for Triton backend (sgl-project#6705) * refactor apply_w8a8_block_fp8_linear in fp (sgl-project#6545) * [PD] Support completion endpoint (sgl-project#6729) * PD Rust LB (PO2) (sgl-project#6437) * Super tiny enable sole usage of expert distribution metrics and update doc (sgl-project#6680) * Support picking variants of EPLB algorithms (sgl-project#6728) * Support tuning DeepEP configs (sgl-project#6742) * [test] add ut and bm for get_last_loc (sgl-project#6746) * Fix mem_fraction_static for AMD CI (sgl-project#6748) * [fix][RL] Fix DeepSeekV3ForCausalLM.post_load_weights for multiple update weight (sgl-project#6265) * Improve EPLB logical to physical dispatch map (sgl-project#6727) * Update DeepSeek-R1-0528 function call chat template (sgl-project#6765) * [PD] Optimize time out logic and add env var doc for mooncake (sgl-project#6761) * Fix aiohttp 'Chunk too big' in bench_serving (sgl-project#6737) * Support sliding window in triton backend (sgl-project#6509) * Fix shared experts fusion error (sgl-project#6289) * Fix one bug in the grouped-gemm triton kernel (sgl-project#6772) * update llama4 chat template and pythonic parser (sgl-project#6679) * feat(tool call): Enhance Llama32Detector for improved JSON parsing in non-stream (sgl-project#6784) * Support token-level quantization for EP MoE (sgl-project#6782) * Temporarily lower mmlu threshold for triton sliding window backend (sgl-project#6785) * ci: relax test_function_call_required (sgl-project#6786) * Add intel_amx backend for Radix Attention for CPU (sgl-project#6408) * Fix incorrect LoRA weight loading for fused gate_up_proj (sgl-project#6734) * fix(PD-disaggregation): Can not get local ip (sgl-project#6792) * [FIX] mmmu bench serving result display error (sgl-project#6525) (sgl-project#6791) * Bump torch to 2.7.0 (sgl-project#6788) * chore: bump sgl-kernel v0.1.5 (sgl-project#6794) * Improve profiler and integrate profiler in bench_one_batch_server (sgl-project#6787) * chore: upgrade sgl-kernel v0.1.5 (sgl-project#6795) * [Minor] Always append newline after image token when parsing chat message (sgl-project#6797) * Update CI tests for Llama4 models (sgl-project#6421) * [Feat] Enable PDL automatically on Hopper architecture (sgl-project#5981) * chore: update blackwell docker (sgl-project#6800) * misc: cache is_hopper_arch (sgl-project#6799) * Remove contiguous before Flashinfer groupwise fp8 gemm (sgl-project#6804) * Correctly abort the failed grammar requests & Improve the handling of abort (sgl-project#6803) * [EP] Add cuda kernel for moe_ep_pre_reorder (sgl-project#6699) * Add draft extend CUDA graph for flashinfer backend (sgl-project#6805) * Refactor CustomOp to avoid confusing bugs (sgl-project#5382) * Tiny log prefill time (sgl-project#6780) * Tiny fix EPLB assertion about rebalancing period and recorder window size (sgl-project#6813) * Add simple utility to dump tensors for debugging (sgl-project#6815) * Fix profiles do not have consistent names (sgl-project#6811) * Speed up rebalancing when using non-static dispatch algorithms (sgl-project#6812) * [1/2] Add Kernel support for Cutlass based Fused FP4 MoE (sgl-project#6093) * [Router] Fix k8s Service Discovery (sgl-project#6766) * Add CPU optimized kernels for topk and rope fusions (sgl-project#6456) * fix new_page_count_next_decode (sgl-project#6671) * Fix wrong weight reference in dynamic EPLB (sgl-project#6818) * Minor add metrics to expert location updater (sgl-project#6816) * [Refactor] Rename `n_share_experts_fusion` as `num_fused_shared_experts` (sgl-project#6735) * [FEAT] Add transformers backend support (sgl-project#5929) * [fix] recover auto-dispatch for rmsnorm and rope (sgl-project#6745) * fix ep_moe_reorder kernel bugs (sgl-project#6858) * [Refactor] Multimodal data processing for VLM (sgl-project#6659) * Decoder-only Scoring API (sgl-project#6460) * feat: add dp-rank to KV events (sgl-project#6852) * Set `num_fused_shared_experts` as `num_shared_experts` when shared_experts fusion is not disabled (sgl-project#6736) * Fix one missing arg in DeepEP (sgl-project#6878) * Support LoRA in TestOpenAIVisionServer and fix fused kv_proj loading bug. (sgl-project#6861) * support 1 shot allreduce in 1-node and 2-node using mscclpp (sgl-project#6277) * Fix Qwen3MoE missing token padding optimization (sgl-project#6820) * Tiny update error hints (sgl-project#6846) * Support layerwise rebalancing experts (sgl-project#6851) * Tiny allow profiler API to auto create directory (sgl-project#6865) * Support Blackwell DeepEP docker images (sgl-project#6868) * [EP] Add cuda kernel for moe_ep_post_reorder (sgl-project#6837) * [theta]merge 0605 * oai: fix openAI client error with single request via batch api (sgl-project#6170) * [PD] Fix potential perf spike caused by tracker gc and optimize doc (sgl-project#6764) * Use deepgemm instead of triton for fused_qkv_a_proj_with_mqa (sgl-project#6890) * [CUTLASS-FP4-MOE] Introduce CutlassMoEParams class for easy initialization of Cutlass Grouped Gems Metadata (sgl-project#6887) * bugfix(OAI): Fix image_data processing for jinja chat templates (sgl-project#6877) * [CPU] enable CI for PRs, add Dockerfile and auto build task (sgl-project#6458) * AITER backend extension and workload optimizations (sgl-project#6838) * [theta]merge * [theta]merge * [Feature] Support Flashinfer fmha on Blackwell (sgl-project#6930) * Fix a bug in abort & Improve docstrings for abort (sgl-project#6931) * Tiny support customize DeepEP max dispatch tokens per rank (sgl-project#6934) * Sync the changes on cuda graph runners (sgl-project#6932) * [PD] Optimize transfer queue forward logic for dummy rank (sgl-project#6922) * [Refactor] image data process in bench_serving (sgl-project#6879) * [fix] logical_to_all_physical_map index 256 is out of bounds in EP parallel. (sgl-project#6767) * Add triton fused moe kernel config for E=257 on B200 (sgl-project#6939) * [sgl-kernel] update deepgemm (sgl-project#6942) * chore: bump sgl-kernel v0.1.6 (sgl-project#6943) * Minor compile fused topk (sgl-project#6944) * [Bugfix] pipeline parallelism and Eagle Qwen2 (sgl-project#6910) * Tiny re-introduce profile id logging (sgl-project#6912) * Add triton version as a fused_moe_triton config search key to avoid performace decrease in different Triton version (sgl-project#5955) * reduce torch.zeros overhead in moe align block size kernel (sgl-project#6369) * chore: upgrade sgl-kernel v0.1.6 (sgl-project#6945) * add fbgemm moe grouped gemm kernel benchmark (sgl-project#6924) * [Docker] Add docker file for SGL Router (sgl-project#6915) * Disabling mixed chunked prefill when eagle is enabled (sgl-project#6874) * Add canary for EPLB rebalancing (sgl-project#6895) * Refactor global_server_args_dict (sgl-project#6866) * Fuse routed scaling factor in topk_reduce kernel (sgl-project#6220) * Update server timeout time in AMD CI. (sgl-project#6953) * [misc] add is_cpu() (sgl-project#6950) * Add H20 fused MoE kernel tuning configs for DeepSeek-R1/V3 (sgl-project#6885) * Add a CUDA kernel for fusing mapping and weighted sum for MoE. (sgl-project#6916) * chore: bump sgl-kernel v0.1.6.post1 (sgl-project#6955) * chore: upgrade sgl-kernel v0.1.6.post1 (sgl-project#6957) * [DeepseekR1-FP4] Add Support for nvidia/DeepSeekR1-FP4 model (sgl-project#6853) * Revert "Fuse routed scaling factor in topk_reduce kernel (sgl-project#6220)" (sgl-project#6968) * [AMD] Add more tests to per-commit-amd (sgl-project#6926) * chore: bump sgl-kernel v0.1.7 (sgl-project#6963) * Slightly improve the sampler to skip unnecessary steps (sgl-project#6956) * rebase h20 fused_moe config (sgl-project#6966) * Fix CI and triton moe Configs (sgl-project#6974) * Remove unnecessary kernels of num_token_non_padded (sgl-project#6965) * Extend cuda graph capture bs for B200 (sgl-project#6937) * Fuse routed scaling factor in deepseek (sgl-project#6970) * Sync cuda graph runners (sgl-project#6976) * Fix draft extend ut stability with flush cache (sgl-project#6979) * Fix triton sliding window test case (sgl-project#6981) * Fix expert distribution dumping causes OOM (sgl-project#6967) * Minor remove one kernel for DeepSeek (sgl-project#6977) * [perf][sgl-kernel] extend cutlass_mla_decode to support num_head < 128 (sgl-project#6929) * Enable more unit tests for AMD CI. (sgl-project#6983) * Use torch.compile to fuse flash attention decode metadata preparation (sgl-project#6973) * Eliminate stream sync to speed up LoRA batch init (sgl-project#6960) * support qwen3 emebedding (sgl-project#6990) * Fix torch profiler bugs for bench_offline_throughput.py (sgl-project#6557) * chore: upgrade flashinfer v0.2.6.post1 jit (sgl-project#6958) * cleanup tmp dir (sgl-project#7007) * chore: update pr test xeon (sgl-project#7008) * Fix cutlass MLA gets almost zero accuracy (sgl-project#6998) * Update amd nightly models CI. (sgl-project#6992) * feat: add direct routing strategy to DP worker (sgl-project#6884) * Fallback to lower triton version for unfound fused moe configs (sgl-project#7013) * Fix torchvision version for Blackwell (sgl-project#7015) * Simplify prepare_extend_after_decode (sgl-project#6987) * Migrate to assertEqual (sgl-project#6741) * Fix torch version in blackwell dockerfile (sgl-project#7017) * chore: update pr test xeon (sgl-project#7018) * Update default settings for blackwell (sgl-project#7023) * Support both approximate and exact expert distribution collection (sgl-project#6964) * Add decode req pool (sgl-project#6980) * [theta]merge 0610 * [theta]merge 0610 * [CI] Add CI workflow for sgl-router docker build (sgl-project#7027) * Fix fused_moe triton configs (sgl-project#7029) * CPU: map changes from developing branch in sgl-kernel (sgl-project#6833) * chore: bump v0.4.7 (sgl-project#7038) * Update README.md (sgl-project#7040)
1 parent 3f6db06 commit a19e2e2

File tree

462 files changed

+26024
-3128
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

462 files changed

+26024
-3128
lines changed

.github/workflows/pr-test-amd.yml

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ jobs:
7272
- name: Evaluate accuracy (TP=2)
7373
timeout-minutes: 30
7474
run: |
75-
bash scripts/amd_ci_exec.sh python3 test_moe_eval_accuracy_large.py
75+
bash scripts/amd_ci_exec.sh -e SGLANG_USE_AITER=0 python3 test_moe_eval_accuracy_large.py
7676
7777
mla-test-1-gpu-amd:
7878
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
@@ -220,8 +220,10 @@ jobs:
220220
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
221221
github.event.pull_request.draft == false
222222
strategy:
223+
fail-fast: false
223224
matrix:
224225
runner: [linux-mi300-gpu-1, linux-mi325-gpu-1]
226+
part: [0, 1, 2, 3, 4, 5]
225227
runs-on: ${{matrix.runner}}
226228
steps:
227229
- name: Checkout code
@@ -238,7 +240,7 @@ jobs:
238240
- name: Run test
239241
timeout-minutes: 40
240242
run: |
241-
bash scripts/amd_ci_exec.sh python3 run_suite.py --suite per-commit-amd
243+
bash scripts/amd_ci_exec.sh python3 run_suite.py --suite per-commit-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 6
242244
243245
unit-test-backend-2-gpu-amd:
244246
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
@@ -264,6 +266,30 @@ jobs:
264266
run: |
265267
bash scripts/amd_ci_exec.sh python3 run_suite.py --suite per-commit-2-gpu-amd
266268
269+
unit-test-backend-4-gpu-amd:
270+
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
271+
github.event.pull_request.draft == false
272+
strategy:
273+
matrix:
274+
runner: [linux-mi300-gpu-4]
275+
runs-on: ${{matrix.runner}}
276+
steps:
277+
- name: Checkout code
278+
uses: actions/checkout@v4
279+
280+
- name: Start CI container
281+
run: bash scripts/amd_ci_start_container.sh
282+
env:
283+
GITHUB_WORKSPACE: ${{ github.workspace }}
284+
285+
- name: Install dependencies
286+
run: bash scripts/amd_ci_install_dependency.sh
287+
288+
- name: Run test
289+
timeout-minutes: 40
290+
run: |
291+
bash scripts/amd_ci_exec.sh python3 run_suite.py --suite per-commit-4-gpu-amd
292+
267293
unit-test-backend-8-gpu-amd:
268294
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
269295
github.event.pull_request.draft == false
@@ -284,9 +310,9 @@ jobs:
284310
run: bash scripts/amd_ci_install_dependency.sh
285311

286312
- name: Run test
287-
timeout-minutes: 40
313+
timeout-minutes: 60
288314
run: |
289-
bash scripts/amd_ci_exec.sh python3 run_suite.py --suite per-commit-8-gpu-amd
315+
bash scripts/amd_ci_exec.sh python3 run_suite.py --suite per-commit-8-gpu-amd --timeout-per-file 3600
290316
291317
finish:
292318
if: always()

.github/workflows/pr-test-sgl-kernel.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,6 @@ jobs:
3636
strategy:
3737
matrix:
3838
include:
39-
- python-version: '3.9'
40-
cuda-version: '11.8'
4139
- python-version: '3.9'
4240
cuda-version: '12.4'
4341
- python-version: '3.9'
@@ -88,7 +86,7 @@ jobs:
8886
- name: Install
8987
run: |
9088
bash scripts/ci_install_dependency.sh
91-
pip3 install torch==2.6.0 torchvision && pip3 install pytest
89+
pip3 install torch==2.7.1 torchvision && pip3 install pytest
9290
pip3 uninstall sgl-kernel -y || true
9391
pip3 install sgl-kernel/dist/*whl --force-reinstall --no-deps
9492
pip3 list | grep sgl-kernel
@@ -120,6 +118,7 @@ jobs:
120118
- name: Install
121119
run: |
122120
bash scripts/ci_install_dependency.sh
121+
pip3 install torch==2.7.1 torchvision
123122
pip3 uninstall sgl-kernel -y || true
124123
pip3 install sgl-kernel/dist/*whl --force-reinstall --no-deps
125124
pip3 list | grep sgl-kernel

.github/workflows/pr-test-xeon.yml

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
name: PR Test (Xeon)
2+
on:
3+
pull_request:
4+
branches:
5+
- main
6+
workflow_dispatch:
7+
8+
concurrency:
9+
group: pr-test-xeon
10+
cancel-in-progress: false
11+
12+
jobs:
13+
build-test:
14+
if: github.event_name == 'pull_request'
15+
runs-on: sgl-kernel-release-node
16+
strategy:
17+
matrix:
18+
build_type: ['all']
19+
steps:
20+
- name: Checkout repository
21+
uses: actions/checkout@v4
22+
23+
- name: Build and Push
24+
run: |
25+
version=$(cat python/sglang/version.py | cut -d'"' -f2)
26+
tag=v${version}-xeon
27+
28+
docker build . -f docker/Dockerfile.xeon -t sglang_xeon --no-cache
29+
30+
- name: Run container
31+
run: |
32+
docker run -dt \
33+
-v ${{ github.workspace }}:/sglang-checkout/ --ipc=host \
34+
--name ci_sglang_xeon \
35+
sglang_xeon
36+
37+
- name: Install Dependency
38+
timeout-minutes: 20
39+
run: |
40+
docker exec ci_sglang_xeon bash -c "python3 -m pip install --upgrade pip"
41+
docker exec ci_sglang_xeon pip uninstall sgl-kernel -y || true
42+
docker exec -w /sglang-checkout/sgl-kernel ci_sglang_xeon bash -c "cp pyproject_cpu.toml pyproject.toml && pip install -v ."
43+
docker exec -w /sglang-checkout/ ci_sglang_xeon bash -c "pip install -e "python[all_cpu]""
44+
docker exec ci_sglang_xeon bash -c "python3 -m pip install pytest expecttest"
45+
46+
- name: Check AMX Support
47+
id: check_amx
48+
timeout-minutes: 5
49+
run: |
50+
docker exec -w /sglang-checkout/ ci_sglang_xeon \
51+
bash -c "python3 -c 'import torch; import sgl_kernel; assert torch._C._cpu._is_amx_tile_supported(); assert hasattr(torch.ops.sgl_kernel, \"convert_weight_packed\"); '"
52+
continue-on-error: true
53+
54+
- name: Run UT Cases
55+
if: steps.check_amx.outcome == 'success'
56+
timeout-minutes: 20
57+
run: |
58+
docker exec -w /sglang-checkout/ ci_sglang_xeon \
59+
bash -c "cd ./test/srt && python3 run_suite.py --suite per-commit-cpu"
60+
61+
- name: Cleanup container
62+
if: always()
63+
run: |
64+
docker rm -f ci_sglang_xeon || true
65+
66+
finish:
67+
if: always()
68+
needs: [build-test]
69+
runs-on: ubuntu-24.04
70+
steps:
71+
- name: Check all dependent job statuses
72+
run: |
73+
results=(${{ join(needs.*.result, ' ') }})
74+
for result in "${results[@]}"; do
75+
if [ "$result" = "failure" ] || [ "$result" = "cancelled" ]; then
76+
echo "Job failed with result: $result"
77+
exit 1
78+
fi
79+
done
80+
echo "All jobs completed successfully"
81+
exit 0

.github/workflows/release-docker-deepep.yml

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,17 @@ jobs:
99
build-dev:
1010
if: ${{ github.repository == 'sgl-project/sglang' }}
1111
runs-on: ubuntu-22.04
12+
13+
strategy:
14+
matrix:
15+
variant:
16+
- base: lmsysorg/sglang:latest
17+
tag: deepep
18+
- base: lmsysorg/sglang:dev
19+
tag: dev-deepep
20+
- base: lmsysorg/sglang:blackwell
21+
tag: blackwell-deepep
22+
1223
steps:
1324
- name: Checkout repository
1425
uses: actions/checkout@v4
@@ -30,7 +41,7 @@ jobs:
3041
username: ${{ secrets.DOCKERHUB_USERNAME }}
3142
password: ${{ secrets.DOCKERHUB_TOKEN }}
3243

33-
- name: Build and Push DeepEP Image
44+
- name: Build and Push Docker Image
3445
run: |
35-
docker build . -f docker/Dockerfile.deepep -t lmsysorg/sglang:deepep --no-cache
36-
docker push lmsysorg/sglang:deepep
46+
docker build . -f docker/Dockerfile.deepep --build-arg BASE_IMAGE=${{ matrix.variant.base }} -t lmsysorg/sglang:${{ matrix.variant.tag }} --no-cache
47+
docker push lmsysorg/sglang:${{ matrix.variant.tag }}

.github/workflows/release-docker-dev-deepep.yml

Lines changed: 0 additions & 36 deletions
This file was deleted.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
name: Release SGLang Router Docker Image
2+
on:
3+
push:
4+
branches:
5+
- main
6+
paths:
7+
- "sgl-router/py_src/sglang_router/version.py"
8+
workflow_dispatch:
9+
10+
jobs:
11+
publish:
12+
if: github.repository == 'sgl-project/sglang'
13+
runs-on: ubuntu-24.04
14+
steps:
15+
- name: Checkout repository
16+
uses: actions/checkout@v4
17+
18+
- name: Login to Docker Hub
19+
uses: docker/login-action@v2
20+
with:
21+
username: ${{ secrets.DOCKERHUB_USERNAME }}
22+
password: ${{ secrets.DOCKERHUB_TOKEN }}
23+
24+
- name: Build and Push
25+
run: |
26+
version=$(cat sgl-router/py_src/sglang_router/version.py | cut -d'"' -f2)
27+
tag=v${version}
28+
29+
docker build . -f docker/Dockerfile.router -t lmsysorg/sglang-router:${tag} --no-cache
30+
docker push lmsysorg/sglang-router:${tag}
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: Release Docker Images
2+
on:
3+
push:
4+
branches:
5+
- main
6+
paths:
7+
- "python/sglang/version.py"
8+
workflow_dispatch:
9+
10+
jobs:
11+
publish:
12+
if: github.repository == 'sgl-project/sglang'
13+
runs-on: ubuntu-24.04
14+
environment: 'prod'
15+
strategy:
16+
matrix:
17+
build_type: ['all']
18+
steps:
19+
20+
- name: Checkout repository
21+
uses: actions/checkout@v4
22+
23+
- name: Login to Docker Hub
24+
uses: docker/login-action@v2
25+
with:
26+
username: ${{ secrets.DOCKERHUB_USERNAME }}
27+
password: ${{ secrets.DOCKERHUB_TOKEN }}
28+
29+
- name: Build and Push
30+
run: |
31+
version=$(cat python/sglang/version.py | cut -d'"' -f2)
32+
tag=v${version}-xeon
33+
34+
docker build . -f docker/Dockerfile.xeon -t lmsysorg/sglang:${tag} --no-cache
35+
docker push lmsysorg/sglang:${tag}

.github/workflows/vllm-dependency-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
- name: Install dependencies
3131
run: |
3232
bash scripts/ci_install_dependency.sh
33-
pip install "vllm==0.8.4"
33+
pip install "vllm==0.9.0.1"
3434
pip install "bitsandbytes>=0.44.0"
3535
3636
- name: Run VLLM dependency tests

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,3 +233,5 @@ compile_commands.json
233233

234234
# Rust lib
235235
Cargo.lock
236+
237+
lmms-eval

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ format: check-deps ## Format modified Python files using isort and black
1919
FILES_TO_UPDATE = docker/Dockerfile.rocm \
2020
python/pyproject.toml \
2121
python/sglang/version.py \
22-
docs/developer/setup_github_runner.md \
22+
docs/references/setup_github_runner.md \
2323
docs/start/install.md \
2424
benchmark/deepseek_v3/README.md
2525

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212

1313
--------------------------------------------------------------------------------
1414

15-
| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/)
15+
| [**Blog**](https://lmsys.org/blog/2025-05-05-large-scale-ep/)
1616
| [**Documentation**](https://docs.sglang.ai/)
1717
| [**Join Slack**](https://slack.sglang.ai/)
1818
| [**Join Bi-Weekly Development Meeting**](https://meeting.sglang.ai/)
@@ -44,7 +44,7 @@ SGLang is a fast serving framework for large language models and vision language
4444
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
4545
The core features include:
4646

47-
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-lora batching.
47+
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor parallelism, pipeline parallelism, expert parallelism, structured outputs, chunked prefill, quantization (FP8/INT4/AWQ/GPTQ), and multi-lora batching.
4848
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
4949
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
5050
- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
@@ -63,7 +63,7 @@ Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-s
6363
[Development Roadmap (2025 H1)](https://github.com/sgl-project/sglang/issues/4042)
6464

6565
## Adoption and Sponsorship
66-
SGLang has been deployed at large scale, serving trillions of tokens in production every day. It is trusted and adopted by a broad range of leading enterprises and institutions, including xAI, NVIDIA, AMD, Google Cloud, Oracle Cloud, LinkedIn, Cursor, Voltage Park, Atlas Cloud, DataCrunch, Baseten, Nebius, Novita, InnoMatrix, RunPod, Stanford, UC Berkeley, UCLA, ETCHED, Jam & Tea Studios, Hyperbolic, as well as major technology organizations across North America and Asia. As an open-source LLM inference engine, SGLang has become the de facto standard in the industry, with production deployments running on over 100,000 GPUs worldwide.
66+
SGLang has been deployed at large scale, generating trillions of tokens in production every day. It is trusted and adopted by a broad range of leading enterprises and institutions, including xAI, NVIDIA, AMD, Google Cloud, Oracle Cloud, LinkedIn, Cursor, Voltage Park, Atlas Cloud, DataCrunch, Baseten, Nebius, Novita, InnoMatrix, RunPod, Stanford, UC Berkeley, UCLA, ETCHED, Jam & Tea Studios, Hyperbolic, as well as major technology organizations across North America and Asia. As an open-source LLM inference engine, SGLang has become the de facto standard in the industry, with production deployments running on over 100,000 GPUs worldwide.
6767

6868
<img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/refs/heads/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>
6969

benchmark/deepseek_v3/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Add [performance optimization options](#performance-optimization-options) as nee
3333

3434
```bash
3535
# Installation
36-
pip install "sglang[all]>=0.4.6.post5"
36+
pip install "sglang[all]>=0.4.7"
3737

3838
# Launch
3939
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code

0 commit comments

Comments
 (0)