Support controlling nsys start and end range programmatically #4688

fzyzcjy · 2025-03-23T01:14:50Z

Motivation

Now we can use /start_profile and /stop_profile to control the nsys range

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

* Fix ut mla-test-1-gpu-amd (sgl-project#4813) Co-authored-by: Zhang Kaihong <[email protected]> * Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner (sgl-project#4638) * [k8s] Clarified the usage of shared memory. (sgl-project#4341) * gemma3: impl `get_attention_sliding_window_size` for attn init (sgl-project#4823) * add partial_json_parser and einops (sgl-project#4827) * fix the release doc dependency issue (sgl-project#4828) * Update doc for DeepSeek-V3-0324 (sgl-project#4825) * deps: lazy import optional dependencies `gguf` and `torchvision` (sgl-project#4826) * Update MMMU Benchmark instructions (sgl-project#4694) * Fix the nightly eval by lowering the threshold of `neuralmagic/gemma-2-2b-it-FP8` (sgl-project#4830) * Basic Cleanup (sgl-project#4833) * Support (1 <= dp < tp) in the dp attention in DeepEP (sgl-project#4770) Co-authored-by: Cheng Wan <[email protected]> * [Fix] Add compressed_tensors as deps (sgl-project#4819) * Fix error due to CustomAllreduce setup failure (sgl-project#4815) Signed-off-by: Kebe <[email protected]> * use default for torch.ops (sgl-project#4835) * [CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder (sgl-project#3969) * [Misc] Fix issues reported by torchfix (sgl-project#4837) * Include context length in /v1/models response. (sgl-project#4809) * [Fix] `self.worker` assignment in `TpModelWorker` and refactor references (sgl-project#4788) Signed-off-by: Xinyuan Tong <[email protected]> * Fix the lora adapter when lora path is none (sgl-project#4799) Co-authored-by: Beichen Ma <[email protected]> * fix: fix typo of comments in w8a8_fp8.py (sgl-project#4843) * Remove retry in nightly tests (sgl-project#4846) * Fix CI of test_patch_torch (sgl-project#4844) * IPv6 support (sgl-project#3949) Signed-off-by: Brayden Zhong <[email protected]> * ci: add condition for daily docker build (sgl-project#4487) * [Fix] fix output_top_logprobs is not exist (sgl-project#4597) * fix: when use SGLANG_PORT this env,port is str (sgl-project#4528) Signed-off-by: rongfu.leng <[email protected]> * Support Page Size > 1 for FA3 (sgl-project#4832) Co-authored-by: Qingquan Song <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> * Fix Engine error when enabling DP attention (sgl-project#4648) * fix: Inappropriate lack of Optional type on OpenAI ChatCompletionRequest (sgl-project#4681) * Support controlling nsys start and end range programmatically (sgl-project#4688) * Remove empty tool function name (sgl-project#4704) Signed-off-by: Kebe <[email protected]> * Fix missing arguments in SchedulePolicy and RadixCache initialization in tests. (sgl-project#4712) * get the python version from env (sgl-project#4729) * Fix torch.cuda.MemPool() internal assertion failure (sgl-project#4687) Co-authored-by: Lianmin Zheng <[email protected]> * Super tiny remove unused code (sgl-project#4750) * Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <[email protected]> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <[email protected]> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <[email protected]> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <[email protected]> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <[email protected]> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <[email protected]> Co-authored-by: laixinn <[email protected]> Co-authored-by: ch-wan <[email protected]> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <[email protected]> Co-authored-by: GeLee <[email protected]> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: zcnrex <[email protected]> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <[email protected]> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <[email protected]> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <[email protected]> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <[email protected]> Co-authored-by: laixinn <[email protected]> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <[email protected]> Co-authored-by: Chunan Zeng <[email protected]> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <[email protected]> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: ispobock <[email protected]> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <[email protected]> * bump v0.4.5 (sgl-project#5117) * Workaround for async copy issue in HPU eager mode (sgl-project#1) Signed-off-by: Rahul Vijayaraghavan <[email protected]> Co-authored-by: Rahul Vijayaraghavan <[email protected]> * [SW-223847]: Fix sgl_kernel module not available (sgl-project#2) Co-authored-by: vikram singh shekhawat <[email protected]> * [Base] Enable torch compile (sgl-project#4) * [SW-226331] disable dynamic shape in torch compile mode Signed-off-by: Mohit Sinha <[email protected]> --------- Signed-off-by: Kebe <[email protected]> Signed-off-by: Xinyuan Tong <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: rongfu.leng <[email protected]> Signed-off-by: Rahul Vijayaraghavan <[email protected]> Signed-off-by: Mohit Sinha <[email protected]> Co-authored-by: strgrb <[email protected]> Co-authored-by: Zhang Kaihong <[email protected]> Co-authored-by: AinL <[email protected]> Co-authored-by: Jiří Suchomel <[email protected]> Co-authored-by: Juwan Yoo <[email protected]> Co-authored-by: Yineng Zhang <[email protected]> Co-authored-by: Ke Bao <[email protected]> Co-authored-by: Ravi Theja <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: Daniel Holanda <[email protected]> Co-authored-by: tarinkk <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: Junrong Lin <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: Jon Durbin <[email protected]> Co-authored-by: XinyuanTong <[email protected]> Co-authored-by: Qiaolin Yu <[email protected]> Co-authored-by: Beichen Ma <[email protected]> Co-authored-by: Jiaqi <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: Vincent <[email protected]> Co-authored-by: warjiang <[email protected]> Co-authored-by: lambert0312 <[email protected]> Co-authored-by: rongfu.leng <[email protected]> Co-authored-by: Stefan He <[email protected]> Co-authored-by: Qingquan Song <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: BroadbentJim <[email protected]> Co-authored-by: vikram singh shekhawat <[email protected]> Co-authored-by: DavidChan <[email protected]> Co-authored-by: chaobo jia <[email protected]> Co-authored-by: rudy152 <[email protected]> Co-authored-by: Fr4nk1in <[email protected]> Co-authored-by: yinfan98 <[email protected]> Co-authored-by: Yi Zhang <[email protected]> Co-authored-by: Adarsh Shirawalmath <[email protected]> Co-authored-by: Sleepcoo <[email protected]> Co-authored-by: SEPLOS <[email protected]> Co-authored-by: Zhiqiang Xie <[email protected]> Co-authored-by: JieXin Liang <[email protected]> Co-authored-by: Mick <[email protected]> Co-authored-by: Yuhong Guo <[email protected]> Co-authored-by: Jinyan Chen <[email protected]> Co-authored-by: laixinn <[email protected]> Co-authored-by: GeLee <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]> Co-authored-by: zcnrex <[email protected]> Co-authored-by: Kaiyu Yang <[email protected]> Co-authored-by: renxin <[email protected]> Co-authored-by: saltyfish66 <[email protected]> Co-authored-by: yuethe <[email protected]> Co-authored-by: simveit <[email protected]> Co-authored-by: Yifan Zhang <[email protected]> Co-authored-by: Ravi Theja Desetty <[email protected]> Co-authored-by: AniZpZ <[email protected]> Co-authored-by: 晟海 <[email protected]> Co-authored-by: Tommy Yang <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: inkcherry <[email protected]> Co-authored-by: mlmz <[email protected]> Co-authored-by: shuaills <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: HAI <[email protected]> Co-authored-by: Rahul Vijayaraghavan <[email protected]> Co-authored-by: Rahul Vijayaraghavan <[email protected]> Co-authored-by: Jay Thakur <[email protected]> Co-authored-by: Anshuman Tripathy <[email protected]>

fzyzcjy added 6 commits March 23, 2025 09:12

more

ab6c783

more

ed5d169

more

cc64d08

more

a2ae2a5

more

5c82101

fmt

82c241a

fzyzcjy requested review from merrymercy, Ying1123, hnyls2002 and xiezhq-hermann as code owners March 23, 2025 01:14

fzyzcjy and others added 2 commits March 23, 2025 09:16

more

3e2caf6

Merge branch 'main' into feat/profiler_nsys

9e635f8

fzyzcjy mentioned this pull request Mar 25, 2025

Support with_stack and record_shapes in profiler #4740

Merged

6 tasks

merrymercy merged commit 53a2c3b into sgl-project:main Mar 28, 2025
14 of 21 checks passed

yuhsuan-t mentioned this pull request Apr 2, 2025

Add validation for nsys and torch profiling #5004

Open

6 tasks

jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025

Support controlling nsys start and end range programmatically (sgl-pr…

4ab2f68

…oject#4688)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support controlling nsys start and end range programmatically #4688

Support controlling nsys start and end range programmatically #4688

Uh oh!

fzyzcjy commented Mar 23, 2025

Uh oh!

Uh oh!

Uh oh!

Support controlling nsys start and end range programmatically #4688

Support controlling nsys start and end range programmatically #4688

Uh oh!

Conversation

fzyzcjy commented Mar 23, 2025

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

Uh oh!