Support dynamic LoRA loading / unloading in engine/server API #7446

lifuhuang · 2025-06-22T21:21:25Z

Co-author: @Fridge003 - this PR leverages idea and prev work from PR #2891.

Motivation

This is the 2nd PR to support dynamic LoRA loading / unloading (issue: #2686)

Please refer to the 1st PR #7412 to see all the refactoring work behind this change.

Future Work / Known Gaps
It is worth noting that, during the implementation/testing process, we identified several features / fixes that are not directly related to dynamic LoRA itself but might affect its overall usability to different degrees. We need to address them in future PRs but adding them here for tracking purposes:

Graceful error handling for unfound LoRA: [Feature] Graceful handling of non-existing lora_path in inference request #7447
Support handling LoRA adapters with different target weights: [Bug] LoRA buffer eviction does not correctly handle adapters with different target weights #7426
Support starting server without initial lora_paths: [Feature] Add server arg enable-lora to allow starting up with empty lora-paths #7463
Test and enable dynamic lora support for data_parallelism dp_size > 1(TODO: add issue to track)

Usage

Server

# Start server 
# Please note that, in this iteration, you need to ensure that at least one entry is provided in --lora-paths. This will be addressed in future PRs (See Known Gaps above)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000 \
  --disable-radix-cache \
  --lora-paths lora1=algoprog/fact-generation-llama-3.1-8b-instruct-lora 

# load lora
curl -X POST http://localhost:30000/load_lora_adapter \
     -H "Content-Type: application/json" \
     -d '{
           "lora_name": "YOUR_LORA_NAME",
           "lora_path": "YOUR_LORA_PATH"
         }'

# unload lora
curl -X POST http://localhost:30000/unload_lora_adapter \
     -H "Content-Type: application/json" \
     -d '{
           "lora_name": "YOUR_LORA_NAME"
         }'

Engine

# Likewise, you need to provide at least one initial lora_paths due to the aforementioned known gap.
engine = Engine(
      model_path="meta-llama/Llama-3.1-8B-Instruct",
      lora_paths=["lora1=algoprog/fact-generation-llama-3.1"], 
      ...
)

# load lora
engine.load_lora_adapter("YOUR_LORA_NAME", "YOUR_LORA_PATH")

# unload lora
engine.unload_lora_adapter("YOUR_LORA_NAME")

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Co-authored-by: Baizhou Zhang <[email protected]>

gemini-code-assist

Summary of Changes

Hello @lifuhuang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant new functionality to the HTTP server API, enabling the dynamic loading and unloading of LoRA (Low-Rank Adaptation) adapters at runtime. This enhancement allows users to manage LoRA models without needing to restart the server, providing greater flexibility and efficiency in resource utilization. The changes involve adding new API endpoints, defining corresponding request/response data structures, and integrating these operations across various components of the system, including the LoRA manager, scheduler, and worker processes.

Highlights

Dynamic LoRA Management API: New HTTP API endpoints (/load_lora_adapter and /unload_lora_adapter) have been added to allow dynamic loading and unloading of LoRA adapters without requiring a server restart.
Request/Response Structures: Dedicated data structures (LoadLoRAAdapterReqInput, LoadLoRAAdapterReqOutput, UnloadLoRAAdapterReqInput, UnloadLoRAAdapterReqOutput) are introduced to standardize the communication for these new LoRA operations.
Core LoRA Manager Enhancements: The LoRAManager now provides more robust load_lora_adapters functionality, including detailed success/failure reporting for each adapter and improved error handling during the loading process.
System Integration: The new LoRA management capabilities are integrated throughout the system's request handling pipeline, from the HTTP server entrypoint through the tokenizer manager, scheduler, and worker components, down to the ModelRunner and LoRAManager.
Data Parallelism Constraint: Dynamic LoRA loading and unloading is currently restricted to configurations where the data parallelism size (dp_size) is 1.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces dynamic LoRA loading and unloading capabilities to the HTTP server API. It includes changes to several files to support the new functionality, including adding new API endpoints, data structures, and modifications to the LoRA manager. The code appears well-structured and includes appropriate logging and error handling.

python/sglang/srt/lora/lora_manager.py

python/sglang/srt/model_executor/model_runner.py

Fridge003 · 2025-06-22T21:28:24Z

Thank you for your awesome work! @lifuhuang
I feel we need to add a test for this feature, under test/srt/models/lora folder~

lifuhuang · 2025-06-22T21:38:25Z

Thank you for your awesome work! @lifuhuang I feel we need to add a test for this feature, under test/srt/models/lora folder~

Absolutely! This PR is still WIP, will let you know once it's ready for review :)

Fridge003 · 2025-06-23T23:04:04Z

@lifuhuang Documents about this feature can be added in docs/backend/lora.ipynb and docs/backend/native_api.ipynb. Adding documents can be left for future PR

Fridge003 · 2025-06-24T01:57:30Z

@lifuhuang Can you please add usage description for /load_lora_adapter and /unload_lora_adapter in this PR, so other users can directly starts using this feature after this PR is merged

test/srt/models/lora/test_lora_update.py

python/sglang/srt/managers/scheduler.py

lifuhuang · 2025-06-24T06:55:16Z

@lifuhuang Can you please add usage description for /load_lora_adapter and /unload_lora_adapter in this PR, so other users can directly starts using this feature after this PR is merged

Good idea, added some basic PR description. Will also look into the ipybn doc tmr in a follow-up PR.

test/srt/test_update_lora_adapters.py

whybeyoung · 2025-06-24T09:44:49Z

LGTM

Fridge003

Impressive work!

test/srt/models/lora/test_lora_update.py

lifuhuang · 2025-06-27T05:56:15Z

The CI failures are unrelated to this PR.

Created 2 issues for tracking:
#7587 #7586

Fridge003 · 2025-06-27T06:41:21Z

The CI failures are unrelated to this PR.

Created 2 issues for tracking: #7587 #7586

Yes, I think these are just flaky tests

whybeyoung · 2025-07-04T14:47:16Z

Great Work！

Support dynamic LoRA loading / unloading in http server API.

3d528a2

Co-authored-by: Baizhou Zhang <[email protected]>

lifuhuang requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, xiezhq-hermann, Fridge003 and zhaochenyang20 as code owners June 22, 2025 21:21

gemini-code-assist bot reviewed Jun 22, 2025

View reviewed changes

python/sglang/srt/lora/lora_manager.py Outdated Show resolved Hide resolved

python/sglang/srt/model_executor/model_runner.py Outdated Show resolved Hide resolved

Fridge003 mentioned this pull request Jun 22, 2025

[Feature] Support dynamic loading and unloading of Lora adapters #2891

Closed

3 tasks

Fridge003 mentioned this pull request Jun 21, 2025

[Feature] Lora Development Roadmap #2929

Open

16 tasks

lifuhuang added 2 commits June 22, 2025 21:46

Update data schema and error handling.

a7d0ddc

Merge remote-tracking branch 'origin/main' into lifuhuang/dyn-lora

80965ff

lifuhuang mentioned this pull request Jun 23, 2025

[Feature] Add server arg enable-lora to allow starting up with empty lora-paths #7463

Open

2 tasks

lifuhuang added 2 commits June 23, 2025 13:30

Add dynamic update tests.

5da4dbf

Merge branch 'main' into lifuhuang/dyn-lora

41dc004

lifuhuang changed the title ~~[WIP] Support dynamic LoRA loading / unloading in HTTP server API.~~ Support dynamic LoRA loading / unloading in HTTP server API. Jun 23, 2025

lifuhuang changed the title ~~Support dynamic LoRA loading / unloading in HTTP server API.~~ Support dynamic LoRA loading / unloading in engine and http server. Jun 23, 2025

lifuhuang added the ready-for-review label Jun 23, 2025

lifuhuang changed the title ~~Support dynamic LoRA loading / unloading in engine and http server.~~ Support dynamic LoRA loading / unloading in engine/server API Jun 23, 2025

lifuhuang mentioned this pull request Jun 23, 2025

Development Roadmap (2025 H1) #4042

Open

67 tasks

lifuhuang and others added 2 commits June 23, 2025 14:16

Enable test in run_suite.

2f51dfe

Merge branch 'main' into lifuhuang/dyn-lora

1b33871

Fridge003 mentioned this pull request Jun 23, 2025

[Feature] Dynamic Lora Support in SGLang #2686

Closed

2 tasks

Fridge003 reviewed Jun 24, 2025

View reviewed changes

Address PR comments.

74e8b2d

lifuhuang added 2 commits June 23, 2025 23:57

Fix Lint.

a39fdf6

Merge branch 'main' into lifuhuang/dyn-lora

b030273

lifuhuang requested a review from Fridge003 June 24, 2025 06:59

Fridge003 reviewed Jun 24, 2025

View reviewed changes

test/srt/test_update_lora_adapters.py Outdated Show resolved Hide resolved

lifuhuang added 3 commits June 24, 2025 16:01

Refactor to merge engine & server test.

3e927b4

Merge branch 'main' into lifuhuang/dyn-lora

806af52

Refactor.

0be68b7

lifuhuang requested a review from Fridge003 June 24, 2025 23:43

lifuhuang added 2 commits June 24, 2025 16:59

Fix

4162640

minor.

ab1e0a9

Fridge003 approved these changes Jun 26, 2025

View reviewed changes

test/srt/models/lora/test_lora_update.py Outdated Show resolved Hide resolved

lifuhuang added ready-to-merge The PR is ready to merge after the CI is green. and removed ready-for-review labels Jun 26, 2025

Merge branch 'main' into lifuhuang/dyn-lora

82f7494

lifuhuang enabled auto-merge (squash) June 27, 2025 00:21

Merge branch 'main' into lifuhuang/dyn-lora

86ffb18

lifuhuang assigned zhyncs Jun 27, 2025

lifuhuang added 2 commits June 27, 2025 15:03

Merge branch 'main' into lifuhuang/dyn-lora

45e31db

Merge remote-tracking branch 'origin/main' into lifuhuang/dyn-lora

e1f1b10

zhyncs disabled auto-merge June 28, 2025 03:59

zhyncs merged commit 49538d1 into main Jun 28, 2025
1 of 47 checks passed

zhyncs deleted the lifuhuang/dyn-lora branch June 28, 2025 04:00

Support dynamic LoRA loading / unloading in engine/server API #7446

Support dynamic LoRA loading / unloading in engine/server API #7446

Conversation

lifuhuang commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Usage

Server

Engine

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Jun 22, 2025

Uh oh!

lifuhuang commented Jun 22, 2025

Uh oh!

Fridge003 commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifuhuang commented Jun 24, 2025

Uh oh!

Uh oh!

whybeyoung commented Jun 24, 2025

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lifuhuang commented Jun 27, 2025

Uh oh!

Fridge003 commented Jun 27, 2025

Uh oh!

Uh oh!

whybeyoung commented Jul 4, 2025

Uh oh!

Uh oh!

lifuhuang commented Jun 22, 2025 •

edited

Loading

Fridge003 commented Jun 23, 2025 •

edited

Loading

Fridge003 commented Jun 24, 2025 •

edited

Loading