Skip to content

Support dynamic LoRA loading / unloading in engine/server API #7446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jun 28, 2025

Conversation

lifuhuang
Copy link
Collaborator

@lifuhuang lifuhuang commented Jun 22, 2025

Co-author: @Fridge003 - this PR leverages idea and prev work from PR #2891.

Motivation

This is the 2nd PR to support dynamic LoRA loading / unloading (issue: #2686)

Please refer to the 1st PR #7412 to see all the refactoring work behind this change.

Future Work / Known Gaps
It is worth noting that, during the implementation/testing process, we identified several features / fixes that are not directly related to dynamic LoRA itself but might affect its overall usability to different degrees. We need to address them in future PRs but adding them here for tracking purposes:

  1. Graceful error handling for unfound LoRA: [Feature] Graceful handling of non-existing lora_path in inference request #7447
  2. Support handling LoRA adapters with different target weights: [Bug] LoRA buffer eviction does not correctly handle adapters with different target weights #7426
  3. Support starting server without initial lora_paths: [Feature] Add server arg enable-lora to allow starting up with empty lora-paths #7463
  4. Test and enable dynamic lora support for data_parallelism dp_size > 1(TODO: add issue to track)

Usage

Server

# Start server 
# Please note that, in this iteration, you need to ensure that at least one entry is provided in --lora-paths. This will be addressed in future PRs (See Known Gaps above)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000 \
  --disable-radix-cache \
  --lora-paths lora1=algoprog/fact-generation-llama-3.1-8b-instruct-lora 

# load lora
curl -X POST http://localhost:30000/load_lora_adapter \
     -H "Content-Type: application/json" \
     -d '{
           "lora_name": "YOUR_LORA_NAME",
           "lora_path": "YOUR_LORA_PATH"
         }'

# unload lora
curl -X POST http://localhost:30000/unload_lora_adapter \
     -H "Content-Type: application/json" \
     -d '{
           "lora_name": "YOUR_LORA_NAME"
         }'

Engine

# Likewise, you need to provide at least one initial lora_paths due to the aforementioned known gap.
engine = Engine(
      model_path="meta-llama/Llama-3.1-8B-Instruct",
      lora_paths=["lora1=algoprog/fact-generation-llama-3.1"], 
      ...
)

# load lora
engine.load_lora_adapter("YOUR_LORA_NAME", "YOUR_LORA_PATH")

# unload lora
engine.unload_lora_adapter("YOUR_LORA_NAME")

Modifications

Checklist

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @lifuhuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant new functionality to the HTTP server API, enabling the dynamic loading and unloading of LoRA (Low-Rank Adaptation) adapters at runtime. This enhancement allows users to manage LoRA models without needing to restart the server, providing greater flexibility and efficiency in resource utilization. The changes involve adding new API endpoints, defining corresponding request/response data structures, and integrating these operations across various components of the system, including the LoRA manager, scheduler, and worker processes.

Highlights

  • Dynamic LoRA Management API: New HTTP API endpoints (/load_lora_adapter and /unload_lora_adapter) have been added to allow dynamic loading and unloading of LoRA adapters without requiring a server restart.
  • Request/Response Structures: Dedicated data structures (LoadLoRAAdapterReqInput, LoadLoRAAdapterReqOutput, UnloadLoRAAdapterReqInput, UnloadLoRAAdapterReqOutput) are introduced to standardize the communication for these new LoRA operations.
  • Core LoRA Manager Enhancements: The LoRAManager now provides more robust load_lora_adapters functionality, including detailed success/failure reporting for each adapter and improved error handling during the loading process.
  • System Integration: The new LoRA management capabilities are integrated throughout the system's request handling pipeline, from the HTTP server entrypoint through the tokenizer manager, scheduler, and worker components, down to the ModelRunner and LoRAManager.
  • Data Parallelism Constraint: Dynamic LoRA loading and unloading is currently restricted to configurations where the data parallelism size (dp_size) is 1.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces dynamic LoRA loading and unloading capabilities to the HTTP server API. It includes changes to several files to support the new functionality, including adding new API endpoints, data structures, and modifications to the LoRA manager. The code appears well-structured and includes appropriate logging and error handling.

@Fridge003
Copy link
Collaborator

Thank you for your awesome work! @lifuhuang
I feel we need to add a test for this feature, under test/srt/models/lora folder~

@lifuhuang
Copy link
Collaborator Author

Thank you for your awesome work! @lifuhuang I feel we need to add a test for this feature, under test/srt/models/lora folder~

Absolutely! This PR is still WIP, will let you know once it's ready for review :)

@lifuhuang lifuhuang changed the title [WIP] Support dynamic LoRA loading / unloading in HTTP server API. Support dynamic LoRA loading / unloading in HTTP server API. Jun 23, 2025
@lifuhuang lifuhuang changed the title Support dynamic LoRA loading / unloading in HTTP server API. Support dynamic LoRA loading / unloading in engine and http server. Jun 23, 2025
@lifuhuang lifuhuang changed the title Support dynamic LoRA loading / unloading in engine and http server. Support dynamic LoRA loading / unloading in engine/server API Jun 23, 2025
@lifuhuang lifuhuang mentioned this pull request Jun 23, 2025
67 tasks
@Fridge003
Copy link
Collaborator

Fridge003 commented Jun 23, 2025

@lifuhuang Documents about this feature can be added in docs/backend/lora.ipynb and docs/backend/native_api.ipynb. Adding documents can be left for future PR

@Fridge003
Copy link
Collaborator

Fridge003 commented Jun 24, 2025

@lifuhuang Can you please add usage description for /load_lora_adapter and /unload_lora_adapter in this PR, so other users can directly starts using this feature after this PR is merged

@lifuhuang
Copy link
Collaborator Author

@lifuhuang Can you please add usage description for /load_lora_adapter and /unload_lora_adapter in this PR, so other users can directly starts using this feature after this PR is merged

Good idea, added some basic PR description. Will also look into the ipybn doc tmr in a follow-up PR.

@lifuhuang lifuhuang requested a review from Fridge003 June 24, 2025 06:59
@whybeyoung
Copy link
Collaborator

LGTM

@lifuhuang lifuhuang requested a review from Fridge003 June 24, 2025 23:43
Copy link
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work!

@lifuhuang lifuhuang added ready-to-merge The PR is ready to merge after the CI is green. and removed ready-for-review labels Jun 26, 2025
@lifuhuang lifuhuang enabled auto-merge (squash) June 27, 2025 00:21
@lifuhuang
Copy link
Collaborator Author

The CI failures are unrelated to this PR.

Created 2 issues for tracking:
#7587 #7586

@Fridge003
Copy link
Collaborator

The CI failures are unrelated to this PR.

Created 2 issues for tracking: #7587 #7586

Yes, I think these are just flaky tests

@zhyncs zhyncs disabled auto-merge June 28, 2025 03:59
@zhyncs zhyncs merged commit 49538d1 into main Jun 28, 2025
1 of 47 checks passed
@zhyncs zhyncs deleted the lifuhuang/dyn-lora branch June 28, 2025 04:00
@whybeyoung
Copy link
Collaborator

Great Work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge The PR is ready to merge after the CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants