-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Support dynamic LoRA loading / unloading in engine/server API #7446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Baizhou Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @lifuhuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces significant new functionality to the HTTP server API, enabling the dynamic loading and unloading of LoRA (Low-Rank Adaptation) adapters at runtime. This enhancement allows users to manage LoRA models without needing to restart the server, providing greater flexibility and efficiency in resource utilization. The changes involve adding new API endpoints, defining corresponding request/response data structures, and integrating these operations across various components of the system, including the LoRA manager, scheduler, and worker processes.
Highlights
- Dynamic LoRA Management API: New HTTP API endpoints (
/load_lora_adapter
and/unload_lora_adapter
) have been added to allow dynamic loading and unloading of LoRA adapters without requiring a server restart. - Request/Response Structures: Dedicated data structures (
LoadLoRAAdapterReqInput
,LoadLoRAAdapterReqOutput
,UnloadLoRAAdapterReqInput
,UnloadLoRAAdapterReqOutput
) are introduced to standardize the communication for these new LoRA operations. - Core LoRA Manager Enhancements: The
LoRAManager
now provides more robustload_lora_adapters
functionality, including detailed success/failure reporting for each adapter and improved error handling during the loading process. - System Integration: The new LoRA management capabilities are integrated throughout the system's request handling pipeline, from the HTTP server entrypoint through the tokenizer manager, scheduler, and worker components, down to the
ModelRunner
andLoRAManager
. - Data Parallelism Constraint: Dynamic LoRA loading and unloading is currently restricted to configurations where the data parallelism size (
dp_size
) is 1.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces dynamic LoRA loading and unloading capabilities to the HTTP server API. It includes changes to several files to support the new functionality, including adding new API endpoints, data structures, and modifications to the LoRA manager. The code appears well-structured and includes appropriate logging and error handling.
Thank you for your awesome work! @lifuhuang |
Absolutely! This PR is still WIP, will let you know once it's ready for review :) |
@lifuhuang Documents about this feature can be added in |
@lifuhuang Can you please add usage description for |
Good idea, added some basic PR description. Will also look into the ipybn doc tmr in a follow-up PR. |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work!
Great Work! |
Motivation
This is the 2nd PR to support dynamic LoRA loading / unloading (issue: #2686)
Please refer to the 1st PR #7412 to see all the refactoring work behind this change.
Future Work / Known Gaps
It is worth noting that, during the implementation/testing process, we identified several features / fixes that are not directly related to dynamic LoRA itself but might affect its overall usability to different degrees. We need to address them in future PRs but adding them here for tracking purposes:
dp_size
> 1(TODO: add issue to track)Usage
Server
Engine
Modifications
Checklist