Skip to content

Commit 9941b35

Browse files
Fridge003tarinkk
authored andcommitted
Add document for LoRA serving (sgl-project#5521)
1 parent 31f335f commit 9941b35

File tree

2 files changed

+205
-0
lines changed

2 files changed

+205
-0
lines changed

docs/backend/lora.ipynb

Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# LoRA Serving"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"SGLang enables the use of [LoRA adapters](https://arxiv.org/abs/2106.09685) with a base model. By incorporating techniques from [S-LoRA](https://arxiv.org/pdf/2311.03285) and [Punica](https://arxiv.org/pdf/2310.18547), SGLang can efficiently support multiple LoRA adapters for different sequences within a single batch of inputs."
15+
]
16+
},
17+
{
18+
"cell_type": "markdown",
19+
"metadata": {},
20+
"source": [
21+
"## Arguments for LoRA Serving"
22+
]
23+
},
24+
{
25+
"cell_type": "markdown",
26+
"metadata": {},
27+
"source": [
28+
"The following server arguments are relevant for multi-LoRA serving:\n",
29+
"\n",
30+
"* `lora_paths`: A mapping from each adaptor's name to its path, in the form of `{name}={path} {name}={path}`.\n",
31+
"\n",
32+
"* `max_loras_per_batch`: Maximum number of adaptors used by each batch. This argument can affect the amount of GPU memory reserved for multi-LoRA serving, so it should be set to a smaller value when memory is scarce. Defaults to be 8.\n",
33+
"\n",
34+
"* `lora_backend`: The backend of running GEMM kernels for Lora modules. It can be one of `triton` or `flashinfer`, and set to `triton` by default. For better performance and stability, we recommend using the Triton LoRA backend. In the future, faster backend built upon Cutlass or Cuda kernels will be added.\n",
35+
"\n",
36+
"* `tp_size`: LoRA serving along with Tensor Parallelism is supported by SGLang. `tp_size` controls the number of GPUs for tensor parallelism. More details on the tensor sharding strategy can be found in [S-Lora](https://arxiv.org/pdf/2311.03285) paper.\n",
37+
"\n",
38+
"From client side, the user needs to provide a list of strings as input batch, and a list of adaptor names that each input sequence corresponds to."
39+
]
40+
},
41+
{
42+
"cell_type": "markdown",
43+
"metadata": {},
44+
"source": [
45+
"## Usage\n",
46+
"\n",
47+
"### Serving Single Adaptor"
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": null,
53+
"metadata": {},
54+
"outputs": [],
55+
"source": [
56+
"from sglang.test.test_utils import is_in_ci\n",
57+
"\n",
58+
"if is_in_ci():\n",
59+
" from patch import launch_server_cmd\n",
60+
"else:\n",
61+
" from sglang.utils import launch_server_cmd\n",
62+
"\n",
63+
"from sglang.utils import wait_for_server, terminate_process\n",
64+
"\n",
65+
"import json\n",
66+
"import requests"
67+
]
68+
},
69+
{
70+
"cell_type": "code",
71+
"execution_count": null,
72+
"metadata": {},
73+
"outputs": [],
74+
"source": [
75+
"server_process, port = launch_server_cmd(\n",
76+
" \"\"\"\n",
77+
"python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
78+
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
79+
" --max-loras-per-batch 1 --lora-backend triton \\\n",
80+
" --disable-cuda-graph --disable-radix-cache\n",
81+
"\"\"\"\n",
82+
")\n",
83+
"\n",
84+
"wait_for_server(f\"http://localhost:{port}\")"
85+
]
86+
},
87+
{
88+
"cell_type": "code",
89+
"execution_count": null,
90+
"metadata": {},
91+
"outputs": [],
92+
"source": [
93+
"url = f\"http://127.0.0.1:{port}\"\n",
94+
"json_data = {\n",
95+
" \"text\": [\n",
96+
" \"List 3 countries and their capitals.\",\n",
97+
" \"AI is a field of computer science focused on\",\n",
98+
" ],\n",
99+
" \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n",
100+
" # The first input uses lora0, and the second input uses the base model\n",
101+
" \"lora_path\": [\"lora0\", None],\n",
102+
"}\n",
103+
"response = requests.post(\n",
104+
" url + \"/generate\",\n",
105+
" json=json_data,\n",
106+
")\n",
107+
"print(f\"Output 0: {response.json()[0]['text']}\")\n",
108+
"print(f\"Output 1: {response.json()[1]['text']}\")"
109+
]
110+
},
111+
{
112+
"cell_type": "code",
113+
"execution_count": null,
114+
"metadata": {},
115+
"outputs": [],
116+
"source": [
117+
"terminate_process(server_process)"
118+
]
119+
},
120+
{
121+
"cell_type": "markdown",
122+
"metadata": {},
123+
"source": [
124+
"### Serving Multiple Adaptors"
125+
]
126+
},
127+
{
128+
"cell_type": "code",
129+
"execution_count": null,
130+
"metadata": {},
131+
"outputs": [],
132+
"source": [
133+
"server_process, port = launch_server_cmd(\n",
134+
" \"\"\"\n",
135+
"python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
136+
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
137+
" lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n",
138+
" --max-loras-per-batch 2 --lora-backend triton \\\n",
139+
" --disable-cuda-graph --disable-radix-cache\n",
140+
"\"\"\"\n",
141+
")\n",
142+
"\n",
143+
"wait_for_server(f\"http://localhost:{port}\")"
144+
]
145+
},
146+
{
147+
"cell_type": "code",
148+
"execution_count": null,
149+
"metadata": {},
150+
"outputs": [],
151+
"source": [
152+
"url = f\"http://127.0.0.1:{port}\"\n",
153+
"json_data = {\n",
154+
" \"text\": [\n",
155+
" \"List 3 countries and their capitals.\",\n",
156+
" \"AI is a field of computer science focused on\",\n",
157+
" ],\n",
158+
" \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n",
159+
" # The first input uses lora0, and the second input uses lora1\n",
160+
" \"lora_path\": [\"lora0\", \"lora1\"],\n",
161+
"}\n",
162+
"response = requests.post(\n",
163+
" url + \"/generate\",\n",
164+
" json=json_data,\n",
165+
")\n",
166+
"print(f\"Output 0: {response.json()[0]['text']}\")\n",
167+
"print(f\"Output 1: {response.json()[1]['text']}\")"
168+
]
169+
},
170+
{
171+
"cell_type": "code",
172+
"execution_count": null,
173+
"metadata": {},
174+
"outputs": [],
175+
"source": [
176+
"terminate_process(server_process)"
177+
]
178+
},
179+
{
180+
"cell_type": "markdown",
181+
"metadata": {},
182+
"source": [
183+
"## Future Works\n",
184+
"\n",
185+
"The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Currently Cuda graph and radix attention are not incompatible with LoRA and must be manually disabled. Other features, including Unified Paging, Cutlass backend, and dynamic loading/unloadingm, are still under development."
186+
]
187+
}
188+
],
189+
"metadata": {
190+
"language_info": {
191+
"codemirror_mode": {
192+
"name": "ipython",
193+
"version": 3
194+
},
195+
"file_extension": ".py",
196+
"mimetype": "text/x-python",
197+
"name": "python",
198+
"nbconvert_exporter": "python",
199+
"pygments_lexer": "ipython3"
200+
}
201+
},
202+
"nbformat": 4,
203+
"nbformat_minor": 2
204+
}

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ The core features include:
5454
backend/structured_outputs_for_reasoning_models.ipynb
5555
backend/custom_chat_template.md
5656
backend/quantization.md
57+
backend/lora.ipynb
5758

5859
.. toctree::
5960
:maxdepth: 1

0 commit comments

Comments
 (0)