Skip to content

LMDeploy Distserve #3304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 86 commits into from
May 8, 2025
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
97d6d5d
sync main
JimyMa Apr 1, 2025
3241c1a
typo correct
JimyMa Apr 2, 2025
1788a28
1. typo 2. add migration event
JimyMa Apr 2, 2025
03b363f
1. move slime to 'https://github.com/JimyMa/DLSlime.git' and init rea…
JimyMa Apr 3, 2025
aabb72b
Update disagg README
JimyMa Apr 3, 2025
3ba605f
mute slime when disable distserve
JimyMa Apr 3, 2025
2e6ee7a
remove build_migration.sh
JimyMa Apr 3, 2025
cdf55c1
revert debug code
JimyMa Apr 3, 2025
ace6ece
1. identify interface. 2. add multi backend registry
JimyMa Apr 6, 2025
481052e
add dlslime max transfer batch
JimyMa Apr 6, 2025
f9b7409
add an infinistore interface
JimyMa Apr 6, 2025
60032b6
add load/store
JimyMa Apr 7, 2025
aa43faa
conditional register of Multi Migration Backend
JimyMa Apr 8, 2025
97e4430
merge router to proxy
JimyMa Apr 11, 2025
1e6c4da
remove redandunt print
JimyMa Apr 11, 2025
290e606
Merge branch 'main' of github.com:JimyMa/lmdeploy into distserve-update
JimyMa Apr 11, 2025
b530384
1. remove redandunt print 2. revert safe_run
JimyMa Apr 11, 2025
efcb72c
dsv3 kvtransfer support (bypass v cache)
JimyMa Apr 12, 2025
a3d973b
dsv3 debug, 1. change log info to log debug of log resp. 2. add num_c…
JimyMa Apr 12, 2025
31fd9f3
DSV3 Debug, known issue:
JimyMa Apr 14, 2025
48d791a
revert match to if,else
JimyMa Apr 14, 2025
2f02e05
[bugfix] rename typo
JimyMa Apr 14, 2025
ae959a0
[refactor] refactor pd_conn
JimyMa Apr 14, 2025
11d9961
1. format code. 2. add engine_role for passing ut test
JimyMa Apr 14, 2025
18da0fb
1. format code 2. parse dp, ep, and dp rank to DisaggEngineConfig
JimyMa Apr 14, 2025
a478c77
1. add pd conn timeout, 2. add default EngineRole to Hybrid, 3. fix d…
JimyMa Apr 15, 2025
c490de4
1. refactor PDConnection Pool
JimyMa Apr 17, 2025
df3f9ef
refactor debug
JimyMa Apr 18, 2025
61ad2a7
fix migration loop bug
JimyMa Apr 18, 2025
ad27c3a
add proxy arguments about distserve
JimyMa Apr 18, 2025
1c3b20c
bugfix
JimyMa Apr 18, 2025
119059f
debug interface
JimyMa Apr 18, 2025
1f220d4
remove unnesessary EngineRole Check.
JimyMa Apr 18, 2025
0a58979
add v1/chat/completions support
JimyMa Apr 18, 2025
83838d8
remove redundent print
JimyMa Apr 18, 2025
b108752
async free cache
JimyMa Apr 18, 2025
74d9256
async free cache
JimyMa Apr 18, 2025
39b2c4f
Merge branch 'main' of github.com:JimyMa/lmdeploy into distserve-micr…
JimyMa Apr 19, 2025
65ba59f
1. add some comments.
JimyMa Apr 19, 2025
3af751b
1. bugfix
JimyMa Apr 21, 2025
6028ec2
[proxy] add connection_warmup api
JimyMa Apr 21, 2025
3047e7b
1. bugfix (warmup_connection_typo and wrong args) 2. preserve cache b…
JimyMa Apr 21, 2025
649b51e
[disagg] update readme, 1. fault tolerance and 2. replace router to p…
JimyMa Apr 21, 2025
531524a
bugfix
JimyMa Apr 21, 2025
ce660ca
fix decode back pressure bug
JimyMa Apr 21, 2025
957bd68
1. add migration_request to chat/completions for correctly cache free
JimyMa Apr 21, 2025
f6de868
2. free cache bugfix
JimyMa Apr 22, 2025
7437bfa
1. fix lock running bug
JimyMa Apr 22, 2025
b0a8f1f
1. fix dist.broadcast deadlock
JimyMa Apr 23, 2025
a7bb7c4
[lint] 1. fix lint
JimyMa Apr 24, 2025
d488d87
rename Ethernet to RoCE
JimyMa Apr 24, 2025
b626d9e
change emun.Enum.__members__[elem] to enum.Enum[elem] directly
JimyMa Apr 24, 2025
2d6f8c1
update readme
JimyMa Apr 24, 2025
fec61ba
update migration-backend
JimyMa Apr 24, 2025
2637091
1. update readme 2. move module to string for conditional import
JimyMa Apr 24, 2025
3dedc69
1. update readme
JimyMa Apr 24, 2025
c09a06b
1. remove migic number and handle long assignments in dlslime. 2. add…
JimyMa Apr 25, 2025
160cb3c
fix error migration in dummy situation
JimyMa Apr 25, 2025
e97a486
1. bugfix when token is not a decodable utf-8 (in test)
JimyMa Apr 25, 2025
0eb588a
1. overlapping migration and forward.
JimyMa Apr 26, 2025
a048dfd
bump dlslime to v0.0.1.post5
JimyMa Apr 29, 2025
506bdb2
remove print
JimyMa Apr 29, 2025
4e0f31d
remove free in decode engine because already freed in proxy
JimyMa Apr 29, 2025
3f53e64
1. bump dlslime to 0.0.1.post7
JimyMa May 6, 2025
b70fc44
1. [proxy] revert self.nodes to nodes 2. [api_server] remove redundan…
JimyMa May 6, 2025
6498133
Merge branch 'main' of https://github.com/JimyMa/LMDeploy into distse…
JimyMa May 6, 2025
8d89f55
1. [cli] remove available_nic args
JimyMa May 6, 2025
4ac8f37
format comments
JimyMa May 6, 2025
d858e81
[pytorch paging] remove redundant logger
JimyMa May 6, 2025
6741c48
[model_agent] bugfix caused by merge
JimyMa May 6, 2025
10a70c9
[model agent] bypass model agent migrate
JimyMa May 7, 2025
c9d9e13
revert migrate to sync mode
JimyMa May 7, 2025
d292bf5
bypass model agent migrate in uni_executor
JimyMa May 7, 2025
70dc438
[proxy] set default serving strategy to DistServe
JimyMa May 7, 2025
2c54627
1. [disagg] update readme
JimyMa May 7, 2025
82a0a58
info -> debug
JimyMa May 7, 2025
ab4a5b9
remove unused code
JimyMa May 7, 2025
c8212e3
lazily initialize migration event
JimyMa May 7, 2025
0e83d26
add nvlink support
JimyMa May 7, 2025
5312fac
mute TCP support by now
JimyMa May 7, 2025
53091e3
update readme for execption
JimyMa May 7, 2025
4af8d3d
set migration token_ids output to numpy array
JimyMa May 7, 2025
76c3a04
update readme
JimyMa May 7, 2025
5f10df9
In PD Disaggregation Mode, fallback next token ids to CPU
JimyMa May 7, 2025
25f3488
1. [disagg] update readme
JimyMa May 8, 2025
2c70c55
move disagg to pytorch backend
JimyMa May 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 28 additions & 3 deletions lmdeploy/cli/serve.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Copyright (c) OpenMMLab. All rights reserved.

from lmdeploy.disagg.messages import EngineRole, MigrationBackend, MigrationTransportProtocol
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can put this after line 307 to avoid unnecessary importing time

from lmdeploy.utils import get_max_batch_size

from .cli import CLI
Expand Down Expand Up @@ -125,6 +125,23 @@ def add_parser_api_server():
'engine’s tasks once the maximum number of concurrent requests is '
'reached, regardless of any additional requests sent by clients '
'concurrently during that time. Default to None.')
parser.add_argument('--role',
type=str,
default='Hybrid',
choices=['Hybrid', 'Prefill', 'Decode'],
help='Hybrid for Non-Disaggregated Engine;'
'Prefill for Disaggregated Prefill Engine;'
'Decode fro Disaggregated Decode Engine;')
parser.add_argument('--migration-backend',
type=str,
default='DLSlime',
choices=['DLSlime', 'Mooncake', 'InfiniStore'],
help='kvcache migration management backend when PD disaggregation')
parser.add_argument('--migration-protocol',
type=str,
default='RDMA',
choices=['TCP', 'RDMA', 'NVLINK'],
help='kvcache migration protocol')
# common args
ArgumentHelper.backend(parser)
ArgumentHelper.log_level(parser)
Expand Down Expand Up @@ -215,7 +232,12 @@ def add_parser_proxy():
parser.set_defaults(run=SubCliServe.proxy)
parser.add_argument('--server-name', type=str, default='0.0.0.0', help='Host ip for proxy serving')
parser.add_argument('--server-port', type=int, default=8000, help='Server port of the proxy')
parser.add_argument('--strategy',
parser.add_argument('--serving-strategy',
type=str,
choices=['Disaggregated', 'NonDisaggregated'],
default='NonDisaggregated',
help='the strategy to dispatch requests to nodes')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May clarify the help info. It is the same as --routing-strategy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The api_server is assigned a specific "role" in this PR. I propose updating the communication protocol between the api_server and the proxy_server to include this role information.

Benefits:

  • The proxy server can make decisions based on the api_server's role.
  • This change would allow us to remove the --serving-strategy option, simplifying the argument list.

Would this approach be feasible? I’d appreciate any feedback or suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. --serving-strategy can be removed and pd mode can be inferred from engine role in backend config.

parser.add_argument('--routing-strategy',
type=str,
choices=['random', 'min_expected_latency', 'min_observed_latency'],
default='min_expected_latency',
Expand Down Expand Up @@ -307,7 +329,10 @@ def api_server(args):
device_type=args.device,
quant_policy=args.quant_policy,
eager_mode=args.eager_mode,
max_prefill_token_num=args.max_prefill_token_num)
max_prefill_token_num=args.max_prefill_token_num,
role=EngineRole.__members__[args.role],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use EngineRole[args.role]?

migration_backend=MigrationBackend.__members__[args.migration_backend],
migration_protocol=MigrationTransportProtocol.__members__[args.migration_protocol])
else:
from lmdeploy.messages import TurbomindEngineConfig
backend_config = TurbomindEngineConfig(dtype=args.dtype,
Expand Down
58 changes: 58 additions & 0 deletions lmdeploy/disagg/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# LMDeploy-DistServe

## Key Components
1. ​**Router Service**: Coordinates between prefill/decode engines
4. ​**Migration Manager**: Facilitates high-performance memory sharing

## Installation
```
# Inference Engine
pip install lmdeploy[all] >= 0.7.0

# Transfer Engine
pip install dlslime==0.0.1.post1
```

## Quick Start
### 1. Configure Endpoints
First deploy your prefill and decode engines.

``` shell
# Prefill Engine
CUDA_VISIBLE_DEVICES=0,1 lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333 --role Prefill --tp 2 --cache-block-seq 32
# Decode Engine
CUDA_VISIBLE_DEVICES=2,3 lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23334 --role Decode --tp 2 --cache-block-seq 32
```

### 2. Launch Router Service

``` shell
python -m lmdeploy.disagg.router \
--host 0.0.0.0 \
--port 5000 \
--prefill-endpoint http://prefill-host:port1 http://prefill-host:port2 \
--decode-endpoint http://decode-host:port3 http://decode-host:port4
```

## API Usage

```shell
# API Invoke
curl -X POST "http://localhost:5000/v1/completions" \
-H "Content-Type: application/json" \
-d '{"model": "internlm/internlm2_5-7b-chat", "temperature":0, "prompt": "Shanghai is a city that ", "max_tokens": 16, "stream": false}'
# Output
{"id":"2","object":"text_completion","created":1743662400,"model":"/nvme1/majinming/hub/models--internlm--internlm2_5-7b-chat/snapshots/4434a5ffc2582f9d5ac45085043ed3e3264f0a9b","choices":[{"index":0,"text":" is very famous for its skyscrapers. It is also a city","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":7,"total_tokens":23,"completion_tokens":16}}
```

## Trouble Shooting

### RDMA Connection Failed:

``` bash
ibstatus # Verify IB device status
ibv_devinfo # Check device capabilities
```

### Check NVSHMEM configuration:
Make sure to verify NVSHMEM installation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you kindly provide the checking method or related url links?

1 change: 1 addition & 0 deletions lmdeploy/disagg/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Copyright (c) OpenMMLab. All rights reserved.
24 changes: 24 additions & 0 deletions lmdeploy/disagg/backend/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from typing import Dict
from lmdeploy.logger import get_logger

logger = get_logger("lmdeploy")


try:
logger.debug("Registering DLSlime Backend")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use the INFO log when trying to register the kv transfer engine backend.
If an exception occurs, it's better to log a WARNING or an ERROR message

from .dlslime import DLSlimeBackend
except ImportError as e:
logger.debug("Disable DLSlime Backend")

try:
logger.debug("Registering Mooncake Backend")
from .mooncake import MooncakeBackend
except ImportError as e:
logger.debug("Disable Mooncake Backend")


try:
logger.debug("Registering InfiniStoreBackend Backend")
from .infinistore import InfiniStoreBackend
except ImportError as e:
logger.debug("Disable InfiniStoreBackend Backend")
12 changes: 12 additions & 0 deletions lmdeploy/disagg/backend/backend.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from lmdeploy.disagg.messages import MigrationBackend


MIGRATION_BACKENDS = {}
Copy link
Collaborator

@lvhan028 lvhan028 Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use the mmengine registry instead of making a new one, like we did in lmdeploy/model.py.



def register_migration_backend(backend_name: MigrationBackend):
def register(cls):
MIGRATION_BACKENDS[backend_name] = cls
return cls

return register
40 changes: 40 additions & 0 deletions lmdeploy/disagg/backend/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
from abc import abstractmethod

from lmdeploy.disagg.messages import (
MigrationInitRequest,
MigrationConnectionRequest,
MigrationAssignment,
MigrationRegisterMemoryRequest,
MigrationTransportProtocol
)


class MigrationBackendImpl:
@abstractmethod
def p2p_initialize(self, init_request: MigrationInitRequest):
raise NotImplementedError

@abstractmethod
def register_memory_region(self, register_mr_request:MigrationRegisterMemoryRequest):
raise NotImplementedError

@abstractmethod
def endpoint_info(self, remote_engine_id: int, protocol: MigrationTransportProtocol):
return NotImplementedError

@abstractmethod
def p2p_connect(self, connect_request: MigrationConnectionRequest):
raise NotImplementedError

@abstractmethod
async def p2p_migrate(self, assignment: MigrationAssignment):
raise NotImplementedError

@abstractmethod
async def store(self, assignment: MigrationAssignment):
raise NotImplementedError

@abstractmethod
async def load(self, assignment: MigrationAssignment):
raise NotImplementedError

84 changes: 84 additions & 0 deletions lmdeploy/disagg/backend/dlslime.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
from typing import Dict

from lmdeploy.disagg.messages import (
MigrationBackend,
MigrationInitRequest,
MigrationTransportProtocol,
DisaggEngineConfig,
MigrationConnectionRequest,
MigrationAssignment,
MigrationRegisterMemoryRequest
)

from lmdeploy.disagg.backend.base import MigrationBackendImpl
from lmdeploy.disagg.backend.backend import register_migration_backend

from dlslime import RDMAEndpoint, available_nic


class DLSlimeMigrationManagement:
def __init__(self, init_request: MigrationInitRequest):
self.rank = init_request.rank
self.tp_rank = init_request.tp_rank
self.remote_engine_config: DisaggEngineConfig = init_request.remote_engine_config
self.endpoint: Dict[str, RDMAEndpoint] = {
MigrationTransportProtocol.TCP: None,
MigrationTransportProtocol.RDMA: None,
MigrationTransportProtocol.NVLINK: None,
}
if init_request.rdma_init_request:
if not init_request.rdma_init_request.device_name:
nics = available_nic()
init_request.rdma_init_request.device_name = nics[self.rank % len(nics)]
self.endpoint[MigrationTransportProtocol.RDMA] = RDMAEndpoint(
device_name=init_request.rdma_init_request.device_name,
ib_port=init_request.rdma_init_request.ib_port,
link_type=init_request.rdma_init_request.link_type
)

def register_memory_region(self, register_mr_request: MigrationRegisterMemoryRequest):
self.endpoint[register_mr_request.protocol].register_memory_region(
register_mr_request.mr_key,
register_mr_request.addr,
register_mr_request.length
)

def connect_to(self, connect_request: MigrationConnectionRequest):
self.endpoint[connect_request.protocol].connect_to(connect_request.remote_endpoint_info)

async def p2p_migrate(self, assignment: MigrationAssignment):
max_batch = 4096 + 2048
Copy link
Collaborator

@lvhan028 lvhan028 Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the two magic numbers represent?

for i in range(0, len(assignment.target_offset), max_batch):
await self.endpoint[assignment.protocol].read_batch_async(
assignment.mr_key,
assignment.target_offset[i: i+max_batch],
assignment.source_offset[i: i+max_batch],
assignment.length
)


@register_migration_backend(MigrationBackend.DLSlime)
class DLSlimeBackend(MigrationBackendImpl):
def __init__(self):
self.links: Dict[int, DLSlimeMigrationManagement] = {}

def p2p_initialize(self, init_request: MigrationInitRequest):
self.links[init_request.remote_engine_id] = DLSlimeMigrationManagement(init_request)

def register_memory_region(self, register_mr_request:MigrationRegisterMemoryRequest):
self.links[register_mr_request.remote_engine_id].register_memory_region(register_mr_request)

def endpoint_info(self, remote_engine_id: int, protocol: MigrationTransportProtocol):
return self.links[remote_engine_id].endpoint[protocol].local_endpoint_info

def p2p_connect(self, connect_request: MigrationConnectionRequest):
self.links[connect_request.remote_engine_id].connect_to(connect_request)

async def p2p_migrate(self, assignment: MigrationAssignment):
await self.links[assignment.remote_engine_id].p2p_migrate(assignment)

async def store(self, assignment: MigrationAssignment):
raise NotImplementedError

async def load(self, assignment: MigrationAssignment):
raise NotImplementedError
35 changes: 35 additions & 0 deletions lmdeploy/disagg/backend/infinistore.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
from lmdeploy.disagg.messages import (
MigrationBackend,
MigrationInitRequest,
MigrationConnectionRequest,
MigrationAssignment,
MigrationRegisterMemoryRequest,
MigrationTransportProtocol
)

from lmdeploy.disagg.backend.backend import register_migration_backend
from lmdeploy.disagg.backend.base import MigrationBackendImpl


@register_migration_backend(MigrationBackend.InfiniStore)
class InfiniStoreBackend(MigrationBackendImpl):
def p2p_initialize(self, init_request: MigrationInitRequest):
raise NotImplementedError

def register_memory_region(self, register_mr_request:MigrationRegisterMemoryRequest):
raise NotImplementedError

def endpoint_info(self, remote_engine_id: int, protocol: MigrationTransportProtocol):
return NotImplementedError

def p2p_connect(self, connect_request: MigrationConnectionRequest):
raise NotImplementedError

async def p2p_migrate(self, assignment: MigrationAssignment):
raise NotImplementedError

async def store(self, assignment: MigrationAssignment):
raise NotImplementedError

async def load(self, assignment: MigrationAssignment):
raise NotImplementedError
35 changes: 35 additions & 0 deletions lmdeploy/disagg/backend/mooncake.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
from lmdeploy.disagg.messages import (
MigrationBackend,
MigrationInitRequest,
MigrationConnectionRequest,
MigrationAssignment,
MigrationRegisterMemoryRequest,
MigrationTransportProtocol
)

from lmdeploy.disagg.backend.backend import register_migration_backend
from lmdeploy.disagg.backend.base import MigrationBackendImpl


@register_migration_backend(MigrationBackend.Mooncake)
class MooncakeBackend(MigrationBackendImpl):
def p2p_initialize(self, init_request: MigrationInitRequest):
raise NotImplementedError

def register_memory_region(self, register_mr_request:MigrationRegisterMemoryRequest):
raise NotImplementedError

def endpoint_info(self, remote_engine_id: int, protocol: MigrationTransportProtocol):
return NotImplementedError

def p2p_connect(self, connect_request: MigrationConnectionRequest):
raise NotImplementedError

async def p2p_migrate(self, assignment: MigrationAssignment):
raise NotImplementedError

async def store(self, assignment: MigrationAssignment):
raise NotImplementedError

async def load(self, assignment: MigrationAssignment):
raise NotImplementedError
Loading