Skip to content

[PD] Handle P/D failure and reconnect without affecting other instances #6263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 63 commits into from
May 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
4a2e09a
[PD] Handle prefill failure and reconnect without affecting decode in…
ShangmingCai May 13, 2025
0daefa4
minor fix
ShangmingCai May 13, 2025
d47962b
tmp diable pd tests
ShangmingCai May 13, 2025
6ada537
tmp fix dependency
ShangmingCai May 13, 2025
358a1e2
tmp
ShangmingCai May 13, 2025
d5fa24f
more
ShangmingCai May 13, 2025
3ce36ad
more
ShangmingCai May 13, 2025
f616417
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 14, 2025
101e3fd
fix
ShangmingCai May 14, 2025
18cf989
Merge branch 'handle_prefill_failure' of github.com:kvcache-ai/sglang…
ShangmingCai May 14, 2025
bbfe2ef
fix lint
ShangmingCai May 14, 2025
8c08ba3
test skip device 0-4
ShangmingCai May 14, 2025
d7c0fd6
revert test
ShangmingCai May 14, 2025
4d9cf49
tmp fix env
ShangmingCai May 14, 2025
31c645d
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 14, 2025
37852f5
revert change env
ShangmingCai May 15, 2025
3263b0d
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 15, 2025
c442578
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 15, 2025
df229eb
Not throwing exception when transfer fail to improve availability
ShangmingCai May 19, 2025
4c89b74
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 19, 2025
7ec6530
tmp fix ci
ShangmingCai May 19, 2025
4262964
revert script changes
ShangmingCai May 21, 2025
7e11533
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 21, 2025
48196fc
Add addr tracker and fix request status and release resource when it …
ShangmingCai May 21, 2025
82a449c
Optimize tracker init location
ShangmingCai May 21, 2025
182d31c
revert prefill/decode changes
ShangmingCai May 22, 2025
fe28341
fix
ShangmingCai May 22, 2025
5a10457
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 22, 2025
dea956e
Add failure records
ShangmingCai May 22, 2025
c14437f
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 22, 2025
1802bb4
minor
ShangmingCai May 22, 2025
a7e7c6b
Fix memory leak
ShangmingCai May 22, 2025
adbda95
add timeout
ShangmingCai May 22, 2025
003676a
revert and add failure_exception impl
ShangmingCai May 23, 2025
2260735
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 23, 2025
8ad77e3
minor
ShangmingCai May 23, 2025
50964e2
typo
ShangmingCai May 23, 2025
a8af131
revert decode changes
ShangmingCai May 23, 2025
e1845de
Add env var for heartbeat
ShangmingCai May 23, 2025
c6fece7
optimize error log
ShangmingCai May 23, 2025
a9f5c70
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 23, 2025
f708c4c
Add log level for bootstrap server
ShangmingCai May 23, 2025
13fc57b
add debug info
ShangmingCai May 23, 2025
88104fb
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 23, 2025
434300f
add heartbeat session
ShangmingCai May 23, 2025
730604a
Add session retry max
ShangmingCai May 23, 2025
a1d1c89
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 23, 2025
2f2cf9a
fix timeout
ShangmingCai May 23, 2025
4119339
remove timeout hack
ShangmingCai May 24, 2025
3228372
optimize log and reduce failure threshold
ShangmingCai May 24, 2025
71299f3
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 24, 2025
3f347a9
Fix potential error caused by clear
ShangmingCai May 24, 2025
39a81a4
add note
ShangmingCai May 24, 2025
263b814
Update timeout mechanism
ShangmingCai May 25, 2025
3c40bb0
reduce heartbeat interval
ShangmingCai May 25, 2025
d7ebdb4
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 25, 2025
b49f835
revert timeout threshold to 1
ShangmingCai May 25, 2025
6fee91e
minor fix for logs
ShangmingCai May 26, 2025
fec544d
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 26, 2025
0d36dba
remove indice length assert to prevent transfer thread fails
ShangmingCai May 26, 2025
9d7cc6e
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 26, 2025
51db6a6
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 26, 2025
9d806e2
Merge branch 'main' into handle_prefill_failure
ShangmingCai May 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions python/sglang/srt/disaggregation/decode.py
Original file line number Diff line number Diff line change
Expand Up @@ -361,7 +361,7 @@ def pop_transferred(self) -> List[DecodeRequest]:
indices_to_remove = set()
for i, (decode_req, poll) in enumerate(zip(self.queue, polls)):
if poll == KVPoll.Failed:
error_message = f"Decode transfer failed for request {decode_req.req.rid=} {decode_req.req.bootstrap_room=}"
error_message = f"Decode transfer failed for request rank={self.scheduler.tp_rank} {decode_req.req.rid=} {decode_req.req.bootstrap_room=}"
try:
decode_req.kv_receiver.failure_exception()
except Exception as e:
Expand Down Expand Up @@ -409,7 +409,8 @@ def pop_transferred(self) -> List[DecodeRequest]:
: decode_req.req.top_logprobs_num
].tolist()
)

if hasattr(decode_req.kv_receiver, "clear"):
decode_req.kv_receiver.clear()
transferred_reqs.append(decode_req.req)
indices_to_remove.add(i)
elif poll in [
Expand Down
Loading
Loading