Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
In the load_lora_weight_to_buffer function, we zero out A_buffer
when uid == None
(code reference) to prevent leftover weights of the previously evicted LoRA adapters from interfering with subsequent computations.
However, I suspect we should do the same even when uid != None
, because in theory different adapters could target different modules (e.g., some adapters do not target k_proj). Our code might not be able to handle this case correctly, for example, if we have two adapters: lora1 targets k_proj, lora2 does not. If lora2 is reusing the memory buffer left by lora1 after its eviction, the k_proj weight of lora1 would remain in the buffer and potentially contaminate the computation of lora2. I discussed this with @Fridge003 and @Qiaolin-Yu offline and they have the same suspicion.
As this is a rare corner case, I have not got a chance to construct a test to verify it. I am creating this issue to track this potential bug. We need to:
- verify: construct a test case to repro the issue, e.g., setting
max-loras-per-batch = 1
but have 2 adapters with different target weights. - fix: always zero out buffer during gpu buffer eviction.
- benchmark: verify perf overheads introduced by the zero-out operation.
Reproduction
See first comment.
Environment
Bug is environment agnostic