-
Notifications
You must be signed in to change notification settings - Fork 500
[API] [Misc]: Support LRU cache with TTL for prefix cache indexer #905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test: add ut for lru and modify hash' ut Signed-off-by: vie-serendipity <[email protected]>
Signed-off-by: vie-serendipity <[email protected]>
Signed-off-by: vie-serendipity <[email protected]>
|
I will review it today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements an LRU cache with TTL to limit the size of the prefix cache and prevent memory exhaustion. Key changes include removing the manual eviction logic in the prefix cache indexer in favor of using an LRU store, updating tests to integrate with the new store interface, and introducing the LRUStore implementation and its accompanying tests in the cache package.
Reviewed Changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/plugins/gateway/prefixcacheindexer/indexer.go | Removed unused time import and the Evict method from the interface to align with the new LRU store-based eviction strategy. |
| pkg/plugins/gateway/prefixcacheindexer/hash_test.go | Updated tests to use the new store and replaced manual eviction calls with a sleep delay to allow TTL expiration. |
| pkg/plugins/gateway/prefixcacheindexer/hash.go | Replaced direct block map operations with store Get/Put calls to integrate the LRU store implementation. |
| pkg/plugins/gateway/cache/store.go | Added a generic Store interface. |
| pkg/plugins/gateway/cache/lru_store.go | Introduced a new LRUStore implementing cache eviction based on capacity and TTL. |
| pkg/plugins/gateway/cache/lru_store_test.go | Added unit tests for LRUStore covering basic put/get behavior, TTL expiration, updates, and concurrent evictions. |
Files not reviewed (1)
- go.mod: Language not supported
Comments suppressed due to low confidence (2)
pkg/plugins/gateway/prefixcacheindexer/hash_test.go:67
- [nitpick] Using a long sleep duration in tests can slow down the test suite. Consider using a shorter TTL in the test configuration or mocking time to simulate TTL expiration more efficiently.
time.Sleep(25 * time.Second)
pkg/plugins/gateway/prefixcacheindexer/indexer.go:29
- [nitpick] The manual Evict method has been removed from the interface and its documentation, relying on the LRUStore's internal eviction mechanism. Ensure that this new approach is adequately covered by tests and that its behavior meets the application's eviction requirements.
Evict method and its comments removed
| defaultPrefixCacheBlockSize = 16 | ||
| defaultPrefixCacheEvictionInternalInMS = 50 | ||
| defaultPrefixCacheEvictionDurationInMins = 60 | ||
| defaultPrefixCacheEvictionInternalInMS = 1000 // 1 second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can rename the variable to defaultPrefixCacheEvictionInternalInSec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will always use seconds for this parameter. let's change to XXXInSec
|
|
||
| type LRUStore[K comparable, V any] struct { | ||
| sync.Mutex | ||
| freeTable map[K]*entry[K, V] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the structure is as follows
hash1 -> model1 -> pod1 + pod1 last access time
-> model1 -> pod2 + pod2 last access time
-> model2 -> pod3 + pod3 last access time
-> blocks last access time
Right now eviction is based of last blocks access time. I kept the eviction like this for now to keep it simple. Eventually we want to move to only delete pod1 + its last access time rather than restricting to only delete entire block.
Another requirement is that, for one request it creates 100 hashes. How to make sure all hashes are deleted for that request together. Thats how vllm implements block eviction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ask is that can you share some design ideas on how to extend it with new implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply.
Eventually we want to move to only delete pod1 + its last access time rather than restricting to only delete entire block.
I think this is reasonable.
Another requirement is that, for one request it creates 100 hashes. How to make sure all hashes are deleted for that request together. Thats how vllm implements block eviction.
Just to confirm, I understand that after vllm processes the request, it frees all the blocks together, so you mean that the router implementation should also free them together.
But I think this involves two layers, and in reality, the router and the kv cache don't have actual interaction. Even if the lastAccessedTime is very close, it doesn't guarantee that there's already a cache for the current request on the node. Hence, I understand that the router here implements a best-effort approach. In the current implementation, selecting the pod with the most recent access would be the most likely to have the cached KV.
I think for precise routing, integrating with a distributed kv cache could be a solution, but that would come with a significant cost. What do you think? cc @Jeffwan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence, I understand that the router here implements a best-effort approach
yes, exactly.
I think for precise routing, integrating with a distributed kv cache could be a solution, but that would come with a significant cost.
This is not in the current scope, because of few reasons. 1. Both vLLM and current distributed kv cache solution (like vineyard we are using) do not provide interfaces to expose kv details. distributed kv cache has metadata server but that's for internal communication. 2. Even they provides a way, fetch the source of truth from engine or kv cache service will be costly. I agree that we can revisit this feature later.
Another requirement is that, for one request it creates 100 hashes. How to make sure all hashes are deleted for that request together
the cache hit follow the common prefix pattern. the earlier blocks alway have fresh "last access time". In that case, if one block is deleted, the blocks after that block will be deleted for sure. that means even map will support such case out of box? correct me if I understand incorrectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your point.
| e.Unlock() | ||
|
|
||
| for _, key := range keysToEvict { | ||
| e.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here. I am curious whether it need key level lock? could it be for loop level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand your question. Let me clarify that the main reason for traversing to obtain the key without performing any operation first is to avoid locking too long time and blocking the router.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a typo in my original message. I mean lock inside loop or lock before loop. I think the lock inside loop helps in the scenarios that we have many eviction keys and there're high volume concurrency request modifying freeTable and lruList. in other cases, lock before loop looks better. I do not have strong option at this moment. I think long eviction window will accumulate more keys, it looks good to me
Signed-off-by: vie-serendipity <[email protected]>
|
I think most feedback are addressed. Please double check the remaining issues, then we should be good to go. @vie-serendipity @varungup90 |
| c.hash.ResetWithSeed(c.seed) | ||
|
|
||
| block, ok = c.blocks[prefixHash] | ||
| block, ok = c.store.Get(prefixHash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store is thread-safe, so we can remove lock from line 134 now. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Store is thread-safe, but hash is currently not thread-safe. The current lock scope is too broad, I narrowed it down.
|
Overall LGTM. Some nits, once you fix them ping me. I will merge the change. Thanks! |
Signed-off-by: vie-serendipity <[email protected]>
|
@varungup90 Seems all comments are addressed. I do not have further feedback as well |
|
|
||
| // returns matchedTokens, unMatchedTokens, matchedPods | ||
| func (c *PrefixHashTable) MatchPrefix(tokens []byte, model string, pods []*v1.Pod) ([]byte, []byte, []*v1.Pod) { | ||
| c.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vie-serendipity I assumed LRU store is thread-safe and removing this lock was fine. While working on PR #933 I came across the bug for concurrent read/write for blocks.
I have added the lock back, if you get a chance, can you make LRU store thread-safe? Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I will locate the bug and fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uncertain whether you encountered the bug after completely removing all locks from the PrefixHashTable implementation.
If so, I found the map obtained from the LRU store is not thread-safe, which should be locked when modifying it.
aibrix/pkg/utils/prefixcacheindexer/hash.go
Lines 106 to 122 in 6c32582
| block, ok := c.store.Get(prefixHash) | |
| if !ok { | |
| block = Block{ | |
| modelToPods: map[string]map[string]time.Time{ | |
| model: { | |
| pod: time.Now(), | |
| }, | |
| }, | |
| } | |
| } else { | |
| blockPods, ok := block.modelToPods[model] | |
| if !ok { | |
| blockPods = map[string]time.Time{} | |
| } | |
| blockPods[pod] = time.Now() | |
| block.modelToPods[model] = blockPods | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PR: 905, lock was removed only from MatchPrefix (AddPrefix had the lock). I encountered same issue that block map is not cloned when read from LRU store.
I propose that we make LRU store thread safe. There are two options
- Block map returned from LRUStore is cloned so that way locks can be removed from hash table implementation.
- Use of sync map in LRU store + clone block map on get function. This is preferable because it will remove the lock from LRUStore as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation, I will submit a PR to fix it.
…lm-project#905) * feat: add lru cache to support prefix cache indexer Signed-off-by: vie-serendipity <[email protected]>
…lm-project#905) * feat: add lru cache to support prefix cache indexer Signed-off-by: vie-serendipity <[email protected]>
Pull Request Description
[Please provide a clear and concise description of your changes here]
Implement LRU cache with TTL to support limited prefix cache, to prevent OOM.
Related Issues
Resolves: #[Insert issue number(s)]
#892
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]: Corrections to existing functionality[CI]: Changes to build process or CI pipeline[Docs]: Updates or additions to documentation[API]: Modifications to aibrix's API or interface[CLI]: Changes or additions to the Command Line Interface[Misc]: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.