Skip to content

Conversation

@vie-serendipity
Copy link
Contributor

Pull Request Description

[Please provide a clear and concise description of your changes here]
Implement LRU cache with TTL to support limited prefix cache, to prevent OOM.

Related Issues

Resolves: #[Insert issue number(s)]
#892
Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

test: add ut for lru and modify hash' ut

Signed-off-by: vie-serendipity <[email protected]>
Signed-off-by: vie-serendipity <[email protected]>
Signed-off-by: vie-serendipity <[email protected]>
@varungup90
Copy link
Collaborator

I will review it today.

@Jeffwan Jeffwan requested review from Copilot and varungup90 and removed request for Copilot March 27, 2025 21:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements an LRU cache with TTL to limit the size of the prefix cache and prevent memory exhaustion. Key changes include removing the manual eviction logic in the prefix cache indexer in favor of using an LRU store, updating tests to integrate with the new store interface, and introducing the LRUStore implementation and its accompanying tests in the cache package.

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/plugins/gateway/prefixcacheindexer/indexer.go Removed unused time import and the Evict method from the interface to align with the new LRU store-based eviction strategy.
pkg/plugins/gateway/prefixcacheindexer/hash_test.go Updated tests to use the new store and replaced manual eviction calls with a sleep delay to allow TTL expiration.
pkg/plugins/gateway/prefixcacheindexer/hash.go Replaced direct block map operations with store Get/Put calls to integrate the LRU store implementation.
pkg/plugins/gateway/cache/store.go Added a generic Store interface.
pkg/plugins/gateway/cache/lru_store.go Introduced a new LRUStore implementing cache eviction based on capacity and TTL.
pkg/plugins/gateway/cache/lru_store_test.go Added unit tests for LRUStore covering basic put/get behavior, TTL expiration, updates, and concurrent evictions.
Files not reviewed (1)
  • go.mod: Language not supported
Comments suppressed due to low confidence (2)

pkg/plugins/gateway/prefixcacheindexer/hash_test.go:67

  • [nitpick] Using a long sleep duration in tests can slow down the test suite. Consider using a shorter TTL in the test configuration or mocking time to simulate TTL expiration more efficiently.
time.Sleep(25 * time.Second)

pkg/plugins/gateway/prefixcacheindexer/indexer.go:29

  • [nitpick] The manual Evict method has been removed from the interface and its documentation, relying on the LRUStore's internal eviction mechanism. Ensure that this new approach is adequately covered by tests and that its behavior meets the application's eviction requirements.
Evict method and its comments removed

defaultPrefixCacheBlockSize = 16
defaultPrefixCacheEvictionInternalInMS = 50
defaultPrefixCacheEvictionDurationInMins = 60
defaultPrefixCacheEvictionInternalInMS = 1000 // 1 second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can rename the variable to defaultPrefixCacheEvictionInternalInSec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will always use seconds for this parameter. let's change to XXXInSec


type LRUStore[K comparable, V any] struct {
sync.Mutex
freeTable map[K]*entry[K, V]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the structure is as follows

hash1 -> model1 -> pod1 + pod1 last access time
-> model1 -> pod2 + pod2 last access time
-> model2 -> pod3 + pod3 last access time
-> blocks last access time

Right now eviction is based of last blocks access time. I kept the eviction like this for now to keep it simple. Eventually we want to move to only delete pod1 + its last access time rather than restricting to only delete entire block.

Another requirement is that, for one request it creates 100 hashes. How to make sure all hashes are deleted for that request together. Thats how vllm implements block eviction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ask is that can you share some design ideas on how to extend it with new implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply.

Eventually we want to move to only delete pod1 + its last access time rather than restricting to only delete entire block.

I think this is reasonable.

Another requirement is that, for one request it creates 100 hashes. How to make sure all hashes are deleted for that request together. Thats how vllm implements block eviction.

Just to confirm, I understand that after vllm processes the request, it frees all the blocks together, so you mean that the router implementation should also free them together.

But I think this involves two layers, and in reality, the router and the kv cache don't have actual interaction. Even if the lastAccessedTime is very close, it doesn't guarantee that there's already a cache for the current request on the node. Hence, I understand that the router here implements a best-effort approach. In the current implementation, selecting the pod with the most recent access would be the most likely to have the cached KV.

I think for precise routing, integrating with a distributed kv cache could be a solution, but that would come with a significant cost. What do you think? cc @Jeffwan

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence, I understand that the router here implements a best-effort approach
yes, exactly.

I think for precise routing, integrating with a distributed kv cache could be a solution, but that would come with a significant cost.

This is not in the current scope, because of few reasons. 1. Both vLLM and current distributed kv cache solution (like vineyard we are using) do not provide interfaces to expose kv details. distributed kv cache has metadata server but that's for internal communication. 2. Even they provides a way, fetch the source of truth from engine or kv cache service will be costly. I agree that we can revisit this feature later.

Another requirement is that, for one request it creates 100 hashes. How to make sure all hashes are deleted for that request together

the cache hit follow the common prefix pattern. the earlier blocks alway have fresh "last access time". In that case, if one block is deleted, the blocks after that block will be deleted for sure. that means even map will support such case out of box? correct me if I understand incorrectly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point.

e.Unlock()

for _, key := range keysToEvict {
e.Lock()
Copy link
Collaborator

@Jeffwan Jeffwan Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here. I am curious whether it need key level lock? could it be for loop level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand your question. Let me clarify that the main reason for traversing to obtain the key without performing any operation first is to avoid locking too long time and blocking the router.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a typo in my original message. I mean lock inside loop or lock before loop. I think the lock inside loop helps in the scenarios that we have many eviction keys and there're high volume concurrency request modifying freeTable and lruList. in other cases, lock before loop looks better. I do not have strong option at this moment. I think long eviction window will accumulate more keys, it looks good to me

@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 31, 2025

I think most feedback are addressed. Please double check the remaining issues, then we should be good to go. @vie-serendipity @varungup90

c.hash.ResetWithSeed(c.seed)

block, ok = c.blocks[prefixHash]
block, ok = c.store.Get(prefixHash)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store is thread-safe, so we can remove lock from line 134 now. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store is thread-safe, but hash is currently not thread-safe. The current lock scope is too broad, I narrowed it down.

@varungup90
Copy link
Collaborator

Overall LGTM. Some nits, once you fix them ping me. I will merge the change. Thanks!

Signed-off-by: vie-serendipity <[email protected]>
@Jeffwan
Copy link
Collaborator

Jeffwan commented Apr 1, 2025

@varungup90 Seems all comments are addressed. I do not have further feedback as well

@varungup90 varungup90 merged commit 0d998e1 into vllm-project:main Apr 1, 2025
11 checks passed

// returns matchedTokens, unMatchedTokens, matchedPods
func (c *PrefixHashTable) MatchPrefix(tokens []byte, model string, pods []*v1.Pod) ([]byte, []byte, []*v1.Pod) {
c.mu.Lock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vie-serendipity I assumed LRU store is thread-safe and removing this lock was fine. While working on PR #933 I came across the bug for concurrent read/write for blocks.

I have added the lock back, if you get a chance, can you make LRU store thread-safe? Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I will locate the bug and fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncertain whether you encountered the bug after completely removing all locks from the PrefixHashTable implementation.
If so, I found the map obtained from the LRU store is not thread-safe, which should be locked when modifying it.

block, ok := c.store.Get(prefixHash)
if !ok {
block = Block{
modelToPods: map[string]map[string]time.Time{
model: {
pod: time.Now(),
},
},
}
} else {
blockPods, ok := block.modelToPods[model]
if !ok {
blockPods = map[string]time.Time{}
}
blockPods[pod] = time.Now()
block.modelToPods[model] = blockPods
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PR: 905, lock was removed only from MatchPrefix (AddPrefix had the lock). I encountered same issue that block map is not cloned when read from LRU store.

I propose that we make LRU store thread safe. There are two options

  1. Block map returned from LRUStore is cloned so that way locks can be removed from hash table implementation.
  2. Use of sync map in LRU store + clone block map on get function. This is preferable because it will remove the lock from LRUStore as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation, I will submit a PR to fix it.

gangmuk pushed a commit to gangmuk/aibrix-gangmuk that referenced this pull request Jun 21, 2025
…lm-project#905)

* feat: add lru cache to support prefix cache indexer

Signed-off-by: vie-serendipity <[email protected]>
Yaegaki1Erika pushed a commit to Yaegaki1Erika/aibrix that referenced this pull request Jul 23, 2025
…lm-project#905)

* feat: add lru cache to support prefix cache indexer

Signed-off-by: vie-serendipity <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants