Hybrid recurrent cache #13979

gabe-l-hart · 2025-06-02T19:50:36Z

This is a re-opened version of #13904 after #13746 was merged

Description

This PR introduces the llama_kv_cache_hybrid_recurrent cache implementation. It follows the pattern of llama_kv_cache_unified_iswa by holding two child cache instances and implementing the interface logic such that it manages both correctly for the appropriate layers.

Changes

The main change in this PR is the addition of llama_kv_cache_hybrid_recurrent in llama-kv-cache-hybrid-recurrent.*. In addition to this, the PR does the following:

Add the llama_model_is_hybrid_recurrent public API (akin to llama_model_is_recurrent)
Add LLM_KV_ATTENTION_LAYER_INDICES as an hparam to hold the indices of the layers that should use attention (versus recurrent)
- This part is not well aligned with iswa, but that mechanism also isn't particularly extensible. It might be more appropriate to have a generic mechanism for indicating the type of caching to use for each layer, but that would start to approach the generic hybrid implementation that I originally attempted which ended up being too abstract (feat: Hybrid unified/recurrent cache #13276).
Abstracting utilities in llm_graph_context that need a specific type of cache to use getters (get_state_unified / get_state_recurrent) that will properly handle llama_kv_cache_hybrid_recurrent
Make n_embd_k_s / n_embd_v_s layer-dependent and use layer indices when calling them in the existing cache implementations
Add layer filtering to llama_kv_cache_recurrent
Updates the logic in llama_model::create_memory to use llm_arch_is_recurrent and llm_arch_is_hybrid_recurrent rather than relying on adding models to the switch statement which was redundant with the implementation of these functions

edits by @ggerganov below:

Notes so I don't forget

Naming update for next PRs (Hybrid recurrent cache #13979 (review))

ggerganov · 2025-06-03T06:14:19Z

src/llama-graph.cpp

+const llama_kv_cache_unified_state * llm_graph_context::get_state_unified() const {
+    const auto * umstate = dynamic_cast<const llama_kv_cache_unified_state *>(mstate);
+    if (!umstate) {
+        const auto hmstate = dynamic_cast<const llama_kv_cache_hybrid_recurrent_state *>(mstate);
+        if (hmstate) {
+            umstate = hmstate->get_state_attn();
+        }
+    }
+    GGML_ASSERT(umstate);
+    return umstate;
+}
+
+const llama_kv_cache_recurrent_state * llm_graph_context::get_state_recurrent() const {
+    const auto * rmstate = dynamic_cast<const llama_kv_cache_recurrent_state *>(mstate);
+    if (!rmstate) {
+        const auto hmstate = dynamic_cast<const llama_kv_cache_hybrid_recurrent_state *>(mstate);
+        if (hmstate) {
+            rmstate = hmstate->get_state_recurrent();
+        }
+    }
+    GGML_ASSERT(rmstate);
+    return rmstate;
+}
+


These dynamic casts should not be necessary. Instead you need a new llm_graph_context::build_attn_inp_kv_hybrid_recurrent() method, similar to llm_graph_context::build_attn_inp_kv_unified_iswa().

I'm working through this now and a couple of questions are coming up:

Would it be best to combine build_inp_s_copy with build_attn_inp_kv for hybrid so that models call just one "build inputs" function, or keep them separate for simplicity?

For the build_attn methods, each has a corresponding llm_graph_input_attn_* class. The build_inp_s_* methods don't have this pattern which would make this a bit harder to have code reuse. Are there plans to refactor that further @compilade?

In the mamba2 branch, s_mask seems to be totally removed. I'd prefer not to do all of the boilerplate for duplicating build_inp_s_mask for the hybrid recurrent case if that's definitely going to be going away. Is there any reason that might stick around past the merge of mamba2?

Answering out of order, but it should still make sense:

2. For the build_attn methods, each has a corresponding llm_graph_input_attn_* class. The build_inp_s_* methods don't have this pattern

They do follow this pattern, see

llama.cpp/src/llama-graph.h

Line 190 in 7e00e60

class llm_graph_input_s_copy : public llm_graph_input_i {

(this is on the current master)

I think you might mean the build_attn_* methods also return instances of llm_graph_input_attn_*?
That seems to be directly related to llm_graph_context::build_attn() which has multiple implementations which differ by the type of the first argument (e.g. for llm_graph_input_attn_kv_unified, llm_graph_input_attn_no_cache, etc.)

Are there plans to refactor that further @compilade?

Not really, outside of removing s_mask (and related functions and classes) as part of #13834.

Would it be best to combine build_inp_s_copy with build_attn_inp_kv for hybrid so that models call just one "build inputs" function, or keep them separate for simplicity?

Personally, I think it would be simpler to keep them separate, because they are fundamentally different (one is intended to be used by build_copy_mask_state (renamed to build_recurrent_state in #13834), while the other is used by build_attn), and they are pretty much independent, even in hybrid models (at least for Jamba, the recurrent and self-attention layers are mostly independent on that front).

I don't see how build_attn would ever need s_copy.

build_inp_s_copy and build_inp_attn_kv_* are called once at the beginning of the graph, while build_attn and build_recurrent_state are called once per layer (where applicable, and so usually different layers for both).

3. Is there any reason [s_mask] might stick around past the merge of mamba2?

No reason to keep it, s_mask will be removed. Its functionality is redundant with s_copy, and otherwise prevents minimizing unnecessary state copies. It was used to clear the states, but the same can be done through inp_s_copy and clearing by copying a zero-ed state (which is the rs_z'th state in the mamba2 branch (and #13834)).

Thanks, that's super helpful! I was missing the distinction between build_attn_inp and build_attn which makes perfect sense.

Personally, I think it would be simpler to keep them separate

I agree on my personal gut feeling, so I'll go with this and see how it feels once complete.

Ok, I think this feels a lot cleaner now. For build_inp_s_copy, I opted to add an optional parameter so that the caller can take ownership of casting the cache state rather than duplicating the function into build_inp_s_copy_hybrid. That felt a little cleaner w.r.t. code reuse, but I'm happy to do a separate method if that's preferred.

It looks like there's one more place that will need changing in build_copy_mask_state (renamed to build_recurrent_state on mamba2). Similar to build_inp_s_copy, I think the cleanest way to do this for code reuse is to add an optional parameter that, if unset, will use the current logic of casting mstate.

compilade · 2025-06-02T20:07:00Z

src/llama-kv-cache-hybrid-recurrent.cpp

+        // TODO: will the recurrent cache be in an undefined state at this point?
+        LLAMA_LOG_ERROR("%s: failed to prepare recurrent ubatches\n", __func__);


Yes, but that will be fixed in #13834

(Noting here in case this gets merged first so that I don't forget to update the comment)

Should I strip out this TODO at this point?

src/llama-kv-cache-hybrid-recurrent.cpp

gabe-l-hart · 2025-06-09T17:02:28Z

@ggerganov I've noticed that the Server tests are consistently failing on this branch, but I see them passing on other PRs (eg #14081). The failures seem to be around retried connections to /slots/0?action=restore (here). I've noticed a couple of things when trying to repro locally:

For all of the server tests, despite the health check ping loop, I need to manually inject a sleep after server.start() to get any of the tests to pass on my M3 Max 64GB, otherwise the first endpoint call will almost always return an unexpected result (invalid json, or 5xx error). To me, this speaks to the server reporting /health success before it's logically "ready." This is probably correct in that the server is healthy enough to take requests, but the tests are treating this as a readiness check which would indicate that the server is fully initialized which seems incorrect (similar to liveness vs readiness in kubernetes). I wasn't able to find any open issues related to this, so I'd be curious if there's a discussion somewhere of readiness vs liveness?
Once I do insert the manual sleep in test_slot_save_restore, the test passes locally for me. This speaks to some difference between my local system and the GH Action runners. I'm not familiar enough with the server code to know whether this would be caused by some kind of memory constraint, disk speed constraint, etc, but my guess would be something like slot1.bin not being fully written somehow. This wouldn't explain why it passes on other branches though. Any insight on the nature of these tests would be much appreciated as I try to debug!

This is an attempt to handle race conditions between /health returning OK and the other endpoints not returning successfully. ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-06-09T17:38:16Z

I've tried adding retry logic for all requests in 39a93b3 to work around the race between /health, but I'm not sure if this is just going to mask an underling issue.

ggerganov · 2025-06-09T19:38:31Z

The changes to the server tests should not be needed. Let's revert the commit for now and I'll investigate.

gabe-l-hart · 2025-06-09T20:14:12Z

@ggerganov Thanks, it looks like those changes didn't fix the failures anyway, so definitely not the right fix. I've reset them out and can open an issue with details of what I see locally

gabe-l-hart · 2025-06-09T20:56:10Z

Issue for follow up on /health race condition with tests: #14092

gabe-l-hart · 2025-06-10T21:23:27Z

I've rebased on #13834. Drafting for now until it's merged

gabe-l-hart · 2025-06-10T22:22:35Z

That was quick! Undrafting now that #13834 is merged

src/llama-graph.h

ggml-ci

ggerganov · 2025-06-18T06:52:10Z

I pushed some recommendations in gabe-l-hart#2. After merging gabe-l-hart#2 into this branch, we can proceed to merge into master. This will simplify the adoption of the new ubatch logic from #14217.

However, note that before we proceed with the rest of the changes that build on top of this, such as Mamba 2, Granite and Falcon, we have to fix the naming of the classes and the variables as I have described in the TODOs in gabe-l-hart#2. We also have to make the recurrent cache to not maintain any compute-related state similar to how the state of the unified cache (i.e. head, n_kv, etc.) was moved to the llama_kv_cache_unified_state class.

The main thing about the naming is that we have to avoid associating the recurrent state/cache with the "KV cache". For example n_kv for the recurrent state does not make sense - there are no keys and values in it. It was called like this because we (ab)used the KV cache to store the recurrent state, but now these are separated and we have to update the code to reflect that. Fixing the naming is important because currently it's difficult to understand the patterns since the terms are conflated.

compilade · 2025-06-18T07:10:25Z

For example n_kv for the recurrent state does not make sense - there are no keys and values in it.

I suggest n_rs, which was used in #7531 exactly like n_kv is currently used for the recurrent cache on master.

But it mostly represents the number of modified recurrent states (which may be larger than ubatch.n_seqs when the states have to be moved around to make the slots contiguous).

gabe-l-hart · 2025-06-18T13:17:54Z

Thanks @ggerganov! I'm looking over those changes now and will merge them shortly once my head is fully wrapped around them. I'm definitely open to tackling the rename from *kv* to *rs* in this PR if you'd prefer. I see you started some of that in your changes with kv_cache_hybrid_recurrent -> mem_hybrid, so I'm fine with either approach.

recurrent : rework graph inputs + add TODOs

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-06-18T13:47:35Z

@ggerganov @compilade With Georgei's PR, I think the last open question is whether to tackle the rename now. Thoughts / preferences? I think it would touch a bunch more files, but might be nice to get it all done at once.

gabe-l-hart · 2025-06-18T13:58:55Z

If we want to tackle the rename now, I think it would look like:

Change file names from llama-kv-cache-[hybrid-]recurrent.* to llama-memory-[hybrid | recurrent].*
Rename llama_memory_i and llama_memory_state_i implementations accordingly
Rename all variables with kv in recurrent to rs
Rename all variables with kv in hybrid to mem

I'll take a whack at it locally and see how invasive it feels

gabe-l-hart · 2025-06-18T14:08:22Z

In the rename, one thing this would open up is renaming llama_kv_cache_unified -> llama_memory_kv_cache and llama_kv_cache_unified_iswa -> llama_memory_kv_cache_iswa for consistency with other llama_memory_* classes. Thoughts?

…e kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-06-18T14:22:31Z

Another one as I'm working through it: in hparams we've got n_embd_[k|v]_s which I think would become n_embd_[conv|ssm]?

compilade · 2025-06-18T14:35:33Z

Another one as I'm working through it: in hparams we've got n_embd_[k|v]_s which I think would become n_embd_[conv|ssm]?

Or maybe n_embd_[r|s] (since RWKV doesn't really call its states conv or ssm, but does use these functions)

gabe-l-hart · 2025-06-18T14:42:00Z

Or maybe n_embd_[r|s] (since RWKV doesn't really call its states conv or ssm, but does use these functions)

Ah, interesting. I was basing off of how the mamba models use get_[k|v]_l which is probably not sufficiently robust.

ggerganov · 2025-06-18T14:54:57Z

In the rename, one thing this would open up is renaming llama_kv_cache_unified -> llama_memory_kv_cache and llama_kv_cache_unified_iswa -> llama_memory_kv_cache_iswa for consistency with other llama_memory_* classes. Thoughts?

Technically llama_memory_kv_cache_unified... would be correct, but we also want to keep things short, so we can skip the "memory" part because we know that the KV cache is a type of a memory module. In the case of llama_memory_hybrid, we cannot use simply llama_hybrid because we no longer have the context that this is a memory module.

The "unified" suffix in llama_kv_cache_unified is used instead of simply llama_kv_cache because I am envisioning to introduce a new KV cache that does not utilize the "unified KV cells" mechanism. Not sure about the name yet, but it would be something like llama_kv_cache_single. That is unless I figure out a way to implement this cleanly in the existing implementation - in that case we can rename it to simply llama_kv_cache.

gabe-l-hart · 2025-06-18T14:56:33Z

Ok, understood on both counts. I'm working through the kv -> rs rename and will hold off touching any of the naming for the actual kv classes

Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-06-18T15:12:30Z

Ok, renaming complete! I think it's a lot cleaner now, but would definitely love the extra eyes on anywhere I might have missed.

ggerganov

This is good to merge. Thanks for the patience and catching up with all the changes on master 👍

src/llama-hparams.h

src/llama-arch.h

src/llama-memory-hybrid.cpp

gabe-l-hart · 2025-06-18T16:13:37Z

This is good to merge. Thanks for the patience and catching up with all the changes on master 👍

Thank you! I learned a ton along the way. I'll make the suggested change from recurrent -> recr and kick CI one more time.

It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-06-18T16:24:37Z

Ok, renames/refactors done. 🤞 on Ci!

gabe-l-hart · 2025-06-18T17:39:31Z

Successfully rebased Granite 4 and verified that things are working as implemented here, including the refactor of the build_inp_mem_hybrid portions.

src/llama-memory-recurrent.h

Co-authored-by: Georgi Gerganov <[email protected]>

compilade · 2025-06-19T06:31:29Z

src/llama-graph.h

+    // TODO: move this implementation to llama_memory_recurrent.
+    //       this is analogous to llama_kv_cache_unified::cpy_k / cpy_v
+    //       when moving, avoid passing `ggml_cgraph` - only pass `ggml_context`. would likely need to split the
+    //         implementation in 2 separate methods. the goal is to avoid calling `ggml_build_forward_expand` in
+    //         `llama_memory_recurrent`


the goal is to avoid calling ggml_build_forward_expand

I'm not sure that would be convenient; build_rs when used with avoid_copies pretty much has to have a way to order the writes.

Splitting the function in two would require handling states_extra at call sites, while for now it's transparently handled within build_rs.

I see. It's ok to pass the ggml_cgraph as argument in that case.

gabe-l-hart mentioned this pull request Jun 2, 2025

Hybrid recurrent cache #13904

Closed

1 task

ggerganov reviewed Jun 3, 2025

View reviewed changes

compilade reviewed Jun 3, 2025

View reviewed changes

gabe-l-hart force-pushed the HybridRecurrentCache branch 7 times, most recently from ab918bb to 60ca3ba Compare June 9, 2025 16:06

gabe-l-hart requested a review from ngxson as a code owner June 9, 2025 17:36

github-actions bot added examples python python script changes server labels Jun 9, 2025

gabe-l-hart force-pushed the HybridRecurrentCache branch from 39a93b3 to 60ca3ba Compare June 9, 2025 20:13

gabe-l-hart mentioned this pull request Jun 9, 2025

Misc. bug: Server tests /health race conditions #14092

Open

gabe-l-hart force-pushed the HybridRecurrentCache branch 2 times, most recently from 7958d84 to 3669876 Compare June 10, 2025 21:22

gabe-l-hart marked this pull request as draft June 10, 2025 21:22

gabe-l-hart force-pushed the HybridRecurrentCache branch from 3669876 to 8c59841 Compare June 10, 2025 22:22

gabe-l-hart marked this pull request as ready for review June 10, 2025 22:22

gabe-l-hart commented Jun 10, 2025

View reviewed changes

src/llama-graph.h Outdated Show resolved Hide resolved

gabe-l-hart force-pushed the HybridRecurrentCache branch from 95b6698 to faf4119 Compare June 17, 2025 20:54

recurrent : rework graph inputs + add TODOs

59fee24

ggml-ci

ggerganov mentioned this pull request Jun 18, 2025

ubatch : new splitting logic #14217

Merged

1 task

gabe-l-hart and others added 2 commits June 18, 2025 08:27

Merge pull request #2 from ggml-org/gabe-l-hart/HybridRecurrentCache

c80e68c

recurrent : rework graph inputs + add TODOs

refactor: Make status and child states const in hybrid and iswa

8488f5e

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

ggerganov approved these changes Jun 18, 2025

View reviewed changes

src/llama-hparams.h Outdated Show resolved Hide resolved

src/llama-arch.h Outdated Show resolved Hide resolved

src/llama-memory-hybrid.cpp Outdated Show resolved Hide resolved

src/llama-memory-hybrid.cpp Outdated Show resolved Hide resolved

gabe-l-hart added 3 commits June 18, 2025 10:19

refacor: _recurrent -> _recr for brevity

6403f19

It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

style: Fix spacing for ref

d0565e8

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

refactor: recurrent_layer() -> is_recurrent()

35c0233

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

ggerganov reviewed Jun 18, 2025

View reviewed changes

src/llama-memory-recurrent.h Outdated Show resolved Hide resolved

style: Fix spacing for size_s_bytes declaration

304f86e

Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov merged commit edc4a29 into ggml-org:master Jun 19, 2025
47 checks passed

compilade reviewed Jun 19, 2025

View reviewed changes

		// TODO: will the recurrent cache be in an undefined state at this point?
		LLAMA_LOG_ERROR("%s: failed to prepare recurrent ubatches\n", __func__);

Hybrid recurrent cache #13979

Hybrid recurrent cache #13979

Uh oh!

Conversation

gabe-l-hart commented Jun 2, 2025 • edited by ggerganov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

compilade Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gabe-l-hart commented Jun 9, 2025

Uh oh!

gabe-l-hart commented Jun 9, 2025

Uh oh!

ggerganov commented Jun 9, 2025

Uh oh!

gabe-l-hart commented Jun 9, 2025

Uh oh!

gabe-l-hart commented Jun 9, 2025

Uh oh!

gabe-l-hart commented Jun 10, 2025

Uh oh!

gabe-l-hart commented Jun 10, 2025

Uh oh!

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

compilade commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Jun 18, 2025

Uh oh!

gabe-l-hart commented Jun 18, 2025

Uh oh!

gabe-l-hart commented Jun 18, 2025

Uh oh!

gabe-l-hart commented Jun 18, 2025

Uh oh!

gabe-l-hart commented Jun 18, 2025

Uh oh!

compilade commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Jun 18, 2025

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

gabe-l-hart commented Jun 18, 2025

Uh oh!

gabe-l-hart commented Jun 18, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabe-l-hart commented Jun 18, 2025

Uh oh!

gabe-l-hart commented Jun 18, 2025

Uh oh!

gabe-l-hart commented Jun 2, 2025 •

edited by ggerganov

Loading

compilade Jun 4, 2025 •

edited

Loading

compilade commented Jun 18, 2025 •

edited

Loading

compilade commented Jun 18, 2025 •

edited

Loading

compilade Jun 19, 2025 •

edited

Loading