Add PT compileable support for flash_attn_with_kvcache #1592

jataylo · 2025-04-14T14:47:50Z

Continues #1139 adding custom op for flash_attn_with_kvcache.

On a transformers model this improves perf by >2x by avoiding graph breaks. There is a gotcha here, with this implementation an error is thrown in PyTorch 2.6 in user code when reshaping FA output:

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: <weakref at 0x7f10e00494e0; to 'torch.storage.UntypedStorage' at 0x7f10e0049400>

This is not an issue for PyTorch 2.7, so I had to introduce conditionalisation to workaround this by returning clone of the output tensors only for PT versions earlier than 2.7 and when compile is being used.

jataylo · 2025-04-16T09:45:48Z

@tridao alternatively if preferred, instead of conditionalising the clone for < PT 2.7, we could just disable compile-able support for this op if below 2.7, the additional clone could cause regressions and increase memory usage.

tridao · 2025-04-22T13:21:54Z

We will drop support for pytorch < 2.4 so you can simplify the code.
I'll need to think more about the clone. Does it slow things down when running in eager?

jataylo · 2025-05-09T13:13:14Z

@tridao sorry for slow response here, I've been away.

I imagine the additional clone could add an eager mode overhead as well as potentially introduce OOM issues, which have been observed in some cases.

Perhaps we just need to lock down the compileable support for this method until 2.7. Unless @drisspg has any thoughts here on why we see this weakref error before 2.7

zou3519 · 2025-05-13T13:55:59Z

@jataylo do you have the full stack trace?

jataylo · 2025-05-14T11:18:39Z

@zou3519
https://gist.github.com/jataylo/ef8b729d53a4415bc00c00c03e934950

Gist of stack trace here, let me see if I can get a reproducer as it's currently from a full model code, looks like we hit this when applying reshape onto the output of the flash_attn kvcache call.

Add PT compileable support for flash_attn_with_kvcache

c75a76b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PT compileable support for flash_attn_with_kvcache #1592

Add PT compileable support for flash_attn_with_kvcache #1592

Uh oh!

jataylo commented Apr 14, 2025

Uh oh!

jataylo commented Apr 16, 2025

Uh oh!

tridao commented Apr 22, 2025

Uh oh!

jataylo commented May 9, 2025 •

edited

Loading

Uh oh!

zou3519 commented May 13, 2025

Uh oh!

jataylo commented May 14, 2025

Uh oh!

Uh oh!

Add PT compileable support for flash_attn_with_kvcache #1592

Are you sure you want to change the base?

Add PT compileable support for flash_attn_with_kvcache #1592

Uh oh!

Conversation

jataylo commented Apr 14, 2025

Uh oh!

jataylo commented Apr 16, 2025

Uh oh!

tridao commented Apr 22, 2025

Uh oh!

jataylo commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented May 13, 2025

Uh oh!

jataylo commented May 14, 2025

Uh oh!

Uh oh!

jataylo commented May 9, 2025 •

edited

Loading