Description
Hello! Thank you for creating this awesome repository.
We're currently working on integrating SageAttention into Axolotl as an alternative to FlashAttention 2 for LLM fine-tuning. Our PR: axolotl-ai-cloud/axolotl#2823
We've had some success so far: both packing and non-packing work correctly with LoRA fine-tuning. However, we're running into an issue with full fine-tuning (loss drops to zero and gradient norm explodes within just a few steps).
We suspect we might be making a mistake in the implementation. We were hoping a maintainer could take a look at the approach in the PR and offer any initial thoughts or guidance.
We would be very open to collaborating on a write-up or blog post about on this integration and to showcase SageAttention. If it's easier to discuss the technical details, I'd also be happy to hop on a quick call to discuss.