Expand optimized LoRA kernels to lm_head & embed_tokens

### ⚠️ Please check that this feature request hasn't been suggested before.

- [x] I searched previous [Ideas in Discussions](https://github.com/axolotl-ai-cloud/axolotl/discussions/categories/ideas) didn't find any similar feature requests.
- [x] I searched previous [Issues](https://github.com/axolotl-ai-cloud/axolotl/labels/enhancement) didn't find any similar feature requests.

### 🔖 Feature description

Currently LoRA optimizations target only specific modules, but no optimized kernels exist for lm_head or embed_tokens. As a result full-model fine-tuning with lora suffers from reduced throughput and increased memory usage

Axolotl’s current LoRA setup already provides optimized kernels for :

```yaml
lora_mlp_kernel: true
lora_o_kernel:    true
lora_qkv_kernel:  true
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj
```

However, there are no analogous optimized kernels for:

* `lm_head`
* `embed_tokens`

When you fine-tune entire model, missing these kernels leads to significant drop in throughput and increased GPU memory usage.

### ✔️ Solution

Extend existing LoRA optimizations (eg via Triton or custom CUDA kernels) to cover lm_head and embed_tokens. Benchmarks should demonstrate restored training throughput and lower peak memory footprint. Once validated, integrate new kernels into Axolotl’s fine-tuning pipeline.

### ❓ Alternatives

_No response_

### 📝 Additional Context

_No response_

### Acknowledgements

- [x] My issue title is concise, descriptive, and in title casing.
- [x] I have searched the existing issues to make sure this feature has not been requested yet.
- [x] I have provided enough information for the maintainers to understand and evaluate this request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Expand optimized LoRA kernels to lm_head & embed_tokens #2720

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Expand optimized LoRA kernels to lm_head & embed_tokens #2720

Description

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions