Skip to content

Support for Loading a Subset of Tensors for LoRA Models  #399

Closed
@skeskinen

Description

@skeskinen

Firstly, thank you for the awesome project. I'm new to LLMs so I hope this suggestion makes sense.

LoRA is a technique used to reduce the number of parameters during finetuning, that is really hitting off with the recent Alpaca stuff. In LoRA models, typically, only the weight matrices Wq and Wv are fine-tuned.

For projects shipping multiple LoRA fine-tuned models, most of the tensors remain unchanged during the fine-tuning process. Storing all weights multiple times would lead to a significant waste of storage space (e.g., ~3.5 GB of data per fine-tune for a 7B model, multiplied by the number of tasks or personalities you want to ship). Supporting the loading of a subset of tensors for LoRA models would enable efficient storage and loading of these models in llama.cpp, reducing storage space requirements, and maybe memory footprint if you wanted to keep multiple models in memory at the same time.

I propose to extend llama.cpp's functionality by adding support for loading a subset of tensors from separate .bin files. This way all the business of merging the LoRA weights would still be done in python. And also the model subset .bin files could be quantized like usual.

An alternative could be to natively support LoRA in llama.cpp. However, this approach would likely not be compatible with pre-quantization of the weights (afaict).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions