Description
Firstly, thank you for the awesome project. I'm new to LLMs so I hope this suggestion makes sense.
LoRA is a technique used to reduce the number of parameters during finetuning, that is really hitting off with the recent Alpaca stuff. In LoRA models, typically, only the weight matrices Wq and Wv are fine-tuned.
For projects shipping multiple LoRA fine-tuned models, most of the tensors remain unchanged during the fine-tuning process. Storing all weights multiple times would lead to a significant waste of storage space (e.g., ~3.5 GB of data per fine-tune for a 7B model, multiplied by the number of tasks or personalities you want to ship). Supporting the loading of a subset of tensors for LoRA models would enable efficient storage and loading of these models in llama.cpp, reducing storage space requirements, and maybe memory footprint if you wanted to keep multiple models in memory at the same time.
I propose to extend llama.cpp's functionality by adding support for loading a subset of tensors from separate .bin files. This way all the business of merging the LoRA weights would still be done in python. And also the model subset .bin files could be quantized like usual.
An alternative could be to natively support LoRA in llama.cpp. However, this approach would likely not be compatible with pre-quantization of the weights (afaict).