Skip to content

Support for Loading a Subset of Tensors for LoRA Models  #399

@skeskinen

Description

@skeskinen

Firstly, thank you for the awesome project. I'm new to LLMs so I hope this suggestion makes sense.

LoRA is a technique used to reduce the number of parameters during finetuning, that is really hitting off with the recent Alpaca stuff. In LoRA models, typically, only the weight matrices Wq and Wv are fine-tuned.

For projects shipping multiple LoRA fine-tuned models, most of the tensors remain unchanged during the fine-tuning process. Storing all weights multiple times would lead to a significant waste of storage space (e.g., ~3.5 GB of data per fine-tune for a 7B model, multiplied by the number of tasks or personalities you want to ship). Supporting the loading of a subset of tensors for LoRA models would enable efficient storage and loading of these models in llama.cpp, reducing storage space requirements, and maybe memory footprint if you wanted to keep multiple models in memory at the same time.

I propose to extend llama.cpp's functionality by adding support for loading a subset of tensors from separate .bin files. This way all the business of merging the LoRA weights would still be done in python. And also the model subset .bin files could be quantized like usual.

An alternative could be to natively support LoRA in llama.cpp. However, this approach would likely not be compatible with pre-quantization of the weights (afaict).

Activity

ggerganov

ggerganov commented on Mar 22, 2023

@ggerganov
Member

Thank you for the useful summary of LoRA - I wasn't familiar and was wondering what it actually means.
The proposed functionality sounds like something that can be achieved relatively easy in the existing framework.

Just curious - is this functionality currently available in other frameworks?
Loading multiple personalities of the model in-memory with reduced storage and dynamically switching between them.

BadisG

BadisG commented on Mar 22, 2023

@BadisG

@ggerganov Loras are used a lot in Stable Diffusion and in the webui version of llama aswell oobabooga/text-generation-webui#332 (it doesn't work for the 4 bits for them atm though)

bakkot

bakkot commented on Mar 23, 2023

@bakkot
Contributor

Loading multiple personalities of the model in-memory with reduced storage and dynamically switching between them.

With Stable Diffusion loading LoRAs separately from models is very popular - there's a whole ecosystem of LoRAs distributed on places like civita. Many people end up with dozens or hundreds of LoRAs around, which is much more practical than keeping dozens of 4GB+ models. That will be even more so with LLaMA, given its larger size.

I expect this to be popular for LLaMA as well once the process for fine-tuning models gets to be more accessible.

redthing1

redthing1 commented on Mar 26, 2023

@redthing1

See related technique: #528

edwios

edwios commented on Mar 29, 2023

@edwios

There are already related discussions and attempts here: #172

and an implementation (using the original LLaMA checkpoints) here: https://github.com/tloen/alpaca-lora#inference-generatepy

If Lora can be made to use with q4 it'd be an awesome feature to both text generation and chat, very much like Lora for images with Stable Diffusion.

skeskinen

skeskinen commented on Mar 29, 2023

@skeskinen
Author

That discussion is kind of orthogonal to this feature request. alpaca-lora has the script for merging lora weights and converting back to pytorch format, the result of which can be used with llama.cpp as usual. That already works today.

changed the title [-]Support for Loading a Subset of Tensors for LoRA Models[/-] [+]Support for Loading a Subset of Tensors for LoRA Models [/+] on Apr 14, 2023
linked a pull request that will close this issueAdd LoRA support #820on Apr 14, 2023

2 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @edwios@skeskinen@bakkot@ggerganov@Green-Sky

      Issue actions

        Support for Loading a Subset of Tensors for LoRA Models · Issue #399 · ggml-org/llama.cpp