-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Description
Currently, the weights are uploaded to VRAM every time that they are used. This usually represents ~30% of the time spent in most mat muls with cuBLAS:
This could be improved by keeping the weights VRAM, however there are two issues blocking this:
- We need to be able to identify with certainty what tensors are constant / weights
- We need to be able to identify when the weights change. With llama.cpp, this can happen after applying a LoRA
Issue 1 can be solved in a not very clean way with #1268 by looking at the matrices that have names ending in ".weight". We could also add some kind of is_weight
flag to ggml_tensor
, but that would be more intrusive.
Issue 2 is more complicated. We would need either to add some kind of is_dirty
flag to ggml_tensor
that would be set automatically by the operations that modify a tensor (such as ggml_cpy
and _inplace
ops), or we could add a global flag to ggml-cuda
that would trigger a full weight re-upload, that would need to be set by the application when the weights change.
The is_dirty
flag would be more intrusive to ggml, essentially we would be adding GPU-specific details to ggml, but would be automatic for downstream users. The global CUDA-only flag would be less intrusive to ggml, but would force users to deal with this manually.
Any thoughts about this? What would be the best way to add this to ggml, while not interfering with the general goal of not adding GPU-specific details to ggml?
Activity
ggerganov commentedon May 1, 2023
Issue 1 should be solved by checking if the
op
isGGML_OP_NONE
:https://github.com/ggerganov/llama.cpp/blob/c0335b51f959ddf8c7b58bf497d10e5dc4730267/ggml.h#L322-L325
This indicates that the tensor is not the result of some operator.
In theory, it can be an optimization parameter (this is not relevant for inference, but we should check anyway).
A tensor becomes an optimization parameter by marking it with the
ggml_set_param()
function:https://github.com/ggerganov/llama.cpp/blob/c0335b51f959ddf8c7b58bf497d10e5dc4730267/ggml.h#L738-L742
In this case, the
grad
member becomes non-NULL
.Therefore, the CUDA code should check:
tensor->op == GGML_OP_NONE && tensor->grad == NULL
to determine if the tensor is a constant.Will think more about issue 2
slaren commentedon May 1, 2023
I think this would work for llama.cpp, but I think it may fail with out-of-graph, but not constant tensors. For example, the k/v cache. In our case it should still work because the k/v tensors go through some view/reshape/permute operations, so technically it isn't interpreted as a constant by the time it reaches the mat mul. But I think it wouldn't be reliable in every case.
This sounds exactly like what we need, but from what I see,
ggml_set_param
duplicates the tensor intensor->grad
. Wouldn't this double memory usage?ggerganov commentedon May 2, 2023
Good point. We can add an enum value
GGML_NOP
orGGML_OP_INP
orGGML_OP_VAR
(not sure what is the best name) and you would set theop
member of such tensors to that, just to mark that it's value can change externally.But probably we need some better interface for that, so the developer does not forget to do it.
The
ggml_set_param
is not suitable because as you noted it creates thegrad
tensor which doubles the memory.Maybe we can instead introduce
ggml_set_var()
orggml_set_inp()
that just sets theop
member to that new enum valueAn alternative solution is adding a
bool is_var
flag to the tensor instead of the enum value. This has the advantage of not having to handle the new enum value in thegraph
functions. The GPU code can then check:tensor->op == GGML_OP_NONE && tensor->grad == NULL && !tensor->is_var
to determine if the tensor is constant. And we can also update
ggml_set_param()
to also settensor->is_var = true
so the GPU check would become just:tensor->op == GGML_OP_NONE && !tensor->is_var
Regarding issue 2:
The
is_dirty
flag + update of allggml
functions does not seem like a good solution. It is indeed intrusive and error-prone.I am thinking about adding the
is_dirty
flag to the tensor, but require the dev to set it manually for each tensor (e.g.ggml_set_dirty()
). The flags will be cleared on everyggml_graph_compute()
call by the main implementation. The GPU code will make use of this as you specified. This will allow more fine-grained control compared to the global flag where you would have to update all tensors even if only one has changed.slaren commentedon May 2, 2023
What if we set
is_dirty
automatically in all thedst
tensors inggml_graph_compute
orggml_compute_forward
? This may fail with tensors that reuse the data of a different tensor, like views or reshapes, but that could be solved by adding aggml_tensor * owner
field that would be set by these operations.ggerganov commentedon May 2, 2023
Could work, but you will still need the
ggml_set_dirty()
because it is likely that we will want to sometimes modify a tensor by writing to thedata
directly, instead of going throughggml
ops.Also, it is not obvious when you will clear the flag.
You cannot clear it at the end of
ggml_graph_compute()
because some tensor could have just became dirty.You cannot clear the
src
s dirty flags inggml_compute_forward()
because they might be part of later ops.So we will need to maintain a separate of "new dirty" flags for each compute I think.
slaren commentedon May 2, 2023
The flag would be cleared in the GPU code after uploading the weights to VRAM. The logic would be something like this:
ggerganov commentedon May 3, 2023
The
owner
could probably be avoided by traversing the sources of a tensor, along with their sources, etc. The logic is that if a tensor is a result of an operation involving a dirty tensor, then it is also dirty.Let's start with manually setting
is_dirty
viaggml
API and clearing it afterggml_graph_compute()
.The automatic
is_dirty
detection could be added after that, but at the moment I feel it is an unnecessary complication.Will think about it more though
SlyEcho commentedon May 5, 2023
Could it be done on a higher level from llama.cpp? Just like it manages the scratch buffers or the KV cache. It also knows exactly what order the weights are used and it could start loading the next layer's weights ahead of time, or even it could do that in the end: load the first ones again for the next evaluation.
slaren commentedon May 5, 2023
llama.cpp builds a graph which is then executed by ggml, it is not synchronous, so these instructions to "prefetch" the next layer weights would have to be in the graph, which would require adding GPU-specific ops to ggml. Eventually, I think what would be the fastest is hybrid GPU/CPU inference, in which some layers are processed entirely in the GPU (as much as the available VRAM allows) and the rest in the CPU. But currently that would have to be in a fork.
SlyEcho commentedon May 5, 2023
It could be useful to prefetch data in other environments as well, low RAM devices: fetch from disk, low disk space: fetch from network, fetch from database.
It could also be used to dequantize some formats where the whole tensor has to be processed instead of one row.
It's a thought I had.
slaren commentedon May 5, 2023
I think it's a good idea, I am just not sure if it would fit in ggml as long as the goal is to keep GPU stuff separated.