Description
In the existing llama.cpp
implementation, quantization bits of consecutive model weights are packed together one after the other. E.g., for 4-bit quantization, the 8 bits of two consecutive weights are stored into a uint8_t
. The disadvantage of this approach is that when the data is to be used in dot products or is being de-quantized for matrix multiplications done via BLAS, and the operations are performed using SIMD instructions, one needs to shuffle the de-quantized bytes to get them into the correct order. These shuffle operations can be avoided by arranging the bits differently. For instance, for 4-bit quantization in blocks of 32 weights (Q4_0
), one can store the quants of the first 16 weights into the low 4 bits of the 16 uint8_t
's, and the quants of the second 16 weights in the block of 32 into the high 4-bits. The same or similar strategy can also be applied for other block sizes or when using 2 bits per weight.
The performance gain is not earth-shattering: in a synthetic benchmark performing Q4_0_Q8_0
dot products I measured about a 10% speedup from avoiding the shuffle. Still, it is a trivial change, so why leave this low-hanging fruit hanging?