Use different bit arrangement for quants (nibbles)

In the existing `llama.cpp` implementation, quantization bits of consecutive model weights are packed together one after the other. E.g., for 4-bit quantization, the 8 bits of two consecutive weights are stored into a `uint8_t`. The disadvantage of this approach is that when the data is to be used in dot products or is being de-quantized for matrix multiplications done via BLAS, and the operations are performed using SIMD instructions, one needs to shuffle the de-quantized bytes to get them into the correct order. These shuffle operations can be avoided by arranging the bits differently. For instance, for 4-bit quantization in blocks of 32 weights (`Q4_0`), one can store the quants of the first 16 weights into the low 4 bits of the 16 `uint8_t`'s, and the quants of the second 16 weights in the block of 32 into the high 4-bits. The same or similar strategy can also be applied for other block sizes or when using 2 bits per weight. 

The performance gain is not earth-shattering: in a synthetic benchmark performing `Q4_0_Q8_0` dot products I measured about a 10% speedup from avoiding the shuffle. Still, it is a trivial change, so why leave this low-hanging fruit hanging?  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use different bit arrangement for quants (nibbles) #1241

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use different bit arrangement for quants (nibbles) #1241

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions