Skip to content

Use different bit arrangement for quants (nibbles) #1241

Closed
@ikawrakow

Description

@ikawrakow

In the existing llama.cpp implementation, quantization bits of consecutive model weights are packed together one after the other. E.g., for 4-bit quantization, the 8 bits of two consecutive weights are stored into a uint8_t. The disadvantage of this approach is that when the data is to be used in dot products or is being de-quantized for matrix multiplications done via BLAS, and the operations are performed using SIMD instructions, one needs to shuffle the de-quantized bytes to get them into the correct order. These shuffle operations can be avoided by arranging the bits differently. For instance, for 4-bit quantization in blocks of 32 weights (Q4_0), one can store the quants of the first 16 weights into the low 4 bits of the 16 uint8_t's, and the quants of the second 16 weights in the block of 32 into the high 4-bits. The same or similar strategy can also be applied for other block sizes or when using 2 bits per weight.

The performance gain is not earth-shattering: in a synthetic benchmark performing Q4_0_Q8_0 dot products I measured about a 10% speedup from avoiding the shuffle. Still, it is a trivial change, so why leave this low-hanging fruit hanging?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions