Closed
Description
While conversion with convert-hf.py
seems to work, models in q80
and f16
formats can not be loaded. Here's the combinations I tried with Llama-3.3-70B-Instruct
:
quant | buffer-float-type | error |
---|---|---|
q80 |
q40 |
Critical error: Unsupported op quant: F_32/F_UNK/F_Q40 |
q80 |
q80 |
Critical error: Unsupported CPU op code: MATMUL, quant: Q80_Q80_F32, op name: block_matmul_q |
q80 |
f16 |
Critical error: Unsupported op quant: F_32/F_UNK/F_16 |
q80 |
f32 |
Critical error: Unsupported op quant: F_32/F_Q80/F_32 |
f16 |
q40 |
Critical error: Unsupported op quant: F_32/F_UNK/F_Q40 |
f16 |
q80 |
Critical error: Unsupported op quant: F_Q80/F_16/F_32 |
f16 |
f16 |
Critical error: Unsupported op quant: F_32/F_UNK/F_16 |
f16 |
f32 |
Critical error: Unsupported op quant: F_32/F_16/F_32 |
I'm mostly interested in q80
models with f16
or higher precision for synchronization. With llama.cpp, 8-bit quantization usually yield very high performance (only slightly slower than 4-bit) without the sometimes obvious model degradation with 4-bit quantization.
Am I doing something wrong, or is support currently missing?
Metadata
Metadata
Assignees
Labels
No labels