Skip to content

q80 and f16 models fail with Critical error: Unsupported ... #183

Closed
@lemmi

Description

@lemmi

While conversion with convert-hf.py seems to work, models in q80 and f16 formats can not be loaded. Here's the combinations I tried with Llama-3.3-70B-Instruct:

quant buffer-float-type error
q80 q40 Critical error: Unsupported op quant: F_32/F_UNK/F_Q40
q80 q80 Critical error: Unsupported CPU op code: MATMUL, quant: Q80_Q80_F32, op name: block_matmul_q
q80 f16 Critical error: Unsupported op quant: F_32/F_UNK/F_16
q80 f32 Critical error: Unsupported op quant: F_32/F_Q80/F_32
f16 q40 Critical error: Unsupported op quant: F_32/F_UNK/F_Q40
f16 q80 Critical error: Unsupported op quant: F_Q80/F_16/F_32
f16 f16 Critical error: Unsupported op quant: F_32/F_UNK/F_16
f16 f32 Critical error: Unsupported op quant: F_32/F_16/F_32

I'm mostly interested in q80 models with f16 or higher precision for synchronization. With llama.cpp, 8-bit quantization usually yield very high performance (only slightly slower than 4-bit) without the sometimes obvious model degradation with 4-bit quantization.

Am I doing something wrong, or is support currently missing?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions