llama : support qwen3 rerank and embeddings #14029

ngxson · 2025-06-05T12:53:44Z

Supersede #14028

Model: https://huggingface.co/Qwen/Qwen3-Reranker-4B

For embeddings model, we simply use the POOLING_LAST
For reranking model, it's in fact a classification model with 2 labels: yes and no. The rerank score is the soft-maxed score of "yes" label

ngxson · 2025-06-05T12:56:51Z

src/llama-graph.cpp

+                        cur = ggml_mul_mat(ctx0, cls_out, inp);
+                        cur = ggml_soft_max(ctx0, cur); // qwen3 uses softmax on the output


@ggerganov I think there is a bug with build_inp_cls(). It suppose to contain only indexes of the output tokens (last token), but in this case, it actually contains all tokens. This make the output score to be incorrect atm as it returns the score for first token. WDYT?

I think you can make quick fix for now like this, similar to build_bert():

diff --git a/src/llama-model.cpp b/src/llama-model.cpp index afef84870..8b11197df 100644 --- a/src/llama-model.cpp +++ b/src/llama-model.cpp @@ -7043,7 +7043,7 @@ struct llm_build_qwen3 : public llm_graph_context { Qcur, Kcur, Vcur, nullptr, nullptr, 1.0f/sqrtf(float(n_embd_head)), il); } - if (il == n_layer - 1) { + if (il == n_layer - 1 && pooling_type == LLAMA_POOLING_TYPE_NONE) { // skip computing output for unused tokens ggml_tensor * inp_out_ids = build_inp_out_ids(); cur = ggml_get_rows(ctx0, cur, inp_out_ids);

No that still doesn't work as I expected.

For example, if my sequence has only one output token, then I expect the inp tensor here to have shape [n_embd, 1], but in reality, it has shape [n_embd, n_tokens]

Maybe I misunderstood something here?

Hmm ok I think I got it. The main problem is that qwen's rerank model use causal attention, it's simply a normal next generation model which outputs either "yes" or "no" token

I think the assumption in llama.cpp is that CLS and RANK are non-causal, hence only the first token is marked as output

Not sure what's the best way to support this though

OK found a hack around this, for Qwen3, I force the position to last (only the position, not the pooling) in 030dc3b

Probably we should separate the notion of "pooling" and "output position" in the future

I think the assumption in llama.cpp is that CLS and RANK are non-causal, hence only the first token is marked as output

The idea is that the llm_build_ functions will compute the embeddings for all tokens in the batch. The notion of "output ids" is purely an optimization trick to avoid unnecessary computation in the last layer and when doing any kind of pooling, it should generally be disabled.

For Qwen3 rerank, what you seem to need is to pool using last and apply the classification head on the result - the latter is missing, so it has to be added. We just haven't encountered models with pooling last and a classification head at the same time.

And it seems we should remove LLAMA_POOLING_TYPE_RANK - it's a bit redundant. Instead CLS and LAST should do the same thing - i.e. apply a classification head if there is one.

Hmm ok I got it. The problem is that I don't have much time for the rest of the day. Do you think we can clean this up in a follow up PR?

And it seems we should remove LLAMA_POOLING_TYPE_RANK - it's a bit redundant. Instead CLS and LAST should do the same thing - i.e. apply a classification head if there is one.

I think having the notion of LLAMA_TASK_* would be useful. For example, pooling CLS can be used for task type CLS and RANK. This can also be useful to block certain endpoints. For example, rerank model should only support /rerank and not /embeddings or /completion

Think it's better to take the time and make it right, no need to merge it now.

convert_hf_to_gguf.py

src/llama-model.cpp

ngxson · 2025-06-06T09:55:43Z

Hmm no the prompt is still not correct

CISC · 2025-06-06T10:00:35Z

src/llama-arch.cpp

@@ -200,7 +200,7 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
    { LLM_KV_TOKENIZER_HF_JSON,              "tokenizer.huggingface.json"              },
    { LLM_KV_TOKENIZER_RWKV,                 "tokenizer.rwkv.world"                    },
    { LLM_KV_TOKENIZER_CHAT_TEMPLATE,        "tokenizer.chat_template"                 },
-    { LLM_KV_TOKENIZER_CHAT_TEMPLATE_N,      "tokenizer.chat_template.%s"              },
+    { LLM_KV_TOKENIZER_CHAT_TEMPLATE_N,      "tokenizer.chat_template."                }, // FIXME: cannot add %s because it will be replaced by arch name


Actually, you can use LLM_KV_TOKENIZER_CHAT_TEMPLATE with suffix:

llama.cpp/src/llama-arch.cpp

Lines 1722 to 1725 in c02f53d

if (suffix != nullptr) {

name += ".";

name += suffix;

}

doing this but it doesn't work, maybe it's buggy somewhere else:

const auto key = name ? LLM_KV(model->arch, name)(LLM_KV_TOKENIZER_CHAT_TEMPLATE) : LLM_KV(model->arch)(LLM_KV_TOKENIZER_CHAT_TEMPLATE);

I was looking in the wrong place, this is where it's broken:

llama.cpp/src/llama-arch.cpp

Lines 1709 to 1712 in e83ba3e

std::string LLM_KV::operator()(llm_kv kv) const {

return suffix ? ::format(LLM_KV_NAMES.at(kv), LLM_ARCH_NAMES.at(arch), suffix)

: ::format(LLM_KV_NAMES.at(kv), LLM_ARCH_NAMES.at(arch));

}

Fixed in #14050

CISC · 2025-06-06T10:02:33Z

src/llama-model.cpp

+    const auto key = name
+        ? LLM_KV(model->arch)(LLM_KV_TOKENIZER_CHAT_TEMPLATE_N) + std::string(name)


Suggested change

const auto key = name

? LLM_KV(model->arch)(LLM_KV_TOKENIZER_CHAT_TEMPLATE_N) + std::string(name)

const auto key = name ? LLM_KV(model->arch, name)(LLM_KV_TOKENIZER_CHAT_TEMPLATE)

I wonder how long this has been broken?

Ah, has never worked it seems, broken since it was introduced in #11016

It's not in used by any of the examples so we don't know if it works in the first place (probably used in downstream project, but idk)

I'll make a PR.

ngxson · 2025-06-06T10:18:16Z

Should be correct now:

{
    "query": "learn more about science",
    "texts": [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
}

Result:

[
    {
        "index": 0,
        "score": -11.903118133544922
    },
    {
        "index": 1,
        "score": -9.440960884094238
    }
]

CISC · 2025-06-06T10:20:32Z

[
    {
        "index": 0,
        "score": -11.903118133544922
    },
    {
        "index": 1,
        "score": -9.440960884094238
    }
]

I don't know if there's a "standard" for this anywhere, but we should include label in the response as well.

ngxson · 2025-06-06T10:23:45Z

I don't know if there's a "standard" for this anywhere, but we should include label in the response as well.

Rerank only return single label (the score). There is no need for displaying the label. The higher the score, the "closer" query compared to the doc

You see 2 indexes because the query has 2 docs. The query can have multiple docs

ngxson · 2025-06-06T10:26:57Z

Also note that, the 2 labels is only used for softmax operation. We cannot have one single label because it will fail in this case:

First output: { yes: 0.5, no: 1.0 }
Second output: { yes: 0.5, no: 0.0 }

CISC · 2025-06-06T10:35:36Z

I see, nvm for this model then, but might be worth investigating multiple scores for other models?

ngxson · 2025-06-06T10:40:26Z

One score per label was always be the way for CLS models,right?

You can optionally add a softmax or mean square norm to make the score more "normalized"

For rerank task, the API is taken from jina API so it doesn't make sense to add a second score, risk of breaking the API compat

CISC · 2025-06-06T10:48:29Z

One score per label was always be the way for CLS models,right?

Yes.

For rerank task, the API is taken from jina API so it doesn't make sense to add a second score, risk of breaking the API compat

Ah, I was wondering about that, ok, would be interesting to know of other APIs, another endpoint supporting this could be useful for sentiment ranking etc.

model : support qwen3 rerank and embeddings

3f3b9a2

github-actions bot added the python python script changes label Jun 5, 2025

ngxson mentioned this pull request Jun 5, 2025

Eval bug: Cannot load Qwen3 ranking models #13820

Open

ngxson commented Jun 5, 2025

View reviewed changes

CISC requested changes Jun 5, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

ngxson added 2 commits June 5, 2025 17:43

Merge branch 'master' into xsn/qwen3_embd_rerank

f8fd440

correct labels

e0eb4b8

CISC reviewed Jun 5, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

CISC reviewed Jun 6, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

aropb mentioned this pull request Jun 6, 2025

How to set my value for the embedding size? SciSharp/LLamaSharp#1200

Closed

ngxson added 5 commits June 6, 2025 10:48

correct output token position

030dc3b

Merge branch 'master' into xsn/qwen3_embd_rerank

f8facb3

rm redundant settings

0777cd3

add embedding model tokenizer chkhsh

8edd2cf

add rerank prompt

c02f53d

This comment was marked as outdated.

Sign in to view

github-actions bot added examples server labels Jun 6, 2025

CISC reviewed Jun 6, 2025

View reviewed changes

fix prompt

c2f4dc7

fix style

cbb6f20

ngxson requested review from CISC and ggerganov June 6, 2025 11:07

		cur = ggml_mul_mat(ctx0, cls_out, inp);
		cur = ggml_soft_max(ctx0, cur); // qwen3 uses softmax on the output

	std::string LLM_KV::operator()(llm_kv kv) const {
	return suffix ? ::format(LLM_KV_NAMES.at(kv), LLM_ARCH_NAMES.at(arch), suffix)
	: ::format(LLM_KV_NAMES.at(kv), LLM_ARCH_NAMES.at(arch));
	}

		const auto key = name
		? LLM_KV(model->arch)(LLM_KV_TOKENIZER_CHAT_TEMPLATE_N) + std::string(name)

	const auto key = name
	? LLM_KV(model->arch)(LLM_KV_TOKENIZER_CHAT_TEMPLATE_N) + std::string(name)
	const auto key = name ? LLM_KV(model->arch, name)(LLM_KV_TOKENIZER_CHAT_TEMPLATE)

llama : support qwen3 rerank and embeddings #14029

Are you sure you want to change the base?

llama : support qwen3 rerank and embeddings #14029

Uh oh!

Conversation

ngxson commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

ngxson commented Jun 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Jun 6, 2025

Uh oh!

CISC commented Jun 6, 2025

Uh oh!

ngxson commented Jun 6, 2025

Uh oh!

ngxson commented Jun 6, 2025

Uh oh!

CISC commented Jun 6, 2025

Uh oh!

ngxson commented Jun 6, 2025

Uh oh!

CISC commented Jun 6, 2025

Uh oh!

Uh oh!

ngxson commented Jun 5, 2025 •

edited

Loading

ngxson Jun 6, 2025 •

edited

Loading