Skip to content

llama : support qwen3 rerank and embeddings #14029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jun 5, 2025

Close #13820

Supersede #14028

Model: https://huggingface.co/Qwen/Qwen3-Reranker-4B

  • For embeddings model, we simply use the POOLING_LAST
  • For reranking model, it's in fact a classification model with 2 labels: yes and no. The rerank score is the soft-maxed score of "yes" label

@github-actions github-actions bot added the python python script changes label Jun 5, 2025
Comment on lines 1581 to 1582
cur = ggml_mul_mat(ctx0, cls_out, inp);
cur = ggml_soft_max(ctx0, cur); // qwen3 uses softmax on the output
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov I think there is a bug with build_inp_cls(). It suppose to contain only indexes of the output tokens (last token), but in this case, it actually contains all tokens. This make the output score to be incorrect atm as it returns the score for first token. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can make quick fix for now like this, similar to build_bert():

diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index afef84870..8b11197df 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -7043,7 +7043,7 @@ struct llm_build_qwen3 : public llm_graph_context {
                         Qcur, Kcur, Vcur, nullptr, nullptr, 1.0f/sqrtf(float(n_embd_head)), il);
             }
 
-            if (il == n_layer - 1) {
+            if (il == n_layer - 1 && pooling_type == LLAMA_POOLING_TYPE_NONE) {
                 // skip computing output for unused tokens
                 ggml_tensor * inp_out_ids = build_inp_out_ids();
                 cur   = ggml_get_rows(ctx0,   cur, inp_out_ids);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that still doesn't work as I expected.

For example, if my sequence has only one output token, then I expect the inp tensor here to have shape [n_embd, 1], but in reality, it has shape [n_embd, n_tokens]

Maybe I misunderstood something here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok I think I got it. The main problem is that qwen's rerank model use causal attention, it's simply a normal next generation model which outputs either "yes" or "no" token

I think the assumption in llama.cpp is that CLS and RANK are non-causal, hence only the first token is marked as output

Not sure what's the best way to support this though

Copy link
Collaborator Author

@ngxson ngxson Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK found a hack around this, for Qwen3, I force the position to last (only the position, not the pooling) in 030dc3b

Probably we should separate the notion of "pooling" and "output position" in the future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assumption in llama.cpp is that CLS and RANK are non-causal, hence only the first token is marked as output

The idea is that the llm_build_ functions will compute the embeddings for all tokens in the batch. The notion of "output ids" is purely an optimization trick to avoid unnecessary computation in the last layer and when doing any kind of pooling, it should generally be disabled.

For Qwen3 rerank, what you seem to need is to pool using last and apply the classification head on the result - the latter is missing, so it has to be added. We just haven't encountered models with pooling last and a classification head at the same time.

And it seems we should remove LLAMA_POOLING_TYPE_RANK - it's a bit redundant. Instead CLS and LAST should do the same thing - i.e. apply a classification head if there is one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok I got it. The problem is that I don't have much time for the rest of the day. Do you think we can clean this up in a follow up PR?

And it seems we should remove LLAMA_POOLING_TYPE_RANK - it's a bit redundant. Instead CLS and LAST should do the same thing - i.e. apply a classification head if there is one.

I think having the notion of LLAMA_TASK_* would be useful. For example, pooling CLS can be used for task type CLS and RANK. This can also be useful to block certain endpoints. For example, rerank model should only support /rerank and not /embeddings or /completion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it's better to take the time and make it right, no need to merge it now.

@ngxson

This comment was marked as outdated.

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 6, 2025

Hmm no the prompt is still not correct

@@ -200,7 +200,7 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
{ LLM_KV_TOKENIZER_HF_JSON, "tokenizer.huggingface.json" },
{ LLM_KV_TOKENIZER_RWKV, "tokenizer.rwkv.world" },
{ LLM_KV_TOKENIZER_CHAT_TEMPLATE, "tokenizer.chat_template" },
{ LLM_KV_TOKENIZER_CHAT_TEMPLATE_N, "tokenizer.chat_template.%s" },
{ LLM_KV_TOKENIZER_CHAT_TEMPLATE_N, "tokenizer.chat_template." }, // FIXME: cannot add %s because it will be replaced by arch name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you can use LLM_KV_TOKENIZER_CHAT_TEMPLATE with suffix:

llama.cpp/src/llama-arch.cpp

Lines 1722 to 1725 in c02f53d

if (suffix != nullptr) {
name += ".";
name += suffix;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing this but it doesn't work, maybe it's buggy somewhere else:

    const auto key = name
        ? LLM_KV(model->arch, name)(LLM_KV_TOKENIZER_CHAT_TEMPLATE)
        : LLM_KV(model->arch)(LLM_KV_TOKENIZER_CHAT_TEMPLATE);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking in the wrong place, this is where it's broken:

llama.cpp/src/llama-arch.cpp

Lines 1709 to 1712 in e83ba3e

std::string LLM_KV::operator()(llm_kv kv) const {
return suffix ? ::format(LLM_KV_NAMES.at(kv), LLM_ARCH_NAMES.at(arch), suffix)
: ::format(LLM_KV_NAMES.at(kv), LLM_ARCH_NAMES.at(arch));
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #14050

Comment on lines +13796 to +13797
const auto key = name
? LLM_KV(model->arch)(LLM_KV_TOKENIZER_CHAT_TEMPLATE_N) + std::string(name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const auto key = name
? LLM_KV(model->arch)(LLM_KV_TOKENIZER_CHAT_TEMPLATE_N) + std::string(name)
const auto key = name ? LLM_KV(model->arch, name)(LLM_KV_TOKENIZER_CHAT_TEMPLATE)

I wonder how long this has been broken?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, has never worked it seems, broken since it was introduced in #11016

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not in used by any of the examples so we don't know if it works in the first place (probably used in downstream project, but idk)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a PR.

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 6, 2025

Should be correct now:

{
    "query": "learn more about science",
    "texts": [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
}

Result:

[
    {
        "index": 0,
        "score": -11.903118133544922
    },
    {
        "index": 1,
        "score": -9.440960884094238
    }
]

@CISC
Copy link
Collaborator

CISC commented Jun 6, 2025

[
    {
        "index": 0,
        "score": -11.903118133544922
    },
    {
        "index": 1,
        "score": -9.440960884094238
    }
]

I don't know if there's a "standard" for this anywhere, but we should include label in the response as well.

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 6, 2025

I don't know if there's a "standard" for this anywhere, but we should include label in the response as well.

Rerank only return single label (the score). There is no need for displaying the label. The higher the score, the "closer" query compared to the doc

You see 2 indexes because the query has 2 docs. The query can have multiple docs

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 6, 2025

Also note that, the 2 labels is only used for softmax operation. We cannot have one single label because it will fail in this case:

  • First output: { yes: 0.5, no: 1.0 }
  • Second output: { yes: 0.5, no: 0.0 }

@CISC
Copy link
Collaborator

CISC commented Jun 6, 2025

I see, nvm for this model then, but might be worth investigating multiple scores for other models?

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 6, 2025

One score per label was always be the way for CLS models,right?

You can optionally add a softmax or mean square norm to make the score more "normalized"

For rerank task, the API is taken from jina API so it doesn't make sense to add a second score, risk of breaking the API compat

@CISC
Copy link
Collaborator

CISC commented Jun 6, 2025

One score per label was always be the way for CLS models,right?

Yes.

For rerank task, the API is taken from jina API so it doesn't make sense to add a second score, risk of breaking the API compat

Ah, I was wondering about that, ok, would be interesting to know of other APIs, another endpoint supporting this could be useful for sentiment ranking etc.

@ngxson ngxson requested review from CISC and ggerganov June 6, 2025 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: Cannot load Qwen3 ranking models
3 participants