-
Notifications
You must be signed in to change notification settings - Fork 12.1k
llama : support qwen3 rerank and embeddings #14029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
src/llama-graph.cpp
Outdated
cur = ggml_mul_mat(ctx0, cls_out, inp); | ||
cur = ggml_soft_max(ctx0, cur); // qwen3 uses softmax on the output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov I think there is a bug with build_inp_cls()
. It suppose to contain only indexes of the output tokens (last token), but in this case, it actually contains all tokens. This make the output score to be incorrect atm as it returns the score for first token. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can make quick fix for now like this, similar to build_bert()
:
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index afef84870..8b11197df 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -7043,7 +7043,7 @@ struct llm_build_qwen3 : public llm_graph_context {
Qcur, Kcur, Vcur, nullptr, nullptr, 1.0f/sqrtf(float(n_embd_head)), il);
}
- if (il == n_layer - 1) {
+ if (il == n_layer - 1 && pooling_type == LLAMA_POOLING_TYPE_NONE) {
// skip computing output for unused tokens
ggml_tensor * inp_out_ids = build_inp_out_ids();
cur = ggml_get_rows(ctx0, cur, inp_out_ids);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that still doesn't work as I expected.
For example, if my sequence has only one output token, then I expect the inp
tensor here to have shape [n_embd, 1]
, but in reality, it has shape [n_embd, n_tokens]
Maybe I misunderstood something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok I think I got it. The main problem is that qwen's rerank model use causal attention, it's simply a normal next generation model which outputs either "yes" or "no" token
I think the assumption in llama.cpp is that CLS and RANK are non-causal, hence only the first token is marked as output
Not sure what's the best way to support this though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK found a hack around this, for Qwen3, I force the position to last (only the position, not the pooling) in 030dc3b
Probably we should separate the notion of "pooling" and "output position" in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assumption in llama.cpp is that CLS and RANK are non-causal, hence only the first token is marked as output
The idea is that the llm_build_
functions will compute the embeddings for all tokens in the batch. The notion of "output ids" is purely an optimization trick to avoid unnecessary computation in the last layer and when doing any kind of pooling, it should generally be disabled.
For Qwen3 rerank, what you seem to need is to pool using last
and apply the classification head on the result - the latter is missing, so it has to be added. We just haven't encountered models with pooling last
and a classification head at the same time.
And it seems we should remove LLAMA_POOLING_TYPE_RANK
- it's a bit redundant. Instead CLS
and LAST
should do the same thing - i.e. apply a classification head if there is one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok I got it. The problem is that I don't have much time for the rest of the day. Do you think we can clean this up in a follow up PR?
And it seems we should remove
LLAMA_POOLING_TYPE_RANK
- it's a bit redundant. InsteadCLS
andLAST
should do the same thing - i.e. apply a classification head if there is one.
I think having the notion of LLAMA_TASK_*
would be useful. For example, pooling CLS can be used for task type CLS and RANK. This can also be useful to block certain endpoints. For example, rerank model should only support /rerank
and not /embeddings
or /completion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it's better to take the time and make it right, no need to merge it now.
This comment was marked as outdated.
This comment was marked as outdated.
Hmm no the prompt is still not correct |
@@ -200,7 +200,7 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = { | |||
{ LLM_KV_TOKENIZER_HF_JSON, "tokenizer.huggingface.json" }, | |||
{ LLM_KV_TOKENIZER_RWKV, "tokenizer.rwkv.world" }, | |||
{ LLM_KV_TOKENIZER_CHAT_TEMPLATE, "tokenizer.chat_template" }, | |||
{ LLM_KV_TOKENIZER_CHAT_TEMPLATE_N, "tokenizer.chat_template.%s" }, | |||
{ LLM_KV_TOKENIZER_CHAT_TEMPLATE_N, "tokenizer.chat_template." }, // FIXME: cannot add %s because it will be replaced by arch name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you can use LLM_KV_TOKENIZER_CHAT_TEMPLATE with suffix:
Lines 1722 to 1725 in c02f53d
if (suffix != nullptr) { | |
name += "."; | |
name += suffix; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing this but it doesn't work, maybe it's buggy somewhere else:
const auto key = name
? LLM_KV(model->arch, name)(LLM_KV_TOKENIZER_CHAT_TEMPLATE)
: LLM_KV(model->arch)(LLM_KV_TOKENIZER_CHAT_TEMPLATE);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking in the wrong place, this is where it's broken:
Lines 1709 to 1712 in e83ba3e
std::string LLM_KV::operator()(llm_kv kv) const { | |
return suffix ? ::format(LLM_KV_NAMES.at(kv), LLM_ARCH_NAMES.at(arch), suffix) | |
: ::format(LLM_KV_NAMES.at(kv), LLM_ARCH_NAMES.at(arch)); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #14050
const auto key = name | ||
? LLM_KV(model->arch)(LLM_KV_TOKENIZER_CHAT_TEMPLATE_N) + std::string(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto key = name | |
? LLM_KV(model->arch)(LLM_KV_TOKENIZER_CHAT_TEMPLATE_N) + std::string(name) | |
const auto key = name ? LLM_KV(model->arch, name)(LLM_KV_TOKENIZER_CHAT_TEMPLATE) |
I wonder how long this has been broken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, has never worked it seems, broken since it was introduced in #11016
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not in used by any of the examples so we don't know if it works in the first place (probably used in downstream project, but idk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a PR.
Should be correct now: {
"query": "learn more about science",
"texts": [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
} Result: [
{
"index": 0,
"score": -11.903118133544922
},
{
"index": 1,
"score": -9.440960884094238
}
] |
I don't know if there's a "standard" for this anywhere, but we should include |
Rerank only return single label (the score). There is no need for displaying the label. The higher the score, the "closer" query compared to the doc You see 2 indexes because the query has 2 docs. The query can have multiple docs |
Also note that, the 2 labels is only used for softmax operation. We cannot have one single label because it will fail in this case:
|
I see, nvm for this model then, but might be worth investigating multiple scores for other models? |
One score per label was always be the way for CLS models,right? You can optionally add a softmax or mean square norm to make the score more "normalized" For rerank task, the API is taken from jina API so it doesn't make sense to add a second score, risk of breaking the API compat |
Yes.
Ah, I was wondering about that, ok, would be interesting to know of other APIs, another endpoint supporting this could be useful for sentiment ranking etc. |
Close #13820
Supersede #14028
Model: https://huggingface.co/Qwen/Qwen3-Reranker-4B
POOLING_LAST
yes
andno
. The rerank score is the soft-maxed score of "yes" label