-
Notifications
You must be signed in to change notification settings - Fork 12.2k
llama : support qwen3 rerank and embeddings #14029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ngxson
wants to merge
10
commits into
ggml-org:master
Choose a base branch
from
ngxson:xsn/qwen3_embd_rerank
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+171
−57
Open
Changes from 3 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
3f3b9a2
model : support qwen3 rerank and embeddings
ngxson f8fd440
Merge branch 'master' into xsn/qwen3_embd_rerank
ngxson e0eb4b8
correct labels
ngxson 030dc3b
correct output token position
ngxson f8facb3
Merge branch 'master' into xsn/qwen3_embd_rerank
ngxson 0777cd3
rm redundant settings
ngxson 8edd2cf
add embedding model tokenizer chkhsh
ngxson c02f53d
add rerank prompt
ngxson c2f4dc7
fix prompt
ngxson cbb6f20
fix style
ngxson File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov I think there is a bug with
build_inp_cls()
. It suppose to contain only indexes of the output tokens (last token), but in this case, it actually contains all tokens. This make the output score to be incorrect atm as it returns the score for first token. WDYT?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can make quick fix for now like this, similar to
build_bert()
:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that still doesn't work as I expected.
For example, if my sequence has only one output token, then I expect the
inp
tensor here to have shape[n_embd, 1]
, but in reality, it has shape[n_embd, n_tokens]
Maybe I misunderstood something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok I think I got it. The main problem is that qwen's rerank model use causal attention, it's simply a normal next generation model which outputs either "yes" or "no" token
I think the assumption in llama.cpp is that CLS and RANK are non-causal, hence only the first token is marked as output
Not sure what's the best way to support this though
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK found a hack around this, for Qwen3, I force the position to last (only the position, not the pooling) in 030dc3b
Probably we should separate the notion of "pooling" and "output position" in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that the
llm_build_
functions will compute the embeddings for all tokens in the batch. The notion of "output ids" is purely an optimization trick to avoid unnecessary computation in the last layer and when doing any kind of pooling, it should generally be disabled.For Qwen3 rerank, what you seem to need is to pool using
last
and apply the classification head on the result - the latter is missing, so it has to be added. We just haven't encountered models with poolinglast
and a classification head at the same time.And it seems we should remove
LLAMA_POOLING_TYPE_RANK
- it's a bit redundant. InsteadCLS
andLAST
should do the same thing - i.e. apply a classification head if there is one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok I got it. The problem is that I don't have much time for the rest of the day. Do you think we can clean this up in a follow up PR?
I think having the notion of
LLAMA_TASK_*
would be useful. For example, pooling CLS can be used for task type CLS and RANK. This can also be useful to block certain endpoints. For example, rerank model should only support/rerank
and not/embeddings
or/completion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it's better to take the time and make it right, no need to merge it now.