Description
Is your feature request related to a problem?
OpenSearch currently only supports symmetric text embedding models, such as sentence-transformers/all-MiniLM-L12-v2
. These models treat queries and passages equally at inference time. While they can be trained on datasets of queries and passages in such a way that they learn similar representations for the queries and passages (e.g. sentence-transformers/msmarco-MiniLM-L-12-v3
), the best performing embedding models on the MTEB board (https://huggingface.co/spaces/mteb/leaderboard) are models that offer different inference "APIs" for queries and passages.
To be able to support these models in OpenSearch we need to be able to define different inference mechanisms for the passage embedding and the query embedding.
What solution would you like?
Prominent asymmetric models use string prefixes to prime the model to embed queries and passages differently. Cf. https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder#using-transformers or https://huggingface.co/intfloat/e5-large-v2#usage. To be able to support this kind of asymmetric model, I propose to introduce model-specific "embedding templates". These should be part of the model metadata and can be used to "format" the input for the model before running inference.
E.g. for the e5-family of models, the query template could look like this:
type | template |
---|---|
query | query: %s |
passage | passage: %s |
I propose to add optional fields to the model configuration in the _register
endpoint in ml-commons. E.g.:
POST /_plugins/_ml/models/_register
{
...,
"model_config": {
...,
"query_template" : "query: %s",
"passage_template" : "passage: %s",
...
},
...
}
ml-commons already distinguishes datasets by type (cf. SearchQueryInputDataset and TextDocsInputDataSet). On inference time, it should be possible to check whether a particular model has templates or not and depending on the dataset type, apply the correct one using regular Java format strings.
OpenSearch neural-search would then need to be extended, to make sure it uses the correct dataset type for queries and passages. Currently it uses the same type, regardless of the use-case: https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java#L248.
What alternatives have you considered?
The proposed change has a small surface and only extends the API. I couldn't think of any competitive alternative solution.
Do you have any additional context?
No.
Metadata
Metadata
Labels
Type
Projects
Status