[FEATURE] Support for asymmetric embedding models

**Is your feature request related to a problem?**
OpenSearch currently only supports symmetric text embedding models, such as `sentence-transformers/all-MiniLM-L12-v2`. These models treat queries and passages equally at inference time. While they can be trained on datasets of queries and passages in such a way that they learn similar representations for the queries and passages (e.g. `sentence-transformers/msmarco-MiniLM-L-12-v3`), the best performing embedding models on the MTEB board (https://huggingface.co/spaces/mteb/leaderboard) are models that offer different inference "APIs" for queries and passages.

To be able to support these models in OpenSearch we need to be able to define different inference mechanisms for the passage embedding and the query embedding.

**What solution would you like?**
Prominent asymmetric models use string prefixes to prime the model to embed queries and passages differently. Cf. https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder#using-transformers or https://huggingface.co/intfloat/e5-large-v2#usage. To be able to support this kind of asymmetric model, I propose to introduce model-specific "embedding templates". These should be part of the model metadata and can be used to "format" the input for the model before running inference.

E.g. for the [e5-family](https://github.com/microsoft/unilm/tree/master/e5) of models, the query template could look like this:

| type | template |
|---|---|
| query |  `query: %s` |
| passage | `passage: %s` |

I propose to add optional fields to the model configuration in the `_register` endpoint in ml-commons. E.g.:

```
POST /_plugins/_ml/models/_register
{
  ...,
  "model_config": {
    ...,
    "query_template" : "query: %s",
    "passage_template" : "passage: %s",
    ...
  },
  ...
}
```

ml-commons already distinguishes datasets by type (cf. [SearchQueryInputDataset](https://github.com/opensearch-project/ml-commons/blob/main/common/src/main/java/org/opensearch/ml/common/dataset/SearchQueryInputDataset.java) and [TextDocsInputDataSet](https://github.com/opensearch-project/ml-commons/blob/main/common/src/main/java/org/opensearch/ml/common/dataset/TextDocsInputDataSet.java)). On inference time, it should be possible to check whether a particular model has templates or not and depending on the dataset type, apply the correct one using regular Java format strings.

OpenSearch neural-search would then need to be extended, to make sure it uses the correct dataset type for queries and passages. Currently it uses the same type, regardless of the use-case: https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java#L248.

**What alternatives have you considered?**
The proposed change has a small surface and only extends the API. I couldn't think of any competitive alternative solution.

**Do you have any additional context?**
No.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Support for asymmetric embedding models #1799

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

type	template
query	`query: %s`
passage	`passage: %s`

[FEATURE] Support for asymmetric embedding models #1799

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions