Skip to content

[Doc] Document Matryoshka Representation Learning support #16770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 17, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions docs/source/models/pooling_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,3 +141,76 @@ Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints tha
- [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.
- [Score API](#score-api) is similar to `LLM.score` for cross-encoder models.

## Matryoshka Representation Learning (MRL)

[Matryoshka Embeddings](https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html#matryoshka-embeddings) or [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) is a technique used in training embedding models. It allows user to trade off between performance and cost.

### Offline Inference

You can change the output of embedding models that support MRL by using the dimensions parameter in PoolingParams.

```python
from vllm import LLM, PoolingParams

model = LLM(model="jinaai/jina-embeddings-v3",
task="embed",
trust_remote_code=True)
outputs = model.embed(["Follow the white rabbit."],
pooling_params=PoolingParams(dimensions=32))
print(outputs[0].outputs)
```

A code example can be found here: <gh-file:examples/offline_inference/embed_matryoshka_fy.py>

### Online Inference

Use the following command to start vllm server.

```text
vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
```

You can change the output of embedding models that support MRL by using the dimensions parameter.

```text
curl http://127.0.0.1:8000/v1/embeddings \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"input": "Follow the white rabbit.",
"model": "jinaai/jina-embeddings-v3",
"encoding_format": "float",
"dimensions": 1
}'
```

expected output

```json
{"id":"embd-0aab28c384d348c3b8f0eb783109dc5f","object":"list","created":1744195454,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-1.0]}],"usage":{"prompt_tokens":10,"total_tokens":10,"completion_tokens":0,"prompt_tokens_details":null}}
```

A openai client example can be found here: <gh-file:examples/online_serving/openai_embedding_matryoshka_fy.py>

### Warning

**Not all embeddings models support MRL. Changing the output dimension for models that do not support MRL will lead to poor results. vllm returns an error for requests that attempt to change the output dimension (dimensions is not None) of an unsupported MRL model.**

For example, trying to change the output dimension of the BAAI/bge-m3 model will result in the following error.

```json
{"object":"error","message":"Model \"BAAI/bge-m3\" does not support matryoshka representation, changing output dimensions will lead to poor results.","type":"BadRequestError","param":null,"code":400}
```

We hope that the open source community will adopt the terms “is_matryoshka ” or “matryoshka_dimensions ” to denote whether a model is compatible with Matryoshka Representation Learning (MRL).

### Manually support MRL

For models supported by MRL but not recognized by vllm, please manually enable MRL support using hf_overrides={"is_matryoshka": True} (Offline) or --hf_overrides '{"is_matryoshka":true}' (online) with caution.

For example, using the following command to start vllm server can manually support MRL.

```text
vllm serve Snowflake/snowflake-arctic-embed-m-v1.5 --hf_overrides '{"is_matryoshka":true}'
```
31 changes: 31 additions & 0 deletions examples/online_serving/openai_embedding_matryoshka_fy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# SPDX-License-Identifier: Apache-2.0

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"


def main():
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

responses = client.embeddings.create(
input=["Follow the white rabbit."],
model=model,
dimensions=1,
)

for data in responses.data:
print(data.embedding) # List of float of len 1


if __name__ == "__main__":
main()