[Doc] Document Matryoshka Representation Learning support

noooop · noooop · commit dbeb036c1f1d · 2025-04-17T18:26:33.000+08:00
diff --git a/docs/source/models/pooling_models.md b/docs/source/models/pooling_models.md
@@ -141,3 +141,76 @@ Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints tha
 - [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
 - [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.
 - [Score API](#score-api) is similar to `LLM.score` for cross-encoder models.
+
+## Matryoshka Representation Learning (MRL)
+
+[Matryoshka Embeddings](https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html#matryoshka-embeddings) or [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) is a technique used in training embedding models. It allows user to trade off between performance and cost.
+
+### Offline Inference
+
+You can change the output of embedding models that support MRL by using the dimensions parameter in PoolingParams.
+
+```python
+from vllm import LLM, PoolingParams
+
+model = LLM(model="jinaai/jina-embeddings-v3", 
+            task="embed", 
+            trust_remote_code=True)
+outputs = model.embed(["Follow the white rabbit."], 
+                      pooling_params=PoolingParams(dimensions=32))
+print(outputs[0].outputs)
+```
+
+A code example can be found here: <gh-file:examples/offline_inference/embed_matryoshka_fy.py>
+
+### Online Inference
+
+Use the following command to start vllm server.
+
+```text
+vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
+```
+
+You can change the output of embedding models that support MRL by using the dimensions parameter.
+
+```text
+curl http://127.0.0.1:8000/v1/embeddings \
+  -H 'accept: application/json' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "input": "Follow the white rabbit.",
+    "model": "jinaai/jina-embeddings-v3",
+    "encoding_format": "float",
+    "dimensions": 1
+  }'
+```
+
+expected output
+
+```json
+{"id":"embd-0aab28c384d348c3b8f0eb783109dc5f","object":"list","created":1744195454,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-1.0]}],"usage":{"prompt_tokens":10,"total_tokens":10,"completion_tokens":0,"prompt_tokens_details":null}}
+```
+
+A openai client example can be found here: <gh-file:examples/online_serving/openai_embedding_matryoshka_fy.py>
+
+### Warning
+
+**Not all embeddings models support MRL. Changing the output dimension for models that do not support MRL will lead to poor results. vllm returns an error for requests that attempt to change the output dimension (dimensions is not None) of an unsupported MRL model.**
+
+For example, trying to change the output dimension of the BAAI/bge-m3 model will result in the following error.
+
+```json
+{"object":"error","message":"Model \"BAAI/bge-m3\" does not support matryoshka representation, changing output dimensions will lead to poor results.","type":"BadRequestError","param":null,"code":400}
+```
+
+We hope that the open source community will adopt the terms “is_matryoshka ” or “matryoshka_dimensions ” to denote whether a model is compatible with Matryoshka Representation Learning (MRL).
+
+### Manually support MRL
+
+For models supported by MRL but not recognized by vllm, please manually enable MRL support using hf_overrides={"is_matryoshka": True} (Offline) or --hf_overrides '{"is_matryoshka":true}' (online) with caution.
+
+For example, using the following command to start vllm server can manually support MRL.
+
+```text
+vllm serve Snowflake/snowflake-arctic-embed-m-v1.5 --hf_overrides '{"is_matryoshka":true}'
+```
diff --git a/examples/online_serving/openai_embedding_matryoshka_fy.py b/examples/online_serving/openai_embedding_matryoshka_fy.py
@@ -0,0 +1,31 @@
+# SPDX-License-Identifier: Apache-2.0
+
+from openai import OpenAI
+
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://localhost:8000/v1"
+
+
+def main():
+    client = OpenAI(
+        # defaults to os.environ.get("OPENAI_API_KEY")
+        api_key=openai_api_key,
+        base_url=openai_api_base,
+    )
+
+    models = client.models.list()
+    model = models.data[0].id
+
+    responses = client.embeddings.create(
+        input=["Follow the white rabbit."],
+        model=model,
+        dimensions=1,
+    )
+
+    for data in responses.data:
+        print(data.embedding)  # List of float of len 1
+
+
+if __name__ == "__main__":
+    main()