Description
What is the bug?
Querying with the hybrid query returns the number of hits (i.e. hits.total.value
) that the subqueries have before applying the pagination_depth
. The returned value would be correct if the pagination_depth
was high enough. If the pagination_depth
is sufficiently low the actual number of results is lower than the value of hits.total.value
suggests.
That is a problem for us, because we want to know beforehand how many pages of results would be navigatable (with respect to the configured pagination_depth
).
How can one reproduce the bug?
PUT /my-test-index
{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"embedding": {
"type": "knn_vector",
"dimension": 3,
"method": {
"engine": "faiss",
"space_type": "l2",
"name": "hnsw",
"parameters": {}
}
},
"text": {
"type": "text"
}
}
}
}
PUT /my-test-index/_doc/1
{
"id": "1",
"embedding": [0.1, 0.1, 0.1],
"text": "Lorem"
}
PUT /my-test-index/_doc/2
{
"id": "2",
"embedding": [0.2, 0.2, 0.2],
"text": "ipsum"
}
PUT /my-test-index/_doc/3
{
"id": "3",
"embedding": [0.3, 0.3, 0.3],
"text": "dolor"
}
PUT /my-test-index/_doc/4
{
"id": "4",
"embedding": [0.4, 0.4, 0.4],
"text": "sit"
}
PUT /my-test-index/_doc/5
{
"id": "5",
"embedding": [0.5, 0.5, 0.5],
"text": "amet"
}
PUT /_search/pipeline/nlp-search-pipeline
{
"description": "Post processor for hybrid search",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "min_max"
},
"combination": {
"technique": "arithmetic_mean",
"parameters": { "weights": [
0.5,
0.5
]}
}
}
}
]
}
POST /my-test-index/_search?search_pipeline=nlp-search-pipeline
{
"size": 10,
"from": 0,
"query": {
"hybrid": {
"pagination_depth": 2,
"queries": [
{
"match": {
"text": {
"query": "dolor"
}
}
},
{
"knn": {
"embedding": {
"vector": [1.0, 1.0, 1.0],
"k": 4
}
}
}
]
}
}
}
returns a response with hits.total.value
of 4 although it only contains 3 results and size
had been spezified to be 10 (>4).
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 0.5,
"hits": [
{
"_index": "my-test-index",
"_id": "3",
"_score": 0.5,
"_source": {
"id": "3",
"embedding": [
0.3,
0.3,
0.3
],
"text": "dolor"
}
},
{
"_index": "my-test-index",
"_id": "5",
"_score": 0.5,
"_source": {
"id": "5",
"embedding": [
0.5,
0.5,
0.5
],
"text": "amet"
}
},
{
"_index": "my-test-index",
"_id": "4",
"_score": 0.0005,
"_source": {
"id": "4",
"embedding": [
0.4,
0.4,
0.4
],
"text": "sit"
}
}
]
}
}
What is the expected behavior?
The search (above in the reproduce section) is expected to have 3 results:
- the match query returns one result from one shard (
"dolor"
aka id"3"
, because it matches the query) - the knn query would return four results, but returns only two from one shard since
pagination_depth
is2
("sit"
aka id"4"
and"amet"
aka id"5"
, because their embeddings are closest to[1,1,1]
) - the results of both subqueries do not overlap, so there are 3 results total
- therefore
hits.total.value
is expected to be 3 (not 4)
Implications on pagination
If we set a size
of 3 so we get 3 documents per page we would get the first page back with 3 results (as above) and a hits.total.size
of 4. We would then assume that there is a second page with the fourth result, but there isn't. The user would be presented with an empty second page when navigating there.
What is your host/environment?
OpenSearch docker image 2.19.2