Skip to content

[BUG] Wrong number of total hits when using hybrid query #1346

Open
@hkohlsaat

Description

@hkohlsaat

What is the bug?

Querying with the hybrid query returns the number of hits (i.e. hits.total.value) that the subqueries have before applying the pagination_depth. The returned value would be correct if the pagination_depth was high enough. If the pagination_depth is sufficiently low the actual number of results is lower than the value of hits.total.value suggests.

That is a problem for us, because we want to know beforehand how many pages of results would be navigatable (with respect to the configured pagination_depth).

How can one reproduce the bug?

PUT /my-test-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "embedding": {
        "type": "knn_vector",
        "dimension": 3,
        "method": {
          "engine": "faiss",
          "space_type": "l2",
          "name": "hnsw",
          "parameters": {}
        }
      },
      "text": {
        "type": "text"
      }
    }
  }
}

PUT /my-test-index/_doc/1
{
  "id": "1",
  "embedding": [0.1, 0.1, 0.1],
  "text": "Lorem"
}

PUT /my-test-index/_doc/2
{
  "id": "2",
  "embedding": [0.2, 0.2, 0.2],
  "text": "ipsum"
}

PUT /my-test-index/_doc/3
{
  "id": "3",
  "embedding": [0.3, 0.3, 0.3],
  "text": "dolor"
}

PUT /my-test-index/_doc/4
{
  "id": "4",
  "embedding": [0.4, 0.4, 0.4],
  "text": "sit"
}

PUT /my-test-index/_doc/5
{
  "id": "5",
  "embedding": [0.5, 0.5, 0.5],
  "text": "amet"
}

PUT /_search/pipeline/nlp-search-pipeline
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": { "weights": [
            0.5,
            0.5
            ]}
          }
        }
      }
  ]
}

POST /my-test-index/_search?search_pipeline=nlp-search-pipeline
{
  "size": 10,
  "from": 0,
  "query": {
    "hybrid": {
      "pagination_depth": 2,
      "queries": [
        {
          "match": {
            "text": {
              "query": "dolor"
            }
          }
        },
        {
          "knn": {
            "embedding": {
              "vector": [1.0, 1.0, 1.0],
              "k": 4
            }
          }
        }
      ]
    }
  }
}

returns a response with hits.total.value of 4 although it only contains 3 results and size had been spezified to be 10 (>4).

{
  "took": 15,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 0.5,
    "hits": [
      {
        "_index": "my-test-index",
        "_id": "3",
        "_score": 0.5,
        "_source": {
          "id": "3",
          "embedding": [
            0.3,
            0.3,
            0.3
          ],
          "text": "dolor"
        }
      },
      {
        "_index": "my-test-index",
        "_id": "5",
        "_score": 0.5,
        "_source": {
          "id": "5",
          "embedding": [
            0.5,
            0.5,
            0.5
          ],
          "text": "amet"
        }
      },
      {
        "_index": "my-test-index",
        "_id": "4",
        "_score": 0.0005,
        "_source": {
          "id": "4",
          "embedding": [
            0.4,
            0.4,
            0.4
          ],
          "text": "sit"
        }
      }
    ]
  }
}

What is the expected behavior?

The search (above in the reproduce section) is expected to have 3 results:

  • the match query returns one result from one shard ("dolor" aka id "3", because it matches the query)
  • the knn query would return four results, but returns only two from one shard since pagination_depth is 2 ("sit" aka id "4" and "amet" aka id "5", because their embeddings are closest to [1,1,1])
  • the results of both subqueries do not overlap, so there are 3 results total
  • therefore hits.total.value is expected to be 3 (not 4)

Implications on pagination

If we set a size of 3 so we get 3 documents per page we would get the first page back with 3 results (as above) and a hits.total.size of 4. We would then assume that there is a second page with the fourth result, but there isn't. The user would be presented with an empty second page when navigating there.

What is your host/environment?

OpenSearch docker image 2.19.2

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions