Optimize MultiTermQueries over keyword field using doc values query

**Is your feature request related to a problem? Please describe.**
I recently came across a use-case where someone had a Boolean query like:

```
"bool": {
  "filter" : [
     {
        "term" : {
            // Some term query
        },
     },
     {
        "range" : {
            // Some time range query         
         }
     },
     {
        "range" : {
            "high_cardinality_keyword_field" : {
                 "lt": "sadhgjhasgjhdasgh" // Some guid term
            }
        }
     }
  ]
}
```

The last range query over a keyword field takes **forever** just to get started, before even creating the Lucene `Scorer`. I dug into it and it's because it creates a `TermRangeQuery`, which is a subclass of `MultiTermQuery`. The `rewrite` method for the `Weight` created by that query ends up [building a bitset union of all matching terms](https://github.com/apache/lucene/blob/branch_9_5/lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java#L189-L217).

A much better approach would be to let the other filter terms "drive" the scorer forward, retrieve each hit's doc value for `high_cardinality_keyword_field`, and check that against the range predicate. A similar approach is used for range queries over various numeric fields, using Lucene's `IndexOrDocValuesQuery`. ([Here](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/mapper/NumberFieldMapper.java#L341-L349) is the code for the float field type.)

**Describe the solution you'd like**
I would like OpenSearch to build a keyword range query that will decide (per-segment) whether it's better to use an index (e.g. if there is a small number of terms and computing the union is likely to be inexpensive) or a doc values-based range query -- probably `SortedSetDocValuesField.newSlowRangeQuery`. Obviously, if the give field doesn't have doc values, then we should just fall back to the existing behavior.

As a bonus, we could support range queries over unindexed keyword fields by using the doc values query instead.

**Describe alternatives you've considered**
In some cases, we might be able to move the query into a post-filter, but in the specific case where I saw this recently, the boolean query above was just one clause in a bigger query. We didn't want to apply the term range query as a filter on the whole result, just to that sub-query.

There are also cases where we could approximate the keyword field by mapping to some number that would partially preserve the order (i.e. map `a -> n_a`, where `n_a < n_b` implies that `a < b`,  but you may have `a < b` where `n_a = n_b`. Like, you could take the first 8 bytes of each keyword and and treat that as an unsigned long.) Then you could take advantage of the existing optimization for numeric fields with doc values.

**Additional context**
Note that this probably isn't as simple as passing the existing `TermRangeQuery` and a doc values query to an `IndexOrDocValuesQuery`, because the `IndexOrDocValuesQuery` needs to ask the underlying Scorers (or Weights?) for their costs, to determine which one to use. Lucene currently spends all that time building the big bitset of docs that match terms before we have access that cost estimate. I'm going to reach out on the lucene-users mailing list to ask for advice on how to address that piece.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize MultiTermQueries over keyword field using doc values query #7057

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize MultiTermQueries over keyword field using doc values query #7057

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions