Skip to content

Optimize MultiTermQueries over keyword field using doc values query #7057

Closed
@msfroh

Description

@msfroh

Is your feature request related to a problem? Please describe.
I recently came across a use-case where someone had a Boolean query like:

"bool": {
  "filter" : [
     {
        "term" : {
            // Some term query
        },
     },
     {
        "range" : {
            // Some time range query         
         }
     },
     {
        "range" : {
            "high_cardinality_keyword_field" : {
                 "lt": "sadhgjhasgjhdasgh" // Some guid term
            }
        }
     }
  ]
}

The last range query over a keyword field takes forever just to get started, before even creating the Lucene Scorer. I dug into it and it's because it creates a TermRangeQuery, which is a subclass of MultiTermQuery. The rewrite method for the Weight created by that query ends up building a bitset union of all matching terms.

A much better approach would be to let the other filter terms "drive" the scorer forward, retrieve each hit's doc value for high_cardinality_keyword_field, and check that against the range predicate. A similar approach is used for range queries over various numeric fields, using Lucene's IndexOrDocValuesQuery. (Here is the code for the float field type.)

Describe the solution you'd like
I would like OpenSearch to build a keyword range query that will decide (per-segment) whether it's better to use an index (e.g. if there is a small number of terms and computing the union is likely to be inexpensive) or a doc values-based range query -- probably SortedSetDocValuesField.newSlowRangeQuery. Obviously, if the give field doesn't have doc values, then we should just fall back to the existing behavior.

As a bonus, we could support range queries over unindexed keyword fields by using the doc values query instead.

Describe alternatives you've considered
In some cases, we might be able to move the query into a post-filter, but in the specific case where I saw this recently, the boolean query above was just one clause in a bigger query. We didn't want to apply the term range query as a filter on the whole result, just to that sub-query.

There are also cases where we could approximate the keyword field by mapping to some number that would partially preserve the order (i.e. map a -> n_a, where n_a < n_b implies that a < b, but you may have a < b where n_a = n_b. Like, you could take the first 8 bytes of each keyword and and treat that as an unsigned long.) Then you could take advantage of the existing optimization for numeric fields with doc values.

Additional context
Note that this probably isn't as simple as passing the existing TermRangeQuery and a doc values query to an IndexOrDocValuesQuery, because the IndexOrDocValuesQuery needs to ask the underlying Scorers (or Weights?) for their costs, to determine which one to use. Lucene currently spends all that time building the big bitset of docs that match terms before we have access that cost estimate. I'm going to reach out on the lucene-users mailing list to ask for advice on how to address that piece.

Metadata

Metadata

Assignees

Labels

PerformanceThis is for any performance related enhancements or bugsSearch:PerformanceenhancementEnhancement or improvement to existing feature or requestv2.12.0Issues and PRs related to version 2.12.0

Type

No type

Projects

Status

✅ Done

Status

Closed

Status

Done

Status

2.12.0 (Launched)

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions