Description
Is your feature request related to a problem? Please describe.
I recently came across a use-case where someone had a Boolean query like:
"bool": {
"filter" : [
{
"term" : {
// Some term query
},
},
{
"range" : {
// Some time range query
}
},
{
"range" : {
"high_cardinality_keyword_field" : {
"lt": "sadhgjhasgjhdasgh" // Some guid term
}
}
}
]
}
The last range query over a keyword field takes forever just to get started, before even creating the Lucene Scorer
. I dug into it and it's because it creates a TermRangeQuery
, which is a subclass of MultiTermQuery
. The rewrite
method for the Weight
created by that query ends up building a bitset union of all matching terms.
A much better approach would be to let the other filter terms "drive" the scorer forward, retrieve each hit's doc value for high_cardinality_keyword_field
, and check that against the range predicate. A similar approach is used for range queries over various numeric fields, using Lucene's IndexOrDocValuesQuery
. (Here is the code for the float field type.)
Describe the solution you'd like
I would like OpenSearch to build a keyword range query that will decide (per-segment) whether it's better to use an index (e.g. if there is a small number of terms and computing the union is likely to be inexpensive) or a doc values-based range query -- probably SortedSetDocValuesField.newSlowRangeQuery
. Obviously, if the give field doesn't have doc values, then we should just fall back to the existing behavior.
As a bonus, we could support range queries over unindexed keyword fields by using the doc values query instead.
Describe alternatives you've considered
In some cases, we might be able to move the query into a post-filter, but in the specific case where I saw this recently, the boolean query above was just one clause in a bigger query. We didn't want to apply the term range query as a filter on the whole result, just to that sub-query.
There are also cases where we could approximate the keyword field by mapping to some number that would partially preserve the order (i.e. map a -> n_a
, where n_a < n_b
implies that a < b
, but you may have a < b
where n_a = n_b
. Like, you could take the first 8 bytes of each keyword and and treat that as an unsigned long.) Then you could take advantage of the existing optimization for numeric fields with doc values.
Additional context
Note that this probably isn't as simple as passing the existing TermRangeQuery
and a doc values query to an IndexOrDocValuesQuery
, because the IndexOrDocValuesQuery
needs to ask the underlying Scorers (or Weights?) for their costs, to determine which one to use. Lucene currently spends all that time building the big bitset of docs that match terms before we have access that cost estimate. I'm going to reach out on the lucene-users mailing list to ask for advice on how to address that piece.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status
Status
Status