Added aggregation precomputation for rare terms #18106

ajleong623 · 2025-04-28T16:32:39Z

Description

This change expands on using the techniques from @sandeshkr419 pull request #11643 to precompute aggregations for match all or match none queries. We can leverage reading from termsEnum to precompute the aggregation when the field is indexed and when there are no deletions. We can check that no terms are deleted by using the weight and checking if it matches maxDocs of the reader.

Unfortunately, I was not able to use the same technique for numeric aggregators like LongRareTermsAggregator. This is because the numeric points are not indexed by frequency of terms but instead through KD-trees to optimize for different types of operations https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/PointValues.java.

Please let me know if there are any comments, concerns or suggestions.

Related Issues

Resolves #13123
#13122
#10954

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-04-28T16:55:08Z

❌ Gradle check result for f6371a2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-28T18:31:03Z

❌ Gradle check result for 844164e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-28T19:16:41Z

❌ Gradle check result for 0f3bd75: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

ajleong623 · 2025-04-28T23:25:52Z

It looks like I am failing the test for org.opensearch.cache.common.tier.TieredSpilloverCacheStatsIT.testClosingShard, however when I tried running this test on my local computer, it passes. What could be happening?

Edit: Sorry, but it actually looks like the test did not pass on my system. I also tested it on the current codebase without any changes that I made, and it did not pass. Therefore, I do not think that my code affects the test.

github-actions · 2025-06-01T20:56:27Z

✅ Gradle check result for 65e20b8: SUCCESS

codecov · 2025-06-01T20:56:51Z

Codecov Report

Attention: Patch coverage is 76.36364% with 26 lines in your changes missing coverage. Please review.

Project coverage is 72.62%. Comparing base (87022b7) to head (86a23cb).
Report is 24 commits behind head on main.

Files with missing lines	Patch %	Lines
...gations/bucket/terms/MapStringTermsAggregator.java	77.50%	5 Missing and 4 partials ⚠️
...aggregations/bucket/missing/MissingAggregator.java	76.92%	2 Missing and 4 partials ⚠️
...ations/bucket/terms/StringRareTermsAggregator.java	78.57%	3 Missing and 3 partials ⚠️
...rch/search/aggregations/support/MissingValues.java	40.00%	3 Missing ⚠️
...ket/terms/GlobalOrdinalsStringTermsAggregator.java	75.00%	0 Missing and 1 partial ⚠️
...arch/search/aggregations/support/ValuesSource.java	85.71%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #18106      +/-   ##
============================================
- Coverage     72.66%   72.62%   -0.05%     
+ Complexity    68231    68221      -10     
============================================
  Files          5555     5556       +1     
  Lines        313857   313994     +137     
  Branches      45522    45551      +29     
============================================
- Hits         228073   228033      -40     
- Misses        67207    67378     +171     
- Partials      18577    18583       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ajleong623 · 2025-06-05T20:20:58Z

Summary of All Changes Made
MissingAggregator.java:

Covered cases of sub aggregators existing, missing parameter enabled, no field name or weight, when there are deleted documents, when the field is not indexed, when there is the _doc_count field (These are all in MissingAggregator.java in the function at line 120)
Had to make sure the values source configuration was accessible by the precomputation function to have access to the missing parameter.
Action Item: I have to make sure that scripted aggregations are checked. Temporarily, I did not return indexFieldName for scripted values sources, but I should put that back and use valuesSourceConfig.script() == null instead. (This might not actually be an issue)
Action Item: FieldName and weight also needed to be available for the function.
The default size of the buckets array is 1 according to https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/aggregations/bucket/BucketsAggregator.java. Therefore since the MissingAggregator is a SingleBucketAggregator, the only bucket that can be incremented is bucket 0 as shown in Line 174 in MissingAggregator.java .

GlobalOrdinalStringTermsAggregator.java:

When I allowed the field names to be more accessible, by extending the indexFieldName property of valuesSource, the case of the missing parameter which would have been skipped prior is now being tested, so I had to propagate the valuesSourceConfig variable to check that case. For a little bit, I worked on an alternative way to add to the missing bucket, but I could not find a good way to find the right ordinal of the missing term. Parts of the solution attempt is in commit 73c7499, but it could not work because of how I used the charset.
Action Item: I have to make sure that scripted aggregations are checked. Temporarily, I did not return indexFieldName for scripted values sources, but I should put that back and use valuesSourceConfig.script() == null instead. And write tests that hit this. (Not an issue)

MapStringTermsAggregator.java:

Similar to the MissingAggregator.java changes, I had to make sure the valuesSourceConfig was propagated so that the missing parameter would be available.
Action Item: An integration test which failed used a custom script, so I have to make sure that scripted aggregations are checked. Temporarily, I did not return indexFieldName for scripted values sources, but I should put that back and use valuesSourceConfig.script() == null instead.
Covered cases of sub aggregators existing, include/exclude filters, missing parameter enabled, no field name or weight, when there are deleted documents, when the field is not indexed, when there is the _doc_count field (These are all in MapStringTermsAggregator.java in the function at line 185)
I incremented the bucket count as usual but also had to update the subset sizes for certain aggregators that use this.

Changes in SignificantTermsAggregatorFactory.java, SignificantTermsAggregatorSupplier.java, SignificantTextAggregatorFactory.java, TermsAggregatorFactory.java, TermsAggregatorSupplier.java were meant to propagate the valuesSourceConfiguration to the aggregators.

StringRareTermsAggregator.java:

Covered cases of sub aggregators existing, include/exclude filters, no weight, when there are deleted documents, when the field is not indexed (These are all in StringRareTermsAggregator.java in the function at line 136)
I incremented the bucket count as usual but also had to update the subset sizes for certain aggregators that use this.
Action Item: I have to make sure the missing parameter test case is included, so that is one action item on my end. I also have to make sure fieldName not existing is checked and that the case with _doc_count field is included.
Action Item: Missing parameter propagation still has to be covered
The method of incrementing the bucketDocCount is parallel to the method in MapStringTermsAggregator.java.
Action Item: I have to make sure that scripted aggregations are checked. Temporarily, I did not return indexFieldName for scripted values sources, but I should put that back and use valuesSourceConfig.script() == null instead. And write tests that hit this. (Not an issue)

MissingValues.java and ValuesSource.java can now have their indexFields accessed when they previously could not.

AggregatorTestCase.java:

Here, I had to make sure that the collection count was being used. Therefore, I copied the searchReduce functions to include that. At AggregatorTestCase.java lines 847 and 868, the aggregator is being wrapped by the counting aggregator. At 848, I kept a running counter of all the collect counts.
At the line 1581, I made a CompositeAggregationAndCount class so that both the aggregation and the count can be returned whenever the test wants to keep track of the collect count, too. I do not know if this is the best way. Maybe I could just extend the Internal aggregation so that in the tests, the changes could be more minimal.

MissingAggregatorTests.java:

This was the only updated test where I used the counter. The main changes were testing indexed fields and also counting the aggregation collects and checking that. In lines 103 to 113 and lines 116 to 123, there is an example of the changes I described. This pattern kind of repeats for all the tests.
In the function at line 577, I updated how the test cases are handled by using the CompositeAggregationAndCount class to verify the result and also use the searchAndReduce function with counting.

TermsAggregatorTests.java and RareTermsAggregatorTests are mostly the same but with using indexed fields, now. If the method of using the counting aggregation is what you were thinking, I will extend the changes to those tests as well.

ajleong623 · 2025-06-05T20:31:55Z

A question I have is whether the way I used the counting aggregator was the way you intended or if there is a better method of doing so? Another big question is how do I make sure the paths are included in the workloads? I have the requests that hit those paths in a previous comment above.

sandeshkr419 · 2025-06-25T17:31:01Z

.../src/test/java/org/opensearch/search/aggregations/bucket/terms/RareTermsAggregatorTests.java

@@ -581,6 +588,21 @@ private void testSearchCase(

    }

+    private void testSearchCaseIndexString(


I would look through all the tests relevant to the aggregators I changed and make sure the leaf bucket collection count is accurate?

Yes.

It should be zero when the pre computation is used but the number of matching documents when it is not, right?

Yes.

Basically we want to capture the conditions accurately that the optimization is being used.

sandeshkr419 · 2025-06-25T17:36:52Z

test/framework/src/main/java/org/opensearch/search/aggregations/AggregatorTestCase.java

@@ -1416,6 +1578,24 @@ public void setWeight(Weight weight) {
        }
    }

+    protected static class CompositeAggregationAndCount {


Is it not possible to make use of CountingAggregator defined above?

Likewise for the methods you introduced, I guess it should be possible to re-use createCountingAggregator directly?

The issue I ran into is that when searchAndReduce (https://github.com/opensearch-project/OpenSearch/blob/main/test/framework/src/main/java/org/opensearch/search/aggregations/AggregatorTestCase.java#L610) is run in AggregatorTestCase.java. That is actually where I put the CountingAggregator however, the count is only accessible within the searchAndReduce method, but the verification takes place in the actual test case like in https://github.com/opensearch-project/OpenSearch/blob/main/server/src/test/java/org/opensearch/search/aggregations/bucket/missing/MissingAggregatorTests.java#L341. I returned `CompositeAggregationAndCount so that the collected count could be part of the verification process.

However, I understand that the technique I used is not the neatest, and I would like to maintain the integrity of the codebase. Since we know the expected count ahead of time, would it be better to throw the error within the searchAndReduce function and also provide the expected value as parameters?

sandeshkr419 · 2025-06-25T17:37:40Z

.../src/test/java/org/opensearch/search/aggregations/bucket/terms/RareTermsAggregatorTests.java

@@ -610,6 +632,42 @@ private <A extends InternalAggregation> A executeTestCase(Query query, List<Long
        }
    }

+    private <A extends InternalAggregation> A executeTestCaseIndexString(


javadoc/comments to explain what this utility is doing?

sandeshkr419 · 2025-06-25T17:39:53Z

server/src/test/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregatorTests.java

+        boolean includeDeletedDocumentsInSegment = randomBoolean();
+        boolean includeDocCountField = randomBoolean();
+        boolean collectSegmentOrds = randomBoolean();
+        if (includeDeletedDocumentsInSegment == false && includeDocCountField == false) {


Let's separate out these 2 cases to run independently instead of just relying on chance whether or not we have deleted documents/docCountField.

This will help reduce the test flakiness if present and also ensure that we have a good test coverage.

sandeshkr419 · 2025-06-25T17:41:26Z

.../src/test/java/org/opensearch/search/aggregations/bucket/missing/MissingAggregatorTests.java

+            assertEquals(0, internalMissingAgg.getDocCount());
+            assertFalse(AggregationInspectionHelper.hasValue(internalMissingAgg));
+            if (isIndexed) {
+                assertEquals(0, internalMissing.getCount());


let's add a comment above L120 - that pre-compute optimization is kicked in and no docs are traversed.

sandeshkr419 · 2025-06-25T17:49:56Z

.../src/main/java/org/opensearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

+        this.resultStrategy = resultStrategy.apply(this); // ResultStrategy needs a reference to the Aggregator to do its job.
+        this.includeExclude = includeExclude;
+        bucketOrds = BytesKeyedBucketOrds.build(context.bigArrays(), cardinality);
+        if (collectorSource instanceof ValuesSourceCollectorSource) {


I don't like the idea of being uncertain about where the fieldName is going to come from, basically either from constructor above or fetching from value source. Let's be deterministic on where we are going to fetch the field name.

Also, you can probably use pattern matching for instanceof:

if (collectorSource instanceof ValuesSourceCollectorSource valuesCollectorSource) { this.fieldName = valuesCollectorSource.getValuesSource().getIndexFieldName(); }

Good point. I will just stick with fetching from the value source. Since I made the modification to add the field name to the constructor, previous implementations should not be affected.

sandeshkr419 · 2025-06-25T17:50:31Z

server/src/main/java/org/opensearch/search/aggregations/bucket/missing/MissingAggregator.java

+        NumericDocValues docCountValues = DocValues.getNumeric(ctx.reader(), DocCountFieldMapper.NAME);
+        if (docCountValues.nextDoc() != NO_MORE_DOCS) {
+            // This segment has at least one document with the _doc_count field.
+            return false;


I think if you separate out the test cases as I commented in test files - that can give you a good code coverage as well.

Signed-off-by: Anthony Leong <[email protected]>

github-actions · 2025-06-26T03:33:46Z

❌ Gradle check result for d51c2a0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Anthony Leong <[email protected]>

github-actions · 2025-06-26T05:32:44Z

✅ Gradle check result for ab13378: SUCCESS

Signed-off-by: Anthony Leong <[email protected]>

github-actions · 2025-06-30T02:03:55Z

❌ Gradle check result for b5e08d8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Anthony Leong <[email protected]>

github-actions · 2025-06-30T02:25:30Z

❌ Gradle check result for ebca7e1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Anthony Leong <[email protected]>

github-actions · 2025-06-30T03:11:33Z

❌ Gradle check result for 9d73b57: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

… completed action items Signed-off-by: Anthony Leong <[email protected]>

Signed-off-by: Anthony Leong <[email protected]>

github-actions · 2025-06-30T07:14:32Z

✅ Gradle check result for b4a4128: SUCCESS

Signed-off-by: Anthony Leong <[email protected]>

ajleong623 · 2025-06-30T19:51:07Z

@sandeshkr419 I believe all the comments were addressed. Rather than making a new class to return the expected count of the missing aggregation too, I simply put a check in the searchAndReduceCounting function. I also remove a lot of the non deterministic tests and made them deterministic. I added extra tests too for better coverage.

The other action item is adding the workloads to the opensearch-benchmark-workloads repository. Do I just add those query bodies in the big5/queries folder?

github-actions · 2025-06-30T21:51:07Z

❌ Gradle check result for b60c221: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

sandeshkr419 · 2025-06-30T21:44:04Z

.../src/main/java/org/opensearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

+        // TODO: A note is that in scripted aggregations, the way of collecting from buckets is determined from
+        // the script aggregator. For now, we will not be able to support the script aggregation.
+
+        if (subAggregators.length > 0 || includeExclude != null || fieldName == null) {


You can pull up null checks for weight and config here so that you don't have to assert it again.

Right now you are checking for config != null twice, and checking up (weight.count(ctx) == ctx.reader().getDocCount(fieldName) before checking for weight == null.

We might be able to proceed if config == null, but if there is a script or there is both a missing parameter and there are actual missing values, we will not be able to use the precomputation optimization. But I can move up the weight check.

sandeshkr419 · 2025-06-30T21:44:44Z

.../src/main/java/org/opensearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

+        // field missing, we might not be able to use the index unless there is some way we can
+        // calculate which ordinal value that missing field is (something I am not sure how to
+        // do yet).
+        if (config != null && config.missing() != null && ((weight.count(ctx) == ctx.reader().getDocCount(fieldName)) == false)) {


nit: weight.count(ctx) != ctx.reader().getDocCount(fieldName) instead of asserting equality as false.

Right. I looked at the formatting guidelines again, and I only have to assert the equality as false for unary negations.

sandeshkr419 · 2025-06-30T21:45:22Z

.../src/main/java/org/opensearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

+
+        // The optimization could only be used if there are no deleted documents and the top-level
+        // query matches all documents in the segment.
+        if (weight == null) {


nit: Moving this null check towards the start of method can make this more readable.

sandeshkr419 · 2025-06-30T21:46:40Z

.../src/main/java/org/opensearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

+            if (bucketOrdinal < 0) { // already seen
+                bucketOrdinal = -1 - bucketOrdinal;
+            }
+            int amount = stringTermsEnum.docFreq();


nit: rename amount to docCount or docFreq

sandeshkr419 · 2025-06-30T21:49:02Z

.../src/main/java/org/opensearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

+                bucketOrdinal = -1 - bucketOrdinal;
+            }
+            int amount = stringTermsEnum.docFreq();
+            if (resultStrategy instanceof SignificantTermsResults) {


nit:

if (resultStrategy instanceof SignificantTermsResults sigTermsResultStrategy) { sigTermsResultStrategy.updateSubsetSizes(0L, docCount); }

sandeshkr419 · 2025-06-30T21:51:09Z

server/src/main/java/org/opensearch/search/aggregations/bucket/missing/MissingAggregator.java

+        if (fieldName == null) {
+            // The optimization does not work when there are subaggregations or if there is a filter.
+            // The query has to be a match all, otherwise
+            //


I think the comment is misplaced here.
Can you please check the comments on the entire PR once. Also, please remove empty comment lines.

Signed-off-by: Anthony Leong <[email protected]>

github-actions · 2025-07-01T02:17:08Z

❌ Gradle check result for 0375104: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Anthony Leong <[email protected]>

github-actions · 2025-07-02T05:51:48Z

✅ Gradle check result for 86a23cb: SUCCESS

github-actions bot added Search:Aggregations Search:Performance labels Apr 28, 2025

ajleong623 marked this pull request as ready for review April 28, 2025 16:35

ajleong623 requested review from anasalkouz, andrross, ashking94, Bukhtawar, CEHENKLE, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami and VachaShah as code owners April 28, 2025 16:35

ajleong623 marked this pull request as draft April 28, 2025 17:10

ajleong623 marked this pull request as ready for review April 29, 2025 00:16

sandeshkr419 added the v3.2.0 label Jun 18, 2025

sandeshkr419 suggested changes Jun 25, 2025

View reviewed changes

merged with main

d51c2a0

Signed-off-by: Anthony Leong <[email protected]>

spotless

ab13378

Signed-off-by: Anthony Leong <[email protected]>

opensearch-infra bot added the lucene label Jun 27, 2025

added edits to missing terms agg tests

b5e08d8

Signed-off-by: Anthony Leong <[email protected]>

spotless

ebca7e1

Signed-off-by: Anthony Leong <[email protected]>

test spotless check

9d73b57

Signed-off-by: Anthony Leong <[email protected]>

ajleong623 added 2 commits June 29, 2025 22:49

added new expected counts tests for string rare aggregation tests and…

66171ca

… completed action items Signed-off-by: Anthony Leong <[email protected]>

spotless

b4a4128

Signed-off-by: Anthony Leong <[email protected]>

mode tests more deterministic and improved coverage

b60c221

Signed-off-by: Anthony Leong <[email protected]>

sandeshkr419 reviewed Jun 30, 2025

View reviewed changes

checked comments, removed more nondeterminism, and reformatted

0375104

Signed-off-by: Anthony Leong <[email protected]>

This was referenced Jul 1, 2025

[AUTOCUT] Gradle Check Flaky Test Report for IndexActionIT #16576

Open

[AUTOCUT] Gradle Check Flaky Test Report for ClientYamlTestSuiteIT #14319

Open

tests pass, I think

86a23cb

Signed-off-by: Anthony Leong <[email protected]>

		@@ -581,6 +588,21 @@ private void testSearchCase(

		}

		private void testSearchCaseIndexString(

Added aggregation precomputation for rare terms #18106

Are you sure you want to change the base?

Added aggregation precomputation for rare terms #18106

Uh oh!

Conversation

ajleong623 commented Apr 28, 2025 • edited by peterzhuamazon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

ajleong623 commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 1, 2025

Uh oh!

codecov bot commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ajleong623 commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajleong623 commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 26, 2025

Uh oh!

github-actions bot commented Jun 26, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

ajleong623 commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajleong623 Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ajleong623 commented Apr 28, 2025 •

edited by peterzhuamazon

Loading

ajleong623 commented Apr 28, 2025 •

edited

Loading

codecov bot commented Jun 1, 2025 •

edited

Loading

ajleong623 commented Jun 5, 2025 •

edited

Loading

ajleong623 commented Jun 5, 2025 •

edited

Loading

ajleong623 commented Jun 30, 2025 •

edited

Loading

ajleong623 Jun 30, 2025 •

edited

Loading