-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Added aggregation precomputation for rare terms #18106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
❌ Gradle check result for f6371a2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 844164e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 0f3bd75: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
It looks like I am failing the test for org.opensearch.cache.common.tier.TieredSpilloverCacheStatsIT.testClosingShard, however when I tried running this test on my local computer, it passes. What could be happening? Edit: Sorry, but it actually looks like the test did not pass on my system. I also tested it on the current codebase without any changes that I made, and it did not pass. Therefore, I do not think that my code affects the test. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18106 +/- ##
============================================
- Coverage 72.66% 72.62% -0.05%
+ Complexity 68231 68221 -10
============================================
Files 5555 5556 +1
Lines 313857 313994 +137
Branches 45522 45551 +29
============================================
- Hits 228073 228033 -40
- Misses 67207 67378 +171
- Partials 18577 18583 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary of All Changes Made
GlobalOrdinalStringTermsAggregator.java:
MapStringTermsAggregator.java:
Changes in SignificantTermsAggregatorFactory.java, SignificantTermsAggregatorSupplier.java, SignificantTextAggregatorFactory.java, TermsAggregatorFactory.java, TermsAggregatorSupplier.java were meant to propagate the valuesSourceConfiguration to the aggregators. StringRareTermsAggregator.java:
MissingValues.java and ValuesSource.java can now have their indexFields accessed when they previously could not. AggregatorTestCase.java:
MissingAggregatorTests.java:
TermsAggregatorTests.java and RareTermsAggregatorTests are mostly the same but with using indexed fields, now. If the method of using the counting aggregation is what you were thinking, I will extend the changes to those tests as well. |
A question I have is whether the way I used the counting aggregator was the way you intended or if there is a better method of doing so? Another big question is how do I make sure the paths are included in the workloads? I have the requests that hit those paths in a previous comment above. |
@@ -581,6 +588,21 @@ private void testSearchCase( | |||
|
|||
} | |||
|
|||
private void testSearchCaseIndexString( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would look through all the tests relevant to the aggregators I changed and make sure the leaf bucket collection count is accurate?
Yes.
It should be zero when the pre computation is used but the number of matching documents when it is not, right?
Yes.
Basically we want to capture the conditions accurately that the optimization is being used.
@@ -1416,6 +1578,24 @@ public void setWeight(Weight weight) { | |||
} | |||
} | |||
|
|||
protected static class CompositeAggregationAndCount { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to make use of CountingAggregator
defined above?
Likewise for the methods you introduced, I guess it should be possible to re-use createCountingAggregator
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I ran into is that when searchAndReduce
(https://github.com/opensearch-project/OpenSearch/blob/main/test/framework/src/main/java/org/opensearch/search/aggregations/AggregatorTestCase.java#L610) is run in AggregatorTestCase.java. That is actually where I put the CountingAggregator
however, the count is only accessible within the searchAndReduce
method, but the verification takes place in the actual test case like in https://github.com/opensearch-project/OpenSearch/blob/main/server/src/test/java/org/opensearch/search/aggregations/bucket/missing/MissingAggregatorTests.java#L341. I returned `CompositeAggregationAndCount so that the collected count could be part of the verification process.
However, I understand that the technique I used is not the neatest, and I would like to maintain the integrity of the codebase. Since we know the expected count ahead of time, would it be better to throw the error within the searchAndReduce
function and also provide the expected value as parameters?
@@ -610,6 +632,42 @@ private <A extends InternalAggregation> A executeTestCase(Query query, List<Long | |||
} | |||
} | |||
|
|||
private <A extends InternalAggregation> A executeTestCaseIndexString( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc/comments to explain what this utility is doing?
boolean includeDeletedDocumentsInSegment = randomBoolean(); | ||
boolean includeDocCountField = randomBoolean(); | ||
boolean collectSegmentOrds = randomBoolean(); | ||
if (includeDeletedDocumentsInSegment == false && includeDocCountField == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's separate out these 2 cases to run independently instead of just relying on chance whether or not we have deleted documents/docCountField.
This will help reduce the test flakiness if present and also ensure that we have a good test coverage.
assertEquals(0, internalMissingAgg.getDocCount()); | ||
assertFalse(AggregationInspectionHelper.hasValue(internalMissingAgg)); | ||
if (isIndexed) { | ||
assertEquals(0, internalMissing.getCount()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a comment above L120 - that pre-compute optimization is kicked in and no docs are traversed.
this.resultStrategy = resultStrategy.apply(this); // ResultStrategy needs a reference to the Aggregator to do its job. | ||
this.includeExclude = includeExclude; | ||
bucketOrds = BytesKeyedBucketOrds.build(context.bigArrays(), cardinality); | ||
if (collectorSource instanceof ValuesSourceCollectorSource) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the idea of being uncertain about where the fieldName
is going to come from, basically either from constructor above or fetching from value source. Let's be deterministic on where we are going to fetch the field name.
Also, you can probably use pattern matching for instanceof:
if (collectorSource instanceof ValuesSourceCollectorSource valuesCollectorSource) {
this.fieldName = valuesCollectorSource.getValuesSource().getIndexFieldName();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will just stick with fetching from the value source. Since I made the modification to add the field name to the constructor, previous implementations should not be affected.
NumericDocValues docCountValues = DocValues.getNumeric(ctx.reader(), DocCountFieldMapper.NAME); | ||
if (docCountValues.nextDoc() != NO_MORE_DOCS) { | ||
// This segment has at least one document with the _doc_count field. | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you separate out the test cases as I commented in test files - that can give you a good code coverage as well.
Signed-off-by: Anthony Leong <[email protected]>
❌ Gradle check result for d51c2a0: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
❌ Gradle check result for b5e08d8: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Anthony Leong <[email protected]>
❌ Gradle check result for ebca7e1: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Anthony Leong <[email protected]>
❌ Gradle check result for 9d73b57: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
… completed action items Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
@sandeshkr419 I believe all the comments were addressed. Rather than making a new class to return the expected count of the missing aggregation too, I simply put a check in the searchAndReduceCounting function. I also remove a lot of the non deterministic tests and made them deterministic. I added extra tests too for better coverage. The other action item is adding the workloads to the opensearch-benchmark-workloads repository. Do I just add those query bodies in the big5/queries folder? |
❌ Gradle check result for b60c221: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
// TODO: A note is that in scripted aggregations, the way of collecting from buckets is determined from | ||
// the script aggregator. For now, we will not be able to support the script aggregation. | ||
|
||
if (subAggregators.length > 0 || includeExclude != null || fieldName == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can pull up null checks for weight
and config
here so that you don't have to assert it again.
Right now you are checking for config != null
twice, and checking up (weight.count(ctx) == ctx.reader().getDocCount(fieldName)
before checking for weight == null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to proceed if config == null
, but if there is a script or there is both a missing parameter and there are actual missing values, we will not be able to use the precomputation optimization. But I can move up the weight check.
// field missing, we might not be able to use the index unless there is some way we can | ||
// calculate which ordinal value that missing field is (something I am not sure how to | ||
// do yet). | ||
if (config != null && config.missing() != null && ((weight.count(ctx) == ctx.reader().getDocCount(fieldName)) == false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: weight.count(ctx) != ctx.reader().getDocCount(fieldName)
instead of asserting equality as false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I looked at the formatting guidelines again, and I only have to assert the equality as false for unary negations.
|
||
// The optimization could only be used if there are no deleted documents and the top-level | ||
// query matches all documents in the segment. | ||
if (weight == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Moving this null check towards the start of method can make this more readable.
if (bucketOrdinal < 0) { // already seen | ||
bucketOrdinal = -1 - bucketOrdinal; | ||
} | ||
int amount = stringTermsEnum.docFreq(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename amount
to docCount
or docFreq
bucketOrdinal = -1 - bucketOrdinal; | ||
} | ||
int amount = stringTermsEnum.docFreq(); | ||
if (resultStrategy instanceof SignificantTermsResults) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if (resultStrategy instanceof SignificantTermsResults sigTermsResultStrategy) {
sigTermsResultStrategy.updateSubsetSizes(0L, docCount);
}
if (fieldName == null) { | ||
// The optimization does not work when there are subaggregations or if there is a filter. | ||
// The query has to be a match all, otherwise | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is misplaced here.
Can you please check the comments on the entire PR once. Also, please remove empty comment lines.
Signed-off-by: Anthony Leong <[email protected]>
❌ Gradle check result for 0375104: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Anthony Leong <[email protected]>
Description
This change expands on using the techniques from @sandeshkr419 pull request #11643 to precompute aggregations for match all or match none queries. We can leverage reading from termsEnum to precompute the aggregation when the field is indexed and when there are no deletions. We can check that no terms are deleted by using the weight and checking if it matches maxDocs of the reader.
Unfortunately, I was not able to use the same technique for numeric aggregators like LongRareTermsAggregator. This is because the numeric points are not indexed by frequency of terms but instead through KD-trees to optimize for different types of operations https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/PointValues.java.
Please let me know if there are any comments, concerns or suggestions.
Related Issues
Resolves #13123
#13122
#10954
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.