-
Notifications
You must be signed in to change notification settings - Fork 94
implement batch document optimization for text embedding processor #1217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement batch document optimization for text embedding processor #1217
Conversation
c04d826
to
516c43a
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## optimized-processor #1217 +/- ##
=========================================================
+ Coverage 81.94% 81.97% +0.02%
+ Complexity 2604 1315 -1289
=========================================================
Files 194 97 -97
Lines 8858 4487 -4371
Branches 1498 760 -738
=========================================================
- Hits 7259 3678 -3581
+ Misses 1016 513 -503
+ Partials 583 296 -287 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
||
protected List<List<IngestDocumentWrapper>> cutBatches(List<IngestDocumentWrapper> ingestDocumentWrappers) { | ||
List<List<IngestDocumentWrapper>> batches = new ArrayList(); | ||
for (int i = 0; i < ingestDocumentWrappers.size(); i += this.batchSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document mention that batch_size
parameter is deprecated. Should we keep supporting it for 3.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wanted to keep the changes minimal to the existing implementation of batchExecute in core here: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/ingest/AbstractBatchingProcessor.java#L48-L74
Do you think we should completely rewrite the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since there are conflicting explanation for batch_size
, i think it's safe to leave it for now
protected List<List<IngestDocumentWrapper>> cutBatches(List<IngestDocumentWrapper> ingestDocumentWrappers) { | ||
List<List<IngestDocumentWrapper>> batches = new ArrayList(); | ||
for (int i = 0; i < ingestDocumentWrappers.size(); i += this.batchSize) { | ||
batches.add(ingestDocumentWrappers.subList(i, Math.min(i + this.batchSize, ingestDocumentWrappers.size()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using subList
may not be efficient for certain data structures. For instance, with a LinkedList
, if batchSize
is 1, making n
calls to subList
would result in an O(n²) time complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used as a default behavior for batchExecute (https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/ingest/AbstractBatchingProcessor.java#L48-L74), which we extend in our inferenceProcessor.
I think it's better to leave the existing logic as is, and make changes to what is needed only. let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Can't we just override subBatchExecute
instead of repeating the core code here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we just override subBatchExecute
, we won't be able to use multiGet
operation to fetch all documents at once, since subBatchExecute
only contains a subset (mostly just one) of the documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what we want. We should not get all documents at once. It might fail if the number of document is too big. Customer can increase batch_size if they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, with this change we would be calling multiple multiGet
operation, since we are calling it at subBatch level.
For example, If user ingests 6 documents, with a batch size of 2,
there will be 3 multiGet calls with each batch size of 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java
Outdated
Show resolved
Hide resolved
assert counter.get() >= 0 : "counter is negative"; | ||
}); | ||
} | ||
}, e -> { handler.accept(null); })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we call handler.accept(null), will the cx see the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, will update the PR to include exception in each document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the latest change, when mget
throws an exception, handler will still accept the ingested document, with exception message populated in each document
src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java
Outdated
Show resolved
Hide resolved
ddc33f8
to
a257a0c
Compare
@@ -106,4 +112,60 @@ public void doBatchExecute(List<String> inferenceList, Consumer<List<?>> handler | |||
ActionListener.wrap(handler::accept, onException) | |||
); | |||
} | |||
|
|||
@Override | |||
public void subBatchExecute(List<IngestDocumentWrapper> ingestDocumentWrappers, Consumer<List<IngestDocumentWrapper>> handler) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you divide this method into three parts?
- Validations
- MGet call
- Checking for embeddings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i refactored the method and added comments to make it more readable. Left the validations at the beginning of the method for cleaner implementation
src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java
Outdated
Show resolved
Hide resolved
@@ -47,7 +45,7 @@ public Object filterInferenceValue( | |||
Optional<Object> existingValueOptional = ProcessorUtils.getValueFromSource(existingSourceAndMetadataMap, textPath); | |||
Optional<Object> embeddingValueOptional = ProcessorUtils.getValueFromSource(existingSourceAndMetadataMap, embeddingKey); | |||
if (existingValueOptional.isPresent() && embeddingValueOptional.isPresent()) { | |||
return copyEmbeddingForSingleValue( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: why we change this method here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a change from previous PR. Since we're comparing single value and list values as object values, we can have just one method for comparison and copying embeddings
a257a0c
to
3bb0a19
Compare
src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java
Show resolved
Hide resolved
handler.accept(Collections.emptyList()); | ||
return; | ||
} | ||
List<DataForInference> dataForInferences = getDataForInference(ingestDocumentWrappers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happen if this line throw exception? Shouldn't we catch it and set it in handler or will it be handled already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is already handled in the method: here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the issue—you need to check every method to ensure exceptions are handled properly, which isn't a future-proof approach. It would be better to handle the exception here as well, ensuring that any exceptions thrown by underlying methods are caught.
// create a map of documents with key: doc_id and value: doc | ||
Map<String, Map<String, Object>> existingDocuments = createDocumentMap(multiGetItemResponses); | ||
List<DataForInference> filteredDataForInference = filterDataForInference(dataForInferences, existingDocuments); | ||
List<String> filteredInferenceList = constructInferenceTexts(filteredDataForInference); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happen if exception is thrown here. Will that be handled gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it checks for whether exception was thrown and only create inference text for valid data inferences here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above comment.
existingListValue.getFirst(), | ||
embeddingListOptional.get(), | ||
sourceAndMetadataMap, | ||
-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit unclear—I’m not sure when to pass -1 versus another value. It would be helpful to have a separate method for clarity and include a detailed Javadoc explaining when to use each method.
// Use when xys
public Object copyEmbeddingXYZ(
String embeddingKey,
Object processValue,
Object existingValue,
Object embeddingValue,
Map<String, Object> sourceAndMetadataMap
) {
return copyEmbedding(embeddingKey, processValue, existingValue, embeddingValue, sourceAndMetadataMap, -1);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, will create a separate method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@will-hwang could you add tests in failure cases of this feature? Like when will the functionality fail and user will not see 200 status from api
Overall I have similar comments like @heemin32 so will wait for once it gets addressed before doing final round of review |
3bb0a19
to
d69a219
Compare
i've added a failure test case here where |
d69a219
to
4400154
Compare
inferenceList = sortedResult.v1(); | ||
Map<Integer, Integer> originalOrder = sortedResult.v2(); | ||
doBatchExecute(inferenceList, results -> { | ||
int startIndex = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing try catch here.
MultiGetAction.INSTANCE, | ||
buildMultiGetRequest(ingestDocumentWrappers), | ||
ActionListener.wrap(response -> { | ||
MultiGetItemResponse[] multiGetItemResponses = response.getResponses(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing try catch here.
Signed-off-by: Will Hwang <[email protected]>
4400154
to
ac037de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
Build failures are coming from changes in knn and it is resolved by #1233 |
bef96ab
into
opensearch-project:optimized-processor
…pensearch-project#1217) Signed-off-by: Will Hwang <[email protected]>
…pensearch-project#1217) Signed-off-by: Will Hwang <[email protected]>
…pensearch-project#1217) Signed-off-by: will-hwang <[email protected]>
…opensearch-project#1217) Signed-off-by: Will Hwang <[email protected]>
…opensearch-project#1217) Signed-off-by: Will Hwang <[email protected]>
…opensearch-project#1217) Signed-off-by: Will Hwang <[email protected]>
…opensearch-project#1217) Signed-off-by: Will Hwang <[email protected]>
…opensearch-project#1217) Signed-off-by: Will Hwang <[email protected]>
…opensearch-project#1217) Signed-off-by: Will Hwang <[email protected]>
…1238) * implement single document update scenario for text embedding processor (#1191) Signed-off-by: Will Hwang <[email protected]> * implement batch document update scenario for text embedding processor (#1217) Signed-off-by: Will Hwang <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]>
…pensearch-project#1238) * implement single document update scenario for text embedding processor (opensearch-project#1191) Signed-off-by: Will Hwang <[email protected]> * implement batch document update scenario for text embedding processor (opensearch-project#1217) Signed-off-by: Will Hwang <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]>
…pensearch-project#1238) * implement single document update scenario for text embedding processor (opensearch-project#1191) Signed-off-by: Will Hwang <[email protected]> * implement batch document update scenario for text embedding processor (opensearch-project#1217) Signed-off-by: Will Hwang <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]>
…pensearch-project#1238) * implement single document update scenario for text embedding processor (opensearch-project#1191) Signed-off-by: Will Hwang <[email protected]> * implement batch document update scenario for text embedding processor (opensearch-project#1217) Signed-off-by: Will Hwang <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]>
#1342) * Implement Optimized embedding generation in text embedding processor (#1238) * implement single document update scenario for text embedding processor (#1191) Signed-off-by: Will Hwang <[email protected]> * implement batch document update scenario for text embedding processor (#1217) Signed-off-by: Will Hwang <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Going from alpha1 to beta1 for 3.0 release (#1245) Signed-off-by: yeonghyeonKo <[email protected]> * Implement Optimized embedding generation in sparse encoding processor (#1246) Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Implement Optimized embedding generation in text and image embedding processor (#1249) Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Inner hits support with hybrid query (#1253) * Inner Hits in Hybrid query Signed-off-by: Varun Jain <[email protected]> * Inner hits support with hybrid query Signed-off-by: Varun Jain <[email protected]> * Add changelog Signed-off-by: Varun Jain <[email protected]> * fix integ tests Signed-off-by: Varun Jain <[email protected]> * Modify comment Signed-off-by: Varun Jain <[email protected]> * Explain test case Signed-off-by: Varun Jain <[email protected]> * Optimize inner hits count calculation method Signed-off-by: Varun Jain <[email protected]> --------- Signed-off-by: Varun Jain <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Support custom tags in semantic highlighter (#1254) Signed-off-by: yeonghyeonKo <[email protected]> * Add neural stats API (#1256) * Add neural stats API Signed-off-by: Andy Qin <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Added release notes for 3.0 beta1 (#1252) * Added release notes for 3.0 beta1 Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update semantic highlighter test model (#1259) Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix the edge case when the value of a fieldMap key in ingestDocument is empty string (#1257) Signed-off-by: Chloe Gao <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Hybrid query should call rewrite before creating weight (#1268) * Hybrid query should call rewrite before creating weight Signed-off-by: Harsha Vamsi Kalluri <[email protected]> * Awaits fix Signed-off-by: Harsha Vamsi Kalluri <[email protected]> * Rewrite with searcher Signed-off-by: Harsha Vamsi Kalluri <[email protected]> * Feature flag issue Signed-off-by: Harsha Vamsi Kalluri <[email protected]> --------- Signed-off-by: Harsha Vamsi Kalluri <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Support phasing off SecurityManager usage in favor of Java Agent (#1265) Signed-off-by: Gulshan <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272) Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic field mapper. (#1225) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Increment version to 3.0.0-SNAPSHOT (#1286) Signed-off-by: opensearch-ci-bot <[email protected]> Co-authored-by: opensearch-ci-bot <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Remove beta1 qualifier (#1292) Signed-off-by: Peter Zhu <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix for merging scoreDocs when totalHits are greater than 1 and fieldDocs are 0 (#1295) (#1296) (cherry picked from commit 6f3aabb) Co-authored-by: Varun Jain <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * add release notes for 3.0 (#1287) Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Allow maven to publish to all versions (#1300) (#1301) Signed-off-by: Peter Zhu <[email protected]> (cherry picked from commit c5625db) Co-authored-by: Peter Zhu <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [FEAT] introduce new FixedStringLengthChunker Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] initial test cases for FixedStringLengthChunker Signed-off-by: yeonghyeonKo <[email protected]> * [FIX] gradlew spotlessApply Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] remove unnecessary comments Signed-off-by: yeonghyeonKo <[email protected]> * [Performance Improvement] Add custom bulk scorer for hybrid query (2-3x faster) (#1289) Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add TextChunkingProcessor stats (#1308) * Add TextChunkingProcessor stats Signed-off-by: Andy Qin <[email protected]> # Conflicts: # CHANGELOG.md * Update unit and integ tests Signed-off-by: Andy Qin <[email protected]> --------- Signed-off-by: Andy Qin <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update Lucene dependencies (#1336) * Update Lucene dependencies Signed-off-by: Ryan Bogan <[email protected]> * Add changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Update model request body for bwc and integ tests Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] modify algorithm name and related parts Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] update test codes along with the change in CharacterLengthChunker Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] remove defensive check to prevent adding redundant code lines Signed-off-by: yeonghyeonKo <[email protected]> * Update CharacterLengthChunker to FixedCharLengthChunker Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update ChunkerFactory Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update CharacterLengthChunkerTests to FixedCharLengthChunkerTests Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [FIX] handle a corner case where the content is shorter than charLimit Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] Add integration test codes for fixed_char_length chunking algorithm Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] integration test code for cascaded pipeline Signed-off-by: yeonghyeonKo <[email protected]> * Support analyzer-based neural sparse query (#1088) * merge main; add analyzer impl Signed-off-by: zhichao-aws <[email protected]> * two phase adaption Signed-off-by: zhichao-aws <[email protected]> * two phase adaption Signed-off-by: zhichao-aws <[email protected]> * remove analysis Signed-off-by: zhichao-aws <[email protected]> * lint Signed-off-by: zhichao-aws <[email protected]> * update Signed-off-by: zhichao-aws <[email protected]> * address comments Signed-off-by: zhichao-aws <[email protected]> * tests Signed-off-by: zhichao-aws <[email protected]> * modify plugin security policy Signed-off-by: zhichao-aws <[email protected]> * change log Signed-off-by: zhichao-aws <[email protected]> * address comments Signed-off-by: zhichao-aws <[email protected]> * modify to package-private Signed-off-by: zhichao-aws <[email protected]> --------- Signed-off-by: zhichao-aws <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fixed score value as null for single shard for sorting (#1277) * Fixed score value as null for single shard for sorting Signed-off-by: Owais <[email protected]> * Addressed comment Signed-off-by: Owais <[email protected]> * Addressed more comments Signed-off-by: Owais <[email protected]> * Added UT Signed-off-by: Owais <[email protected]> --------- Signed-off-by: Owais <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add IT for neural sparse query + bert-uncased mbert-uncased analyzer (#1279) * add it Signed-off-by: zhichao-aws <[email protected]> * change log Signed-off-by: zhichao-aws <[email protected]> --------- Signed-off-by: zhichao-aws <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add WithFieldName implementation to QueryBuilders (#1285) Signed-off-by: Owais <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [AUTO] Increment version to 3.1.0-SNAPSHOT (#1288) * Increment version to 3.1.0-SNAPSHOT Signed-off-by: opensearch-ci-bot <[email protected]> * Update build.gradle Signed-off-by: Peter Zhu <[email protected]> --------- Signed-off-by: opensearch-ci-bot <[email protected]> Signed-off-by: Peter Zhu <[email protected]> Co-authored-by: opensearch-ci-bot <[email protected]> Co-authored-by: Peter Zhu <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * add release notes for 3.0 (#1298) Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Return bad request for invalid stat parameters in stats API (#1291) Signed-off-by: Andy Qin <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic mapping transformer. (#1276) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic ingest processor. (#1309) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [Performance Improvement] Add custom bulk scorer for hybrid query (2-3x faster) (#1289) Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Implement the query logic for the semantic field. (#1315) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Support custom weights params in RRF (#1322) * Support Weights params in RRF Signed-off-by: Varun Jain <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * add validation for invalid nested hybrid query (#1305) * add validation for nested hybrid query Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add stats tracking for semantic highlighting (#1327) * Add stats tracking for semantic highlighting Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update Lucene dependencies (#1336) * Update Lucene dependencies Signed-off-by: Ryan Bogan <[email protected]> * Add changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Update model request body for bwc and integ tests Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Enhance semantic field to allow to enable/disable chunking. (#1337) * Implement the query logic for the semantic field. Signed-off-by: Bo Zhang <[email protected]> * Enhance semantic field to allow to enable/disable chunking. Signed-off-by: Bo Zhang <[email protected]> --------- Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] modify algorithm name and related parts Signed-off-by: yeonghyeonKo <[email protected]> * Update CHANGELOG.md Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [FEAT] Add fixed_char_length chunking algorithm to STAT manager Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] Add integration test codes for fixed_char_length chunking algorithm Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] integration test code for cascaded pipeline Signed-off-by: yeonghyeonKo <[email protected]> * Going from alpha1 to beta1 for 3.0 release (#1245) Signed-off-by: yeonghyeonKo <[email protected]> * Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272) Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic field mapper. (#1225) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic mapping transformer. (#1276) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272) Signed-off-by: Junqiu Lei <[email protected]> * Add semantic field mapper. (#1225) Signed-off-by: Bo Zhang <[email protected]> * Add semantic mapping transformer. (#1276) Signed-off-by: Bo Zhang <[email protected]> * [FIX] minor typo Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] adopt FixedTokenLengthChunker's loop strategy for robust final chunking Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] sum the number of processors and their executions correctly in TextChunkingProcessorIT Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] gradlew spotlessApply Signed-off-by: yeonghyeonKo <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> Signed-off-by: will-hwang <[email protected]> Signed-off-by: Varun Jain <[email protected]> Signed-off-by: Andy Qin <[email protected]> Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: Chloe Gao <[email protected]> Signed-off-by: Harsha Vamsi Kalluri <[email protected]> Signed-off-by: Gulshan <[email protected]> Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: opensearch-ci-bot <[email protected]> Signed-off-by: Peter Zhu <[email protected]> Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: zhichao-aws <[email protected]> Signed-off-by: Owais <[email protected]> Co-authored-by: Will Hwang <[email protected]> Co-authored-by: Martin Gaievski <[email protected]> Co-authored-by: Varun Jain <[email protected]> Co-authored-by: Junqiu Lei <[email protected]> Co-authored-by: Andy <[email protected]> Co-authored-by: Chloe Gao <[email protected]> Co-authored-by: Harsha Vamsi Kalluri <[email protected]> Co-authored-by: Gulshan <[email protected]> Co-authored-by: Bo Zhang <[email protected]> Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com> Co-authored-by: opensearch-ci-bot <[email protected]> Co-authored-by: Peter Zhu <[email protected]> Co-authored-by: Ryan Bogan <[email protected]> Co-authored-by: zhichao-aws <[email protected]> Co-authored-by: Owais Kazi <[email protected]>
Description
This PR includes changes for:
subBatchExecute
to include filter whenskip_existing
flag is onProposed State [Batch Document Update]
Steps:
Related Issues
HLD: #1138
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.