Skip to content

implement batch document optimization for text embedding processor #1217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

will-hwang
Copy link
Contributor

@will-hwang will-hwang commented Mar 7, 2025

Description

This PR includes changes for:

  1. refactor of InferenceFilter to simplify logic for copying embeddings
  2. implementation for batch document update optimization by overriding InferenceProcessor's subBatchExecute to include filter when skip_existing flag is on

Proposed State [Batch Document Update]

proposed-batch-flow

Steps:

  1. Process Maps are generated for each IngestDocument based on the Field Map defined.
  2. If skip_existing feature is set to true, filter the process map for each IngestDocument.
    1. Existing documents are fetched via OpenSearch client’s Multi-Get Action to compare each of the existing inference text against its corresponding IngestDocument
      1. if documents do not exist, or any exception is thrown, fallback to calling model inference
    2. Locate the embedding fields in each existing document
      1. Recursively traverse the process map to find embedding fields. Keeping track of the traversal path is required for look up in existing document.
      2. Once embedding fields are found, attempt to copy embeddings from existing document to its corresponding ingest document.
    3. If eligible, copy over the vector embeddings from existing document to its corresponding ingest document
      1. It is eligible for copy if inference text in ingest document and its corresponding existing document is the same, and embeddings for the inference text exist in its existing document.
      2. Note, if in case of values in list, the fields in the same index are compared to determine text equality
    4. If eligible fields have been copied, remove the entry in process map
  3. Inference List is generated based on entries in Filtered Process Map.
  4. ML Commons InferenceSentence API is invoked with filtered inference list.
  5. Embeddings for filtered inference list are generated in ML Commons.
  6. Embeddings for filtered inference list are mapped to target fields via entries defined in process map.
  7. Embeddings for filtered inference list are populated to target fields in IngestDocument.

Related Issues

HLD: #1138

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@will-hwang will-hwang force-pushed the optimized-text-embedding-processor-batch branch from c04d826 to 516c43a Compare March 8, 2025 04:41
Copy link

codecov bot commented Mar 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.97%. Comparing base (1a6e58e) to head (3bb0a19).

Additional details and impacted files
@@                    Coverage Diff                    @@
##             optimized-processor    #1217      +/-   ##
=========================================================
+ Coverage                  81.94%   81.97%   +0.02%     
+ Complexity                  2604     1315    -1289     
=========================================================
  Files                        194       97      -97     
  Lines                       8858     4487    -4371     
  Branches                    1498      760     -738     
=========================================================
- Hits                        7259     3678    -3581     
+ Misses                      1016      513     -503     
+ Partials                     583      296     -287     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


protected List<List<IngestDocumentWrapper>> cutBatches(List<IngestDocumentWrapper> ingestDocumentWrappers) {
List<List<IngestDocumentWrapper>> batches = new ArrayList();
for (int i = 0; i < ingestDocumentWrappers.size(); i += this.batchSize) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document mention that batch_size parameter is deprecated. Should we keep supporting it for 3.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wanted to keep the changes minimal to the existing implementation of batchExecute in core here: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/ingest/AbstractBatchingProcessor.java#L48-L74

Do you think we should completely rewrite the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there are conflicting explanation for batch_size, i think it's safe to leave it for now

protected List<List<IngestDocumentWrapper>> cutBatches(List<IngestDocumentWrapper> ingestDocumentWrappers) {
List<List<IngestDocumentWrapper>> batches = new ArrayList();
for (int i = 0; i < ingestDocumentWrappers.size(); i += this.batchSize) {
batches.add(ingestDocumentWrappers.subList(i, Math.min(i + this.batchSize, ingestDocumentWrappers.size())));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using subList may not be efficient for certain data structures. For instance, with a LinkedList, if batchSize is 1, making n calls to subList would result in an O(n²) time complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used as a default behavior for batchExecute (https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/ingest/AbstractBatchingProcessor.java#L48-L74), which we extend in our inferenceProcessor.
I think it's better to leave the existing logic as is, and make changes to what is needed only. let me know your thoughts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Can't we just override subBatchExecute instead of repeating the core code here?

Copy link
Contributor Author

@will-hwang will-hwang Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we just override subBatchExecute, we won't be able to use multiGet operation to fetch all documents at once, since subBatchExecute only contains a subset (mostly just one) of the documents.

Copy link
Collaborator

@heemin32 heemin32 Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what we want. We should not get all documents at once. It might fail if the number of document is too big. Customer can increase batch_size if they want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, with this change we would be calling multiple multiGet operation, since we are calling it at subBatch level.

For example, If user ingests 6 documents, with a batch size of 2,

there will be 3 multiGet calls with each batch size of 2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

assert counter.get() >= 0 : "counter is negative";
});
}
}, e -> { handler.accept(null); }));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we call handler.accept(null), will the cx see the error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, will update the PR to include exception in each document

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the latest change, when mget throws an exception, handler will still accept the ingested document, with exception message populated in each document

@will-hwang will-hwang force-pushed the optimized-text-embedding-processor-batch branch 2 times, most recently from ddc33f8 to a257a0c Compare March 11, 2025 22:15
@@ -106,4 +112,60 @@ public void doBatchExecute(List<String> inferenceList, Consumer<List<?>> handler
ActionListener.wrap(handler::accept, onException)
);
}

@Override
public void subBatchExecute(List<IngestDocumentWrapper> ingestDocumentWrappers, Consumer<List<IngestDocumentWrapper>> handler) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you divide this method into three parts?

  1. Validations
  2. MGet call
  3. Checking for embeddings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i refactored the method and added comments to make it more readable. Left the validations at the beginning of the method for cleaner implementation

@@ -47,7 +45,7 @@ public Object filterInferenceValue(
Optional<Object> existingValueOptional = ProcessorUtils.getValueFromSource(existingSourceAndMetadataMap, textPath);
Optional<Object> embeddingValueOptional = ProcessorUtils.getValueFromSource(existingSourceAndMetadataMap, embeddingKey);
if (existingValueOptional.isPresent() && embeddingValueOptional.isPresent()) {
return copyEmbeddingForSingleValue(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: why we change this method here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a change from previous PR. Since we're comparing single value and list values as object values, we can have just one method for comparison and copying embeddings

@will-hwang will-hwang force-pushed the optimized-text-embedding-processor-batch branch from a257a0c to 3bb0a19 Compare March 12, 2025 21:56
handler.accept(Collections.emptyList());
return;
}
List<DataForInference> dataForInferences = getDataForInference(ingestDocumentWrappers);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happen if this line throw exception? Shouldn't we catch it and set it in handler or will it be handled already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is already handled in the method: here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the issue—you need to check every method to ensure exceptions are handled properly, which isn't a future-proof approach. It would be better to handle the exception here as well, ensuring that any exceptions thrown by underlying methods are caught.

// create a map of documents with key: doc_id and value: doc
Map<String, Map<String, Object>> existingDocuments = createDocumentMap(multiGetItemResponses);
List<DataForInference> filteredDataForInference = filterDataForInference(dataForInferences, existingDocuments);
List<String> filteredInferenceList = constructInferenceTexts(filteredDataForInference);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happen if exception is thrown here. Will that be handled gracefully?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it checks for whether exception was thrown and only create inference text for valid data inferences here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above comment.

existingListValue.getFirst(),
embeddingListOptional.get(),
sourceAndMetadataMap,
-1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit unclear—I’m not sure when to pass -1 versus another value. It would be helpful to have a separate method for clarity and include a detailed Javadoc explaining when to use each method.

// Use when xys 
public Object copyEmbeddingXYZ(
        String embeddingKey,
        Object processValue,
        Object existingValue,
        Object embeddingValue,
        Map<String, Object> sourceAndMetadataMap
    ) {
  return copyEmbedding(embeddingKey, processValue, existingValue, embeddingValue, sourceAndMetadataMap, -1);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, will create a separate method

Copy link
Member

@vibrantvarun vibrantvarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@will-hwang could you add tests in failure cases of this feature? Like when will the functionality fail and user will not see 200 status from api

@vibrantvarun
Copy link
Member

Overall I have similar comments like @heemin32 so will wait for once it gets addressed before doing final round of review

@will-hwang will-hwang force-pushed the optimized-text-embedding-processor-batch branch from 3bb0a19 to d69a219 Compare March 14, 2025 22:30
@will-hwang
Copy link
Contributor Author

@will-hwang could you add tests in failure cases of this feature? Like when will the functionality fail and user will not see 200 status from api

i've added a failure test case here where mget fails with runtime exception. In summary, the error handling is the same as our existing implementation, where if dependent service (opensearch client or mlcommonclient) fails with exception, ingestDocument is updated with exception, instead of failing ingestion.

@will-hwang will-hwang force-pushed the optimized-text-embedding-processor-batch branch from d69a219 to 4400154 Compare March 17, 2025 23:13
inferenceList = sortedResult.v1();
Map<Integer, Integer> originalOrder = sortedResult.v2();
doBatchExecute(inferenceList, results -> {
int startIndex = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing try catch here.

MultiGetAction.INSTANCE,
buildMultiGetRequest(ingestDocumentWrappers),
ActionListener.wrap(response -> {
MultiGetItemResponse[] multiGetItemResponses = response.getResponses();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing try catch here.

@will-hwang will-hwang force-pushed the optimized-text-embedding-processor-batch branch from 4400154 to ac037de Compare March 17, 2025 23:58
Copy link
Collaborator

@heemin32 heemin32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@junqiu-lei junqiu-lei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@heemin32
Copy link
Collaborator

heemin32 commented Mar 18, 2025

Build failures are coming from changes in knn and it is resolved by #1233

@heemin32 heemin32 merged commit bef96ab into opensearch-project:optimized-processor Mar 18, 2025
39 of 48 checks passed
will-hwang added a commit to will-hwang/neural-search that referenced this pull request Mar 20, 2025
will-hwang added a commit to will-hwang/neural-search that referenced this pull request Mar 20, 2025
will-hwang added a commit to will-hwang/neural-search that referenced this pull request Mar 20, 2025
will-hwang added a commit to will-hwang/neural-search that referenced this pull request Mar 24, 2025
will-hwang added a commit to will-hwang/neural-search that referenced this pull request Mar 24, 2025
will-hwang added a commit to will-hwang/neural-search that referenced this pull request Mar 24, 2025
will-hwang added a commit to will-hwang/neural-search that referenced this pull request Mar 24, 2025
will-hwang added a commit to will-hwang/neural-search that referenced this pull request Mar 24, 2025
will-hwang added a commit to will-hwang/neural-search that referenced this pull request Mar 24, 2025
heemin32 pushed a commit that referenced this pull request Mar 25, 2025
…1238)

* implement single document update scenario for text embedding processor (#1191)

Signed-off-by: Will Hwang <[email protected]>

* implement batch document update scenario for text embedding processor (#1217)

Signed-off-by: Will Hwang <[email protected]>

---------

Signed-off-by: Will Hwang <[email protected]>
ryanbogan pushed a commit to ryanbogan/neural-search that referenced this pull request Apr 10, 2025
…pensearch-project#1238)

* implement single document update scenario for text embedding processor (opensearch-project#1191)

Signed-off-by: Will Hwang <[email protected]>

* implement batch document update scenario for text embedding processor (opensearch-project#1217)

Signed-off-by: Will Hwang <[email protected]>

---------

Signed-off-by: Will Hwang <[email protected]>
YeonghyeonKO pushed a commit to YeonghyeonKO/neural-search that referenced this pull request May 30, 2025
…pensearch-project#1238)

* implement single document update scenario for text embedding processor (opensearch-project#1191)

Signed-off-by: Will Hwang <[email protected]>

* implement batch document update scenario for text embedding processor (opensearch-project#1217)

Signed-off-by: Will Hwang <[email protected]>

---------

Signed-off-by: Will Hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>
YeonghyeonKO pushed a commit to YeonghyeonKO/neural-search that referenced this pull request May 30, 2025
…pensearch-project#1238)

* implement single document update scenario for text embedding processor (opensearch-project#1191)

Signed-off-by: Will Hwang <[email protected]>

* implement batch document update scenario for text embedding processor (opensearch-project#1217)

Signed-off-by: Will Hwang <[email protected]>

---------

Signed-off-by: Will Hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>
yuye-aws pushed a commit that referenced this pull request Jun 10, 2025
#1342)

* Implement Optimized embedding generation in text embedding processor (#1238)

* implement single document update scenario for text embedding processor (#1191)

Signed-off-by: Will Hwang <[email protected]>

* implement batch document update scenario for text embedding processor (#1217)

Signed-off-by: Will Hwang <[email protected]>

---------

Signed-off-by: Will Hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Going from alpha1 to beta1 for 3.0 release (#1245)

Signed-off-by: yeonghyeonKo <[email protected]>

* Implement Optimized embedding generation in sparse encoding processor (#1246)

Signed-off-by: Will Hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Implement Optimized embedding generation in text and image embedding processor (#1249)

Signed-off-by: will-hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Inner hits support with hybrid query (#1253)

* Inner Hits in Hybrid query

Signed-off-by: Varun Jain <[email protected]>

* Inner hits support with hybrid query

Signed-off-by: Varun Jain <[email protected]>

* Add changelog

Signed-off-by: Varun Jain <[email protected]>

* fix integ tests

Signed-off-by: Varun Jain <[email protected]>

* Modify comment

Signed-off-by: Varun Jain <[email protected]>

* Explain test case

Signed-off-by: Varun Jain <[email protected]>

* Optimize inner hits count calculation method

Signed-off-by: Varun Jain <[email protected]>

---------

Signed-off-by: Varun Jain <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Support custom tags in semantic highlighter (#1254)

Signed-off-by: yeonghyeonKo <[email protected]>

* Add neural stats API (#1256)

* Add neural stats API

Signed-off-by: Andy Qin <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Added release notes for 3.0 beta1 (#1252)

* Added release notes for 3.0 beta1

Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Update semantic highlighter test model (#1259)

Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Fix the edge case when the value of a fieldMap key in ingestDocument is empty string (#1257)

Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Hybrid query should call rewrite before creating weight (#1268)

* Hybrid query should call rewrite before creating weight

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>

* Awaits fix

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>

* Rewrite with searcher

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>

* Feature flag issue

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>

---------

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Support phasing off SecurityManager usage in favor of Java Agent (#1265)

Signed-off-by: Gulshan <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272)

Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add semantic field mapper. (#1225)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Increment version to 3.0.0-SNAPSHOT (#1286)

Signed-off-by: opensearch-ci-bot <[email protected]>
Co-authored-by: opensearch-ci-bot <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Remove beta1 qualifier (#1292)

Signed-off-by: Peter Zhu <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Fix for merging scoreDocs when totalHits are greater than 1 and fieldDocs are 0 (#1295) (#1296)

(cherry picked from commit 6f3aabb)

Co-authored-by: Varun Jain <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* add release notes for 3.0 (#1287)

Signed-off-by: will-hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Allow maven to publish to all versions (#1300) (#1301)

Signed-off-by: Peter Zhu <[email protected]>
(cherry picked from commit c5625db)

Co-authored-by: Peter Zhu <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [FEAT] introduce new FixedStringLengthChunker

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] initial test cases for FixedStringLengthChunker

Signed-off-by: yeonghyeonKo <[email protected]>

* [FIX] gradlew spotlessApply

Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] remove unnecessary comments

Signed-off-by: yeonghyeonKo <[email protected]>

* [Performance Improvement] Add custom bulk scorer for hybrid query (2-3x faster) (#1289)

Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add TextChunkingProcessor stats (#1308)

* Add TextChunkingProcessor stats

Signed-off-by: Andy Qin <[email protected]>

# Conflicts:
#	CHANGELOG.md

* Update unit and integ tests

Signed-off-by: Andy Qin <[email protected]>

---------

Signed-off-by: Andy Qin <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Update Lucene dependencies (#1336)

* Update Lucene dependencies

Signed-off-by: Ryan Bogan <[email protected]>

* Add changelog entry

Signed-off-by: Ryan Bogan <[email protected]>

* Update model request body for bwc and integ tests

Signed-off-by: Ryan Bogan <[email protected]>

---------

Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] modify algorithm name and related parts

Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] update test codes along with the change in CharacterLengthChunker

Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] remove defensive check to prevent adding redundant code lines

Signed-off-by: yeonghyeonKo <[email protected]>

* Update CharacterLengthChunker to FixedCharLengthChunker

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Update ChunkerFactory

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Update CharacterLengthChunkerTests to FixedCharLengthChunkerTests

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [FIX] handle a corner case where the content is shorter than charLimit

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] Add integration test codes for fixed_char_length chunking algorithm

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] integration test code for cascaded pipeline

Signed-off-by: yeonghyeonKo <[email protected]>

* Support analyzer-based neural sparse query (#1088)

* merge main; add analyzer impl

Signed-off-by: zhichao-aws <[email protected]>

* two phase adaption

Signed-off-by: zhichao-aws <[email protected]>

* two phase adaption

Signed-off-by: zhichao-aws <[email protected]>

* remove analysis

Signed-off-by: zhichao-aws <[email protected]>

* lint

Signed-off-by: zhichao-aws <[email protected]>

* update

Signed-off-by: zhichao-aws <[email protected]>

* address comments

Signed-off-by: zhichao-aws <[email protected]>

* tests

Signed-off-by: zhichao-aws <[email protected]>

* modify plugin security policy

Signed-off-by: zhichao-aws <[email protected]>

* change log

Signed-off-by: zhichao-aws <[email protected]>

* address comments

Signed-off-by: zhichao-aws <[email protected]>

* modify to package-private

Signed-off-by: zhichao-aws <[email protected]>

---------

Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Fixed score value as null for single shard for sorting (#1277)

* Fixed score value as null for single shard for sorting

Signed-off-by: Owais <[email protected]>

* Addressed comment

Signed-off-by: Owais <[email protected]>

* Addressed more comments

Signed-off-by: Owais <[email protected]>

* Added UT

Signed-off-by: Owais <[email protected]>

---------

Signed-off-by: Owais <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add IT for neural sparse query + bert-uncased mbert-uncased analyzer (#1279)

* add it

Signed-off-by: zhichao-aws <[email protected]>

* change log

Signed-off-by: zhichao-aws <[email protected]>

---------

Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add WithFieldName implementation to QueryBuilders (#1285)

Signed-off-by: Owais <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [AUTO] Increment version to 3.1.0-SNAPSHOT (#1288)

* Increment version to 3.1.0-SNAPSHOT

Signed-off-by: opensearch-ci-bot <[email protected]>

* Update build.gradle

Signed-off-by: Peter Zhu <[email protected]>

---------

Signed-off-by: opensearch-ci-bot <[email protected]>
Signed-off-by: Peter Zhu <[email protected]>
Co-authored-by: opensearch-ci-bot <[email protected]>
Co-authored-by: Peter Zhu <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* add release notes for 3.0 (#1298)

Signed-off-by: will-hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Return bad request for invalid stat parameters in stats API (#1291)

Signed-off-by: Andy Qin <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add semantic mapping transformer. (#1276)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add semantic ingest processor. (#1309)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [Performance Improvement] Add custom bulk scorer for hybrid query (2-3x faster) (#1289)

Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Implement the query logic for the semantic field. (#1315)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Support custom weights params in RRF (#1322)

* Support Weights params in RRF

Signed-off-by: Varun Jain <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* add validation for invalid nested hybrid query (#1305)

* add validation for nested hybrid query

Signed-off-by: will-hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add stats tracking for semantic highlighting (#1327)

* Add stats tracking for semantic highlighting

Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Update Lucene dependencies (#1336)

* Update Lucene dependencies

Signed-off-by: Ryan Bogan <[email protected]>

* Add changelog entry

Signed-off-by: Ryan Bogan <[email protected]>

* Update model request body for bwc and integ tests

Signed-off-by: Ryan Bogan <[email protected]>

---------

Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Enhance semantic field to allow to enable/disable chunking. (#1337)

* Implement the query logic for the semantic field.

Signed-off-by: Bo Zhang <[email protected]>

* Enhance semantic field to allow to enable/disable chunking.

Signed-off-by: Bo Zhang <[email protected]>

---------

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] modify algorithm name and related parts

Signed-off-by: yeonghyeonKo <[email protected]>

* Update CHANGELOG.md

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [FEAT] Add fixed_char_length chunking algorithm to STAT manager

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] Add integration test codes for fixed_char_length chunking algorithm

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] integration test code for cascaded pipeline

Signed-off-by: yeonghyeonKo <[email protected]>

* Going from alpha1 to beta1 for 3.0 release (#1245)

Signed-off-by: yeonghyeonKo <[email protected]>

* Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272)

Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add semantic field mapper. (#1225)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add semantic mapping transformer. (#1276)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272)

Signed-off-by: Junqiu Lei <[email protected]>

* Add semantic field mapper. (#1225)

Signed-off-by: Bo Zhang <[email protected]>

* Add semantic mapping transformer. (#1276)

Signed-off-by: Bo Zhang <[email protected]>

* [FIX] minor typo

Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] adopt FixedTokenLengthChunker's loop strategy for robust final chunking

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] sum the number of processors and their executions correctly in TextChunkingProcessorIT

Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] gradlew spotlessApply

Signed-off-by: yeonghyeonKo <[email protected]>

---------

Signed-off-by: Will Hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>
Signed-off-by: will-hwang <[email protected]>
Signed-off-by: Varun Jain <[email protected]>
Signed-off-by: Andy Qin <[email protected]>
Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Harsha Vamsi Kalluri <[email protected]>
Signed-off-by: Gulshan <[email protected]>
Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: opensearch-ci-bot <[email protected]>
Signed-off-by: Peter Zhu <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: Owais <[email protected]>
Co-authored-by: Will Hwang <[email protected]>
Co-authored-by: Martin Gaievski <[email protected]>
Co-authored-by: Varun Jain <[email protected]>
Co-authored-by: Junqiu Lei <[email protected]>
Co-authored-by: Andy <[email protected]>
Co-authored-by: Chloe Gao <[email protected]>
Co-authored-by: Harsha Vamsi Kalluri <[email protected]>
Co-authored-by: Gulshan <[email protected]>
Co-authored-by: Bo Zhang <[email protected]>
Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com>
Co-authored-by: opensearch-ci-bot <[email protected]>
Co-authored-by: Peter Zhu <[email protected]>
Co-authored-by: Ryan Bogan <[email protected]>
Co-authored-by: zhichao-aws <[email protected]>
Co-authored-by: Owais Kazi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants