Skip to content

Implement Optimized embedding generation in text embedding processor #1238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 25, 2025

Conversation

will-hwang
Copy link
Contributor

@will-hwang will-hwang commented Mar 20, 2025

Description

This PR merges the previously approved PRs (#1191 , #1217 ) from feature branch onto main.

See below for Benchmark results

Setup

Version: OS 3.0 alpha
Cluster: 4 r6g.2xlarge nodes (1 coordinator node 3 data nodes)

Benchmark Results

Dataset: Trec-covid (https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip)
Ingest Pipeline

{
	"description": "An NLP ingest pipeline",
	"processors": [
		{
			"text_embedding": {
				"model_id": "-l5Dr5UB17VvHPspuB6C",
				"field_map": {
					"passage_text": "passage_embedding"
				},
				"skip_existing": true/false
			}
		}
	]
}

Index Mapping

{
  "settings": {
    "index.knn": true,
    "number_of_shards": 3,
    "default_pipeline": "nlp-ingest-pipeline"
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text"
      },
      "passage_embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "engine": "lucene",
          "space_type": "l2",
          "name": "hnsw",
          "parameters": {}
        }
      },
      "text": {
          "type": "text"
      }
    }
  }
}

Ingest Latency

The following table presents latency measurements of initial ingest operation (in milliseconds) with skip_existing feature enabled and disabled. The Percent difference columns show the relative performance impact between the two.

Operation Doc Size Batch Size skip_existing_off latency (ms) skip_existing_on latency (ms) Percent Difference
Single Ingest 1000 1 459896.33 463579.12 0.80%
Single Ingest 2000 1 932563.3 938318.45 0.62%
Single Ingest 3000 1 1400710.43 1401216.8 0.04%
Batch Ingest 171332 200 2247191.38 2192883.54 -2.42%
Batch Ingest 171332 500 2065514.4 2011408.73 -2.62%

Update Latency

The following table presents the latency measurements of update operation after identical ingest operation (in milliseconds) with skip_existing feature enabled and disabled. The Percent difference columns show the relative performance impact between the two.

Operation Doc Size Batch Size skip_existing_off latency (ms) skip_existing_on latency (ms) Percent Difference
Single Update 1000 1 459896.33 100957.91 -78.05%
Single Update 2000 1 932563.3 201715.16 -78.37%
Single Update 3000 1 1400710.43 292020.02 -79.15%
Batch Update 171332 200 2247191.38 352767.2 -84.30%
Batch Update 171332 500 2065514.4 293164.93 -85.81%

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Mar 20, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (fdd4c8c) to head (3dc9d7c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #1238       +/-   ##
============================================
- Coverage     81.90%       0   -81.91%     
============================================
  Files           208       0      -208     
  Lines          9594       0     -9594     
  Branches       1632       0     -1632     
============================================
- Hits           7858       0     -7858     
+ Misses         1105       0     -1105     
+ Partials        631       0      -631     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@junqiu-lei junqiu-lei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the benchmark results, LGTM.

@heemin32
Copy link
Collaborator

Could you share the index mapping for knn field?

@heemin32
Copy link
Collaborator

Can we test with shard number 3?

{
  "settings": {
    "index.knn": true,
    "number_of_shards": 3,
    "default_pipeline": "nlp-ingest-pipeline"
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text"
      },
      "passage_embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "engine": "lucene",
          "space_type": "l2",
          "name": "hnsw",
          "parameters": {}
        }
      },
      "text": {
          "type": "text"
      }
    }
  }
}

}
try {
if (CollectionUtils.isEmpty(ingestDocumentWrappers)) {
handler.accept(Collections.emptyList());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we simply return ingestDocumentWrappers rather than construct a new empty list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, either should work

try {
for (IngestDocumentWrapper ingestDocumentWrapper : ingestDocumentWrappers) {
// The IngestDocumentWrapper might already run into exception and not sent for inference. So here we only
// set exception to IngestDocumentWrapper which doesn't have exception before.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we do updateWithExceptions(ingestDocumentWrappers, e) in multiple catch blocks. But we only consider the case that IngestDocumentWrapper might already run into exception and not sent for inference here which does not make sense. It's possible the other catch block can override the exception of IngestDocumentWrapper incorrectly. e.g. The IngestDocumentWrapper can have exception A and not sent for inference but now we can potentially override its exception related to inference.

I think we should either propagate the exception to one catch block and handle it with the consideration that IngestDocumentWrapper can already run into the exception. Or we should modify updateWithExceptions function to account for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was added to ensure that if any of the lines while handling the exceptions in for loop throws exceptions, it will be handled properly. I understand that this may not be needed, but was brought up as a possible risk in previous pr

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this situation occur? The IngestDocumentWrapper might have already encountered an exception and was not sent for inference. It would be helpful if you could point out the code where the exception is set before inference.

I agree that there's an inconsistency in how we handle this—checking if an exception is already set only in this specific place but not elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can propagate the exception to top level and set the exceptions there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can propagate the exception. One option is to remove this check and override the exception for all documents. When an exception occurs here, retaining the original exception may not make a significant difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option is to keep the check for all exception cases, which will only store the first exception encountered for all documents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way, we'll know which ingest document failed with which exact exception

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would work as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder do we really check if the doc already run into the exception and skip the embedding generation for it? If that's the case it makes sense to keep the actual exception fail the ingest of the doc which can be helpful for debugging.

);
}
handler.accept(ingestDocumentWrappers);
} catch (Exception e) {
Copy link
Collaborator

@bzhangam bzhangam Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder why we want to clearly try catch the exception here? If we don't catch it here won't it be auto caught by the exception handler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exception handler is catching the exception thrown only in doBatchExecute method here. we do not catch exceptions if any of the lines between 221-236 throws exceptions.

}
}
handler.accept(ingestDocumentWrappers);
} catch (Exception e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this try catch? If we can throw the exception when we update the exception in ingestDocumentWrappers then in the catch block we will throw exception again and will not invoke handler.accept. If we can throw the exception when we do handler.accept shouldn't the exception handler of the handler to handle it rather than handle it here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. handler.accept() should go to finally block.

}

});
} catch (Exception e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this catch block? If we are worry about the exception when we do sortByLengthAndReturnOriginalOrder can we simply try catch it? It's not easy to understand each try catch block if we nest them a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm okay with simplifying the try/catch, or even removing if we know for sure the underlying methods will not throw exceptions. But this was brought up as a possible concern in the previous PR, so decided to add them

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @bzhangam is talking about

try {
  sortByLengthAndReturnOriginalOrder()
} catch {
}

where we only wrap try-catch around sortByLengthAndReturnOriginalOrder. but not removing it.

I think having try-catch wrapping the entire method might be simpler and safer.
If the code is hard to read, we can try to introduce a method to return lambda.

protected void doSubBatchExecute(
        List<IngestDocumentWrapper> ingestDocumentWrappers,
        List<String> inferenceList,
        List<DataForInference> dataForInferences,
        Consumer<List<IngestDocumentWrapper>> handler
    ) {
   ...
   doBatchExecute(inferenceList, batchExecuteHandler());
   ...
}


private Consumer<InferenceResult> batchExecuteHandler() {
 ...
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'm more thinking can we make the try catch cleaner. I think when we do try catch we want to do some special logic to handle the exception. But here we do the same logic with multiple nested try catch which is kind complicated the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I think keep passing handler to downstream method is not a good practice as that will make us to keep adding all those try-catch for every method. In general, we should pass handler only when we make async call.

@bzhangam If you have more specific recommendation on how we should refactor the code here, please provide one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've refactored the code to separate out the try catch to a separate method as heemin suggested. Do you have other suggestions on how to make try/catch cleaner?

Copy link
Collaborator

@bzhangam bzhangam Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using wrap if we have an exception in the onResponse we will auto invoke the onFailure to handle the exception which is helpful in our use case since what we want to do in the try catch is simply the same thing we want to do in the onFailure.

    static <Response> ActionListener<Response> wrap(final CheckedConsumer<Response, ? extends Exception> onResponse, final Consumer<Exception> onFailure) {
        return new ActionListener<Response>() {
            public void onResponse(Response response) {
                try {
                    onResponse.accept(response);
                } catch (Exception e) {
                    this.onFailure(e);
                }

            }

            public void onFailure(Exception e) {
                onFailure.accept(e);
            }
        };
    }

With this I think we can remove some try catch in the onResponse part.

// are copied
openSearchClient.execute(
MultiGetAction.INSTANCE,
buildMultiGetRequest(ingestDocumentWrappers),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to get all doc? What if some docs don't have a value for inference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to get all docs first to check if docs have or don't have value for inference

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the doc in the request doesn't have a value for the neural field which means we should not generate embedding for it since there is no value there. In that I think we can skip pulling the existing doc since we know there is no neural field we need to inference.

} else {
doSubBatchExecute(ingestDocumentWrappers, filteredInferenceList, filteredDataForInference, handler);
}
} catch (Exception e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above do we really want to update the exception for all docs? Do we really need to have multiple try catch here? It seems like we are doing the same thing in the catch block. Can we pop up the exception and simply handle it at the top layer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ActionListener.wrap(response -> {}) spawns a new thread, and needs to be handled separately

Consumer<List<IngestDocumentWrapper>> handler
) {
try {
Tuple<List<String>, Map<Integer, Integer>> sortedResult = sortByLengthAndReturnOriginalOrder(inferenceList);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder why we need to sort the inference list here? Seems like we sort it and then restore it without doing anything special.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort is done before inference call is made, and results is the retrieved embedding from the inference call. I think it's safe to leave as is for now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the sort is done before the inference call but I don't know why we sort it. Because we simply send the sorted data to the inference call without any logic e.g. group or batch the data. If that's the case this seems like a redundant logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can create an issue to track this no need to address it in this PR. Probably it's better to check with the author to see why we add this logic here.

@will-hwang
Copy link
Contributor Author

will-hwang commented Mar 20, 2025

Can we test with shard number 3?

{
  "settings": {
    "index.knn": true,
    "number_of_shards": 3,
    "default_pipeline": "nlp-ingest-pipeline"
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text"
      },
      "passage_embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "engine": "lucene",
          "space_type": "l2",
          "name": "hnsw",
          "parameters": {}
        }
      },
      "text": {
          "type": "text"
      }
    }
  }
}

@heemin32 updated in description

@will-hwang will-hwang force-pushed the optimized-processor branch 6 times, most recently from e77f881 to fca9a4f Compare March 24, 2025 07:07
),
e -> {
// When exception is thrown in for MultiGetAction, set exception to all ingestDocumentWrappers
updateWithExceptions(ingestDocumentWrappers, handler, e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not set it for all since some docs may not have the data for inference and we are not getting the existing doc for them.

updateWithExceptions(ingestDocumentWrappers, handler, exception);
}
);
} catch (Exception e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't plan to do something special here I think we can rely on the caller to handle the exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed to handle it at caller

}
handler.accept(ingestDocumentWrappers);
});
} catch (Exception e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not needed since the exception handler will catch this exception. They are wrapped by an ActionListener wrap in doBatchExecute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

}
handler.accept(ingestDocumentWrappers);
} catch (Exception ex) {
handler.accept(null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should log the error here to help debug since we cannot add the exception to the doc in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@will-hwang will-hwang force-pushed the optimized-processor branch from fca9a4f to 3dc9d7c Compare March 24, 2025 22:31
Copy link
Collaborator

@heemin32 heemin32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

@heemin32 heemin32 merged commit f5edc0a into opensearch-project:main Mar 25, 2025
47 of 50 checks passed
ryanbogan pushed a commit to ryanbogan/neural-search that referenced this pull request Apr 10, 2025
…pensearch-project#1238)

* implement single document update scenario for text embedding processor (opensearch-project#1191)

Signed-off-by: Will Hwang <[email protected]>

* implement batch document update scenario for text embedding processor (opensearch-project#1217)

Signed-off-by: Will Hwang <[email protected]>

---------

Signed-off-by: Will Hwang <[email protected]>
YeonghyeonKO pushed a commit to YeonghyeonKO/neural-search that referenced this pull request May 30, 2025
…pensearch-project#1238)

* implement single document update scenario for text embedding processor (opensearch-project#1191)

Signed-off-by: Will Hwang <[email protected]>

* implement batch document update scenario for text embedding processor (opensearch-project#1217)

Signed-off-by: Will Hwang <[email protected]>

---------

Signed-off-by: Will Hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>
YeonghyeonKO pushed a commit to YeonghyeonKO/neural-search that referenced this pull request May 30, 2025
…pensearch-project#1238)

* implement single document update scenario for text embedding processor (opensearch-project#1191)

Signed-off-by: Will Hwang <[email protected]>

* implement batch document update scenario for text embedding processor (opensearch-project#1217)

Signed-off-by: Will Hwang <[email protected]>

---------

Signed-off-by: Will Hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>
yuye-aws pushed a commit that referenced this pull request Jun 10, 2025
#1342)

* Implement Optimized embedding generation in text embedding processor (#1238)

* implement single document update scenario for text embedding processor (#1191)

Signed-off-by: Will Hwang <[email protected]>

* implement batch document update scenario for text embedding processor (#1217)

Signed-off-by: Will Hwang <[email protected]>

---------

Signed-off-by: Will Hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Going from alpha1 to beta1 for 3.0 release (#1245)

Signed-off-by: yeonghyeonKo <[email protected]>

* Implement Optimized embedding generation in sparse encoding processor (#1246)

Signed-off-by: Will Hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Implement Optimized embedding generation in text and image embedding processor (#1249)

Signed-off-by: will-hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Inner hits support with hybrid query (#1253)

* Inner Hits in Hybrid query

Signed-off-by: Varun Jain <[email protected]>

* Inner hits support with hybrid query

Signed-off-by: Varun Jain <[email protected]>

* Add changelog

Signed-off-by: Varun Jain <[email protected]>

* fix integ tests

Signed-off-by: Varun Jain <[email protected]>

* Modify comment

Signed-off-by: Varun Jain <[email protected]>

* Explain test case

Signed-off-by: Varun Jain <[email protected]>

* Optimize inner hits count calculation method

Signed-off-by: Varun Jain <[email protected]>

---------

Signed-off-by: Varun Jain <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Support custom tags in semantic highlighter (#1254)

Signed-off-by: yeonghyeonKo <[email protected]>

* Add neural stats API (#1256)

* Add neural stats API

Signed-off-by: Andy Qin <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Added release notes for 3.0 beta1 (#1252)

* Added release notes for 3.0 beta1

Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Update semantic highlighter test model (#1259)

Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Fix the edge case when the value of a fieldMap key in ingestDocument is empty string (#1257)

Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Hybrid query should call rewrite before creating weight (#1268)

* Hybrid query should call rewrite before creating weight

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>

* Awaits fix

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>

* Rewrite with searcher

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>

* Feature flag issue

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>

---------

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Support phasing off SecurityManager usage in favor of Java Agent (#1265)

Signed-off-by: Gulshan <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272)

Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add semantic field mapper. (#1225)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Increment version to 3.0.0-SNAPSHOT (#1286)

Signed-off-by: opensearch-ci-bot <[email protected]>
Co-authored-by: opensearch-ci-bot <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Remove beta1 qualifier (#1292)

Signed-off-by: Peter Zhu <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Fix for merging scoreDocs when totalHits are greater than 1 and fieldDocs are 0 (#1295) (#1296)

(cherry picked from commit 6f3aabb)

Co-authored-by: Varun Jain <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* add release notes for 3.0 (#1287)

Signed-off-by: will-hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Allow maven to publish to all versions (#1300) (#1301)

Signed-off-by: Peter Zhu <[email protected]>
(cherry picked from commit c5625db)

Co-authored-by: Peter Zhu <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [FEAT] introduce new FixedStringLengthChunker

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] initial test cases for FixedStringLengthChunker

Signed-off-by: yeonghyeonKo <[email protected]>

* [FIX] gradlew spotlessApply

Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] remove unnecessary comments

Signed-off-by: yeonghyeonKo <[email protected]>

* [Performance Improvement] Add custom bulk scorer for hybrid query (2-3x faster) (#1289)

Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add TextChunkingProcessor stats (#1308)

* Add TextChunkingProcessor stats

Signed-off-by: Andy Qin <[email protected]>

# Conflicts:
#	CHANGELOG.md

* Update unit and integ tests

Signed-off-by: Andy Qin <[email protected]>

---------

Signed-off-by: Andy Qin <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Update Lucene dependencies (#1336)

* Update Lucene dependencies

Signed-off-by: Ryan Bogan <[email protected]>

* Add changelog entry

Signed-off-by: Ryan Bogan <[email protected]>

* Update model request body for bwc and integ tests

Signed-off-by: Ryan Bogan <[email protected]>

---------

Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] modify algorithm name and related parts

Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] update test codes along with the change in CharacterLengthChunker

Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] remove defensive check to prevent adding redundant code lines

Signed-off-by: yeonghyeonKo <[email protected]>

* Update CharacterLengthChunker to FixedCharLengthChunker

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Update ChunkerFactory

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Update CharacterLengthChunkerTests to FixedCharLengthChunkerTests

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [FIX] handle a corner case where the content is shorter than charLimit

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] Add integration test codes for fixed_char_length chunking algorithm

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] integration test code for cascaded pipeline

Signed-off-by: yeonghyeonKo <[email protected]>

* Support analyzer-based neural sparse query (#1088)

* merge main; add analyzer impl

Signed-off-by: zhichao-aws <[email protected]>

* two phase adaption

Signed-off-by: zhichao-aws <[email protected]>

* two phase adaption

Signed-off-by: zhichao-aws <[email protected]>

* remove analysis

Signed-off-by: zhichao-aws <[email protected]>

* lint

Signed-off-by: zhichao-aws <[email protected]>

* update

Signed-off-by: zhichao-aws <[email protected]>

* address comments

Signed-off-by: zhichao-aws <[email protected]>

* tests

Signed-off-by: zhichao-aws <[email protected]>

* modify plugin security policy

Signed-off-by: zhichao-aws <[email protected]>

* change log

Signed-off-by: zhichao-aws <[email protected]>

* address comments

Signed-off-by: zhichao-aws <[email protected]>

* modify to package-private

Signed-off-by: zhichao-aws <[email protected]>

---------

Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Fixed score value as null for single shard for sorting (#1277)

* Fixed score value as null for single shard for sorting

Signed-off-by: Owais <[email protected]>

* Addressed comment

Signed-off-by: Owais <[email protected]>

* Addressed more comments

Signed-off-by: Owais <[email protected]>

* Added UT

Signed-off-by: Owais <[email protected]>

---------

Signed-off-by: Owais <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add IT for neural sparse query + bert-uncased mbert-uncased analyzer (#1279)

* add it

Signed-off-by: zhichao-aws <[email protected]>

* change log

Signed-off-by: zhichao-aws <[email protected]>

---------

Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add WithFieldName implementation to QueryBuilders (#1285)

Signed-off-by: Owais <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [AUTO] Increment version to 3.1.0-SNAPSHOT (#1288)

* Increment version to 3.1.0-SNAPSHOT

Signed-off-by: opensearch-ci-bot <[email protected]>

* Update build.gradle

Signed-off-by: Peter Zhu <[email protected]>

---------

Signed-off-by: opensearch-ci-bot <[email protected]>
Signed-off-by: Peter Zhu <[email protected]>
Co-authored-by: opensearch-ci-bot <[email protected]>
Co-authored-by: Peter Zhu <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* add release notes for 3.0 (#1298)

Signed-off-by: will-hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Return bad request for invalid stat parameters in stats API (#1291)

Signed-off-by: Andy Qin <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add semantic mapping transformer. (#1276)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add semantic ingest processor. (#1309)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [Performance Improvement] Add custom bulk scorer for hybrid query (2-3x faster) (#1289)

Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Implement the query logic for the semantic field. (#1315)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Support custom weights params in RRF (#1322)

* Support Weights params in RRF

Signed-off-by: Varun Jain <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* add validation for invalid nested hybrid query (#1305)

* add validation for nested hybrid query

Signed-off-by: will-hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add stats tracking for semantic highlighting (#1327)

* Add stats tracking for semantic highlighting

Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Update Lucene dependencies (#1336)

* Update Lucene dependencies

Signed-off-by: Ryan Bogan <[email protected]>

* Add changelog entry

Signed-off-by: Ryan Bogan <[email protected]>

* Update model request body for bwc and integ tests

Signed-off-by: Ryan Bogan <[email protected]>

---------

Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Enhance semantic field to allow to enable/disable chunking. (#1337)

* Implement the query logic for the semantic field.

Signed-off-by: Bo Zhang <[email protected]>

* Enhance semantic field to allow to enable/disable chunking.

Signed-off-by: Bo Zhang <[email protected]>

---------

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] modify algorithm name and related parts

Signed-off-by: yeonghyeonKo <[email protected]>

* Update CHANGELOG.md

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* [FEAT] Add fixed_char_length chunking algorithm to STAT manager

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] Add integration test codes for fixed_char_length chunking algorithm

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] integration test code for cascaded pipeline

Signed-off-by: yeonghyeonKo <[email protected]>

* Going from alpha1 to beta1 for 3.0 release (#1245)

Signed-off-by: yeonghyeonKo <[email protected]>

* Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272)

Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add semantic field mapper. (#1225)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Add semantic mapping transformer. (#1276)

Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>

* Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272)

Signed-off-by: Junqiu Lei <[email protected]>

* Add semantic field mapper. (#1225)

Signed-off-by: Bo Zhang <[email protected]>

* Add semantic mapping transformer. (#1276)

Signed-off-by: Bo Zhang <[email protected]>

* [FIX] minor typo

Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] adopt FixedTokenLengthChunker's loop strategy for robust final chunking

Signed-off-by: yeonghyeonKo <[email protected]>

* [TEST] sum the number of processors and their executions correctly in TextChunkingProcessorIT

Signed-off-by: yeonghyeonKo <[email protected]>

* [REFACTOR] gradlew spotlessApply

Signed-off-by: yeonghyeonKo <[email protected]>

---------

Signed-off-by: Will Hwang <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>
Signed-off-by: will-hwang <[email protected]>
Signed-off-by: Varun Jain <[email protected]>
Signed-off-by: Andy Qin <[email protected]>
Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Harsha Vamsi Kalluri <[email protected]>
Signed-off-by: Gulshan <[email protected]>
Signed-off-by: Bo Zhang <[email protected]>
Signed-off-by: opensearch-ci-bot <[email protected]>
Signed-off-by: Peter Zhu <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: Owais <[email protected]>
Co-authored-by: Will Hwang <[email protected]>
Co-authored-by: Martin Gaievski <[email protected]>
Co-authored-by: Varun Jain <[email protected]>
Co-authored-by: Junqiu Lei <[email protected]>
Co-authored-by: Andy <[email protected]>
Co-authored-by: Chloe Gao <[email protected]>
Co-authored-by: Harsha Vamsi Kalluri <[email protected]>
Co-authored-by: Gulshan <[email protected]>
Co-authored-by: Bo Zhang <[email protected]>
Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com>
Co-authored-by: opensearch-ci-bot <[email protected]>
Co-authored-by: Peter Zhu <[email protected]>
Co-authored-by: Ryan Bogan <[email protected]>
Co-authored-by: zhichao-aws <[email protected]>
Co-authored-by: Owais Kazi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants