Add Hybrid, Semantic and Multimodal search test procedures in neural search workload #624

weijia-aws · 2025-04-22T18:29:24Z

Description

This PR adds test procedures for Hybrid search and Semantic search with neural query clause in neural search workload
Fix a code bug when creating a sparse search ingestion pipeline - the attribute name should be text instead of passage_text in field_map, as the dataset being used only has text field

In Commit 5, I added another test procedures for supporting multimodal search

Issues Resolved

This will partially resolve 597, we will add more support (multimodal search) in future PRs

Testing

New functionality includes testing

[Describe how this change was tested]

Backport to Branches:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Weijia Zhao <[email protected]>

neural_search/test_procedures/semantic_search.json

neural_search/operations/semantic_search.json

neural_search/workload.py

neural_search/operations/sparse_search.json

neural_search/operations/hybrid_search.json

neural_search/operations/semantic_search.json

neural_search/operations/hybrid_search.json

Signed-off-by: Weijia Zhao <[email protected]>

martin-gaievski

what's the default number of replicas? I tried it one 1 node cluster and workload hung, index was Yellow, I assume that is because num of replicas is > 0, and with one node that leads to a unstable yellow state, meaning index health will always be negative.

neural_search/params/hybrid_search.json

weijia-aws · 2025-04-24T17:16:33Z

what's the default number of replicas? I tried it one 1 node cluster and workload hung, index was Yellow, I assume that is because num of replicas is > 0, and with one node that leads to a unstable yellow state, meaning index health will always be negative.

Here's the index setting, since we don't define the number_of_replicas parameter, the default value will be 1

martin-gaievski · 2025-04-24T17:43:59Z

what's the default number of replicas? I tried it one 1 node cluster and workload hung, index was Yellow, I assume that is because num of replicas is > 0, and with one node that leads to a unstable yellow state, meaning index health will always be negative.

Here's the index setting, since we don't define the number_of_replicas parameter, the default value will be 1

you can change default to be 0 as part of your worklow. if you want to keep default as 1 then you need to do something with the check_cluster_health operation. If will go into infinite cycle in case there is only one data node.

weijia-aws · 2025-04-24T19:13:33Z

what's the default number of replicas? I tried it one 1 node cluster and workload hung, index was Yellow, I assume that is because num of replicas is > 0, and with one node that leads to a unstable yellow state, meaning index health will always be negative.

Here's the index setting, since we don't define the number_of_replicas parameter, the default value will be 1

you can change default to be 0 as part of your worklow. if you want to keep default as 1 then you need to do something with the check_cluster_health operation. If will go into infinite cycle in case there is only one data node.

I see, I haven't run into this issue as my cluster has two data nodes. But this setting should totally depend on the number of data node in the cluster that used to run benchmark. I will instead mention it in README file

Signed-off-by: Weijia Zhao <[email protected]>

neural_search/params/hybrid_search.json

neural_search/test_procedures/hybrid_search.json

neural_search/operations/hybrid_search.json

martin-gaievski · 2025-04-25T17:21:35Z

please add following to quora.json where we do have mapping:

  "mappings": {
    "properties": {
      "id": {
        "type": "text",
       "fielddata": true 
      },

Addition is this "fielddata": true flag that will allow to use text field content in aggregations. This is workaround for this dataset as it effectively has just two fields: text and corresponding embeddings. Adding this flag should not affect performance of non-aggs operations

Signed-off-by: Weijia Zhao <[email protected]>

martin-gaievski

few minor asks for each parameter file:

use multiple search clients, I think it's 8 for other workloads. Params is "search_clients": 8
we need to ingest all documents, "ingest_percentage": 100
increase k, typical production query has k = 100
don't fixate throughput, set it to 0 so client squeeze maximum possible throughput: "target_throughput": 0

Signed-off-by: Weijia Zhao <[email protected]>

weijia-aws · 2025-05-05T21:51:11Z

neural_search/workload_queries.json

+  },
+  {
+    "name": "abo",
+    "base-url": "https://github.com/weijia-aws/neural-search/releases/download/abo",


Need some help uploading this dataset to a public repo

Will do this after the PR is merged. Can be done as a separate PR.

martin-gaievski · 2025-05-06T15:32:08Z

@weijia-aws I think this workload does not have a way to enable/disable concurrent segment search dynamically. Check out noaa_semantic_search, we have done it via parameter (ref). Having it on or off can affect the response times significantly.

weijia-aws · 2025-05-07T22:12:30Z

@weijia-aws I think this workload does not have a way to enable/disable concurrent segment search dynamically. Check out noaa_semantic_search, we have done it via parameter (ref). Having it on or off can affect the response times significantly.

Sure, will add it in the new revision

Signed-off-by: Weijia Zhao <[email protected]>

neural_search/operations/default.json

Signed-off-by: Weijia Zhao <[email protected]>

martin-gaievski

Looks good to me, thanks @weijia-aws

gkamat

Some comments are suggestions for you to consider.

gkamat · 2025-05-14T07:18:37Z

neural_search/indices/abo.json

+          "name": "{{method | default('hnsw')}}",
+          "engine": "{{engine | default('lucene')}}",


Suggest using more clearly named parameters, such asvector_engine or graph_algorithm

Good suggestion, currently these parameter names are used when creating model across the whole package, I want to keep the same. But this can be an improvement for the package. Also we provide the parameter description in readme file, so it should not cause confusion

gkamat · 2025-05-14T16:51:12Z

neural_search/workload.json

+    },
+    {
+      "name": "abo",
+      "base-url": "https://github.com/weijia-aws/neural-search/releases/download/abo",


Please move to the standard repository after this PR is merged.

gkamat · 2025-05-14T16:58:27Z

neural_search/test_procedures/semantic_search.json

+    {{ benchmark.collect(parts="../../common_operations/delete_index.json") }},
+    {
+      "operation": "delete-ingest-pipeline"
+    },
+    {
+      "operation": "delete-ml-model-sentence-transformer"
+    },
+    {
+      "operation": "put-cluster-settings"
+    },
+    {%- if concurrent_segment_search_enabled is defined %}
+    {
+      "operation": "put-concurrent-segment-search-setting"
+    },
+    {%- endif %}
+    {
+      "operation": "register-ml-model-sentence-transformer"
+    },
+    {
+      "operation": "deploy-ml-model"
+    },
+    {
+      "operation": "create-text-embedding-processor-ingest-pipeline"


Many of these sections that deal with deleting and creating the ingest pipelines, adding the concurrent search setting and deploying the model seem to be repetitive, with perhaps the model type as a variable. Consider consolidating it into a common fragment that you can include. Again, it may not be possible, but just a suggestion.

I tried that before, but there're a lot of different steps, mainly due to the model and pipeline are different.

gkamat · 2025-05-14T17:03:28Z

neural_search/workload.py

+        params = self._params
+        with open('model_id.json', 'r') as f:
+            d = json.loads(f.read())
+            params['body']['query']['hybrid']['queries'][1]['neural']['passage_embedding']['model_id'] = d['model_id']


Would be more readable if you used an intermediate variable. Something like:

q = params['body']['query']['hybrid']['queries'][1]

gkamat · 2025-05-14T17:11:26Z

neural_search/workload.py

+            d = json.loads(f.read())
+            params['body']['query']['hybrid']['queries'][2]['neural']['passage_embedding']['model_id'] = d['model_id']
+
+        def tokenize_query(query_text: str) -> List[str]:


Some details over how this function works (creating bigrams and using them for queries) will be helpful for the user if added a description comment at the top.

gkamat · 2025-05-14T17:28:23Z

neural_search/README.md

 * `index_settings`: A list of index settings. Index settings defined elsewhere (e.g. `number_of_replicas`) need to be overridden explicitly.
 * `ingest_percentage` (default: 100): A number between 0 and 100 that defines how much of the document corpus should be ingested.
-* `iterations`  Number of test iterations of each search client executes.
-* `k` (default: 10) Number of nearest neighbors are returned.
+* `iterations`:  Number of test iterations of each search client executes.


that each search client executes

gkamat · 2025-05-14T17:29:00Z

neural_search/README.md

-* `refresh_interval` (default: "5s") Interval to refresh the index in seconds
+* `rank_constant` (default: 60): A constant added to each document’s rank before calculating the reciprocal score. Only applicable to Hybrid search with score-ranker-processor enabled
+* `refresh_interval` (default: "5s"): Interval to refresh the index in seconds
+* `region`: AWS region


Consider leaving 'AWS' out if generic.

gkamat · 2025-05-14T17:29:27Z

neural_search/README.md

 * `requests_cache_enabled` (default: false): Enables or disables the index request cache
 * `search_clients`: Number of clients that issue search requests.
+* `search_pipeline_processor`: Types of processors for hybrid search, available processors are normalization-processor and score-ranker-processor, if not defined, normalization-processor will be chosen


Type of processor for hybrid search. Available...

gkamat · 2025-05-14T17:29:42Z

neural_search/README.md

+* `secret_key`: AWS credential secret key
+* `session_token`: AWS credential session token


Consider leaving "AWS" out.

gkamat · 2025-05-14T17:31:02Z

neural_search/README.md

+
+### Gotchas
+1. The above benchmark is running against a cluster that has two data nodes (See Docker compose [detail](https://docs.opensearch.org/docs/latest/install-and-configure/install-opensearch/docker/#sample-docker-composeyml)). 
+If your cluster only have one data node, test procedures may stuck in `check-cluster-health` step, in that case, you should add a `number_of_replicas` parameter with value `0`


only has a single data node, test procedures may get stuck in the check-cluster-health step. In that case

junqiu-lei · 2025-05-14T19:27:42Z

neural_search/workload.py

@@ -62,6 +83,9 @@ def partition(self, partition_index, total_partitions):
        return self

 class NeuralSparseQueryParamSource(QueryParamSource):
+    def get_dataset_name(self):
+        return 'quora'


params file already have corpora_name config https://github.com/opensearch-project/opensearch-benchmark-workloads/pull/624/files#diff-95a2eb01d714f95fd72432cbd3d615a09a7fe98582fc11aa91a54f2c38640c87R4

Can we read the value from there instead of hardcode?

return self._params.get('corpora_name', 'quora') # Default to quora if not found

It doesn't work that way, although we have parameters defined in that file, it doesn't mean all of them will be loaded in this search operation. If we want to do that, we need to pass the parameter in all the search operations manually. I will keep it as is

junqiu-lei · 2025-05-14T20:25:45Z

neural_search/test_procedures/semantic_search.json

+  "default": false,
+  "schedule": [
+    {
+      "operation": "put-concurrent-segment-search-setting"


I just ran query only workflow and got error:

Running put-concurrent-segment-search-setting [ 0% done][ERROR] illegal_argument_exception ({'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'Failed to parse value [] as only [true] or [false] are allowed.'}], 'type': 'illegal_argument_exception', 'reason': 'Failed to parse value [] as only [true] or [false] are allowed.'}, 'status': 400})

I don't see current params provide concurrent_segment_search_enabled property, we might need provide default value at

opensearch-benchmark-workloads/neural_search/operations/default.json

Line 26 in 58f8602

"search.concurrent_segment_search.enabled": "{{concurrent_segment_search_enabled}}"

?

example:

"search.concurrent_segment_search.enabled": "{{concurrent_segment_search_enabled | default('false')}}"

I removed the default on purpose. If user want to set concurrent_segment_search_enabled parameter, they need to specify whether they want to turn on/off the feature

Signed-off-by: Weijia Zhao <[email protected]>

bzhangam · 2025-05-16T00:00:29Z

neural_search/workload.py

+
+        count = self._params.get("variable-queries", 0)
+        if count > 0:
+            script_dir = os.path.dirname(os.path.realpath(__file__))


Can we extract this logic into a separate function to avoid duplicated codes reading the query line?

We can probably do it in another PR

bzhangam · 2025-05-16T00:00:47Z

neural_search/workload.py

+        params = self._params
+        hybrid_queries = params['body']['query']['hybrid']['queries']
+
+        with open('model_id.json', 'r') as f:


Can we extract this logic into a separate function to avoid duplicated codes reading the model id?

This is not duplicate code, the way to get the model id path is different in different search methods

bzhangam · 2025-05-16T00:06:29Z

neural_search/params/sparse_search.json

  "index_body": "indices/quora.json",
  "corpora_name": "quora",
  "ingest_percentage": 100,
-  "variable_queries": 0,
+  "variable_queries": 10000,


Is this the number of the queries we will run? In workload.py seems like we only check if it's larger than 0 so is there a difference when we use 1 vs 10000?

number of the queries we will run is defined in iteration parameter.

bzhangam · 2025-05-16T00:13:58Z

neural_search/workload.py

@@ -42,15 +54,24 @@ def __init__(self, workload, params, **kwargs):
        self._params['variable-queries'] = params.get("variable-queries", 0)
        self.infinite = True

+        self.dataset_name = self.get_dataset_name()


Can't we get the dataset_name from the params?

Junqiu also asked the same question, check my reply

gkamat

Please open a PR to move the data files to the standard corpus location.

weijia-aws · 2025-05-16T19:11:43Z

Please open a PR to move the data files to the standard corpus location.

Thank you for merging the PR. Can you help upload the dataset to the standard location, and provide me the link? I will open a PR for updating it

…search workload (#624) * Add Hybrid and Semantic search test procedures Signed-off-by: Weijia Zhao <[email protected]> * Address comments Signed-off-by: Weijia Zhao <[email protected]> * Update readme Signed-off-by: Weijia Zhao <[email protected]> * Add more complex hybrid search queries Signed-off-by: Weijia Zhao <[email protected]> * Add benchmark workload for Multimodal search Signed-off-by: Weijia Zhao <[email protected]> * Add concurrent_segment_search support Signed-off-by: Weijia Zhao <[email protected]> * Make concurrent_segment_search_enabled configurable Signed-off-by: Weijia Zhao <[email protected]> * Address comments Signed-off-by: Weijia Zhao <[email protected]> --------- Signed-off-by: Weijia Zhao <[email protected]> (cherry picked from commit fd0e88d) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…search workload (#624) (#638) * Add Hybrid and Semantic search test procedures * Address comments * Update readme * Add more complex hybrid search queries * Add benchmark workload for Multimodal search * Add concurrent_segment_search support * Make concurrent_segment_search_enabled configurable * Address comments --------- (cherry picked from commit fd0e88d) Signed-off-by: Weijia Zhao <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…search workload (#624) (#639) * Add Hybrid and Semantic search test procedures * Address comments * Update readme * Add more complex hybrid search queries * Add benchmark workload for Multimodal search * Add concurrent_segment_search support * Make concurrent_segment_search_enabled configurable * Address comments --------- (cherry picked from commit fd0e88d) Signed-off-by: Weijia Zhao <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Add Hybrid and Semantic search test procedures

587ce38

Signed-off-by: Weijia Zhao <[email protected]>

weijia-aws requested review from IanHoang, gkamat, beaioun, rishabh6788, VijayanB and OVI3D0 as code owners April 22, 2025 18:29

junqiu-lei reviewed Apr 22, 2025

View reviewed changes

neural_search/test_procedures/semantic_search.json Show resolved Hide resolved

junqiu-lei reviewed Apr 22, 2025

View reviewed changes

neural_search/operations/semantic_search.json Outdated Show resolved Hide resolved

junqiu-lei reviewed Apr 22, 2025

View reviewed changes

neural_search/workload.py Outdated Show resolved Hide resolved

yizheliu-amazon reviewed Apr 22, 2025

View reviewed changes

neural_search/workload.py Outdated Show resolved Hide resolved

neural_search/operations/sparse_search.json Show resolved Hide resolved

martin-gaievski reviewed Apr 22, 2025

View reviewed changes

Address comments

2ee7b29

Signed-off-by: Weijia Zhao <[email protected]>

martin-gaievski reviewed Apr 24, 2025

View reviewed changes

neural_search/params/hybrid_search.json Outdated Show resolved Hide resolved

Update readme

11678d6

Signed-off-by: Weijia Zhao <[email protected]>

martin-gaievski reviewed Apr 25, 2025

View reviewed changes

martin-gaievski mentioned this pull request Apr 28, 2025

[RFC] Speed up score collecting for Hybrid query opensearch-project/neural-search#1290

Closed

Add more complex hybrid search queries

3b02eab

Signed-off-by: Weijia Zhao <[email protected]>

martin-gaievski reviewed Apr 29, 2025

View reviewed changes

Add benchmark workload for Multimodal search

5c608bc

Signed-off-by: Weijia Zhao <[email protected]>

weijia-aws changed the title ~~Add Hybrid and Semantic search test procedures in neural search workload~~ Add Hybrid, Semantic and Multimodal search test procedures in neural search workload May 5, 2025

weijia-aws commented May 5, 2025

View reviewed changes

martin-gaievski mentioned this pull request May 6, 2025

[Performance Improvement] Add custom bulk scorer for hybrid query (2-3x faster) opensearch-project/neural-search#1289

Merged

3 tasks

Add concurrent_segment_search support

f337dd3

Signed-off-by: Weijia Zhao <[email protected]>

martin-gaievski reviewed May 13, 2025

View reviewed changes

neural_search/operations/default.json Show resolved Hide resolved

Make concurrent_segment_search_enabled configurable

58f8602

Signed-off-by: Weijia Zhao <[email protected]>

martin-gaievski approved these changes May 13, 2025

View reviewed changes

gkamat requested changes May 14, 2025

View reviewed changes

junqiu-lei reviewed May 14, 2025

View reviewed changes

Address comments

f31bd03

Signed-off-by: Weijia Zhao <[email protected]>

bzhangam reviewed May 16, 2025

View reviewed changes

gkamat approved these changes May 16, 2025

View reviewed changes

gkamat merged commit fd0e88d into opensearch-project:main May 16, 2025
2 checks passed

gkamat added backport 2 Backport to the "2" branch backport 3 Backport to the "3" branch labels May 16, 2025

opensearch-trigger-bot bot mentioned this pull request May 16, 2025

[Backport 2] Add Hybrid, Semantic and Multimodal search test procedures in neural search workload #638

Merged

opensearch-trigger-bot bot mentioned this pull request May 16, 2025

[Backport 3] Add Hybrid, Semantic and Multimodal search test procedures in neural search workload #639

Merged

martin-gaievski mentioned this pull request May 20, 2025

Adding semantic search workload that includes vector and bm25 search #342

Closed

martin-gaievski mentioned this pull request Jun 8, 2025

[Performance Optimization] Shard level search - change score collection approach from random access by same doc ID to collecting scores for all doc IDs in sub-queries opensearch-project/neural-search#1234

Closed

		"name": "{{method \| default('hnsw')}}",
		"engine": "{{engine \| default('lucene')}}",

		* `secret_key`: AWS credential secret key
		* `session_token`: AWS credential session token

Add Hybrid, Semantic and Multimodal search test procedures in neural search workload #624

Add Hybrid, Semantic and Multimodal search test procedures in neural search workload #624

Uh oh!

Conversation

weijia-aws commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues Resolved

Testing

Backport to Branches:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martin-gaievski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weijia-aws commented Apr 24, 2025

Uh oh!

martin-gaievski commented Apr 24, 2025

Uh oh!

weijia-aws commented Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martin-gaievski commented Apr 25, 2025

Uh oh!

martin-gaievski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-gaievski commented May 6, 2025

Uh oh!

weijia-aws commented May 7, 2025

Uh oh!

Uh oh!

martin-gaievski left a comment

Choose a reason for hiding this comment

Uh oh!

gkamat left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

weijia-aws commented Apr 22, 2025 •

edited

Loading

bzhangam May 16, 2025 •

edited

Loading