Skip to content

Add Hybrid, Semantic and Multimodal search test procedures in neural search workload #624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 16, 2025

Conversation

weijia-aws
Copy link
Contributor

@weijia-aws weijia-aws commented Apr 22, 2025

Description

  • This PR adds test procedures for Hybrid search and Semantic search with neural query clause in neural search workload
  • Fix a code bug when creating a sparse search ingestion pipeline - the attribute name should be text instead of passage_text in field_map, as the dataset being used only has text field

In Commit 5, I added another test procedures for supporting multimodal search

Issues Resolved

This will partially resolve 597, we will add more support (multimodal search) in future PRs

Testing

  • New functionality includes testing

[Describe how this change was tested]

Backport to Branches:

  • 6
  • 7
  • 1
  • 2
  • 3

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Weijia Zhao <[email protected]>
Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the default number of replicas? I tried it one 1 node cluster and workload hung, index was Yellow, I assume that is because num of replicas is > 0, and with one node that leads to a unstable yellow state, meaning index health will always be negative.

@weijia-aws
Copy link
Contributor Author

what's the default number of replicas? I tried it one 1 node cluster and workload hung, index was Yellow, I assume that is because num of replicas is > 0, and with one node that leads to a unstable yellow state, meaning index health will always be negative.

Here's the index setting, since we don't define the number_of_replicas parameter, the default value will be 1

@martin-gaievski
Copy link
Member

what's the default number of replicas? I tried it one 1 node cluster and workload hung, index was Yellow, I assume that is because num of replicas is > 0, and with one node that leads to a unstable yellow state, meaning index health will always be negative.

Here's the index setting, since we don't define the number_of_replicas parameter, the default value will be 1

you can change default to be 0 as part of your worklow. if you want to keep default as 1 then you need to do something with the check_cluster_health operation. If will go into infinite cycle in case there is only one data node.

@weijia-aws
Copy link
Contributor Author

what's the default number of replicas? I tried it one 1 node cluster and workload hung, index was Yellow, I assume that is because num of replicas is > 0, and with one node that leads to a unstable yellow state, meaning index health will always be negative.

Here's the index setting, since we don't define the number_of_replicas parameter, the default value will be 1

you can change default to be 0 as part of your worklow. if you want to keep default as 1 then you need to do something with the check_cluster_health operation. If will go into infinite cycle in case there is only one data node.

I see, I haven't run into this issue as my cluster has two data nodes. But this setting should totally depend on the number of data node in the cluster that used to run benchmark. I will instead mention it in README file

Signed-off-by: Weijia Zhao <[email protected]>
@martin-gaievski
Copy link
Member

please add following to quora.json where we do have mapping:

  "mappings": {
    "properties": {
      "id": {
        "type": "text",
       "fielddata": true 
      },

Addition is this "fielddata": true flag that will allow to use text field content in aggregations. This is workaround for this dataset as it effectively has just two fields: text and corresponding embeddings. Adding this flag should not affect performance of non-aggs operations

Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few minor asks for each parameter file:

  • use multiple search clients, I think it's 8 for other workloads. Params is "search_clients": 8
  • we need to ingest all documents, "ingest_percentage": 100
  • increase k, typical production query has k = 100
  • don't fixate throughput, set it to 0 so client squeeze maximum possible throughput: "target_throughput": 0

@weijia-aws weijia-aws changed the title Add Hybrid and Semantic search test procedures in neural search workload Add Hybrid, Semantic and Multimodal search test procedures in neural search workload May 5, 2025
},
{
"name": "abo",
"base-url": "https://github.com/weijia-aws/neural-search/releases/download/abo",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some help uploading this dataset to a public repo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do this after the PR is merged. Can be done as a separate PR.

@martin-gaievski
Copy link
Member

@weijia-aws I think this workload does not have a way to enable/disable concurrent segment search dynamically. Check out noaa_semantic_search, we have done it via parameter (ref). Having it on or off can affect the response times significantly.

@weijia-aws
Copy link
Contributor Author

@weijia-aws I think this workload does not have a way to enable/disable concurrent segment search dynamically. Check out noaa_semantic_search, we have done it via parameter (ref). Having it on or off can affect the response times significantly.

Sure, will add it in the new revision

Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @weijia-aws

Copy link
Collaborator

@gkamat gkamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments are suggestions for you to consider.

Comment on lines +22 to +23
"name": "{{method | default('hnsw')}}",
"engine": "{{engine | default('lucene')}}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using more clearly named parameters, such asvector_engine or graph_algorithm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, currently these parameter names are used when creating model across the whole package, I want to keep the same. But this can be an improvement for the package. Also we provide the parameter description in readme file, so it should not cause confusion

},
{
"name": "abo",
"base-url": "https://github.com/weijia-aws/neural-search/releases/download/abo",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move to the standard repository after this PR is merged.

Comment on lines +6 to +28
{{ benchmark.collect(parts="../../common_operations/delete_index.json") }},
{
"operation": "delete-ingest-pipeline"
},
{
"operation": "delete-ml-model-sentence-transformer"
},
{
"operation": "put-cluster-settings"
},
{%- if concurrent_segment_search_enabled is defined %}
{
"operation": "put-concurrent-segment-search-setting"
},
{%- endif %}
{
"operation": "register-ml-model-sentence-transformer"
},
{
"operation": "deploy-ml-model"
},
{
"operation": "create-text-embedding-processor-ingest-pipeline"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of these sections that deal with deleting and creating the ingest pipelines, adding the concurrent search setting and deploying the model seem to be repetitive, with perhaps the model type as a variable. Consider consolidating it into a common fragment that you can include. Again, it may not be possible, but just a suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that before, but there're a lot of different steps, mainly due to the model and pipeline are different.

params = self._params
with open('model_id.json', 'r') as f:
d = json.loads(f.read())
params['body']['query']['hybrid']['queries'][1]['neural']['passage_embedding']['model_id'] = d['model_id']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be more readable if you used an intermediate variable. Something like:

q = params['body']['query']['hybrid']['queries'][1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

d = json.loads(f.read())
params['body']['query']['hybrid']['queries'][2]['neural']['passage_embedding']['model_id'] = d['model_id']

def tokenize_query(query_text: str) -> List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some details over how this function works (creating bigrams and using them for queries) will be helpful for the user if added a description comment at the top.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

* `index_settings`: A list of index settings. Index settings defined elsewhere (e.g. `number_of_replicas`) need to be overridden explicitly.
* `ingest_percentage` (default: 100): A number between 0 and 100 that defines how much of the document corpus should be ingested.
* `iterations` Number of test iterations of each search client executes.
* `k` (default: 10) Number of nearest neighbors are returned.
* `iterations`: Number of test iterations of each search client executes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that each search client executes

* `refresh_interval` (default: "5s") Interval to refresh the index in seconds
* `rank_constant` (default: 60): A constant added to each document’s rank before calculating the reciprocal score. Only applicable to Hybrid search with score-ranker-processor enabled
* `refresh_interval` (default: "5s"): Interval to refresh the index in seconds
* `region`: AWS region
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider leaving 'AWS' out if generic.

* `requests_cache_enabled` (default: false): Enables or disables the index request cache
* `search_clients`: Number of clients that issue search requests.
* `search_pipeline_processor`: Types of processors for hybrid search, available processors are normalization-processor and score-ranker-processor, if not defined, normalization-processor will be chosen
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type of processor for hybrid search. Available...

Comment on lines +187 to +188
* `secret_key`: AWS credential secret key
* `session_token`: AWS credential session token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider leaving "AWS" out.


### Gotchas
1. The above benchmark is running against a cluster that has two data nodes (See Docker compose [detail](https://docs.opensearch.org/docs/latest/install-and-configure/install-opensearch/docker/#sample-docker-composeyml)).
If your cluster only have one data node, test procedures may stuck in `check-cluster-health` step, in that case, you should add a `number_of_replicas` parameter with value `0`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only has a single data node, test procedures may get stuck in the check-cluster-health step. In that case

@@ -62,6 +83,9 @@ def partition(self, partition_index, total_partitions):
return self

class NeuralSparseQueryParamSource(QueryParamSource):
def get_dataset_name(self):
return 'quora'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

params file already have corpora_name config https://github.com/opensearch-project/opensearch-benchmark-workloads/pull/624/files#diff-95a2eb01d714f95fd72432cbd3d615a09a7fe98582fc11aa91a54f2c38640c87R4

Can we read the value from there instead of hardcode?

return self._params.get('corpora_name', 'quora') # Default to quora if not found

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work that way, although we have parameters defined in that file, it doesn't mean all of them will be loaded in this search operation. If we want to do that, we need to pass the parameter in all the search operations manually. I will keep it as is

"default": false,
"schedule": [
{
"operation": "put-concurrent-segment-search-setting"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran query only workflow and got error:

Running put-concurrent-segment-search-setting                                  [  0% done][ERROR] illegal_argument_exception ({'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'Failed to parse value [] as only [true] or [false] are allowed.'}], 'type': 'illegal_argument_exception', 'reason': 'Failed to parse value [] as only [true] or [false] are allowed.'}, 'status': 400})

I don't see current params provide concurrent_segment_search_enabled property, we might need provide default value at

"search.concurrent_segment_search.enabled": "{{concurrent_segment_search_enabled}}"
?

example:

"search.concurrent_segment_search.enabled": "{{concurrent_segment_search_enabled | default('false')}}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the default on purpose. If user want to set concurrent_segment_search_enabled parameter, they need to specify whether they want to turn on/off the feature

Signed-off-by: Weijia Zhao <[email protected]>

count = self._params.get("variable-queries", 0)
if count > 0:
script_dir = os.path.dirname(os.path.realpath(__file__))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract this logic into a separate function to avoid duplicated codes reading the query line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably do it in another PR

params = self._params
hybrid_queries = params['body']['query']['hybrid']['queries']

with open('model_id.json', 'r') as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract this logic into a separate function to avoid duplicated codes reading the model id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not duplicate code, the way to get the model id path is different in different search methods

"index_body": "indices/quora.json",
"corpora_name": "quora",
"ingest_percentage": 100,
"variable_queries": 0,
"variable_queries": 10000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the number of the queries we will run? In workload.py seems like we only check if it's larger than 0 so is there a difference when we use 1 vs 10000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of the queries we will run is defined in iteration parameter.

@@ -42,15 +54,24 @@ def __init__(self, workload, params, **kwargs):
self._params['variable-queries'] = params.get("variable-queries", 0)
self.infinite = True

self.dataset_name = self.get_dataset_name()
Copy link
Contributor

@bzhangam bzhangam May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we get the dataset_name from the params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junqiu also asked the same question, check my reply

Copy link
Collaborator

@gkamat gkamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open a PR to move the data files to the standard corpus location.

@gkamat gkamat merged commit fd0e88d into opensearch-project:main May 16, 2025
2 checks passed
@weijia-aws
Copy link
Contributor Author

Please open a PR to move the data files to the standard corpus location.

Thank you for merging the PR. Can you help upload the dataset to the standard location, and provide me the link? I will open a PR for updating it

@gkamat gkamat added backport 2 Backport to the "2" branch backport 3 Backport to the "3" branch labels May 16, 2025
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 16, 2025
…search workload (#624)

* Add Hybrid and Semantic search test procedures

Signed-off-by: Weijia Zhao <[email protected]>

* Address comments

Signed-off-by: Weijia Zhao <[email protected]>

* Update readme

Signed-off-by: Weijia Zhao <[email protected]>

* Add more complex hybrid search queries

Signed-off-by: Weijia Zhao <[email protected]>

* Add benchmark workload for Multimodal search

Signed-off-by: Weijia Zhao <[email protected]>

* Add concurrent_segment_search support

Signed-off-by: Weijia Zhao <[email protected]>

* Make concurrent_segment_search_enabled configurable

Signed-off-by: Weijia Zhao <[email protected]>

* Address comments

Signed-off-by: Weijia Zhao <[email protected]>

---------

Signed-off-by: Weijia Zhao <[email protected]>
(cherry picked from commit fd0e88d)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 16, 2025
…search workload (#624)

* Add Hybrid and Semantic search test procedures

Signed-off-by: Weijia Zhao <[email protected]>

* Address comments

Signed-off-by: Weijia Zhao <[email protected]>

* Update readme

Signed-off-by: Weijia Zhao <[email protected]>

* Add more complex hybrid search queries

Signed-off-by: Weijia Zhao <[email protected]>

* Add benchmark workload for Multimodal search

Signed-off-by: Weijia Zhao <[email protected]>

* Add concurrent_segment_search support

Signed-off-by: Weijia Zhao <[email protected]>

* Make concurrent_segment_search_enabled configurable

Signed-off-by: Weijia Zhao <[email protected]>

* Address comments

Signed-off-by: Weijia Zhao <[email protected]>

---------

Signed-off-by: Weijia Zhao <[email protected]>
(cherry picked from commit fd0e88d)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gkamat pushed a commit that referenced this pull request May 16, 2025
…search workload (#624) (#638)

* Add Hybrid and Semantic search test procedures



* Address comments



* Update readme



* Add more complex hybrid search queries



* Add benchmark workload for Multimodal search



* Add concurrent_segment_search support



* Make concurrent_segment_search_enabled configurable



* Address comments



---------


(cherry picked from commit fd0e88d)

Signed-off-by: Weijia Zhao <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gkamat pushed a commit that referenced this pull request May 16, 2025
…search workload (#624) (#639)

* Add Hybrid and Semantic search test procedures



* Address comments



* Update readme



* Add more complex hybrid search queries



* Add benchmark workload for Multimodal search



* Add concurrent_segment_search support



* Make concurrent_segment_search_enabled configurable



* Address comments



---------


(cherry picked from commit fd0e88d)

Signed-off-by: Weijia Zhao <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2 Backport to the "2" branch backport 3 Backport to the "3" branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Neural Search Plugin Benchmark
6 participants