-
Notifications
You must be signed in to change notification settings - Fork 98
Add Hybrid, Semantic and Multimodal search test procedures in neural search workload #624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Hybrid, Semantic and Multimodal search test procedures in neural search workload #624
Conversation
Signed-off-by: Weijia Zhao <[email protected]>
Signed-off-by: Weijia Zhao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the default number of replicas? I tried it one 1 node cluster and workload hung, index was Yellow, I assume that is because num of replicas is > 0, and with one node that leads to a unstable yellow state, meaning index health will always be negative.
Here's the index setting, since we don't define the |
you can change default to be 0 as part of your worklow. if you want to keep default as 1 then you need to do something with the check_cluster_health operation. If will go into infinite cycle in case there is only one data node. |
I see, I haven't run into this issue as my cluster has two data nodes. But this setting should totally depend on the number of data node in the cluster that used to run benchmark. I will instead mention it in README file |
Signed-off-by: Weijia Zhao <[email protected]>
please add following to quora.json where we do have mapping:
Addition is this |
Signed-off-by: Weijia Zhao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few minor asks for each parameter file:
- use multiple search clients, I think it's 8 for other workloads. Params is "search_clients": 8
- we need to ingest all documents, "ingest_percentage": 100
- increase k, typical production query has k = 100
- don't fixate throughput, set it to 0 so client squeeze maximum possible throughput: "target_throughput": 0
Signed-off-by: Weijia Zhao <[email protected]>
}, | ||
{ | ||
"name": "abo", | ||
"base-url": "https://github.com/weijia-aws/neural-search/releases/download/abo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some help uploading this dataset to a public repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do this after the PR is merged. Can be done as a separate PR.
@weijia-aws I think this workload does not have a way to enable/disable concurrent segment search dynamically. Check out noaa_semantic_search, we have done it via parameter (ref). Having it on or off can affect the response times significantly. |
Sure, will add it in the new revision |
Signed-off-by: Weijia Zhao <[email protected]>
Signed-off-by: Weijia Zhao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @weijia-aws
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments are suggestions for you to consider.
"name": "{{method | default('hnsw')}}", | ||
"engine": "{{engine | default('lucene')}}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using more clearly named parameters, such asvector_engine
or graph_algorithm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, currently these parameter names are used when creating model across the whole package, I want to keep the same. But this can be an improvement for the package. Also we provide the parameter description in readme file, so it should not cause confusion
}, | ||
{ | ||
"name": "abo", | ||
"base-url": "https://github.com/weijia-aws/neural-search/releases/download/abo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move to the standard repository after this PR is merged.
{{ benchmark.collect(parts="../../common_operations/delete_index.json") }}, | ||
{ | ||
"operation": "delete-ingest-pipeline" | ||
}, | ||
{ | ||
"operation": "delete-ml-model-sentence-transformer" | ||
}, | ||
{ | ||
"operation": "put-cluster-settings" | ||
}, | ||
{%- if concurrent_segment_search_enabled is defined %} | ||
{ | ||
"operation": "put-concurrent-segment-search-setting" | ||
}, | ||
{%- endif %} | ||
{ | ||
"operation": "register-ml-model-sentence-transformer" | ||
}, | ||
{ | ||
"operation": "deploy-ml-model" | ||
}, | ||
{ | ||
"operation": "create-text-embedding-processor-ingest-pipeline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of these sections that deal with deleting and creating the ingest pipelines, adding the concurrent search setting and deploying the model seem to be repetitive, with perhaps the model type as a variable. Consider consolidating it into a common fragment that you can include. Again, it may not be possible, but just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that before, but there're a lot of different steps, mainly due to the model and pipeline are different.
neural_search/workload.py
Outdated
params = self._params | ||
with open('model_id.json', 'r') as f: | ||
d = json.loads(f.read()) | ||
params['body']['query']['hybrid']['queries'][1]['neural']['passage_embedding']['model_id'] = d['model_id'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be more readable if you used an intermediate variable. Something like:
q = params['body']['query']['hybrid']['queries'][1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
d = json.loads(f.read()) | ||
params['body']['query']['hybrid']['queries'][2]['neural']['passage_embedding']['model_id'] = d['model_id'] | ||
|
||
def tokenize_query(query_text: str) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some details over how this function works (creating bigrams and using them for queries) will be helpful for the user if added a description comment at the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
neural_search/README.md
Outdated
* `index_settings`: A list of index settings. Index settings defined elsewhere (e.g. `number_of_replicas`) need to be overridden explicitly. | ||
* `ingest_percentage` (default: 100): A number between 0 and 100 that defines how much of the document corpus should be ingested. | ||
* `iterations` Number of test iterations of each search client executes. | ||
* `k` (default: 10) Number of nearest neighbors are returned. | ||
* `iterations`: Number of test iterations of each search client executes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that each search client executes
* `refresh_interval` (default: "5s") Interval to refresh the index in seconds | ||
* `rank_constant` (default: 60): A constant added to each document’s rank before calculating the reciprocal score. Only applicable to Hybrid search with score-ranker-processor enabled | ||
* `refresh_interval` (default: "5s"): Interval to refresh the index in seconds | ||
* `region`: AWS region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider leaving 'AWS' out if generic.
neural_search/README.md
Outdated
* `requests_cache_enabled` (default: false): Enables or disables the index request cache | ||
* `search_clients`: Number of clients that issue search requests. | ||
* `search_pipeline_processor`: Types of processors for hybrid search, available processors are normalization-processor and score-ranker-processor, if not defined, normalization-processor will be chosen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type of processor for hybrid search. Available...
* `secret_key`: AWS credential secret key | ||
* `session_token`: AWS credential session token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider leaving "AWS" out.
neural_search/README.md
Outdated
|
||
### Gotchas | ||
1. The above benchmark is running against a cluster that has two data nodes (See Docker compose [detail](https://docs.opensearch.org/docs/latest/install-and-configure/install-opensearch/docker/#sample-docker-composeyml)). | ||
If your cluster only have one data node, test procedures may stuck in `check-cluster-health` step, in that case, you should add a `number_of_replicas` parameter with value `0` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only has a single data node, test procedures may get stuck in the check-cluster-health
step. In that case
@@ -62,6 +83,9 @@ def partition(self, partition_index, total_partitions): | |||
return self | |||
|
|||
class NeuralSparseQueryParamSource(QueryParamSource): | |||
def get_dataset_name(self): | |||
return 'quora' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
params file already have corpora_name config https://github.com/opensearch-project/opensearch-benchmark-workloads/pull/624/files#diff-95a2eb01d714f95fd72432cbd3d615a09a7fe98582fc11aa91a54f2c38640c87R4
Can we read the value from there instead of hardcode?
return self._params.get('corpora_name', 'quora') # Default to quora if not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work that way, although we have parameters defined in that file, it doesn't mean all of them will be loaded in this search operation. If we want to do that, we need to pass the parameter in all the search operations manually. I will keep it as is
"default": false, | ||
"schedule": [ | ||
{ | ||
"operation": "put-concurrent-segment-search-setting" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ran query only workflow and got error:
Running put-concurrent-segment-search-setting [ 0% done][ERROR] illegal_argument_exception ({'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'Failed to parse value [] as only [true] or [false] are allowed.'}], 'type': 'illegal_argument_exception', 'reason': 'Failed to parse value [] as only [true] or [false] are allowed.'}, 'status': 400})
I don't see current params provide concurrent_segment_search_enabled
property, we might need provide default value at
"search.concurrent_segment_search.enabled": "{{concurrent_segment_search_enabled}}" |
example:
"search.concurrent_segment_search.enabled": "{{concurrent_segment_search_enabled | default('false')}}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the default on purpose. If user want to set concurrent_segment_search_enabled
parameter, they need to specify whether they want to turn on/off the feature
Signed-off-by: Weijia Zhao <[email protected]>
|
||
count = self._params.get("variable-queries", 0) | ||
if count > 0: | ||
script_dir = os.path.dirname(os.path.realpath(__file__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extract this logic into a separate function to avoid duplicated codes reading the query line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably do it in another PR
params = self._params | ||
hybrid_queries = params['body']['query']['hybrid']['queries'] | ||
|
||
with open('model_id.json', 'r') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extract this logic into a separate function to avoid duplicated codes reading the model id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not duplicate code, the way to get the model id path is different in different search methods
"index_body": "indices/quora.json", | ||
"corpora_name": "quora", | ||
"ingest_percentage": 100, | ||
"variable_queries": 0, | ||
"variable_queries": 10000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the number of the queries we will run? In workload.py seems like we only check if it's larger than 0 so is there a difference when we use 1 vs 10000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number of the queries we will run is defined in iteration
parameter.
@@ -42,15 +54,24 @@ def __init__(self, workload, params, **kwargs): | |||
self._params['variable-queries'] = params.get("variable-queries", 0) | |||
self.infinite = True | |||
|
|||
self.dataset_name = self.get_dataset_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we get the dataset_name from the params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Junqiu also asked the same question, check my reply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please open a PR to move the data files to the standard corpus location.
Thank you for merging the PR. Can you help upload the dataset to the standard location, and provide me the link? I will open a PR for updating it |
…search workload (#624) * Add Hybrid and Semantic search test procedures Signed-off-by: Weijia Zhao <[email protected]> * Address comments Signed-off-by: Weijia Zhao <[email protected]> * Update readme Signed-off-by: Weijia Zhao <[email protected]> * Add more complex hybrid search queries Signed-off-by: Weijia Zhao <[email protected]> * Add benchmark workload for Multimodal search Signed-off-by: Weijia Zhao <[email protected]> * Add concurrent_segment_search support Signed-off-by: Weijia Zhao <[email protected]> * Make concurrent_segment_search_enabled configurable Signed-off-by: Weijia Zhao <[email protected]> * Address comments Signed-off-by: Weijia Zhao <[email protected]> --------- Signed-off-by: Weijia Zhao <[email protected]> (cherry picked from commit fd0e88d) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…search workload (#624) * Add Hybrid and Semantic search test procedures Signed-off-by: Weijia Zhao <[email protected]> * Address comments Signed-off-by: Weijia Zhao <[email protected]> * Update readme Signed-off-by: Weijia Zhao <[email protected]> * Add more complex hybrid search queries Signed-off-by: Weijia Zhao <[email protected]> * Add benchmark workload for Multimodal search Signed-off-by: Weijia Zhao <[email protected]> * Add concurrent_segment_search support Signed-off-by: Weijia Zhao <[email protected]> * Make concurrent_segment_search_enabled configurable Signed-off-by: Weijia Zhao <[email protected]> * Address comments Signed-off-by: Weijia Zhao <[email protected]> --------- Signed-off-by: Weijia Zhao <[email protected]> (cherry picked from commit fd0e88d) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…search workload (#624) (#638) * Add Hybrid and Semantic search test procedures * Address comments * Update readme * Add more complex hybrid search queries * Add benchmark workload for Multimodal search * Add concurrent_segment_search support * Make concurrent_segment_search_enabled configurable * Address comments --------- (cherry picked from commit fd0e88d) Signed-off-by: Weijia Zhao <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…search workload (#624) (#639) * Add Hybrid and Semantic search test procedures * Address comments * Update readme * Add more complex hybrid search queries * Add benchmark workload for Multimodal search * Add concurrent_segment_search support * Make concurrent_segment_search_enabled configurable * Address comments --------- (cherry picked from commit fd0e88d) Signed-off-by: Weijia Zhao <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
text
instead ofpassage_text
in field_map, as the dataset being used only hastext
fieldIn Commit 5, I added another test procedures for supporting multimodal search
Issues Resolved
This will partially resolve 597, we will add more support (multimodal search) in future PRs
Testing
[Describe how this change was tested]
Backport to Branches:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.