-
Notifications
You must be signed in to change notification settings - Fork 94
add stats for text embedding processors with flags #1332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add stats for text embedding processors with flags #1332
Conversation
f827460
to
7217183
Compare
Could you update the PR description with updated response of the stats api? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, minor nits
increment(stats, InfoStatName.TEXT_EMBEDDING_PROCESSORS); | ||
Object skipExisting = processorConfig.get(TextEmbeddingProcessor.SKIP_EXISTING); | ||
if (Objects.nonNull(skipExisting) && skipExisting.equals(Boolean.TRUE)) { | ||
increment(stats, InfoStatName.TEXT_EMBEDDING_SKIP_EXISTING_PROCESSORS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use this caster helper here? Or does this not work because it's a boolean primative? https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/stats/info/InfoStatsManager.java#L180-L193
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it directly reads as boolean, not as map. I think it's cleaner this way
@@ -72,6 +72,7 @@ public void doExecute( | |||
generateAndSetInference(ingestDocument, processMap, inferenceList, handler); | |||
return; | |||
} | |||
EventStatsManager.increment(EventStatName.TEXT_EMBEDDING_PROCESSOR_SKIP_EXISTING_EXECUTIONS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not this change but I think I noticed a bug in my initial PR, we should be incrementing the stat even when we run batch execute but currently we only increment during single execute so I think it might not be counted when we run pipelines in batch. Do you think you could include that change in this PR as well? If not I can try to catch it in the next one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, i can add it to this pr
7217183
to
9188606
Compare
Currently we increment the stat whenever we have an execution that has the option enabled. Should we also have a stat to record when we have a "cache hit" and successfully skip an inference, or do you think that would be redundant? cc: @heemin32 |
@q-andy It really comes down to what insight we want to gain from these stats.
I think we should start with 1 (which this PR addresses), and scope out how, and to what extent we should support 2. |
For cache_hit stats, we can wait until there is an ask from users with the feature. |
I'm not sure if the current stat name Also, do we really need a separate execution stat like |
@heemin32 the naming convention follows the one for text chunking which have different available algorithm options (link). As for execution stats, it would provide stats for how many times the processor with |
Signed-off-by: will-hwang <[email protected]>
@will-hwang You might need rebase with main branch to pass the CI. |
e9ac623
to
2e3afdd
Compare
Regarding the stats APIs in the neural plugin — do we really need to track processor executions metrics? Wouldn’t the number of processors alone be sufficient to measure adoption? My main concern is scalability. The stats APIs have limitations in that area, and once we add a metric, it's difficult to remove it later. That’s why I’d prefer to keep the metrics as minimal as possible. |
@heemin32 |
detailed metrics would be useful too, we can see which search configuration is most used one and invest there our efforts to improve relevance or other aspects of search. If number of metrics is critical for infra team then we can have only number of processors, that is P0. |
For event execution, we need to add metrics which cannot be retrieved from eventStats. For example, number of neural query execution of which data is not available from cluster info. |
sounds reasonable to me. What do others think? |
I agree that more data can be better, but we also have to consider that adding a metric isn't without cost. Also, there's a chance that a single user could generate a large number of calls to a specific processor, which might skew the data and not accurately reflect its true popularity. |
@@ -109,6 +110,7 @@ public void doBatchExecute(List<String> inferenceList, Consumer<List<?>> handler | |||
@Override | |||
public void subBatchExecute(List<IngestDocumentWrapper> ingestDocumentWrappers, Consumer<List<IngestDocumentWrapper>> handler) { | |||
try { | |||
EventStatsManager.increment(EventStatName.TEXT_EMBEDDING_PROCESSOR_EXECUTIONS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we count this by the num of the docs that we are processing? I feel we may want to know how many docs are processed by the processor.
Or we may want to use another event name for the batch processing use case otherwise it can be confusing if we want to rely on this event to tell how many docs we have processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we're more concerned with the number of execution than how many docs being processed per execution in general for other processors as well
Had a chat with infra team, the primary concern is high memory consumption on large clusters seen when calling APIs like Based on this, for 3.1 it should be okay to add granular stats, and caller side can filter them as needed if we run into scalability concerns. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1332 +/- ##
============================================
- Coverage 82.62% 0 -82.63%
============================================
Files 149 0 -149
Lines 7257 0 -7257
Branches 1164 0 -1164
============================================
- Hits 5996 0 -5996
+ Misses 811 0 -811
+ Partials 450 0 -450 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
Enhance Stats for Text Embedding Processor, including stats for
skip_existing
option enabledUpdated Response
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.