Skip to content

[BUG] [Hybrid Search] Non-determinsitic NullPointerException bug when using hybrid search with a single shard #1415

Open
@ekeric13

Description

@ekeric13

Currently on version 2.19.X

What is the bug?

A clear and concise description of the bug.

Null Pointer Exception here:
https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L324

This code seems to be gated if you have multiple shards, as this code seems to imply that if there are multiple shards fetchSearchResultOptional is empty and we have an early return:
https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L282-L283
https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L111-L114

Not exactly sure how the issue occurs but you can tell the docMap created from hit documents:
https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L300-L308

Doesn't have the document that was scored:
https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L320-L322

I am new to hybrid search, and am not running the code, but giving it a good reading this is my guess on what is going on:

  1. we create unProcessedDocs from the topDocs
    https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L359-L366
  2. We construct a map of docs, the key being from unProcessedDocs and the value being from the hit docs.
    https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L298C38-L308
  3. Similar to how we constructed unProcessedDocs, we go over the topDocs for each hit recorded. Importantly we don't get the docs from docIds but from querySearchResult:
    https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L310-L324
  4. the unProccedDocs was created early on:
    https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L62-L65
  5. Before we call updateOriginalFetchResults we call updateOriginalQueryResults and in updateOriginalQueryResults we mutate the queryResults object with a different topDocs value
    https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L97C9-L103
    https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L212
  6. So now i think in updateOriginalFetchResults, we are coming back with with different topDocs that are used in the trimmedLengthOfSearchHits for loop than the unProcessedDocs ones we used to make the map.
    https://github.com/opensearch-project/neural-search/blob/2.19.2.0/src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java#L310C47-L311

So maybe the fix is to not pass through unprocessedDocIds and then update updateOriginalFetchResults to just use the new re-ranked docs throughout the whole function? Assuming my diagnosis was/is correct. i am new to opensearch so not exactly sure how it is suppose to work.

How can one reproduce the bug?

Steps to reproduce the behavior.

Create an instance with a single shard, add 3 documents, use a hybrid search query that returns 2 documents. Sometimes it passes but most of the time i get the NPE. Hence why i mentioned it is non-determinsitic.

What is the expected behavior?

A clear and concise description of what you expected to happen.

Return the hit documents and not throw an NPE. We see this behavior when we used 3 shards.

What is your host/environment?

Operating system, version.

Opensearch 2.19. I believe it is aws opensearch. A teammate of mine actually owns the opensearch service so do not have all the details here.

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Image

Do you have any additional context?

Add any other context about the problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions