[RFC] Inner hits in hybrid query

# Introduction

This is a design document for supporting the inner_hits feature with hybrid queries. In this document, we will discuss about the challenges associated with the feature and then walk through the proposed solution.

# Problem Statement

In OpenSearch, when the user performs the search using [nested objects](https://opensearch.org/docs/latest/field-types/supported-field-types/nested/) or [parent-join](https://opensearch.org/docs/latest/field-types/supported-field-types/join/), the underlying hits (nested inner objects or child documents) are hidden by default. To retrieve those hidden inner hits, inner_hits parameter is provided in the search query. Many customers have requested this feature to be supported with hybrid query via [github issue](https://github.com/opensearch-project/neural-search/issues/718). Fulfilling this request is essential to enhance the user experience in providing details about inner hits that led to the document to be part of the final search result. 


# Requirements

## Functional requirements

* The system must allow inner_hits to be used inside nested, has_child or has_parent queries.
* The system must only return relevant inner documents that match the query condition.
* The system must support to allow custom name parameter with the inner hits.
* The system should allow sorting and pagination to be applied under the inner_hits clause.
* The system should support explain feature with inner_hits.
* The system should show hybrid scores at inner hits level.

## Out of Scope

* The system should support highlight feature within inner_hits.

# Current Architecture

In this section, we will learn about the inner_hits workflow for traditional queries such as match, term, and bool. Moreover, we will also explore the current hybrid query workflow.

## What does inner_hits mean?

In OpenSearch, when you perform a search using [nested objects](https://opensearch.org/docs/latest/field-types/supported-field-types/nested/) or [parent-join](https://opensearch.org/docs/latest/field-types/supported-field-types/join/), the underlying hits (nested inner objects or child documents) are hidden by default. You can retrieve those inner hits by using the inner_hits parameter in the search query. To understand it in detail, please checkout inner hits [documentation](https://opensearch.org/docs/latest/search-plugins/searching-data/inner-hits/).

### Component workflow

![Image](https://github.com/user-attachments/assets/c631e2b0-7088-4fb8-9f82-ebfce83328ab)

1. During the query phase, the NestedInnerHitContext is created. Based on the NestedInnerHitContext, the parent documents (matching results) are returned to the coordinator node.
2. The NestedInnerHitContext is created in NestedQueryBuilder/hasChildQueryBuilder depending on the query type. The following method is overridden from AbstractQueryBuilder to the corresponding query builders that support inner_hits feature.
```
@Override
public void extractInnerHitBuilders(Map<String, InnerHitContextBuilder> innerHits) 
```

3. During the fetch phase, the fetch sub-phase (InnerHitsPhase) is responsible for retrieving all the inner_hits for the parent document, including their source and calculate the relevancy scores.
4. The final search response will contain parent documents with their corresponding inner_hits. The relevancy score of inner_hits will be reflected in the parent document scores. Please find the sample inner_hits response in the appendix section below.

## Hybrid query current workflow

In the Query Phase of Hybrid Search, results from multiple subqueries are retrieved, and the top documents from those individual query results form the individual shard result. After the Query Phase, the normalization processor normalizes and combines the scores to the multiple subqueries results on each shard and removes duplicate documents from the shard results. These final results from all shards are then sent to the Fetch Phase to retrieve source and form the final search response.


# Challenges

In the earlier section, we learnt about the current workflow of inner_hits with traditional query and hybrid query workflow. These workflows raises a major challenge in enabling inner_hits with hybrid query due to their functioning.

## Traditional Queries with Inner Hits:

* When user use inner_hits with traditional queries, the parent documents are retrieved based on inner hits context.
* The position of the parent document in the final search result order will be determined by its corresponding inner_hits relevancy.

## Hybrid Queries :

* When user uses hybrid query, the behavior is different from traditional queries. The final result ordering in the hybrid query result is determined by normalization process.

## Key Difference:

In traditional queries, inner_hits relevancy controls the result ordering. In hybrid queries, normalization scores control the result ordering. Therefore, in the hybrid query context, we can conclude that inner_hits must logically relate to the normalized parent documents. 

After reviewing the key difference, the following questions arise. These questions will be answered in the later sections of this document.

1. Is it possible to reflect inner_hits relevancy with parent documents of the hybrid search response? 
2. How to show hybrid scores in inner_hits? 
3. Can hybrid query be sent under the nested clause/hasChild clause?
4. Does the proposed solution fulfill the customer requirements of getting inner_hits and hybrid scores both?

# Possible Solutions

## Solution 1 [Recommended]

The solution proposes to leverage the current inner hits workflow of opensearch that is used by other queries. This process involves creating inner_hits contexts for each nested subquery within the hybrid clause that require inner_hits extraction. The system then retrieves results from individual shards for each subquery. These distributed results are consolidated at the coordinator node, where the normalization process takes place. During the subsequent fetch phase, the system retrieves inner_hits for each parent document using the previously established inner_hits contexts, ensuring the relevant inner_hits extraction.

### Pros

1. The customer requirement of retrieving the internal hits that made parent document to be part of the hybrid search result is fulfilled.
2. No new component is needed to be designed to enable the inner hits functionality. Hybrid query can reuse the inner hits workflow present in the opensearch core.
3. No extra work is needed to make sort, pagination functionalities to work with inner hits as the current implementation in opensearch core will take care of it.
4. Relevant Inner hits are retrieved for only those parent documents that are part of final hybridized search result.

### Cons

1. Hybrid scores cannot be shown at inner_hits level because they are retrieved during the fetch phase. 
2. Customers will need to use explain api to understand the relevancy between the raw scores of  inner_hits and the hybrid scores of parent documents.

## Solution 2

This solution proposes to create a new inner hits workflow for the hybrid query where inner_hits document IDs will be retrieved during the query phase for each subquery with their corresponding parent documents. Later, during the fetch phase, the source of all retrieved parent documents and inner hits will be added to the final search response.

### Pros

1. Customer requirement of displaying hybrid scores at the inner hits level would be satisfied.
2. The customer requirement of retrieving the internal hits that caused parent documents to be part of the hybrid search result is fulfilled.

### Cons

1. It will require building new components like inner hit docIdIterator to retrieve inner hits during the query phase.
2. The fetch phase will need to be modified to stop performing a light version of search on each parent document to retrieve its corresponding inner hits. Instead, it will only retrieve the source data from the parent documents and their inner hits.
3. Currently, inner hits are retrieved only for parent documents that are part of the final results. With the alternative solution of retrieving inner hits during the query phase, inner hits will be retrieved for all search results of a query, even those that not part of the final search result. This will degrade search performance.
4. It will require building new components to support sort, pagination, collapse, and highlight features to work with inner hits.
5. Major changes in existing search flow of OpenSearch core will be needed to support this solution.

## Solution Comparison

Solution 1 fulfills the customer requirement partially by retrieving the relevant inner hits but not showing hybrid scores at the inner hit level. However, by using the explain api with hybrid query customers can fulfill that gap by understanding the relevancy between raw scores and hybrid scores.

Solution 2 fulfills the customer requirement completely but it will require the behavioral and technical changes in current inner hits workflow. It will degrade the search performance and also might impact the already stable inner hits functionality for other queries.

# HLD

![Image](https://github.com/user-attachments/assets/61255483-3fd9-421e-99a5-808ebd6af5b1)

1. The search request lands on the coordinator node.
2. While parsing the source from the search request, the inner hits context will be created for corresponding subqueries present under the hybrid query clause.
3. The hybrid search workflow is executed where results will be retrieved from shards during the query phase based on the inner hits context. 
4. The normalization phase combines the multiple subqueries result by using normalization and score combination techniques
5. The fetch phase will retrieve source of the parent documents that are part of the final search response.
6. For each parent document inner hits will be retrieved during the fetch phase by leveraging InnerHits fetch sub-phase processor.

# LLD 

Core technical concepts to be used

1. Abstract class
2. Inheritance

![Image](https://github.com/user-attachments/assets/9423bb7f-71b6-4915-a57c-8d700d156796)

The design workflow to enable inner_hits is divided into the following four parts:

## Initializing InnerHitsContext

HybridQueryBuilder, the class responsible for executing hybrid queries, extends AbstractQueryBuilder which contains an abstract method called extractInnerHitsBuilders. We override this method in HybridQueryBuilder to set up inner_hits context for each nested/child subquery. For setting up inner_hits contexts, we leverage the abstract class InnerHitsContextBuilder, which is the main class for creating inner_hits context. This class is extended by two core classes: NestedInnerHitContextBuilder and ParentChildInnerHitContextBuilder. Depending on the subquery type in the search request (nested or parent-child), OpenSearch initializes either of these two classes. The InnerHitsContext object is stored in SearchContext.

## Creating InnerHitsSubContext from the InnerHitsContext

After initialization of the InnerHitsContext, OpenSearch will create the corresponding InnerHitsSubContext by leveraging searchContext. InnerHitsSubContext is the object that contains information about sort condition, explain criteria, fields to be highlighted, track_scores, fetch fields, script fields etc to be applied during the inner_hits extraction and determining the parent documents relevancy. After creating the InnerHitsSubContext, the searchContext has now have all the required information to execute the search.

## Execute HybridSearch

OpenSearch will execute the hybrid search where each subquery will return the parent documents based on the InnerHitsContext. The search results of each subquery will be normalized and combined during the normalization process. After that, the combined results are sent to the fetch phase.

## InnerHitsFetchSubPhase

FetchPhase will instantiate the InnerHitsFetchSubPhase processor to retrieve the inner_hits of each parent document present in result received from the normalization process. The inner_hits are added to the search response and returned to the user.

Inner_hits response of hybrid query with inner_hits

Sample response
```
hits": {
        "max_score": 1.0,
        "hits": [
            {
                "_index": "index-test",
                "_id": "rzp6p48B52hrBAvjI679",
                "_score": 1.0, --> Hybrid score (normalized score)
                "_source": {...},
                "inner_hits": { --> inner_hits start
                    "user": {
                        "hits": {
                            "total": {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 1.540445, --> BM25 score (raw score)
                            "hits": [
                                {
                                    "_index": "index-test",
                                    "_id": "rzp6p48B52hrBAvjI679",
                                    "_nested": {
                                        "field": "user",
                                        "offset": 0
                                    },
                                    "_score": 1.540445,
                                    "_source": {
                                        "firstname": "john",
                                        "age": 1,
                                        "lastname": "black"
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
```
1. The parent document has hybrid score. The explains that the document relevancy is determined by combining the multiple subquery results.
2. The inner_hits reflect the raw scores of parent documents before normalization occurs. As we learned in the earlier section, parent documents are retrieved from the shard based on the innerHitContext. Therefore, the inner_hits present under each parent document are relevant because they are the primary reason for the corresponding parent document's retrieval as a subquery result.

# Answers of the questions raised earlier

## Is it possible to reflect inner_hits relevancy with parent documents of the hybrid search response? 

Yes, we can demonstrate the relevancy of the inner_hits to their parent documents. Inner_hits scores represent the raw scores of the parent documents before normalization occurs. They are the primary reason why a parent document becomes relevant for a subquery. We can even validate this behavior by using explain api.

## How to show hybrid scores in inner_hits? 

Hybrid scores cannot be displayed at the inner_hits level for several reasons. From a logical perspective, the hybrid query combines parent documents rather than their inner_hits. From a technical standpoint, inner_hits are retrieved during the fetch phase, which occurs after the normalization process has been completed. Since the normalization process cannot be executed again, it is not possible to calculate hybrid scores for inner_hits. Therefore, displaying hybrid scores at the inner_hits level is not feasible.

## Can hybrid query be sent under the nested clause?

No. Hybrid query cannot be present under any other type of query clause. It can always be a top level query clause. 

## Does the proposed solution fulfill the customer requirements of getting inner_hits and hybrid scores both?

Most customers need information about the inner_hits that led to parent document retrieval. The proposed solution fulfills this requirement. However, some customers also want hybrid scores to be present at the inner_hits level. For these customers, we have logical reasoning to explain why OpenSearch cannot provide this functionality. If they need to prove the relevancy between inner_hits and parent documents, they can do it by using the explain API.

# Testing

* Unit testing
    * extractInnerHitsBuilders method test case in HybridQueryBuilder
* Integration testing
    * Success scenarios
        * inner_hits with nested query
        * inner_hits with parent-child query
        * inner_hits with sort condition
        * inner_hits with custom name
        * inner_hits with custom from and size value in it
    * Failure scenarios
        * Two nested subqueries on the same nested field cannot have two inner_hits definitions
        * TBD for exceptions

# Appendix

## Query Phase

In this phase, the query provided to OpenSearch is broadcasted to a copy of every shard across the entire index. Once received, the query is executed locally. The result is a priority queue of matching, sorted documents for each shard. This priority queue is simply a sorted list of the top n matching documents with top being determined by relevance and n being determined by pagination parameters set by the user (or the default if not set by the user). Relevance in this case is a score of how well each document it matches the query. The individual shards are responsible for the actual matching process as well as the scoring.

## Fetch Phase

Now that the query phase has identified the documents that satisfy the request, OpenSearch needs to actually retrieve the source of the matched documents. For the fetch phase, the coordinating node used the globally sorted priority list generated in the query phase to build the GET requests needed for the query.

## Normalization Processor

The normalization processor is a phase results processor which runs between the query phase and the fetch phase. It calculates the normalized score of each subquery result by using normalization techniques like min-max, rrf and then combine them on the basis of score combination techniques like arithmetic mean, harmonic mean etc.

## Top Docs

The TopDocs contains all the query results found in the shard in the form of list of ScoreDocs. It is also contains the count totalHits found in that shard. 

## Duplicates

When a query result is part of more than one query, it will be present in the top docs of all the queries it belongs to. This query result is referred to as a duplicate.

## Duplicate-free

After calculating the normalized scores, we remove the duplicates from the top docs of individual shards. Consequently, those shard results are duplicate-free.

## Sample inner_hits response
```
{
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.6931471,
    "hits": [
      {
        "_index": "my-nlp-index",
        "_id": "1",    --> Parent document
        "_score": 0.6931471, --> BM25 Score (Max score found in inner hits)
        "_source": {...},
        "inner_hits": {
          "user": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 0.6931471,    --> BM25 Score
              "hits": [ --> inner_hits
                {
                  "_index": "my-nlp-index",
                  "_id": "1",
                  "_nested": {
                    "field": "user",
                    "offset": 0
                  },
                  "_score": 0.6931471,
                  "_source": {
                    "name": "John Doe",
                    "age": 28
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}
```

[RFC] Inner hits in hybrid query #1247

Description

Introduction

Problem Statement

Requirements

Functional requirements

Out of Scope

Current Architecture

What does inner_hits mean?

Component workflow

Hybrid query current workflow

Challenges

Traditional Queries with Inner Hits:

Hybrid Queries :

Key Difference:

Possible Solutions

Solution 1 [Recommended]

Pros

Cons

Solution 2

Pros

Cons

Solution Comparison

HLD

LLD

Initializing InnerHitsContext

Creating InnerHitsSubContext from the InnerHitsContext

Execute HybridSearch

InnerHitsFetchSubPhase

Answers of the questions raised earlier

Is it possible to reflect inner_hits relevancy with parent documents of the hybrid search response?

How to show hybrid scores in inner_hits?

Can hybrid query be sent under the nested clause?

Does the proposed solution fulfill the customer requirements of getting inner_hits and hybrid scores both?

Testing

Appendix

Query Phase

Fetch Phase

Normalization Processor

Top Docs

Duplicates

Duplicate-free

Sample inner_hits response

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions