[RFC]  Adding Clustered Parent Child Index Support to OS

### Is your feature request related to a problem? Please describe

In Lucene, one common approach to handle relational data is to denormalize data at the application level and index the flattened documents into Lucene. This approach works well for many cases, but for use cases with many child documents per parent, it becomes inefficient. Denormalization leads to heavy duplication since the fields from parent documents are indexed multiple times, once for each child document.

Overtime, Lucene/ Opensearch  has introduced two methods to handle joins between documents with hierarchical relationships (parent/child): 

### **Index Time Join (BlockJoin)**
Lucene's block join requires documents belonging to the same parent to be indexed together, starting with the child documents followed by the parent documents. The IndexWriter API adds these documents atomically, ensuring they are placed into a single segment as adjacent documents. The BlockJoinQuery([BJC](https://lucene.apache.org/core/3_4_0/api/contrib-join/org/apache/lucene/search/join/BlockJoinQuery.html)) depends on this specific ordering to map hits from child documents to their respective parent documents

However, a key limitation is that block join requires reindexing the entire block (both child and parent documents) if any changes are made to either the parent or child documents. This means that any modification to the documents in a block necessitates reindexing all of them together - causing slow read/writes in cases when rate for block updates are higher

### **Query Time Join:** 
With query-time join, hierarchical entities (parent/child documents) are stored as separate Lucene documents, with each child document containing a pointer of its parent. Updates to the parent document do not require reindexing the child document(s), and vice versa. The join is performed during query time in a two-pass search; the first pass retrieves the parent documents, and the second pass finds all child documents that match the parent terms collected in the first pass. This approach is inefficient in most cases as we over retrieve documents that are not needed, consequently causing high retrieval latency.

### Describe the solution you'd like

We propose a **hybrid approach** that combines the advantages of both query-time join and block join techniques to support efficient relational joins in OpenSearch. Like the query-time join model, this approach allows parent and child documents to be independently indexed or updated. At the same time, it leverages block join principles by physically clustering parent and child documents together through a sort-based layout during commit.

To ensure that parent and child documents reside within the same segment, the index is force-merged into a single segment at commit time. We refer to this method as **Clustered Parent-Child Indexing**. This approach offers the best of both worlds: the **indexing flexibility and efficiency** of the traditional parent-child model, and the **query-time performance benefits** of block-join-based nested queries.

Enabling native support for Clustered Parent-Child Indexing in OpenSearch requires changes across three core areas: indexing, query execution, and the deployment model. This RFC outlines the required modifications and implementation plan in detail.

### **Indexing**

### **Clustered Parent Child Index Mode** 

To support Clustered Parent-Child Indexing, the design requires several enhancements to the ingestion flow. To encapsulate these changes and provide a clean abstraction, we propose introducing a new index setting field:

```
{
  "settings": {
    "clustered_parent_child": {}
    }
  }
}
```

This field serves two key purposes:
- Declares the index as a clustered parent-child index, enabling the system to apply special handling during indexing and query execution.
- Defines the relationship between parent and child document types, including the join field name and role mapping.

 An example configuration might look like:

```
{
  "settings": {
    "clustered_parent_child": {
      "relations": {
        "parent": "store",
        "child": "item",
        "relation_key": "store_uuid"
      }
    }
  }
}

```
Alternatively can also use existing join field type to define the relations.

When the clustered_parent_child setting is enabled, OpenSearch will validate and enforce clustering constraints, sort documents appropriately (e.g., child-first ordering within parent groups), and ensure that related documents are committed into the same segment, using force-merge

### **Index Layout**
A central aspect of the clustered parent-child index design is the co-location of parent documents with their corresponding child documents. Specifically, both parent and child documents reside in the same Lucene segment, with all child documents grouped together and placed immediately before their parent document, similarly to the block join indexing layout.

### **Routing** 
To ensure that a parent and all its children are routed to the same shard, the parent document’s UUID (i.e., its primary key) is used as the sharding/routing key in the indexing requests for both child documents .

### **Single Segment Merge Policy**
To enforce this layout constraint, the index is force-merged into a single segment after each commit. This is achieved through a custom merge policy, inspired by Lucene's merge-on-commit strategy, but extended to always consolidate all segments into one. This guarantees that the parent-child document ordering remains intact across index updates.   

### **Index Sorting** 
To enforce this layout constraint, documents in the index must be sorted using a specific sort key. The sort key is composed of the parent UUID, relation type, and child UUID. The relation type is encoded such that child documents are ordered before their corresponding parent document.

Specifically, the sort key follows the format: <<Parent_UUID>><<Relation_Type>><<Child_UUID>>, where Relation_Type is set to '0' for child documents and '1' for the parent. This encoding ensures that, during sorting, all child documents for a given parent appear before the parent document itself.

### **Out of order handing**
To maintain index integrity and ensure documents are sorted as expected for queries, we need to handle edge cases like out-of-order document arrival—where child documents may arrive before their parent. At commit time, this can lead to orphan child documents.

To solve this, we use an in-memory map to track parent primary keys and child foreign keys as they arrive. If a child appears without its parent, we temporarily create a dummy parent as a placeholder. When the actual parent arrives later, it replaces the dummy document.

This in-memory map is essential for handling newly ingested documents that haven't been committed yet and are not visible in queries. On node restart, the map is rebuilt by reading all parent UUID terms from the index.

### **Query**

On the query side, we require an operator similar to [ToChildBlockJoinQuery](http://tochildblockjoinquery/). While ToChildBlockJoinQuery can technically be used, it currently accepts only a parent filter and retrieves all child documents for the matching parents. However, there are use cases that also require applying a filter on child documents and limiting the number of children returned per parent.

Supporting such per-parent child constraints directly at the retrieval layer can significantly improve query efficiency by allowing the iterator to skip to the next parent once the required number of child documents have been retrieved.

To address this, a new Lucene issue has been filed, and a [pull request ](https://github.com/apache/lucene/issues/14565)is underway to introduce a new ParentChildQuery that supports both parent and child filtering, along with per-parent limits. We plan to leverage this new query operator in OpenSearch when the index is a clustered parent_child index.

### **Deployment Model**

### **Segment Replication**

Using Cluster Parent child index mode, also requires segment-based replication. This is to ensure updates are made visible to searchers only after documents are re-ordered so that the parent-child index layout is maintained.

### **Reader Writer Separation**
Since merging updates into a single sorted segment can be computationally expensive, we propose a read/write separation model, as illustrated in the diagram below. In this approach, after each commit, the ingester force-merges the segments into a single sorted segment and uploads it to remote storage (e.g., HDFS or S3). The searcher then downloads the index from remote storage and serves queries in a read-only mode. The freshness of segments on the searcher node depends on the commit interval, the duration of merge operations, and the upload/download latency

### **Reader/Writer As Side Car Containers** 

One of the trade-offs of using a clustered parent-child index is the significant data transfer between the ingester and searcher. In real-time update scenarios, it is common for at least a few documents to be updated per shard. Since updates are merged into a single sorted segment, each commit modifies the entire index, leading to large data transfers from the ingester to the searcher and requiring high bandwidth for remote storage. For large indexes, this can substantially increase remote storage costs.
To address this, we propose colocating the reader and writer as separate containers on the same node. This allows segments to be synchronized via the local disk, making replication both cost-efficient and reliable by eliminating the dependency on remote storage for segment transfer. Remote storage is then used only for segment backups.

### **Use Micro index to scale the ForceMerge** 

Index size significantly impacts both merge duration and upload/download latency. To maintain reasonable data freshness, it's important to limit the size of each index. However, having too many small segments can negatively affect query latency and throughput. Therefore, selecting an appropriate index size is a trade-off between freshness and query efficiency.

Experiments show that it takes approximately 10 minutes to commit and force-merge updates into a 6–7 GB segment. For large-scale use cases, this would require many shards to maintain reasonable performance and freshness.

There are two ways to address this:

**Shard Sizing:** Configure the number of shards such that each shard stays around 5-6 GB. This helps keep merge and transfer times low.


**Two-Level Partitioning (Micro Indexing)**: Introduce a two-level partitioning where each index is partitioned into shards, and each shard is further divided into smaller partitions called slices (or sub-shards). Each slice functions as an independent Lucene index. This topology keeps each Lucene index smaller, enabling parallel force-merge operations. On the searcher side, slices can also be queried concurrently, significantly improving query latency.

Scaling force merge is one of the key challenges in implementing parent-child relationships in OpenSearch. To address this, having a large number of shards—potentially in the hundreds—can help scale parent-child indexing. As part of our cloud-native roadmap, we aim to enhance OpenSearch to better support large-scale indexing by enabling support for high shard counts.

However, introducing native support for sub-shards or slices would represent a significant architectural change. While this could unlock additional use cases through fine-grained data partitioning, it would also involve substantial development effort and warrants deeper design discussions.



### Related component

Search:Query Capabilities

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Adding Clustered Parent Child Index Support to OS #18608

Is your feature request related to a problem? Please describe

Index Time Join (BlockJoin)

Query Time Join:

Describe the solution you'd like

Indexing

Clustered Parent Child Index Mode

Index Layout

Routing

Single Segment Merge Policy

Index Sorting

Out of order handing

Query

Deployment Model

Segment Replication

Reader Writer Separation

Reader/Writer As Side Car Containers

Use Micro index to scale the ForceMerge

Related component

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Adding Clustered Parent Child Index Support to OS #18608

Description

Is your feature request related to a problem? Please describe

Index Time Join (BlockJoin)

Query Time Join:

Describe the solution you'd like

Indexing

Clustered Parent Child Index Mode

Index Layout

Routing

Single Segment Merge Policy

Index Sorting

Out of order handing

Query

Deployment Model

Segment Replication

Reader Writer Separation

Reader/Writer As Side Car Containers

Use Micro index to scale the ForceMerge

Related component

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions