[RFC] Derived Source for Vectors

## Introduction

This is an RFC that presents a proposal for removing knn_vector from "_source" field without loss of OpenSearch functionality that "_source" enables. "_source" in this context refers to the per document field in OpenSearch that stores the original source provided by the user as a StoredField in lucene. See [SourceFieldMapper](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/mapper/SourceFieldMapper.java#L287) for more details.

This is a followup for https://github.com/opensearch-project/k-NN/pull/1571 and https://github.com/opensearch-project/k-NN/issues/1572.

## Problem

Currently, vectors for native indices are stored in 3 places by default

1. _source stored field. Vectors along with the reset of json body of the document are stored (i.e. .fdt)
2. Native library files — ANN structure and vectors are stored (i.e. .hnsw)
3. FlatVectorsFormat format — Basically doc values for vectors (i.e. .vec)

In an experiment with 10k 128-dimensional vectors, the size break down of these files was:
Total Index Size | 24.3 mb
-- | --
HNSW files | 5.91 mb
Doc values | 3.8 mb
Source | 14.6 mb

With BEST_COMPRESSION codec:
Total Index Size | 18.3 mb
-- | --
HNSW files | 5.91 mb
Doc values | 3.75 mb
Source | 8.64 mb

From this, we can see that 47%-60% of the storage is going towards the _source storage. For more details, see: https://github.com/opensearch-project/OpenSearch/issues/6356. Worse yet, for our disk based feature with our quantized vectors (vectors that get compressed in a lossy fashion), the native lib files will get smaller than the FlatVectorsFormat file, so the _source will take up an even larger percentage of the storage.

For a typical user, **they should not need to get the source vector from OpenSearch**. Thus, storing the vectors in _source poses significant problems for users with minimal benefits:

1. Users have to pay to store data they do not really need or use. This issue gets even more pronounced for disk-based vector search, where memory is no longer the bottleneck. Users end up having to provision their clusters based on storage capacity.
2. Vectors in _source eat up serialization/deserialization bandwidth. Whenever the _source field needs to be serialized or deserialized (i.e. written to disk, shards migration, snapshot, etc.) a major portion of the bandwidth of this channel is consumed by the vectors in the _source themselves. This can affect all different areas of a users’s vector search workload, such as indexing throughput, search speed, page cache utilization, shard migration, etc. Again, this gets worse with disk-based vector search, where all resources are much more scarce.

Because of this, we generally recommend to users that they disable storing the vectors in the source. However, this has serious limitations:

1. They will not be able to reindex the data
2. Update and update by query API does not work
3. Requires understanding a lot of concepts which leads to poor OOB experience

So, enter “derived_source”. We take inspiration from “[derived fields](https://opensearch.org/docs/latest/field-types/supported-field-types/derived/)” feature of OpenSearch to use one format of data for another purpose on the fly. The idea is that we already have the vectors available via the FlatVectorsFormat files (.vec). When we need to read the _source, we should just inject the vector fields into the _source field from the FlatVectorsFormat file. The effect will be that all functionality of OpenSearch works and we get a potential > 50% reduction in storage space for vectors.

## Proposed Solutions

### [Option # 1]  (Preferred) Implement Custom StoredFieldsFormat in existing KNNCodec

Because the KNN plugin already implements its own Codec, we can override the [StoredFieldsFormat](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/StoredFieldsFormat.java#L26) to intercept and inject the vector fields when needed. This format would use the delegate pattern (as the k-NN plugin already does with core codecs) and only intervene with respect to accesses on the  _source stored field on read and write (see [PoC](https://github.com/jmazanec15/k-NN-1/tree/derived-source-vectors)). 

**Pros**
1. Great out of box experience! User would not need to provide any special configuration in order to get this benefit. On search, they would still need to manually exclude the vector fields, but this is consistent with the existing OpenSearch behavior.
2. Robust feature support. Because we are modifying the _source at a very low level, we can be confident that features that require _source built on top of this would work without any issues. The _source injection would be totally transparent

**Cons**
1. Unable to access OpenSearch resources — To implement this option, we would extend our existing codec. The codec abstraction is at the Lucene level. With this, it is difficult to get some of the required OpenSearch dependencies we would need. For instance, for nested fields, in order to get the parent/child filters, we would need to either directly use the FieldsFormat/PostingFormat (as was done in the PoC) or somehow create a searcher. It is unclear exactly what limitations we will hit here
2. Coupling of different Format readers feels like an anti-pattern. Having the StoredFieldsReader rely on KNNVectorsReader creates a dependency chain between the 2. With this it opens up the door to a circular dependency in the future (although no concrete situations come to mind)

For this option, we created a PoC to showcase feasibility. The [PoC ](https://github.com/jmazanec15/k-NN-1/tree/derived-source-vectors) was able to support the following features:
1. [Flat vector mappings] Injecting vectors into source
2. [Flat vector mappings] Reindexing
3. [Flat vector mappings] Update by query
4. [Nested] Injecting vectors into source for single nested mapping without deletes
5. [Nested] Reindexing
6. [Nested] Update by query

### [Option # 2] Introduce a dedicated FetchSubPhase to inject vector into source

As an alternative, as was done in  https://github.com/opensearch-project/k-NN/issues/1572 by @luyuncheng, we can also create a custom FetchSubPhases in order to prepare the payload with the injected source that we can return to the caller. Generally, this will be where _source gets read (but not guaranteed to be so). 

The general workflow for users would be:

1. Create an index with the vector fields explicitly excluded from source
2. On search/get, the DerivedVectorSearchFetchSubPhase would intercept the SearchResponse (without the vectors) and add the excluded vector fields back into SearchResponse

This approach has the following pros/cons:

**Pros**
1. Easy access to required OpenSearch resources — _source is an OpenSearch concept - Lucene just sees it as a stored field. Thus, most of the configuration details around it are stored in the OpenSearch layer (as opposed to Lucene) — e.g. MappedFieldTypes. Implementing at the FetchSubphase gives us access to these required resources. This also makes it easier to handle other OpenSearch specific cases (such as nested fields)

**Cons**
1. FetchSubphase from plugin would execute after all core FetchSubphases. Thus, the core FetchSubphases would not have access to the vector source. There are not any explicit use cases I can think of here where they need it, but if a user comes up with a case, this would be a hard limitation.
2. Non-deterministic ordering of plugin based fetch-subphases — OpenSearch will execute FetchSubPhases sequentially. OpenSearch will control ordering of the FetchSubPhases that plugins add. Thus, if another plugin adds a FetchSubPhase, it is not clear whether source will be present or not for them to use
3. The overall experience is inconsistent with existing OpenSearch experience. A user would need to exclude the vector fields from source, but still get them in the search response. 

### [Option # 3] Implement Custom StoredFieldVisitor

The security plugin has a feature called “Field-level security” where admins can limit access to different users at the field level. This feature requires that they automatically filter or mask privileged fields from _source. This is similar to what we want to do for vectors! They do this by implementing a custom [StoredFieldsVisitor](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/StoredFieldVisitor.java#L38), [FlsStoredFieldsVisitor](https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/configuration/DlsFlsFilterLeafReader.java#L640). The StoredFieldsVisitor will be called in the [StoredFieldsReader](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/StoredFields.java#L77), for a given document and a given field. Thus, their visitor has the option to intercept the “_source” field, and filter/mask the fields they want. They use the “[onIndexModule](https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/OpenSearchSecurityPlugin.java#L698)” extension point in order to inject this via a custom readerWrapper.

We could do something similar for vector derived source, where instead of filtering and masking, we inject the vector fields.

**Pros**
1. Somewhat easy access to required OpenSearch resources — we have everything on OpenSearch side because extension point is onIndexModule
2. Closer than Option #1 to actual _source field retrieval, which will mean that more features will be supported out of the box

**Cons**
1. Incompatible with security plugin — [indexModule.setReaderWrapper](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/IndexModule.java#L443-L459) can only be called once. Thus, as it stands now, security and knn derived source would not work together.
2. Inconsistent user experience — A user will still need to exclude the vector fields from source, but still get them in the search response.

### Summary

We are proposing option 1 because it provides a consistent UX with existing OpenSearch UX and extends a low level enough point to be generally robust.

## Proposed User Experience

The user interface will either have a cluster setting that will indicate whether or not to use the derived_source feature.

```
PUT my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true
      "knn.derived_source.enabled": true/false # default to tru 
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 2
      }
    }
  }
}

// On search, my_vector1 is excluded
POST some_index/_search
{
       _source : {
           "excludes": ["my_vector1"]
       }
       ...
}'
```

## Open Questions

### Avoid reconstruction of vectors on searches that later filter it out

In the current PoC, if someone excludes a field like this, in the StoredFieldsReader, we will inject the vector into the document and it will be later filtered out by OpenSearch logic. Instead of this, we need to figure out a way where we skip reconstruction in the first place if the field is going to be excluded anyway. This is a bit tricky to do and may involve a change in core. One idea is to pass this information in the [FieldsVisitor](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/fieldvisitor/FieldsVisitor.java) and do some kind of type casting to get the information in the StoredFieldsReader component.

```
// On search, my_vector1 is excluded
POST some_index/_search
{
       _source : {
           "excludes": ["my_vector1"]
       }
       ...
}'
```

## Next Steps

1. Publish high level design
2. Create PoC/Proposal on core on solving redundant reconstruction of vector issue
3. Publish low level design


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Derived Source for Vectors #2377

Introduction

Problem

Proposed Solutions

[Option # 1] (Preferred) Implement Custom StoredFieldsFormat in existing KNNCodec

[Option # 2] Introduce a dedicated FetchSubPhase to inject vector into source

[Option # 3] Implement Custom StoredFieldVisitor

Summary

Proposed User Experience

Open Questions

Avoid reconstruction of vectors on searches that later filter it out

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Derived Source for Vectors #2377

Description

Introduction

Problem

Proposed Solutions

[Option # 1] (Preferred) Implement Custom StoredFieldsFormat in existing KNNCodec

[Option # 2] Introduce a dedicated FetchSubPhase to inject vector into source

[Option # 3] Implement Custom StoredFieldVisitor

Summary

Proposed User Experience

Open Questions

Avoid reconstruction of vectors on searches that later filter it out

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions