[Feature] Enhancing Efficient Filters(in Faiss Engine) for Restrictive Filtering use case

## Introduction
This document proposes the enhancements for Efficient Filters for Faiss Engine for Restrictive Filtering use case. Restrictive Filtering is the case where you have lets say 100K documents in graph but filter reduce the size of the search space to like 1000 documents or less than that.

META Issue for filtering Feature: https://github.com/opensearch-project/k-NN/issues/903

## Background

While doing the Efficient Filtering for K-NN query we do follow the below step for each segment of a shard in OpenSearch.

1. Do a constant score query in via Lucene and get the DocIds.
2. If Length of Filtered DocIds <= K then we do the exact Search, else we do the ANN Search. The reason for doing this because to get K nearest neighbor you need to do at-least K distance calculation. Hence switching to exact search make sense in this case.
3. While doing the exact search, we read the doc values(which is byte representation of Vectors) convert them to array of floats and then use the Lucene distance calculation functions to calculate the distance between the query vector and indexed vector. After that we return the Top K docIds for that segment.

## Problem

The problem with the logic defined above in #2 is, that if K is small and filtered Ids is large but total docs present in the segment is even larger than the Recall drops. Example: Let’s say we ingested 1M documents with 24 shards in an index and then we do a query with K=100, and FilteredIds is 2K but segment contains like ~41K document then the recall drops. We saw this in our experiments for 2.9 OpenSearch release.

```
{
    "metadata": {
      "test_name": "Faiss HNSW Restrictive Filter Test",
      "test_id": "Faiss HNSW Restrictive Filter Test",
      "date": "07/20/2023 20:52:31",
      "python_version": "3.8.17 (default, Jul  5 2023, 21:04:15) \n[GCC 11.2.0]",
      "os_version": "Linux-6.1.34-59.116.amzn2023.x86_64-x86_64-with-glibc2.17",
      "processor": "x86_64, 36 cores",
      "memory": "920506368 (used) / 72086974464 (available) / 73716973568 (total)"
    },
    "results": {
      "test_took": 400951.101299801,
      "delete_index_took_total": 313.93581539959996,
      "create_index_took_total": 207.38219340200885,
      "ingest_multi_field_took_total": 60584.180823600036,
      "refresh_index_store_kb_total": 2993302.6438476564,
      "refresh_index_took_total": 37546.422699699906,
      "force_merge_took_total": 243785.17976769945,
      "query_with_filter_took_total": 58514.0,
      "query_with_filter_took_p50": 5.0,
      "query_with_filter_took_p90": 6.0,
      "query_with_filter_took_p99": 6.7,
      "query_with_filter_took_p99.9": 9.4,
      "query_with_filter_took_p100": 1128.0,
      "query_with_filter_client_time_total": 94834.7,
      "query_with_filter_client_time_p50": 8.7,
      "query_with_filter_client_time_p90": 9.5,
      "query_with_filter_client_time_p99": 10.5,
      "query_with_filter_client_time_p99.9": 199.4,
      "query_with_filter_client_time_p100": 1133.0,
      "query_with_filter_memory_kb_total": 1297436.0,
      "query_with_filter_recall@K_total": 0.18980632427320202,
      "query_with_filter_recall@1_total": 0.4012699802934092
    }
}
```

In vector search space a recall of 18% at K is really bad. We want to improve the recall to be > 0.95.

The same case happened for one of the customer: https://github.com/opensearch-project/k-NN/issues/1043

## Proposed Solutions

### Option 1

The proposal to solve the problem is to provide customers a capability via settings for an index where they can define a threshold(used at a segment level) below which if filtered ids are present we will always do Exact Search instead of ANN along with if filteredIds <= K. I did the experiments(Refer Experiment Section) and found out 2K is a good default number up-to 4096 dimensions. 

```
PUT /hotels-index
{
"settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100,
      "knn.advanced.filtered_exact_search_threshold_pct" : <Threshold_Value_in_pct>,
      "knn.advanced.filtered_exact_search_threshold" : <Threshold_Value>
      
    }
  }
  ......
}
```

The threshold about which we are talking about will be defined in 2 ways:

1. As percentage with default value as 10%. So if the filtered ids is <= 10% of the total live documents present in the segment than we do exact search. This 10% number is coming from the experiments which we did to get 5K number. So if you just use the above example provided what we can say is 5K is ~12% of 41K documents.
2. As absolute value with a default value as 2000. So if the filtered ids is <= 5000 than we do exact search. 

Both of these will make sure that if customer has high number of documents in the his segment then we are caping the exact search to an absolute value. With 1 we get the advantage if a segment has 2.5K document only we are not falling in exact search case.

Overall the condition to do exact search will be :

```
FilteredIds <=K OR (
       (filterIds/totalLiveDocuments) * 100 <= filtered_exact_search_threshold_pct
       AND 
       filterIds.length <= filtered_exact_search_threshold
    )
``` 

**Pros:**

1. Customers will have full control over this switch and can tune it based on their needs.

**Cons:**

1. This adds learning curve for customers as they now have to understand this new setting apart from M, EF_SEARCH and EF_CONSTRUCTION.
2. There are cases like a segment with 25K nodes and had like 3K filtered ids then we would end up doing ANN search which could have a bad recall if the higher level in HNSW graph doesn’t have those nodes.

### Option 2

If we look closely what is happening during restrictive filtered search is we are not able to reach the nodes which are present in the filtered set. So the graph search at segment level is not returning K documents(even when FilteredIds Length > K). We can take advantage of this behavior where when ANN filtered search happened and it didn’t give K documents (even when FilteredIds Length > K) then discard the result of the ANN search and perform the exact search.

**Pros:**

1. The benefit of this approach is customers don’t have to configure anything in the index settings. Ease of use.

**Cons:**

1. As we are doing ANN search and then exact search, we are doing double work. But given that ANN search is pretty fast at a segment level, the added latency will not be that big. (Latency is factor of many things so not adding numbers here)


### Option 3

This approach is nothing but a combination of 1 & 2. So we will have index settings and also the logic to do exact search if K ids are not returned by ANN Search.
Note: Even by combining both Option 1 & 2 there can still be cases where HNSW graph is built in a fashion which leads to bad recall like with 20K document in a segment and 5K filtered Ids, those filtered Ids could be far apart such that we get K results but not the best K results.

**Pros:**

1. It has Pros of both the above approach. :P 

**Cons:**

1. As both the above approaches are little complementary to one another it overall we should have some good coverage on the different use cases of Filtered Search.

### Option 4(Recommended)
Please check the details of this approach here: https://github.com/opensearch-project/k-NN/issues/1049#issuecomment-1694741092

## Experiment Setup
### Philosophy
The basic philosophy behind the experiment setup is:
1. We should be able to do the testing on different dimensions and different number of FilterIds in Quick and efficient fashion.
2. The distance calculation setup should be as close as to the actual production. For this we will use the [same classes](https://github.com/apache/lucene/blob/branch_9_7/lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java) provided by Lucene that we use in K-NN plugin for doing distance calculation. The reason to use lucene classes is because we use them in production and Lucene with JDK 20 has added the support for SIMD. Once that is enabled in the OpenSearch we will automatically get the benefits.
 
### Setup
I create a small gradle java(Java 17) project with Lucene(9.7 version) as a dependency(Code: https://gist.github.com/navneet1v/87c693f015ae1267984173df7036ff32). I generated randomized dataset with different dimensions and then run it on mac(16GB RAM with 6 cores). For every dimensions there were 10 iterations and best latency was taken. 

**_**Experiment table template** _1000_latency(ms)_: Represents the latency observed when filtered Ids count was 1000._**

dimensions | 1000_latency(ms) | 2000_latency(ms) | 5000_latency(ms) | 8000_latency(ms) | 10000_latency(ms) | 15000_latency(ms)
-- | -- | -- | -- | -- | -- | --
64 |   |   |   |   |   |  
128 |   |   |   |   |   |  
256 |   |   |   |   |   |  
512 |   |   |   |   |   |  
768 |   |   |   |   |   |  
968 |   |   |   |   |   |  
1024 |   |   |   |   |   |  
2048 |   |   |   |   |   |  
4096 |   |   |   |   |   |  


### Experiment

Below are the steps that are performed to run this experiment:

1. For every dimension and every run, generate a dataset with required number of vectors in it. We choose 1k, 2k, 5k, 8k, 10k, 15k vectors count. The vectors in this dataset are serialized using the same code we use in K-NN([Code ref](https://github.com/opensearch-project/k-NN/blob/e781ad3c33a6d70826f5aa2f109641efcb45279f/src/main/java/org/opensearch/knn/index/codec/util/KNNVectorAsCollectionOfFloatsSerializer.java#L19-L39)).
2. After query vector is generated. We don’t need to serialize the query as this is not what happens in the exact search.
3. Now to do the distance calculation we first de-serialized the vectors in the dataset using same code used by k-NN([Code ref](https://github.com/opensearch-project/k-NN/blob/e781ad3c33a6d70826f5aa2f109641efcb45279f/src/main/java/org/opensearch/knn/index/codec/util/KNNVectorAsCollectionOfFloatsSerializer.java#L19-L39)) and then performed the distance calculation using Lucene provided functions. This is done for different dimensions and dataset. The time taken in this full step de-serialization and distance calculation is what being provided in below table. Lucene distance calculation functions are used to do the distance calculation. [code ref](https://github.com/apache/lucene/blob/branch_9_7/lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java) 
4. The experiment then collects all the latencies and format them in a table.

So, the query latency include:

1. Time taken to de-serialize the vectors to floats.
2. Distance calculation between the query vector and indexed vector.

dimensions | 1000_latency_serialized(ms) | 2000_latency_serialized(ms) | 5000_latency_serialized(ms) | 8000_latency_serialized(ms) | 10000_latency_serialized(ms) | 15000_latency_serialized(ms)
-- | -- | -- | -- | -- | -- | --
64 | 0.5361 | 0.30428 | 0.77805 | 1.2804 | 1.8098 | 2.36014
128 | 0.43123 | 0.54735 | 1.24873 | 1.9306 | 2.48106 | 3.5845
256 | 0.58271 | 1.06349 | 2.07025 | 3.69847 | 4.90092 | 8.71385
512 | 0.79718 | 1.65931 | 4.18262 | 6.67985 | 8.53127 | 15.2144
768 | 1.15486 | 2.20225 | 5.68341 | 9.35401 | 12.03496 | 23.44284
968 | 1.4213 | 2.85394 | 7.74272 | 14.27054 | 14.97346 | 33.13775
1024 | 1.46418 | 3.15568 | 7.50078 | 14.56883 | 17.00156 | 29.74787
2048 | 2.97711 | 6.85896 | 16.52853 | 24.79757 | 30.17373 | 78.9689
4096 | 5.90317 | 11.74842 | 31.05969 | 64.13292 | 83.25422 | 122.25665

![Results_with_Serialization](https://github.com/opensearch-project/k-NN/assets/6162415/388cf060-3e8e-41ac-8c83-f0a5c2310ebe)


### Experiment Conclusion
1. Flipping to exact search when there are 2K Filtered Documents seems to be a good default. Reason for choosing 2K is because the latency diff between 2K and 5K is pretty high for all dimensions almost double. So if there are more than 1 segment then we will have high latencies. Example: for 1024 dimensions with 5K document latency was 7.5ms so for 5 segments it will be like &gt; 35ms which substantially high compared to ANN search is pretty fast.
2. With Lucene coming up with SIMD instruction we will be able to reduce the distance calculation time.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Enhancing Efficient Filters(in Faiss Engine) for Restrictive Filtering use case #1049

Introduction

Background

Problem

Proposed Solutions

Option 1

Option 2

Option 3

Option 4(Recommended)

Experiment Setup

Philosophy

Setup

Experiment

Experiment Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

dimensions	1000_latency_serialized(ms)	2000_latency_serialized(ms)	5000_latency_serialized(ms)	8000_latency_serialized(ms)	10000_latency_serialized(ms)	15000_latency_serialized(ms)
64	0.5361	0.30428	0.77805	1.2804	1.8098	2.36014
128	0.43123	0.54735	1.24873	1.9306	2.48106	3.5845
256	0.58271	1.06349	2.07025	3.69847	4.90092	8.71385
512	0.79718	1.65931	4.18262	6.67985	8.53127	15.2144
768	1.15486	2.20225	5.68341	9.35401	12.03496	23.44284
968	1.4213	2.85394	7.74272	14.27054	14.97346	33.13775
1024	1.46418	3.15568	7.50078	14.56883	17.00156	29.74787
2048	2.97711	6.85896	16.52853	24.79757	30.17373	78.9689
4096	5.90317	11.74842	31.05969	64.13292	83.25422	122.25665

[Feature] Enhancing Efficient Filters(in Faiss Engine) for Restrictive Filtering use case #1049

Description

Introduction

Background

Problem

Proposed Solutions

Option 1

Option 2

Option 3

Option 4(Recommended)

Experiment Setup

Philosophy

Setup

Experiment

Experiment Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions