Skip to content

Investigate migrating custom codec from BinaryDocValuesFormat to KnnVectorsFormat #1087

Closed
@jmazanec15

Description

@jmazanec15

Currently, we integrate our native libraries with OpenSearch through Lucene's DocValuesFormat. At the time, Lucene did not have the KnnVectorsFormat format (which was released in 9.0).

Now that it exists, I am wondering if we should move to use KnnVectorsFormat. KnnVectorsFormat has a KnnVectorsWriter and KnnVectorsReader. Migrating to KnnVectorsFormat would allow us to:

  1. reduce branching logic between the different engines we support
  2. avoid duplicate effort. For instance, we might be able to more easily support features such as incremental index building: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/KnnFieldVectorsWriter.java#L37.
  3. support future features easier

In general, it would make the native library integrations more inline with the Lucene architecture which would have long-term benefits around maintainability and extendability.

All that being said, we need to do a deep dive into what switching means in terms of backwards compatibility and also scope out how much work needs to be done.

Tracking list of benefits of moving to KnnVectorsFormat

  1. [FEATURE] Move Lucene Vector field and HNSW KNN Search as a first class feature in core  #1467
  2. [FEATURE] [Build-TimeReduction V1] Merge Optimization: Streaming vectors from Java Layer to JNI layer to avoid OOM/Circuit Breaker in native engines #1506
  3. [Enhancement] Optimize the de-serialization of vector when reading from Doc Values  #1050
  4. [Feature] k-NN Array support for Vector Field type #675
  5. Incremental index building for vector fields
  6. Less blast radius for codec - right now, for any k-NN index, we will handle all binary doc values in our consumer. If there is an issue, potentially impacting non-knn fields. For this, we could just implement KnnVectorsFormat and only be responsible for the field's knn vectors
  7. Open up possibility to use non FSDirectory backed indices (i.e. remote storage)
  8. Use byte values with native libraries without reimplementation of int8 docvalues

Metadata

Metadata

Labels

EnhancementsIncreases software capabilities beyond original client specificationsv2.17.0

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions