[8.19] Add L2 norm normalization support to linear retriever #128972

mridula-s109 · 2025-06-05T10:58:45Z

This PR backports the following changes to the 8.19 branch:

Adds L2 norm normalization support to the linear retriever
Includes Javadoc clarification for L2ScoreNormalizer

Related Issues/PRs

Original PR: Add l2_norm normalization support to linear retriever #128504 (Add l2_norm normalization support)
Related PR: Clarify Javadoc for L2ScoreNormalizer (l2_norm) #128808 (Javadoc clarification)

* New l2 normalizer added * L2 score normaliser is registered * test case added to the yaml * Documentation added * Resolved checkstyle issues * Update docs/changelog/128504.yaml * Update docs/reference/elasticsearch/rest-apis/retrievers.md Co-authored-by: Copilot <[email protected]> * Score 0 test case added to check for corner cases * Edited the markdown doc description * Pruned the comment * Renamed the variable * Added comment to the class * Unit tests added * Spotless and checkstyle fixed * Fixed build failure * Fixed the forbidden test --------- Co-authored-by: Copilot <[email protected]>

* propgating retrievers to inner retrievers * Java doc fixed * Cleaned up * Update docs/changelog/128808.yaml * Enhanced comment as stated by the copilot * Delete docs/changelog/128808.yaml

Copilot

Pull Request Overview

This PR backports changes to add L2 norm normalization support to the linear retriever along with corresponding documentation updates. Key changes include:

Adding YAML-based REST tests for L2 normalization, including cases for typical scores and all-zero scores.
Implementing L2ScoreNormalizer in the main code and integrating it into the ScoreNormalizer value resolution.
Introducing unit tests in L2ScoreNormalizerTests to verify normalization behavior.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
x-pack/plugin/rank-rrf/src/yamlRestTest/resources/rest-api-spec/test/linear/10_linear_retriever.yml	Added tests to validate normalization of scores with l2_norm, handling both typical and zero score scenarios
x-pack/plugin/rank-rrf/src/test/java/org/elasticsearch/xpack/rank/linear/L2ScoreNormalizerTests.java	Introduced unit tests for various normalization edge cases
x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/ScoreNormalizer.java	Updated to recognize and return L2ScoreNormalizer when requested
x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/L2ScoreNormalizer.java	New implementation of L2 normalization logic for scores
docs/changelog/128504.yaml	Changelog entry for tracking the enhancement

x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/L2ScoreNormalizer.java

kderusso

Thanks for creating the backport, a couple things need to be addressed before we backport.

docs/reference/elasticsearch/rest-apis/retrievers.md

...plugin/rank-rrf/src/yamlRestTest/resources/rest-api-spec/test/linear/10_linear_retriever.yml

x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/L2ScoreNormalizer.java

x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/ScoreNormalizer.java

* propgating retrievers to inner retrievers * test feature taken care of * Small changes in concurrent multipart upload interfaces (elastic#128977) Small changes in BlobContainer interface and wrapper. Relates ES-11815 * Unmute FollowingEngineTests#testProcessOnceOnPrimary() test (elastic#129054) The reason the test fails is that operations contained _seq_no field with different doc value types (with no skippers and with skippers) and this isn't allowed, since field types need to be consistent in a Lucene index. The initial operations were generated not knowing about the fact the index mode was set to logsdb or time_series. Causing the operations to not have doc value skippers. However when replaying the operations via following engine, the operations did have doc value skippers. The fix is to set `index.seq_no.index_options` to `points_and_doc_values`, so that the initial operations are indexed without doc value skippers. This test doesn't gain anything from storing seqno with doc value skippers, so there is no loss of testing coverage. Closes elastic#128541 * [Build] Add support for publishing to maven central (elastic#128659) This ensures we package an aggregation zip with all artifacts we want to publish to maven central as part of a release. Running zipAggregation will produce a zip file in the build/nmcp/zip folder. The content of this zip is meant to match the maven artifacts we have currently declared as dra maven artifacts. * ESQL: Check for errors while loading blocks (elastic#129016) Runs a sanity check after loading a block of values. Previously we were doing a quick check if assertions were enabled. Now we do two quick checks all the time. Better - we attach information about how a block was loaded when there's a problem. Relates to elastic#128959 * Make `PhaseCacheManagementTests` project-aware (elastic#129047) The functionality in `PhaseCacheManagement` was already project-aware, but these tests were still using deprecated methods. * Vector test tools (elastic#128934) This adds some testing tools for verifying vector recall and latency directly without having to spin up an entire ES node and running a rally track. Its pretty barebones and takes inspiration from lucene-util, but I wanted access to our own formats and tooling to make our lives easier. Here is an example config file. This will build the initial index, run queries at num_candidates: 50, then again at num_candidates 100 (without reindexing, and re-using the cached nearest neighbors). ``` [{ "doc_vectors" : "path", "query_vectors" : "path", "num_docs" : 10000, "num_queries" : 10, "index_type" : "hnsw", "num_candidates" : 50, "k" : 10, "hnsw_m" : 16, "hnsw_ef_construction" : 200, "index_threads" : 4, "reindex" : true, "force_merge" : false, "vector_space" : "maximum_inner_product", "dimensions" : 768 }, { "doc_vectors" : "path", "query_vectors" : "path", "num_docs" : 10000, "num_queries" : 10, "index_type" : "hnsw", "num_candidates" : 100, "k" : 10, "hnsw_m" : 16, "hnsw_ef_construction" : 200, "vector_space" : "maximum_inner_product", "dimensions" : 768 } ] ``` To execute: ``` ./gradlew :qa:vector:checkVec --args="/Path/to/knn_tester_config.json" ``` Calling `./gradlew :qa:vector:checkVecHelp` gives some guidance on how to use it, additionally providing a way to run it via java directly (useful to bypass gradlew guff). * ES|QL: refactor generative tests (elastic#129028) * Add a test of LOOKUP JOIN against a time series index (elastic#129007) Add a spec test of `LOOKUP JOIN` against a time series index. * Make ILM `ClusterStateWaitStep` project-aware (elastic#129042) This is part of an iterative process to make ILM project-aware. * Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {lookup-join.LookupJoinOnTimeSeriesIndex ASYNC} elastic#129078 * Remove `ClusterState` param from ILM `AsyncBranchingStep` (elastic#129076) The `ClusterState` parameter of the `asyncPredicate` is not used anywhere. * Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {lookup-join.LookupJoinOnTimeSeriesIndex SYNC} elastic#129082 * Mute org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT test {p0=upgraded_cluster/70_ilm/Test Lifecycle Still There And Indices Are Still Managed} elastic#129097 * Mute org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT test {p0=upgraded_cluster/90_ml_data_frame_analytics_crud/Get mixed cluster outlier_detection job} elastic#129098 * Mute org.elasticsearch.packaging.test.DockerTests test081SymlinksAreFollowedWithEnvironmentVariableFiles elastic#128867 * Threadpool merge executor is aware of available disk space (elastic#127613) This PR introduces 3 new settings: indices.merge.disk.check_interval, indices.merge.disk.watermark.high, and indices.merge.disk.watermark.high.max_headroom that control if the threadpool merge executor starts executing new merges when the disk space is getting low. The intent of this change is to avoid the situation where in-progress merges exhaust the available disk space on the node's local filesystem. To this end, the thread pool merge executor periodically monitors the available disk space, as well as the current disk space estimates required by all in-progress (currently running) merges on the node, and will NOT schedule any new merges if the disk space is getting low (by default below the 5% limit of the total disk space, or 100 GB, whichever is smaller (same as the disk allocation flood stage level)). * Add option to include or exclude vectors from _source retrieval (elastic#128735) This PR introduces a new include_vectors option to the _source retrieval context. When set to false, vectors are excluded from the returned _source. This is especially efficient when used with synthetic source, as it avoids loading vector fields entirely. By default, vectors remain included unless explicitly excluded. * Remove direct minScore propagation to inner retrievers * cleaned up skip * Mute org.elasticsearch.index.engine.ThreadPoolMergeExecutorServiceDiskSpaceTests testAvailableDiskSpaceMonitorWhenFileSystemStatErrors elastic#129149 * Add transport version for ML inference Mistral chat completion (elastic#129033) * Add transport version for ML inference Mistral chat completion * Add changelog for Mistral Chat Completion version fix * Revert "Add changelog for Mistral Chat Completion version fix" This reverts commit 7a57416. * Correct index path validation (elastic#129144) All we care about is if reindex is true or false. We shouldn't worry about force merge. Because if reindex is true, we will create the directory, if its false, we won't. * Mute org.elasticsearch.index.engine.ThreadPoolMergeExecutorServiceDiskSpaceTests testUnavailableBudgetBlocksNewMergeTasksFromStartingExecution elastic#129148 * Implemented completion task for Google VertexAI (elastic#128694) * Google Vertex AI completion model, response entity and tests * Fixed GoogleVertexAiServiceTest for Service configuration * Changelog * Removed downcasting and using `moveToFirstToken` * Create GoogleVertexAiChatCompletionResponseHandler for streaming and non streaming responses * Added unit tests * PR feedback * Removed googlevertexaicompletion model. Using just GoogleVertexAiChatCompletionModel for completion and chat completion * Renamed uri -> nonStreamingUri. Added streamingUri and getters in GoogleVertexAiChatCompletionModel * Moved rateLimitGroupHashing to subclasses of GoogleVertexAiModel * Fixed rate limit has of GoogleVertexAiRerankModel and refactored uri for GoogleVertexAiUnifiedChatCompletionRequest --------- Co-authored-by: lhoet-google <[email protected]> Co-authored-by: Jonathan Buttner <[email protected]> * Added cluster feature to yaml * Node feature added * Duplicate line - result of merge removed * Update docs/changelog/129181.yaml * Update 129181.yaml --------- Co-authored-by: Tanguy Leroux <[email protected]> Co-authored-by: Martijn van Groningen <[email protected]> Co-authored-by: Rene Groeschke <[email protected]> Co-authored-by: Nik Everett <[email protected]> Co-authored-by: Niels Bauman <[email protected]> Co-authored-by: Benjamin Trent <[email protected]> Co-authored-by: Luigi Dell'Aquila <[email protected]> Co-authored-by: Bogdan Pintea <[email protected]> Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Albert Zaharovits <[email protected]> Co-authored-by: Jim Ferenczi <[email protected]> Co-authored-by: Jan-Kazlouski-elastic <[email protected]> Co-authored-by: Leonardo Hoet <[email protected]> Co-authored-by: lhoet-google <[email protected]> Co-authored-by: Jonathan Buttner <[email protected]>

…ntry

…erview.asciidoc

kderusso

Looks good, let's fix the failing test and docs build error. Thanks for creating this!

docs/reference/search/search-your-data/retrievers-overview.asciidoc

mridula-s109 and others added 2 commits June 5, 2025 11:52

Clarify Javadoc for L2ScoreNormalizer (l2_norm) (elastic#128808)

c10f782

* propgating retrievers to inner retrievers * Java doc fixed * Cleaned up * Update docs/changelog/128808.yaml * Enhanced comment as stated by the copilot * Delete docs/changelog/128808.yaml

mridula-s109 requested a review from Copilot June 5, 2025 10:58

elasticsearchmachine added v8.19.0 needs:triage Requires assignment of a team area label labels Jun 5, 2025

mridula-s109 added the backport label Jun 5, 2025

Copilot AI reviewed Jun 5, 2025

View reviewed changes

x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/L2ScoreNormalizer.java Show resolved Hide resolved

x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/L2ScoreNormalizer.java Show resolved Hide resolved

mridula-s109 added the Team:Search Meta label for search team label Jun 5, 2025

mridula-s109 requested review from ioanatia and a team June 5, 2025 10:59

Merge branch '8.19' into backport/8.19/pr-128504-pr-128808

fe0d87e

kderusso reviewed Jun 5, 2025

View reviewed changes

docs/reference/elasticsearch/rest-apis/retrievers.md Outdated Show resolved Hide resolved

...plugin/rank-rrf/src/yamlRestTest/resources/rest-api-spec/test/linear/10_linear_retriever.yml Show resolved Hide resolved

afoucret approved these changes Jun 5, 2025

View reviewed changes

mridula-s109 and others added 4 commits June 10, 2025 19:12

Remove changelog for 129181, keep only 128504.yaml as the changelog e…

59dc593

…ntry

Remove redundant retrievers.md, documentation is now in retrievers-ov…

5a48427

…erview.asciidoc

updated retriever-overview.asciidoc

eea0d29

mridula-s109 requested a review from kderusso June 10, 2025 18:28

kderusso reviewed Jun 10, 2025

View reviewed changes

docs/reference/search/search-your-data/retrievers-overview.asciidoc Outdated Show resolved Hide resolved

mridula-s109 added 2 commits June 11, 2025 14:25

Resolved duplicate tag issue

65b9c5b

Merge branch '8.19' into backport/8.19/pr-128504-pr-128808

67a3290

mridula-s109 enabled auto-merge (squash) June 11, 2025 13:27

mridula-s109 added 4 commits June 11, 2025 14:33

Merge branch '8.19' into backport/8.19/pr-128504-pr-128808

e8192d9

Resolved the test case which caused because of merge

1aced59

Merge branch '8.19' into backport/8.19/pr-128504-pr-128808

cbd1cf0

Merge branch '8.19' into backport/8.19/pr-128504-pr-128808

d644583

mridula-s109 merged commit 90d0f63 into elastic:8.19 Jun 11, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[8.19] Add L2 norm normalization support to linear retriever #128972

[8.19] Add L2 norm normalization support to linear retriever #128972

Uh oh!

mridula-s109 commented Jun 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

kderusso left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kderusso left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[8.19] Add L2 norm normalization support to linear retriever #128972

[8.19] Add L2 norm normalization support to linear retriever #128972

Uh oh!

Conversation

mridula-s109 commented Jun 5, 2025

Related Issues/PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

kderusso left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kderusso left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!