Feat: Add `FixedCharLengthChunker` for character length-based chunking #1342

YeonghyeonKO · 2025-05-22T16:37:28Z

Description

This pull request introduces a new chunking algorithm, FixedCharLengthChunker. It provides functionality to split text content based on a fixed number of characters, rather than tokens or delimiters. Using the number of characters to chunk your content is especially beneficial for languages that do not use spaces to separate words, such as Chinese, Japanese, Thai, Khmer. The new chunker offers synergy when cascaded with the Delimiter chunker, as multiple processors can be chained within an ingest pipeline.

As an example, one could first use a delimiter like "\n\n" to create initial chunks, which can then be further split by a specific length.

The key features and changes include:

Added FixedCharLengthChunker.java which extends the Chunker abstract class and its test codes to verify the proper operation.
This chunker splits content into passages of a specified maximum character length. (char_limit)
It supports an overlap_rate parameter to define the percentage of characters from the previous chunk to include in the current chunk.

Related Issues

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

YeonghyeonKO · 2025-05-22T16:39:10Z

As discussed with @yuye-aws, this pull request implements just simple length-based chunking algorithm. Currently, only unit tests have been added. Once the code review is mostly complete, I’ll proceed with adding the integration test examples that demonstrate the usage of the new chunking algorithm and also with running BWC tests.

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedStringLengthChunker.java

yuye-aws · 2025-05-23T02:49:32Z

Once the code review is mostly complete, I’ll proceed with adding the integration test examples that demonstrate the usage of the chunk_size parameter and also with running BWC tests.

Makes sense. Looking into your PR now

yuye-aws · 2025-05-23T02:59:29Z

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedStringLengthChunker.java

+    public static final String OVERLAP_RATE_FIELD = "overlap_rate";
+
+    // Default values for each non-runtime parameter
+    private static final int DEFAULT_LENGTH_LIMIT = 500; // Default character limit per chunk


I would suggest DEFAULT_LENGTH_LIMIT set to be 2048, because 1 token is approximately 4 chars and 512 token is a common limit for text embedding models

Make sense. Newer or larger embedding models might support longer sequences (e.g., 1024, 2048, 4096, or even more tokens), but 512 remains a well-known baseline as you said. The estimation for the length of 1 token as 4 chars is practical as well. The value 500 has changed to 2048.

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedStringLengthChunker.java

CHANGELOG.md

src/main/java/org/opensearch/neuralsearch/processor/chunker/CharacterLengthChunker.java

yuye-aws · 2025-05-25T02:08:30Z

@vibrantvarun Do you think this PR needs BWC test?

YeonghyeonKO · 2025-05-27T07:52:36Z

src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java

@@ -304,7 +305,12 @@ private void recordChunkingExecutionStats(String algorithmName) {
        EventStatsManager.increment(EventStatName.TEXT_CHUNKING_PROCESSOR_EXECUTIONS);
        switch (algorithmName) {
            case DelimiterChunker.ALGORITHM_NAME -> EventStatsManager.increment(EventStatName.TEXT_CHUNKING_DELIMITER_EXECUTIONS);
-            case FixedTokenLengthChunker.ALGORITHM_NAME -> EventStatsManager.increment(EventStatName.TEXT_CHUNKING_FIXED_LENGTH_EXECUTIONS);


Hi @q-andy ,
as I added the new chunking algorithm called fixed_char_length, EventStatName.TEXT_CHUNKING_FIXED_LENGTH_EXECUTIONS will be divided into two enums

@q-andy Can you also review this PR to take a look what is needed for stats?

Sure, since we added the text chunking algorithm stats in 3.1 it's okay to split. In the future, post-3.1 we should avoid changing statnames since technically it's a breaking change for the API response.

src/test/resources/processor/chunker/PipelineForFixedCharLengthChunker.json

yuye-aws · 2025-05-27T10:23:40Z

@YeonghyeonKO Do you think you can get this PR finished before the code freeze date of 3.1.0 (June 10th)? If so, I'll label 3.1.0 to this PR.

yuye-aws · 2025-06-03T02:22:29Z

@YeonghyeonKO I can review now. Can you fix the conflicts?

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>

YeonghyeonKO · 2025-06-03T05:34:27Z

@yuye-aws I fixed the conflict.

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedCharLengthChunker.java

src/test/java/org/opensearch/neuralsearch/processor/TextChunkingProcessorIT.java

yuye-aws · 2025-06-03T06:21:54Z

@heemin32 @martin-gaievski Hi, can you help review this PR? It is targeted towards 3.1.0 (June 10th code freeze date)

…inal chunking Signed-off-by: yeonghyeonKo <[email protected]>

vibrantvarun · 2025-06-04T20:11:56Z

Tests, gradle check are failing. Could you guys fix it?
cc: @YeonghyeonKO @yuye-aws

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>

yuye-aws · 2025-06-05T02:18:18Z

Tests, gradle check are failing. Could you guys fix it? cc: @YeonghyeonKO @yuye-aws

Running gradle checks now. Can you later verify flakey tests?

… TextChunkingProcessorIT Signed-off-by: yeonghyeonKo <[email protected]>

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>

Signed-off-by: yeonghyeonKo <[email protected]>

yuye-aws · 2025-06-05T03:48:20Z

@vibrantvarun Can you help verify this: https://github.com/opensearch-project/neural-search/actions/runs/15457223483/job/43511630203?pr=1342?

heemin32 · 2025-06-05T17:15:12Z

@YeonghyeonKO Thanks for the contribution! Before we can merge this PR, it needs to go through a security review and have an accompanying documentation PR prepared. Given the tight timeline, it may not be feasible to include this in the 3.1 release. @yuye-aws Could you help with security review and documentation here?

YeonghyeonKO · 2025-06-06T02:12:14Z

@heemin32 Thank you for reviewing the PR! I think the related documentation is _ingest-pipelines/processors/text-chunking.md. Is it okay to create new PR in documentation-website repository?

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>

yuye-aws · 2025-06-06T03:07:40Z

@YeonghyeonKO Thanks for the contribution! Before we can merge this PR, it needs to go through a security review and have an accompanying documentation PR prepared. Given the tight timeline, it may not be feasible to include this in the 3.1 release. @yuye-aws Could you help with security review and documentation here?

This PR is not introducing an API, and instead, it is providing an algorithm. Do you think there is any security risk?

yuye-aws · 2025-06-06T03:08:08Z

@heemin32 Thank you for reviewing the PR! I think the related documentation is _ingest-pipelines/processors/text-chunking.md. Is it okay to create new PR in documentation-website repository?

Yes. You can create a documentation PR and ping me to review it

heemin32 · 2025-06-07T00:24:59Z

@YeonghyeonKO Thanks for the contribution! Before we can merge this PR, it needs to go through a security review and have an accompanying documentation PR prepared. Given the tight timeline, it may not be feasible to include this in the 3.1 release. @yuye-aws Could you help with security review and documentation here?

This PR is not introducing an API, and instead, it is providing an algorithm. Do you think there is any security risk?

I don't see any security risk. However because it include new parameter, it might be bette to consult with security engineer.

yuye-aws · 2025-06-08T11:42:14Z

I don't see any security risk. However because it include new parameter, it might be bette to consult with security engineer.

Thanks for reviewing the code. I'll go through the code once again before approving.

yuye-aws · 2025-06-09T02:31:29Z

@YeonghyeonKO This PR also looks good to me. We can merge it before the Code freeze date (June 10th). If there is some security concern, we can revert it in the AOS 3.1.0 branch. cc: @heemin32

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>

codecov · 2025-06-10T04:54:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (097820c) to head (481a28f).
Report is 1 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #1342       +/-   ##
============================================
- Coverage     82.83%       0   -82.84%     
============================================
  Files           149       0      -149     
  Lines          7573       0     -7573     
  Branches       1218       0     -1218     
============================================
- Hits           6273       0     -6273     
+ Misses          835       0      -835     
+ Partials        465       0      -465

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

YeonghyeonKO requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, vibrantvarun, zhichao-aws, yuye-aws and minalsha as code owners May 22, 2025 16:37

YeonghyeonKO commented May 23, 2025

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedStringLengthChunker.java Outdated Show resolved Hide resolved

yuye-aws reviewed May 23, 2025

View reviewed changes

YeonghyeonKO changed the base branch from 3.0 to main May 24, 2025 00:46

YeonghyeonKO changed the title ~~Feat: Add FixedStringLengthChunker for length-based chunking~~ Feat: Add CharacterLengthChunker for length-based chunking May 24, 2025

yuye-aws reviewed May 25, 2025

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

src/main/java/org/opensearch/neuralsearch/processor/chunker/CharacterLengthChunker.java Outdated Show resolved Hide resolved

YeonghyeonKO changed the title ~~Feat: Add CharacterLengthChunker for length-based chunking~~ Feat: Add FixedCharLengthChunker for character length-based chunking May 26, 2025

YeonghyeonKO force-pushed the feat/fixed-string-length-chunker branch 2 times, most recently from 8d76d84 to 4bf94d7 Compare May 27, 2025 07:50

YeonghyeonKO commented May 27, 2025

View reviewed changes

yuye-aws mentioned this pull request May 27, 2025

Feat: Add chunk_size parameter to DelimiterChunker #1330

Closed

5 tasks

yuye-aws added the v3.1.0 label May 27, 2025

Merge branch 'main' into feat/fixed-string-length-chunker

074fcde

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>

yuye-aws reviewed Jun 3, 2025

View reviewed changes

[REFACTOR] adopt FixedTokenLengthChunker's loop strategy for robust f…

195d0c9

…inal chunking Signed-off-by: yeonghyeonKo <[email protected]>

YeonghyeonKO requested a review from yuye-aws June 3, 2025 12:21

Merge branch 'main' into feat/fixed-string-length-chunker

c1eb011

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>

YeonghyeonKO added 3 commits June 5, 2025 10:34

[TEST] sum the number of processors and their executions correctly in…

c527296

… TextChunkingProcessorIT Signed-off-by: yeonghyeonKo <[email protected]>

Merge branch 'main' into feat/fixed-string-length-chunker

7e7d903

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>

[REFACTOR] gradlew spotlessApply

aba5d4a

Signed-off-by: yeonghyeonKo <[email protected]>

Merge branch 'main' into feat/fixed-string-length-chunker

aa52128

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>

YeonghyeonKO mentioned this pull request Jun 6, 2025

Introduce fixed_char_length algorithm in Text Chunking opensearch-project/documentation-website#10043

Merged

1 task

heemin32 approved these changes Jun 6, 2025

View reviewed changes

yuye-aws approved these changes Jun 9, 2025

View reviewed changes

Merge branch 'main' into feat/fixed-string-length-chunker

481a28f

Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>

yuye-aws merged commit ce023f5 into opensearch-project:main Jun 10, 2025
241 of 267 checks passed

This was referenced Jun 10, 2025

[FEATURE] Update neural-search stats API spec with new stats added in 3.1 opensearch-project/opensearch-api-specification#890

Open

[DOC] Update neural-search stats API docs with new stats added in 3.1 opensearch-project/documentation-website#9943

Closed

Feat: Add FixedCharLengthChunker for character length-based chunking #1342

Feat: Add FixedCharLengthChunker for character length-based chunking #1342

Uh oh!

Conversation

YeonghyeonKO commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

YeonghyeonKO commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yuye-aws commented May 23, 2025

Uh oh!

yuye-aws May 23, 2025

Choose a reason for hiding this comment

Uh oh!

YeonghyeonKO May 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuye-aws commented May 25, 2025

Uh oh!

YeonghyeonKO May 27, 2025

Choose a reason for hiding this comment

Uh oh!

yuye-aws May 27, 2025

Choose a reason for hiding this comment

Uh oh!

q-andy May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuye-aws commented May 27, 2025

Uh oh!

yuye-aws commented Jun 3, 2025

Uh oh!

YeonghyeonKO commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuye-aws commented Jun 3, 2025

Uh oh!

vibrantvarun commented Jun 4, 2025

Uh oh!

yuye-aws commented Jun 5, 2025

Uh oh!

yuye-aws commented Jun 5, 2025

Uh oh!

heemin32 commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YeonghyeonKO commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuye-aws commented Jun 6, 2025

Uh oh!

yuye-aws commented Jun 6, 2025

Uh oh!

heemin32 commented Jun 7, 2025

Uh oh!

yuye-aws commented Jun 8, 2025

Uh oh!

yuye-aws commented Jun 9, 2025

Uh oh!

Uh oh!

codecov bot commented Jun 10, 2025

Codecov Report

Uh oh!

Uh oh!

Feat: Add `FixedCharLengthChunker` for character length-based chunking #1342

Feat: Add `FixedCharLengthChunker` for character length-based chunking #1342

YeonghyeonKO commented May 22, 2025 •

edited

Loading

YeonghyeonKO commented May 22, 2025 •

edited

Loading

heemin32 commented Jun 5, 2025 •

edited

Loading

YeonghyeonKO commented Jun 6, 2025 •

edited

Loading