Skip to content

Feat: Add FixedCharLengthChunker for character length-based chunking #1342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

YeonghyeonKO
Copy link
Contributor

@YeonghyeonKO YeonghyeonKO commented May 22, 2025

Description

This pull request introduces a new chunking algorithm, FixedCharLengthChunker. It provides functionality to split text content based on a fixed number of characters, rather than tokens or delimiters. Using the number of characters to chunk your content is especially beneficial for languages that do not use spaces to separate words, such as Chinese, Japanese, Thai, Khmer. The new chunker offers synergy when cascaded with the Delimiter chunker, as multiple processors can be chained within an ingest pipeline.

As an example, one could first use a delimiter like "\n\n" to create initial chunks, which can then be further split by a specific length.

The key features and changes include:

  • Added FixedCharLengthChunker.java which extends the Chunker abstract class and its test codes to verify the proper operation.
  • This chunker splits content into passages of a specified maximum character length. (char_limit)
  • It supports an overlap_rate parameter to define the percentage of characters from the previous chunk to include in the current chunk.

Related Issues

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@YeonghyeonKO
Copy link
Contributor Author

YeonghyeonKO commented May 22, 2025

As discussed with @yuye-aws, this pull request implements just simple length-based chunking algorithm. Currently, only unit tests have been added. Once the code review is mostly complete, I’ll proceed with adding the integration test examples that demonstrate the usage of the new chunking algorithm and also with running BWC tests.

@yuye-aws
Copy link
Member

Once the code review is mostly complete, I’ll proceed with adding the integration test examples that demonstrate the usage of the chunk_size parameter and also with running BWC tests.

Makes sense. Looking into your PR now

public static final String OVERLAP_RATE_FIELD = "overlap_rate";

// Default values for each non-runtime parameter
private static final int DEFAULT_LENGTH_LIMIT = 500; // Default character limit per chunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest DEFAULT_LENGTH_LIMIT set to be 2048, because 1 token is approximately 4 chars and 512 token is a common limit for text embedding models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Newer or larger embedding models might support longer sequences (e.g., 1024, 2048, 4096, or even more tokens), but 512 remains a well-known baseline as you said. The estimation for the length of 1 token as 4 chars is practical as well. The value 500 has changed to 2048.

@YeonghyeonKO YeonghyeonKO changed the base branch from 3.0 to main May 24, 2025 00:46
@YeonghyeonKO YeonghyeonKO changed the title Feat: Add FixedStringLengthChunker for length-based chunking Feat: Add CharacterLengthChunker for length-based chunking May 24, 2025
@yuye-aws
Copy link
Member

@vibrantvarun Do you think this PR needs BWC test?

@YeonghyeonKO YeonghyeonKO changed the title Feat: Add CharacterLengthChunker for length-based chunking Feat: Add FixedCharLengthChunker for character length-based chunking May 26, 2025
@YeonghyeonKO YeonghyeonKO force-pushed the feat/fixed-string-length-chunker branch 2 times, most recently from 8d76d84 to 4bf94d7 Compare May 27, 2025 07:50
@@ -304,7 +305,12 @@ private void recordChunkingExecutionStats(String algorithmName) {
EventStatsManager.increment(EventStatName.TEXT_CHUNKING_PROCESSOR_EXECUTIONS);
switch (algorithmName) {
case DelimiterChunker.ALGORITHM_NAME -> EventStatsManager.increment(EventStatName.TEXT_CHUNKING_DELIMITER_EXECUTIONS);
case FixedTokenLengthChunker.ALGORITHM_NAME -> EventStatsManager.increment(EventStatName.TEXT_CHUNKING_FIXED_LENGTH_EXECUTIONS);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @q-andy ,
as I added the new chunking algorithm called fixed_char_length, EventStatName.TEXT_CHUNKING_FIXED_LENGTH_EXECUTIONS will be divided into two enums

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@q-andy Can you also review this PR to take a look what is needed for stats?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, since we added the text chunking algorithm stats in 3.1 it's okay to split. In the future, post-3.1 we should avoid changing statnames since technically it's a breaking change for the API response.

@yuye-aws
Copy link
Member

@YeonghyeonKO Do you think you can get this PR finished before the code freeze date of 3.1.0 (June 10th)? If so, I'll label 3.1.0 to this PR.

@yuye-aws
Copy link
Member

yuye-aws commented Jun 3, 2025

@YeonghyeonKO I can review now. Can you fix the conflicts?

@YeonghyeonKO
Copy link
Contributor Author

@yuye-aws I fixed the conflict.

@yuye-aws
Copy link
Member

yuye-aws commented Jun 3, 2025

@heemin32 @martin-gaievski Hi, can you help review this PR? It is targeted towards 3.1.0 (June 10th code freeze date)

@YeonghyeonKO YeonghyeonKO requested a review from yuye-aws June 3, 2025 12:21
@vibrantvarun
Copy link
Member

Tests, gradle check are failing. Could you guys fix it?
cc: @YeonghyeonKO @yuye-aws

@yuye-aws
Copy link
Member

yuye-aws commented Jun 5, 2025

Tests, gradle check are failing. Could you guys fix it? cc: @YeonghyeonKO @yuye-aws

Running gradle checks now. Can you later verify flakey tests?

@yuye-aws
Copy link
Member

yuye-aws commented Jun 5, 2025

@vibrantvarun Can you help verify this: https://github.com/opensearch-project/neural-search/actions/runs/15457223483/job/43511630203?pr=1342?

@heemin32
Copy link
Collaborator

heemin32 commented Jun 5, 2025

@YeonghyeonKO Thanks for the contribution! Before we can merge this PR, it needs to go through a security review and have an accompanying documentation PR prepared. Given the tight timeline, it may not be feasible to include this in the 3.1 release. @yuye-aws Could you help with security review and documentation here?

@YeonghyeonKO
Copy link
Contributor Author

YeonghyeonKO commented Jun 6, 2025

@heemin32 Thank you for reviewing the PR! I think the related documentation is _ingest-pipelines/processors/text-chunking.md. Is it okay to create new PR in documentation-website repository?

@yuye-aws
Copy link
Member

yuye-aws commented Jun 6, 2025

@YeonghyeonKO Thanks for the contribution! Before we can merge this PR, it needs to go through a security review and have an accompanying documentation PR prepared. Given the tight timeline, it may not be feasible to include this in the 3.1 release. @yuye-aws Could you help with security review and documentation here?

This PR is not introducing an API, and instead, it is providing an algorithm. Do you think there is any security risk?

@yuye-aws
Copy link
Member

yuye-aws commented Jun 6, 2025

@heemin32 Thank you for reviewing the PR! I think the related documentation is _ingest-pipelines/processors/text-chunking.md. Is it okay to create new PR in documentation-website repository?

Yes. You can create a documentation PR and ping me to review it

@heemin32
Copy link
Collaborator

heemin32 commented Jun 7, 2025

@YeonghyeonKO Thanks for the contribution! Before we can merge this PR, it needs to go through a security review and have an accompanying documentation PR prepared. Given the tight timeline, it may not be feasible to include this in the 3.1 release. @yuye-aws Could you help with security review and documentation here?

This PR is not introducing an API, and instead, it is providing an algorithm. Do you think there is any security risk?

I don't see any security risk. However because it include new parameter, it might be bette to consult with security engineer.

@yuye-aws
Copy link
Member

yuye-aws commented Jun 8, 2025

I don't see any security risk. However because it include new parameter, it might be bette to consult with security engineer.

Thanks for reviewing the code. I'll go through the code once again before approving.

@yuye-aws
Copy link
Member

yuye-aws commented Jun 9, 2025

@YeonghyeonKO This PR also looks good to me. We can merge it before the Code freeze date (June 10th). If there is some security concern, we can revert it in the AOS 3.1.0 branch. cc: @heemin32

@yuye-aws yuye-aws merged commit ce023f5 into opensearch-project:main Jun 10, 2025
241 of 267 checks passed
Copy link

codecov bot commented Jun 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (097820c) to head (481a28f).
Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #1342       +/-   ##
============================================
- Coverage     82.83%       0   -82.84%     
============================================
  Files           149       0      -149     
  Lines          7573       0     -7573     
  Branches       1218       0     -1218     
============================================
- Hits           6273       0     -6273     
+ Misses          835       0      -835     
+ Partials        465       0      -465     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.