-
Notifications
You must be signed in to change notification settings - Fork 94
Feat: Add FixedCharLengthChunker
for character length-based chunking
#1342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Add FixedCharLengthChunker
for character length-based chunking
#1342
Conversation
As discussed with @yuye-aws, this pull request implements just simple length-based chunking algorithm. Currently, only unit tests have been added. Once the code review is mostly complete, I’ll proceed with adding the integration test examples that demonstrate the usage of the new chunking algorithm and also with running BWC tests. |
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedStringLengthChunker.java
Outdated
Show resolved
Hide resolved
Makes sense. Looking into your PR now |
public static final String OVERLAP_RATE_FIELD = "overlap_rate"; | ||
|
||
// Default values for each non-runtime parameter | ||
private static final int DEFAULT_LENGTH_LIMIT = 500; // Default character limit per chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest DEFAULT_LENGTH_LIMIT set to be 2048, because 1 token is approximately 4 chars and 512 token is a common limit for text embedding models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. Newer or larger embedding models might support longer sequences (e.g., 1024, 2048, 4096, or even more tokens), but 512 remains a well-known baseline as you said. The estimation for the length of 1 token as 4 chars is practical as well. The value 500
has changed to 2048
.
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedStringLengthChunker.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedStringLengthChunker.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedStringLengthChunker.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedStringLengthChunker.java
Outdated
Show resolved
Hide resolved
FixedStringLengthChunker
for length-based chunkingCharacterLengthChunker
for length-based chunking
src/main/java/org/opensearch/neuralsearch/processor/chunker/CharacterLengthChunker.java
Outdated
Show resolved
Hide resolved
@vibrantvarun Do you think this PR needs BWC test? |
CharacterLengthChunker
for length-based chunkingFixedCharLengthChunker
for character length-based chunking
8d76d84
to
4bf94d7
Compare
@@ -304,7 +305,12 @@ private void recordChunkingExecutionStats(String algorithmName) { | |||
EventStatsManager.increment(EventStatName.TEXT_CHUNKING_PROCESSOR_EXECUTIONS); | |||
switch (algorithmName) { | |||
case DelimiterChunker.ALGORITHM_NAME -> EventStatsManager.increment(EventStatName.TEXT_CHUNKING_DELIMITER_EXECUTIONS); | |||
case FixedTokenLengthChunker.ALGORITHM_NAME -> EventStatsManager.increment(EventStatName.TEXT_CHUNKING_FIXED_LENGTH_EXECUTIONS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @q-andy ,
as I added the new chunking algorithm called fixed_char_length
, EventStatName.TEXT_CHUNKING_FIXED_LENGTH_EXECUTIONS
will be divided into two enums
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@q-andy Can you also review this PR to take a look what is needed for stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, since we added the text chunking algorithm stats in 3.1 it's okay to split. In the future, post-3.1 we should avoid changing statnames since technically it's a breaking change for the API response.
src/test/resources/processor/chunker/PipelineForFixedCharLengthChunker.json
Show resolved
Hide resolved
@YeonghyeonKO Do you think you can get this PR finished before the code freeze date of 3.1.0 (June 10th)? If so, I'll label 3.1.0 to this PR. |
@YeonghyeonKO I can review now. Can you fix the conflicts? |
Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
@yuye-aws I fixed the conflict. |
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedCharLengthChunker.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedCharLengthChunker.java
Outdated
Show resolved
Hide resolved
src/test/java/org/opensearch/neuralsearch/processor/TextChunkingProcessorIT.java
Show resolved
Hide resolved
@heemin32 @martin-gaievski Hi, can you help review this PR? It is targeted towards 3.1.0 (June 10th code freeze date) |
…inal chunking Signed-off-by: yeonghyeonKo <[email protected]>
Tests, gradle check are failing. Could you guys fix it? |
Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Running gradle checks now. Can you later verify flakey tests? |
… TextChunkingProcessorIT Signed-off-by: yeonghyeonKo <[email protected]>
Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Signed-off-by: yeonghyeonKo <[email protected]>
@YeonghyeonKO Thanks for the contribution! Before we can merge this PR, it needs to go through a security review and have an accompanying documentation PR prepared. Given the tight timeline, it may not be feasible to include this in the 3.1 release. @yuye-aws Could you help with security review and documentation here? |
@heemin32 Thank you for reviewing the PR! I think the related documentation is _ingest-pipelines/processors/text-chunking.md. Is it okay to create new PR in documentation-website repository? |
Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
This PR is not introducing an API, and instead, it is providing an algorithm. Do you think there is any security risk? |
Yes. You can create a documentation PR and ping me to review it |
I don't see any security risk. However because it include new parameter, it might be bette to consult with security engineer. |
Thanks for reviewing the code. I'll go through the code once again before approving. |
@YeonghyeonKO This PR also looks good to me. We can merge it before the Code freeze date (June 10th). If there is some security concern, we can revert it in the AOS 3.1.0 branch. cc: @heemin32 |
Signed-off-by: Marcel Yeonghyeon Ko <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1342 +/- ##
============================================
- Coverage 82.83% 0 -82.84%
============================================
Files 149 0 -149
Lines 7573 0 -7573
Branches 1218 0 -1218
============================================
- Hits 6273 0 -6273
+ Misses 835 0 -835
+ Partials 465 0 -465 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
This pull request introduces a new chunking algorithm,
FixedCharLengthChunker
. It provides functionality to split text content based on a fixed number of characters, rather than tokens or delimiters. Using the number of characters to chunk your content is especially beneficial for languages that do not use spaces to separate words, such as Chinese, Japanese, Thai, Khmer. The new chunker offers synergy when cascaded with theDelimiter
chunker, as multiple processors can be chained within an ingest pipeline.The key features and changes include:
FixedCharLengthChunker.java
which extends theChunker
abstract class and its test codes to verify the proper operation.char_limit
)overlap_rate
parameter to define the percentage of characters from the previous chunk to include in the current chunk.Related Issues
recursive_character
algorithm for Text chunking processor #1312 (comment)chunk_size
parameter to DelimiterChunker #1330 (comment)Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.