Skip to content

Commit ce023f5

Browse files
YeonghyeonKOwill-hwangmartin-gaievskivibrantvarunjunqiu-lei
authored
Feat: Add FixedCharLengthChunker for character length-based chunking (#1342)
* Implement Optimized embedding generation in text embedding processor (#1238) * implement single document update scenario for text embedding processor (#1191) Signed-off-by: Will Hwang <[email protected]> * implement batch document update scenario for text embedding processor (#1217) Signed-off-by: Will Hwang <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Going from alpha1 to beta1 for 3.0 release (#1245) Signed-off-by: yeonghyeonKo <[email protected]> * Implement Optimized embedding generation in sparse encoding processor (#1246) Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Implement Optimized embedding generation in text and image embedding processor (#1249) Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Inner hits support with hybrid query (#1253) * Inner Hits in Hybrid query Signed-off-by: Varun Jain <[email protected]> * Inner hits support with hybrid query Signed-off-by: Varun Jain <[email protected]> * Add changelog Signed-off-by: Varun Jain <[email protected]> * fix integ tests Signed-off-by: Varun Jain <[email protected]> * Modify comment Signed-off-by: Varun Jain <[email protected]> * Explain test case Signed-off-by: Varun Jain <[email protected]> * Optimize inner hits count calculation method Signed-off-by: Varun Jain <[email protected]> --------- Signed-off-by: Varun Jain <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Support custom tags in semantic highlighter (#1254) Signed-off-by: yeonghyeonKo <[email protected]> * Add neural stats API (#1256) * Add neural stats API Signed-off-by: Andy Qin <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Added release notes for 3.0 beta1 (#1252) * Added release notes for 3.0 beta1 Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update semantic highlighter test model (#1259) Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix the edge case when the value of a fieldMap key in ingestDocument is empty string (#1257) Signed-off-by: Chloe Gao <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Hybrid query should call rewrite before creating weight (#1268) * Hybrid query should call rewrite before creating weight Signed-off-by: Harsha Vamsi Kalluri <[email protected]> * Awaits fix Signed-off-by: Harsha Vamsi Kalluri <[email protected]> * Rewrite with searcher Signed-off-by: Harsha Vamsi Kalluri <[email protected]> * Feature flag issue Signed-off-by: Harsha Vamsi Kalluri <[email protected]> --------- Signed-off-by: Harsha Vamsi Kalluri <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Support phasing off SecurityManager usage in favor of Java Agent (#1265) Signed-off-by: Gulshan <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272) Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic field mapper. (#1225) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Increment version to 3.0.0-SNAPSHOT (#1286) Signed-off-by: opensearch-ci-bot <[email protected]> Co-authored-by: opensearch-ci-bot <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Remove beta1 qualifier (#1292) Signed-off-by: Peter Zhu <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix for merging scoreDocs when totalHits are greater than 1 and fieldDocs are 0 (#1295) (#1296) (cherry picked from commit 6f3aabb) Co-authored-by: Varun Jain <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * add release notes for 3.0 (#1287) Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Allow maven to publish to all versions (#1300) (#1301) Signed-off-by: Peter Zhu <[email protected]> (cherry picked from commit c5625db) Co-authored-by: Peter Zhu <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [FEAT] introduce new FixedStringLengthChunker Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] initial test cases for FixedStringLengthChunker Signed-off-by: yeonghyeonKo <[email protected]> * [FIX] gradlew spotlessApply Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] remove unnecessary comments Signed-off-by: yeonghyeonKo <[email protected]> * [Performance Improvement] Add custom bulk scorer for hybrid query (2-3x faster) (#1289) Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add TextChunkingProcessor stats (#1308) * Add TextChunkingProcessor stats Signed-off-by: Andy Qin <[email protected]> # Conflicts: # CHANGELOG.md * Update unit and integ tests Signed-off-by: Andy Qin <[email protected]> --------- Signed-off-by: Andy Qin <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update Lucene dependencies (#1336) * Update Lucene dependencies Signed-off-by: Ryan Bogan <[email protected]> * Add changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Update model request body for bwc and integ tests Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] modify algorithm name and related parts Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] update test codes along with the change in CharacterLengthChunker Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] remove defensive check to prevent adding redundant code lines Signed-off-by: yeonghyeonKo <[email protected]> * Update CharacterLengthChunker to FixedCharLengthChunker Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update ChunkerFactory Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update CharacterLengthChunkerTests to FixedCharLengthChunkerTests Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [FIX] handle a corner case where the content is shorter than charLimit Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] Add integration test codes for fixed_char_length chunking algorithm Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] integration test code for cascaded pipeline Signed-off-by: yeonghyeonKo <[email protected]> * Support analyzer-based neural sparse query (#1088) * merge main; add analyzer impl Signed-off-by: zhichao-aws <[email protected]> * two phase adaption Signed-off-by: zhichao-aws <[email protected]> * two phase adaption Signed-off-by: zhichao-aws <[email protected]> * remove analysis Signed-off-by: zhichao-aws <[email protected]> * lint Signed-off-by: zhichao-aws <[email protected]> * update Signed-off-by: zhichao-aws <[email protected]> * address comments Signed-off-by: zhichao-aws <[email protected]> * tests Signed-off-by: zhichao-aws <[email protected]> * modify plugin security policy Signed-off-by: zhichao-aws <[email protected]> * change log Signed-off-by: zhichao-aws <[email protected]> * address comments Signed-off-by: zhichao-aws <[email protected]> * modify to package-private Signed-off-by: zhichao-aws <[email protected]> --------- Signed-off-by: zhichao-aws <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fixed score value as null for single shard for sorting (#1277) * Fixed score value as null for single shard for sorting Signed-off-by: Owais <[email protected]> * Addressed comment Signed-off-by: Owais <[email protected]> * Addressed more comments Signed-off-by: Owais <[email protected]> * Added UT Signed-off-by: Owais <[email protected]> --------- Signed-off-by: Owais <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add IT for neural sparse query + bert-uncased mbert-uncased analyzer (#1279) * add it Signed-off-by: zhichao-aws <[email protected]> * change log Signed-off-by: zhichao-aws <[email protected]> --------- Signed-off-by: zhichao-aws <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add WithFieldName implementation to QueryBuilders (#1285) Signed-off-by: Owais <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [AUTO] Increment version to 3.1.0-SNAPSHOT (#1288) * Increment version to 3.1.0-SNAPSHOT Signed-off-by: opensearch-ci-bot <[email protected]> * Update build.gradle Signed-off-by: Peter Zhu <[email protected]> --------- Signed-off-by: opensearch-ci-bot <[email protected]> Signed-off-by: Peter Zhu <[email protected]> Co-authored-by: opensearch-ci-bot <[email protected]> Co-authored-by: Peter Zhu <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * add release notes for 3.0 (#1298) Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Return bad request for invalid stat parameters in stats API (#1291) Signed-off-by: Andy Qin <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic mapping transformer. (#1276) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic ingest processor. (#1309) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [Performance Improvement] Add custom bulk scorer for hybrid query (2-3x faster) (#1289) Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Implement the query logic for the semantic field. (#1315) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Support custom weights params in RRF (#1322) * Support Weights params in RRF Signed-off-by: Varun Jain <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * add validation for invalid nested hybrid query (#1305) * add validation for nested hybrid query Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add stats tracking for semantic highlighting (#1327) * Add stats tracking for semantic highlighting Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update Lucene dependencies (#1336) * Update Lucene dependencies Signed-off-by: Ryan Bogan <[email protected]> * Add changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Update model request body for bwc and integ tests Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Enhance semantic field to allow to enable/disable chunking. (#1337) * Implement the query logic for the semantic field. Signed-off-by: Bo Zhang <[email protected]> * Enhance semantic field to allow to enable/disable chunking. Signed-off-by: Bo Zhang <[email protected]> --------- Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] modify algorithm name and related parts Signed-off-by: yeonghyeonKo <[email protected]> * Update CHANGELOG.md Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [FEAT] Add fixed_char_length chunking algorithm to STAT manager Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] Add integration test codes for fixed_char_length chunking algorithm Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] integration test code for cascaded pipeline Signed-off-by: yeonghyeonKo <[email protected]> * Going from alpha1 to beta1 for 3.0 release (#1245) Signed-off-by: yeonghyeonKo <[email protected]> * Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272) Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic field mapper. (#1225) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic mapping transformer. (#1276) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272) Signed-off-by: Junqiu Lei <[email protected]> * Add semantic field mapper. (#1225) Signed-off-by: Bo Zhang <[email protected]> * Add semantic mapping transformer. (#1276) Signed-off-by: Bo Zhang <[email protected]> * [FIX] minor typo Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] adopt FixedTokenLengthChunker's loop strategy for robust final chunking Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] sum the number of processors and their executions correctly in TextChunkingProcessorIT Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] gradlew spotlessApply Signed-off-by: yeonghyeonKo <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> Signed-off-by: will-hwang <[email protected]> Signed-off-by: Varun Jain <[email protected]> Signed-off-by: Andy Qin <[email protected]> Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: Chloe Gao <[email protected]> Signed-off-by: Harsha Vamsi Kalluri <[email protected]> Signed-off-by: Gulshan <[email protected]> Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: opensearch-ci-bot <[email protected]> Signed-off-by: Peter Zhu <[email protected]> Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: zhichao-aws <[email protected]> Signed-off-by: Owais <[email protected]> Co-authored-by: Will Hwang <[email protected]> Co-authored-by: Martin Gaievski <[email protected]> Co-authored-by: Varun Jain <[email protected]> Co-authored-by: Junqiu Lei <[email protected]> Co-authored-by: Andy <[email protected]> Co-authored-by: Chloe Gao <[email protected]> Co-authored-by: Harsha Vamsi Kalluri <[email protected]> Co-authored-by: Gulshan <[email protected]> Co-authored-by: Bo Zhang <[email protected]> Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com> Co-authored-by: opensearch-ci-bot <[email protected]> Co-authored-by: Peter Zhu <[email protected]> Co-authored-by: Ryan Bogan <[email protected]> Co-authored-by: zhichao-aws <[email protected]> Co-authored-by: Owais Kazi <[email protected]>
1 parent 097820c commit ce023f5

File tree

15 files changed

+544
-23
lines changed

15 files changed

+544
-23
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
1212
- [Semantic Field] Implement the query logic for the semantic field. ([#1315](https://github.com/opensearch-project/neural-search/pull/1315))
1313
- [Semantic Field] Enhance semantic field to allow to enable/disable chunking. ([#1337](https://github.com/opensearch-project/neural-search/pull/1337))
1414
- [Semantic Field] Implement the search analyzer support for semantic field at query time. ([#1341](https://github.com/opensearch-project/neural-search/pull/1341))
15+
- Add `FixedCharLengthChunker` for character length-based chunking ([#1342](https://github.com/opensearch-project/neural-search/pull/1342))
1516
- [Semantic Field] Implement the search analyzer support for semantic field at semantic field index creation time. ([#1367](https://github.com/opensearch-project/neural-search/pull/1367))
1617

1718
### Enhancements
@@ -32,6 +33,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
3233
- Filter requested stats based on minimum cluster version to fix BWC tests for stats API ([#1373](https://github.com/opensearch-project/neural-search/pull/1373))
3334

3435
### Infrastructure
36+
- [3.0] Update neural-search for OpenSearch 3.0 beta compatibility ([#1245](https://github.com/opensearch-project/neural-search/pull/1245))
3537

3638
### Documentation
3739

qa/rolling-upgrade/src/test/java/org/opensearch/neuralsearch/bwc/rolling/RestNeuralStatsActionIT.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,3 +120,4 @@ public void testStats_E2EFlow() throws Exception {
120120
}
121121
}
122122
}
123+

src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
import org.opensearch.index.mapper.IndexFieldMapper;
2626
import org.opensearch.neuralsearch.processor.chunker.ChunkerFactory;
2727
import org.opensearch.neuralsearch.processor.chunker.DelimiterChunker;
28+
import org.opensearch.neuralsearch.processor.chunker.FixedCharLengthChunker;
2829
import org.opensearch.neuralsearch.processor.chunker.FixedTokenLengthChunker;
2930
import org.opensearch.neuralsearch.stats.events.EventStatName;
3031
import org.opensearch.neuralsearch.stats.events.EventStatsManager;
@@ -58,7 +59,9 @@ public final class TextChunkingProcessor extends AbstractProcessor {
5859
DelimiterChunker.ALGORITHM_NAME,
5960
() -> EventStatsManager.increment(EventStatName.TEXT_CHUNKING_DELIMITER_EXECUTIONS),
6061
FixedTokenLengthChunker.ALGORITHM_NAME,
61-
() -> EventStatsManager.increment(EventStatName.TEXT_CHUNKING_FIXED_LENGTH_EXECUTIONS)
62+
() -> EventStatsManager.increment(EventStatName.TEXT_CHUNKING_FIXED_TOKEN_LENGTH_EXECUTIONS),
63+
FixedCharLengthChunker.ALGORITHM_NAME,
64+
() -> EventStatsManager.increment(EventStatName.TEXT_CHUNKING_FIXED_CHAR_LENGTH_EXECUTIONS)
6265
);
6366

6467
private int maxChunkLimit;

src/main/java/org/opensearch/neuralsearch/processor/chunker/ChunkerFactory.java

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,9 @@ private ChunkerFactory() {} // no instance of this factory class
2222
FixedTokenLengthChunker.ALGORITHM_NAME,
2323
FixedTokenLengthChunker::new,
2424
DelimiterChunker.ALGORITHM_NAME,
25-
DelimiterChunker::new
25+
DelimiterChunker::new,
26+
FixedCharLengthChunker.ALGORITHM_NAME,
27+
FixedCharLengthChunker::new
2628
);
2729

2830
/** Set of supported chunker algorithm types */
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
/*
2+
* Copyright OpenSearch Contributors
3+
* SPDX-License-Identifier: Apache-2.0
4+
*/
5+
package org.opensearch.neuralsearch.processor.chunker;
6+
7+
import java.util.Locale;
8+
import java.util.Map;
9+
import java.util.List;
10+
import java.util.ArrayList;
11+
12+
import static org.opensearch.neuralsearch.processor.chunker.ChunkerParameterParser.parseInteger;
13+
import static org.opensearch.neuralsearch.processor.chunker.ChunkerParameterParser.parseDoubleWithDefault;
14+
import static org.opensearch.neuralsearch.processor.chunker.ChunkerParameterParser.parsePositiveIntegerWithDefault;
15+
16+
/**
17+
* The implementation {@link Chunker} for fixed character length algorithm.
18+
*/
19+
public final class FixedCharLengthChunker extends Chunker {
20+
21+
/** The identifier for the fixed character length chunking algorithm. */
22+
public static final String ALGORITHM_NAME = "fixed_char_length";
23+
24+
/** Field name for specifying the maximum number of characters per chunk. */
25+
public static final String CHAR_LIMIT_FIELD = "char_limit";
26+
27+
/** Field name for specifying the overlap rate between consecutive chunks based on fixed character length. */
28+
public static final String OVERLAP_RATE_FIELD = "overlap_rate";
29+
30+
// Default values for each non-runtime parameter
31+
private static final int DEFAULT_CHAR_LIMIT = 2048; // Default character limit per chunk (512 tokens * 4 chars)
32+
private static final double DEFAULT_OVERLAP_RATE = 0.0;
33+
34+
// Parameter restrictions
35+
private static final double OVERLAP_RATE_LOWER_BOUND = 0.0;
36+
private static final double OVERLAP_RATE_UPPER_BOUND = 0.5; // Max 50% overlap
37+
38+
// Parameter values
39+
private int charLimit;
40+
private double overlapRate;
41+
42+
/**
43+
* Constructor that initializes the fixed character length chunker with the specified parameters.
44+
* @param parameters a map with non-runtime parameters to be parsed
45+
*/
46+
public FixedCharLengthChunker(final Map<String, Object> parameters) {
47+
parseParameters(parameters);
48+
}
49+
50+
/**
51+
* Parse the parameters for fixed character length algorithm.
52+
* Throw IllegalArgumentException when parameters are invalid.
53+
*
54+
* @param parameters a map with non-runtime parameters as the following:
55+
* 1. char_limit: the character limit for each chunked passage
56+
* 2. overlap_rate: the overlapping degree for each chunked passage, indicating how many characters come from the previous passage
57+
* Here are requirements for non-runtime parameters:
58+
* 1. char_limit must be a positive integer
59+
* 2. overlap_rate must be within range [0, 0.5]
60+
*/
61+
@Override
62+
public void parseParameters(Map<String, Object> parameters) {
63+
this.charLimit = parsePositiveIntegerWithDefault(parameters, CHAR_LIMIT_FIELD, DEFAULT_CHAR_LIMIT);
64+
this.overlapRate = parseDoubleWithDefault(parameters, OVERLAP_RATE_FIELD, DEFAULT_OVERLAP_RATE);
65+
66+
if (overlapRate < OVERLAP_RATE_LOWER_BOUND || overlapRate > OVERLAP_RATE_UPPER_BOUND) {
67+
throw new IllegalArgumentException(
68+
String.format(
69+
Locale.ROOT,
70+
"Parameter [%s] must be between %s and %s, but was %s",
71+
OVERLAP_RATE_FIELD,
72+
OVERLAP_RATE_LOWER_BOUND,
73+
OVERLAP_RATE_UPPER_BOUND,
74+
overlapRate
75+
)
76+
);
77+
}
78+
}
79+
80+
/**
81+
* Return the chunked passages for fixed character length algorithm.
82+
* Throw IllegalArgumentException when runtime parameters are invalid.
83+
*
84+
* @param content input string
85+
* @param runtimeParameters a map for runtime parameters, containing the following runtime parameters:
86+
* 1. max_chunk_limit: field level max chunk limit
87+
* 2. chunk_string_count: number of non-empty strings (including itself) which need to be chunked later
88+
*/
89+
@Override
90+
public List<String> chunk(final String content, final Map<String, Object> runtimeParameters) {
91+
int runtimeMaxChunkLimit = parseInteger(runtimeParameters, MAX_CHUNK_LIMIT_FIELD);
92+
int chunkStringCount = parseInteger(runtimeParameters, CHUNK_STRING_COUNT_FIELD);
93+
94+
List<String> chunkResult = new ArrayList<>();
95+
96+
int startCharIndex = 0;
97+
int overlapCharNumber = (int) Math.floor(this.charLimit * this.overlapRate);
98+
// Ensure `chunkInterval` is positive. charLimit is positive. overlapRate is [0, 0.5].
99+
// So, (charLimit - overlapCharNumber) >= 0.5 * charLimit, which is always > 0 if charLimit >= 1.
100+
int chunkInterval = this.charLimit - overlapCharNumber;
101+
102+
while (startCharIndex < content.length()) {
103+
if (Chunker.checkRunTimeMaxChunkLimit(chunkResult.size(), runtimeMaxChunkLimit, chunkStringCount)) {
104+
chunkResult.add(content.substring(startCharIndex));
105+
break;
106+
}
107+
108+
int endPosition;
109+
// Check if the current chunk will extend to or past the end of the content
110+
if (startCharIndex + this.charLimit >= content.length()) {
111+
endPosition = content.length(); // Ensure chunk goes to the very end
112+
chunkResult.add(content.substring(startCharIndex, endPosition));
113+
break;
114+
} else {
115+
endPosition = startCharIndex + this.charLimit;
116+
chunkResult.add(content.substring(startCharIndex, endPosition));
117+
}
118+
119+
startCharIndex += chunkInterval;
120+
}
121+
122+
return chunkResult;
123+
}
124+
125+
@Override
126+
public String getAlgorithmName() {
127+
return ALGORITHM_NAME;
128+
}
129+
}

src/main/java/org/opensearch/neuralsearch/stats/events/EventStatName.java

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,8 @@ public enum EventStatName implements StatName {
3939
EventStatType.TIMESTAMPED_EVENT_COUNTER,
4040
Version.V_3_1_0
4141
),
42-
TEXT_CHUNKING_FIXED_LENGTH_EXECUTIONS(
43-
"text_chunking_fixed_length_executions",
42+
TEXT_CHUNKING_FIXED_TOKEN_LENGTH_EXECUTIONS(
43+
"text_chunking_fixed_token_length_executions",
4444
"processors.ingest",
4545
EventStatType.TIMESTAMPED_EVENT_COUNTER,
4646
Version.V_3_1_0
@@ -51,6 +51,12 @@ public enum EventStatName implements StatName {
5151
EventStatType.TIMESTAMPED_EVENT_COUNTER,
5252
Version.V_3_1_0
5353
),
54+
TEXT_CHUNKING_FIXED_CHAR_LENGTH_EXECUTIONS(
55+
"text_chunking_fixed_char_length_executions",
56+
"processors.ingest",
57+
EventStatType.TIMESTAMPED_EVENT_COUNTER,
58+
Version.V_3_1_0
59+
),
5460
SEMANTIC_FIELD_PROCESSOR_EXECUTIONS(
5561
"semantic_field_executions",
5662
"processors.ingest",

src/main/java/org/opensearch/neuralsearch/stats/info/InfoStatName.java

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,14 @@ public enum InfoStatName implements StatName {
3636
InfoStatType.INFO_COUNTER,
3737
Version.V_3_1_0
3838
),
39-
TEXT_CHUNKING_FIXED_LENGTH_PROCESSORS(
40-
"text_chunking_fixed_length_processors",
39+
TEXT_CHUNKING_FIXED_TOKEN_LENGTH_PROCESSORS(
40+
"text_chunking_fixed_token_length_processors",
41+
"processors.ingest",
42+
InfoStatType.INFO_COUNTER,
43+
Version.V_3_1_0
44+
),
45+
TEXT_CHUNKING_FIXED_CHAR_LENGTH_PROCESSORS(
46+
"text_chunking_fixed_char_length_processors",
4147
"processors.ingest",
4248
InfoStatType.INFO_COUNTER,
4349
Version.V_3_1_0

src/main/java/org/opensearch/neuralsearch/stats/info/InfoStatsManager.java

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
import org.opensearch.neuralsearch.processor.normalization.MinMaxScoreNormalizationTechnique;
1919
import org.opensearch.neuralsearch.processor.normalization.ZScoreNormalizationTechnique;
2020
import org.opensearch.neuralsearch.processor.chunker.DelimiterChunker;
21+
import org.opensearch.neuralsearch.processor.chunker.FixedCharLengthChunker;
2122
import org.opensearch.neuralsearch.processor.chunker.FixedTokenLengthChunker;
2223
import org.opensearch.neuralsearch.settings.NeuralSearchSettingsAccessor;
2324
import org.opensearch.neuralsearch.stats.common.StatSnapshot;
@@ -48,7 +49,9 @@ public class InfoStatsManager {
4849
DelimiterChunker.ALGORITHM_NAME,
4950
stats -> increment(stats, InfoStatName.TEXT_CHUNKING_DELIMITER_PROCESSORS),
5051
FixedTokenLengthChunker.ALGORITHM_NAME,
51-
stats -> increment(stats, InfoStatName.TEXT_CHUNKING_FIXED_LENGTH_PROCESSORS)
52+
stats -> increment(stats, InfoStatName.TEXT_CHUNKING_FIXED_TOKEN_LENGTH_PROCESSORS),
53+
FixedCharLengthChunker.ALGORITHM_NAME,
54+
stats -> increment(stats, InfoStatName.TEXT_CHUNKING_FIXED_CHAR_LENGTH_PROCESSORS)
5255
);
5356

5457
private static final Map<String, Consumer<Map<InfoStatName, CountableInfoStatSnapshot>>> normTechniqueIncrementers = Map.of(
@@ -216,7 +219,7 @@ private void countTextChunkingProcessorStats(Map<InfoStatName, CountableInfoStat
216219

217220
// If no algorithm is specified, default case is fixed length
218221
if (chunkingAlgorithmIncrementers.containsKey(algorithmKey) == false) {
219-
increment(stats, InfoStatName.TEXT_CHUNKING_FIXED_LENGTH_PROCESSORS);
222+
increment(stats, InfoStatName.TEXT_CHUNKING_FIXED_TOKEN_LENGTH_PROCESSORS);
220223
} else {
221224
// Map is guaranteed to contain key in this block, so we can do direct map get
222225
chunkingAlgorithmIncrementers.get(algorithmKey).accept(stats);

0 commit comments

Comments
 (0)