Open
Description
What is the bug?
When ingesting documents using an index alias, the text chunking ingest processor with the "fixed_token_length" algorithm and "standard" analyzer fails to tokenize text longer than 10,000 tokens.
How can one reproduce the bug?
- Create an ingest pipeline with the text_chunking processor
- Create an index with
index.analyze.max_token_count
set to a value above 10,000 - Create an alias for the index
- Post a document with over 10,000 tokens to the alias.
$ curl -XPUT localhost:8080/_ingest/pipeline/my_pipeline --json '{
"processors": [{
"text_chunking": {
"algorithm": {
"fixed_token_length": {
"token_limit": 32,
"max_chunk_limit": -1,
"max_token_limit": 15000
}
},
"field_map": { "text": "text_chunks" }
}
}]
}'
{"acknowledged":true}
$ curl -XPUT localhost:8080/my_index --json '{
"settings": {
"default_pipeline": "my_pipeline",
"index.analyze.max_token_count" : 15000
}
}'
{"acknowledged":true,"shards_acknowledged":true,"index":"my_index"}
$ curl localhost:8080/_aliases --json '{"actions": [{"add": {"index": "my_index", "alias": "my_alias"}}]}'
{"acknowledged":true}
$ curl localhost:8080/my_alias/_doc/1 --json '{"text": "'"$(printf 'token %.0s' {1..10001})"'"}'
{"error":{"root_cause":[{"type":"illegal_state_exception","reason":"analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."}],"type":"illegal_state_exception","reason":"analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting.","caused_by":{"type":"illegal_state_exception","reason":"The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."}},"status":500}
Response:
{
"error": {
"root_cause": [
{
"type": "illegal_state_exception",
"reason": "analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."
}
],
"type": "illegal_state_exception",
"reason": "analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting.",
"caused_by": {
"type": "illegal_state_exception",
"reason": "The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."
}
},
"status": 500
}
What is the expected behavior?
Text is split into chunks, respecting the max token count / max token limit.
What is your host/environment?
OpenSearch 2.19 (AWS)
Do you have any additional context?
Originally raised here, before realising this was specific to index aliases: #1321
The workaround is to use the index name explicitly, in which the token limit config works as expected.