Skip to content

[BUG] Text chunking ingest processor fails over 10,000 tokens when using index alias for ingestion #1325

Open
@marcus-bcl

Description

@marcus-bcl

What is the bug?

When ingesting documents using an index alias, the text chunking ingest processor with the "fixed_token_length" algorithm and "standard" analyzer fails to tokenize text longer than 10,000 tokens.

How can one reproduce the bug?

  1. Create an ingest pipeline with the text_chunking processor
  2. Create an index with index.analyze.max_token_count set to a value above 10,000
  3. Create an alias for the index
  4. Post a document with over 10,000 tokens to the alias.
$ curl -XPUT localhost:8080/_ingest/pipeline/my_pipeline --json '{
  "processors": [{
    "text_chunking": {
      "algorithm": {
        "fixed_token_length": {
          "token_limit": 32,
          "max_chunk_limit": -1,
          "max_token_limit": 15000
        }
      },
      "field_map": { "text": "text_chunks" }
    }
  }]
}'
{"acknowledged":true}

$ curl -XPUT localhost:8080/my_index --json '{
  "settings": {
    "default_pipeline": "my_pipeline",
    "index.analyze.max_token_count" : 15000
  }
}'
{"acknowledged":true,"shards_acknowledged":true,"index":"my_index"}

$ curl localhost:8080/_aliases --json '{"actions": [{"add": {"index": "my_index", "alias": "my_alias"}}]}'
{"acknowledged":true}

$ curl localhost:8080/my_alias/_doc/1 --json '{"text": "'"$(printf 'token %.0s' {1..10001})"'"}'         
{"error":{"root_cause":[{"type":"illegal_state_exception","reason":"analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."}],"type":"illegal_state_exception","reason":"analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting.","caused_by":{"type":"illegal_state_exception","reason":"The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."}},"status":500}

Response:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_state_exception",
        "reason": "analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."
      }
    ],
    "type": "illegal_state_exception",
    "reason": "analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting.",
    "caused_by": {
      "type": "illegal_state_exception",
      "reason": "The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."
    }
  },
  "status": 500
}

What is the expected behavior?

Text is split into chunks, respecting the max token count / max token limit.

What is your host/environment?

OpenSearch 2.19 (AWS)

Do you have any additional context?

Originally raised here, before realising this was specific to index aliases: #1321

The workaround is to use the index name explicitly, in which the token limit config works as expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions