[BUG] Text chunking ingest processor fails over 10,000 tokens when using index alias for ingestion

### What is the bug?
When ingesting documents using an index alias, the text chunking ingest processor with the "fixed_token_length" algorithm and "standard" analyzer fails to tokenize text longer than 10,000 tokens.

### How can one reproduce the bug?
1. Create an ingest pipeline with the text_chunking processor
2. Create an index with `index.analyze.max_token_count` set to a value above 10,000
3. Create an alias for the index
4. Post a document with over 10,000 tokens to the alias.

```bash
$ curl -XPUT localhost:8080/_ingest/pipeline/my_pipeline --json '{
  "processors": [{
    "text_chunking": {
      "algorithm": {
        "fixed_token_length": {
          "token_limit": 32,
          "max_chunk_limit": -1,
          "max_token_limit": 15000
        }
      },
      "field_map": { "text": "text_chunks" }
    }
  }]
}'
{"acknowledged":true}

$ curl -XPUT localhost:8080/my_index --json '{
  "settings": {
    "default_pipeline": "my_pipeline",
    "index.analyze.max_token_count" : 15000
  }
}'
{"acknowledged":true,"shards_acknowledged":true,"index":"my_index"}

$ curl localhost:8080/_aliases --json '{"actions": [{"add": {"index": "my_index", "alias": "my_alias"}}]}'
{"acknowledged":true}

$ curl localhost:8080/my_alias/_doc/1 --json '{"text": "'"$(printf 'token %.0s' {1..10001})"'"}'         
{"error":{"root_cause":[{"type":"illegal_state_exception","reason":"analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."}],"type":"illegal_state_exception","reason":"analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting.","caused_by":{"type":"illegal_state_exception","reason":"The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."}},"status":500}
```

Response:
```json
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_state_exception",
        "reason": "analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."
      }
    ],
    "type": "illegal_state_exception",
    "reason": "analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting.",
    "caused_by": {
      "type": "illegal_state_exception",
      "reason": "The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."
    }
  },
  "status": 500
}
```


### What is the expected behavior?

Text is split into chunks, respecting the max token count / max token limit.

### What is your host/environment?

OpenSearch 2.19 (AWS)

### Do you have any additional context?

Originally raised here, before realising this was specific to index aliases: https://github.com/opensearch-project/neural-search/issues/1321

The workaround is to use the index name explicitly, in which the token limit config works as expected.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Text chunking ingest processor fails over 10,000 tokens when using index alias for ingestion #1325

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any additional context?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Text chunking ingest processor fails over 10,000 tokens when using index alias for ingestion #1325

Description

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any additional context?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions