Description
I have implemented a hybrid search with according ingest and search pipeline, using text embeddings on document chunks, as the embedding models have input token size limitations of course.
The ingest pipeline follows https://opensearch.org/docs/latest/search-plugins/text-chunking/
The top results should now be used as input for RAG, I configured a search pipeline for this, following https://opensearch.org/docs/latest/search-plugins/conversational-search/ :
{
"description": "Post and response processor for hybrid search and RAG",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "min_max"
},
"combination": {
"technique": "arithmetic_mean",
"parameters": {
"weights": [
0.3,
0.7
]
}
}
}
}
],
"response_processors": [
{
"retrieval_augmented_generation": {
"tag": "rag_pipeline",
"description": "Pipeline using the configured LLM",
"model_id": "{{ _.LlmId }}",
"context_field_list": [
"tns_body"
],
"system_prompt": "You are a helpful assistant",
"user_instructions": "Generate a concise and informative answer in less than 100 words for the given question"
}
}
]
}
Now I am able to send a search request using this search pipeline to {{ _.openSearchUrl }}/{{ _.openSearchIndex }}/_search?search_pipeline=hybrid-rag-pipeline
, which is working:
{
"_source": {
"include": [
"tns_body"
]
},
"query": {
"hybrid": {
"queries": [
{
"bool": {
"should": [
{
"match": {
"tns_body": {
"query": "${{query}}"
}
}
},
{
"match": {
"tns_body.ngram": {
"query": "${{query}}"
}
}
}
]
}
},
{
"nested": {
"score_mode": "max",
"path": "embedding_chunked_512_paragraphs_chunks_tns_body",
"query": {
"neural": {
"embedding_chunked_512_paragraphs_chunks_tns_body.knn": {
"query_text": "{{ _.query }}",
"model_id": "{{ _.embeddingsModelId }}",
"k": 10
}
}
}
}
}
]
}
},
"ext": {
"generative_qa_parameters": {
"llm_question": "${{query}}",
"llm_model": "llama3",
"context_size": 2,
"message_size": 5,
"timeout": 150
}
}
}
Now I am falling into the issue that the documents in my index are too long for my LLM input. In OpenSearch currently the context_size and the message_size is configurable, but when the first document exceeds the input token limit, OpenSearch sends a message to the LLM provider that can not be processed.
Two things comes into my mind now:
- Add a 'max token' / 'max terms' config parameter parallel to context_size and message_size, that not depend on the document sizes.
- Because the documents are such big, and it is working fine for embeddings with chunking, why not enable to use the matching chunks for RAG input instead of the whole documents. If the chunks are too small (because the embeddings input size is small) I can also think of some scripted field that considers the matching chunk together with neighbor chunks. Or maybe it would be possible to specify an according text field to a generated embedding vector at ingest time, that can comprise the neighbor chunks also. This then could be used for RAG afterwards.
Currently big documents are not only silently lost in RAG. Because the whole prompt exceeds the input token limit of the LLM, it is (in my setting at least) accidentally truncated, meaning that the question - which is the last part of the generated prompt - is lost. So the user question will not be answered at all.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status