-
Notifications
You must be signed in to change notification settings - Fork 749
Made it possible to skip chunking #526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -412,23 +412,20 @@ async def aadd_texts( | |||
Returns: | |||
True if the doc was added, otherwise False if already in the collection. | |||
""" | |||
all_settings = get_settings(settings) | |||
|
|||
if embedding_model is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cut the option to pull the model from settings here? I think we can still support it and skip if the chunksize = 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to embed later - if embedding_model isn't present, we will no longer fetch from settings. That means the settings isn't used anywhere in the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm saying that instead of the flag being the presence of embedding_model
in the entrypoint, what do you think of using chunksize=0 from settings? Then that way a user could still use this method just by inputting a settings object and they don't need to instantiate their embedding_model before passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chunk_size / skipping chunking does come from settings.
The change I made here is unrelated to this. I just made it so that embedding is not done when a document is added, since it's kind of wasteful for tests and rate limits. Now we embed texts only when calling gather evidence. A side-effect of that change is the only time we embed is if an explicit embedding_model is passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- I was imagining that a user would only want to avoid the early embedding if they wanted to do full-doc embeddings.
One other edge case, when a user is building an index, we use aadd
/aadd_texts
for each doc as we index it. There we probably want to pre-embed because we save individual Docs
with one paper each, and combine the candidate matching Docs
after a paper search. If we only embedded at gather evidence time, then we'd need to either embed each time, or split back out into the constituent single-paper Docs
objects and re-save them at that step.
To your point that definitely can push the embedding rate limit, but I think it makes sense for the local index. This PR totally supports this use case if you just insert an embedding_model
, but in practice we're only passing a Settings
object into our aadd
method call in search.process_file
. So it was clean to extract the model out of the settings there.
I think all we need to do instead is to build an embedding_model
in search.process_file
and pass that into aadd
and then this will still work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - added new setting (off by default) to defer_embeddings. It is default on for debug setting (to avoid so many embedding calls) and if we get to a point where we have extremely large indices, it may be more efficient to defer embedding.
async def embed_documents( | ||
self, texts: list[str], batch_size: int = 16 | ||
) -> list[list[float]]: | ||
texts = self._truncate_if_large(texts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt this will practically affect retrieval much (since 8k is already so large), but I wonder if we should support splitting + embedding then taking the averaging back into a single embedding for a large document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is another path to success: keep chunking strategy so we can retrieve from any place in the document, but when it's time for summarization substitute the full source doc. I think that's a different feature though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I understand what you're saying - I'm not sure that works though. Would have to experiment
Made a few changes to get closer to using Gemini, where we do not chunk and work with whole documents.
0
will trigger skipping chunking