Skip to content

Made it possible to skip chunking #526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 4, 2024
Merged

Made it possible to skip chunking #526

merged 4 commits into from
Oct 4, 2024

Conversation

whitead
Copy link
Collaborator

@whitead whitead commented Oct 4, 2024

Made a few changes to get closer to using Gemini, where we do not chunk and work with whole documents.

  • a chunk_size of 0 will trigger skipping chunking
  • deferred embedding documents until retrieval, to avoid embedding docs all the time in tests (or in cases like when you're not doing vector retrieval)
  • added a small helper function to truncate before embedding, in case a text is too long

@whitead whitead requested a review from jamesbraza October 4, 2024 06:16
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Oct 4, 2024
@whitead whitead requested a review from mskarlin October 4, 2024 06:16
@dosubot dosubot bot added the enhancement New feature or request label Oct 4, 2024
@@ -412,23 +412,20 @@ async def aadd_texts(
Returns:
True if the doc was added, otherwise False if already in the collection.
"""
all_settings = get_settings(settings)

if embedding_model is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cut the option to pull the model from settings here? I think we can still support it and skip if the chunksize = 0.

Copy link
Collaborator Author

@whitead whitead Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to embed later - if embedding_model isn't present, we will no longer fetch from settings. That means the settings isn't used anywhere in the function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm saying that instead of the flag being the presence of embedding_model in the entrypoint, what do you think of using chunksize=0 from settings? Then that way a user could still use this method just by inputting a settings object and they don't need to instantiate their embedding_model before passing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunk_size / skipping chunking does come from settings.

The change I made here is unrelated to this. I just made it so that embedding is not done when a document is added, since it's kind of wasteful for tests and rate limits. Now we embed texts only when calling gather evidence. A side-effect of that change is the only time we embed is if an explicit embedding_model is passed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- I was imagining that a user would only want to avoid the early embedding if they wanted to do full-doc embeddings.

One other edge case, when a user is building an index, we use aadd/aadd_texts for each doc as we index it. There we probably want to pre-embed because we save individual Docs with one paper each, and combine the candidate matching Docs after a paper search. If we only embedded at gather evidence time, then we'd need to either embed each time, or split back out into the constituent single-paper Docs objects and re-save them at that step.

To your point that definitely can push the embedding rate limit, but I think it makes sense for the local index. This PR totally supports this use case if you just insert an embedding_model, but in practice we're only passing a Settings object into our aadd method call in search.process_file. So it was clean to extract the model out of the settings there.

I think all we need to do instead is to build an embedding_model in search.process_file and pass that into aadd and then this will still work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - added new setting (off by default) to defer_embeddings. It is default on for debug setting (to avoid so many embedding calls) and if we get to a point where we have extremely large indices, it may be more efficient to defer embedding.

async def embed_documents(
self, texts: list[str], batch_size: int = 16
) -> list[list[float]]:
texts = self._truncate_if_large(texts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt this will practically affect retrieval much (since 8k is already so large), but I wonder if we should support splitting + embedding then taking the averaging back into a single embedding for a large document.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is another path to success: keep chunking strategy so we can retrieve from any place in the document, but when it's time for summarization substitute the full source doc. I think that's a different feature though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I understand what you're saying - I'm not sure that works though. Would have to experiment

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Oct 4, 2024
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 4, 2024
@whitead whitead enabled auto-merge (squash) October 4, 2024 16:56
@whitead whitead merged commit caca42a into main Oct 4, 2024
5 checks passed
@whitead whitead deleted the skip-chunk branch October 4, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants