Made it possible to skip chunking #526

whitead · 2024-10-04T06:16:29Z

Made a few changes to get closer to using Gemini, where we do not chunk and work with whole documents.

a chunk_size of 0 will trigger skipping chunking
deferred embedding documents until retrieval, to avoid embedding docs all the time in tests (or in cases like when you're not doing vector retrieval)
added a small helper function to truncate before embedding, in case a text is too long

paperqa/llms.py

mskarlin · 2024-10-04T15:37:53Z

paperqa/docs.py

@@ -412,23 +412,20 @@ async def aadd_texts(
        Returns:
            True if the doc was added, otherwise False if already in the collection.
        """
-        all_settings = get_settings(settings)
-
-        if embedding_model is None:


Why cut the option to pull the model from settings here? I think we can still support it and skip if the chunksize = 0.

I switched to embed later - if embedding_model isn't present, we will no longer fetch from settings. That means the settings isn't used anywhere in the function.

Right, I'm saying that instead of the flag being the presence of embedding_model in the entrypoint, what do you think of using chunksize=0 from settings? Then that way a user could still use this method just by inputting a settings object and they don't need to instantiate their embedding_model before passing.

Chunk_size / skipping chunking does come from settings.

The change I made here is unrelated to this. I just made it so that embedding is not done when a document is added, since it's kind of wasteful for tests and rate limits. Now we embed texts only when calling gather evidence. A side-effect of that change is the only time we embed is if an explicit embedding_model is passed

I see -- I was imagining that a user would only want to avoid the early embedding if they wanted to do full-doc embeddings.

One other edge case, when a user is building an index, we use aadd/aadd_texts for each doc as we index it. There we probably want to pre-embed because we save individual Docs with one paper each, and combine the candidate matching Docs after a paper search. If we only embedded at gather evidence time, then we'd need to either embed each time, or split back out into the constituent single-paper Docs objects and re-save them at that step.

To your point that definitely can push the embedding rate limit, but I think it makes sense for the local index. This PR totally supports this use case if you just insert an embedding_model, but in practice we're only passing a Settings object into our aadd method call in search.process_file. So it was clean to extract the model out of the settings there.

I think all we need to do instead is to build an embedding_model in search.process_file and pass that into aadd and then this will still work.

OK - added new setting (off by default) to defer_embeddings. It is default on for debug setting (to avoid so many embedding calls) and if we get to a point where we have extremely large indices, it may be more efficient to defer embedding.

mskarlin · 2024-10-04T15:39:00Z

paperqa/llms.py

    async def embed_documents(
        self, texts: list[str], batch_size: int = 16
    ) -> list[list[float]]:
+        texts = self._truncate_if_large(texts)


I doubt this will practically affect retrieval much (since 8k is already so large), but I wonder if we should support splitting + embedding then taking the averaging back into a single embedding for a large document.

Yes, that is another path to success: keep chunking strategy so we can retrieve from any place in the document, but when it's time for summarization substitute the full source doc. I think that's a different feature though

Oh I understand what you're saying - I'm not sure that works though. Would have to experiment

Made it possible to skip chunking

2a79bf1

whitead requested a review from jamesbraza October 4, 2024 06:16

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Oct 4, 2024

whitead requested a review from mskarlin October 4, 2024 06:16

dosubot bot added the enhancement New feature or request label Oct 4, 2024

mskarlin reviewed Oct 4, 2024

View reviewed changes

whitead added 3 commits October 4, 2024 08:53

Addressed PR comments

b60673c

Updated comment

7309e31

Made it possible to defer embeddings

2923231

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Oct 4, 2024

mskarlin approved these changes Oct 4, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 4, 2024

whitead enabled auto-merge (squash) October 4, 2024 16:56

whitead merged commit caca42a into main Oct 4, 2024
5 checks passed

whitead deleted the skip-chunk branch October 4, 2024 17:18

jamesbraza mentioned this pull request Oct 16, 2024

Updating the index when using lazy embeddings #596

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Made it possible to skip chunking #526

Made it possible to skip chunking #526

Uh oh!

whitead commented Oct 4, 2024

Uh oh!

Uh oh!

mskarlin Oct 4, 2024

Uh oh!

whitead Oct 4, 2024 •

edited

Loading

Uh oh!

mskarlin Oct 4, 2024

Uh oh!

whitead Oct 4, 2024

Uh oh!

mskarlin Oct 4, 2024

Uh oh!

whitead Oct 4, 2024

Uh oh!

mskarlin Oct 4, 2024

Uh oh!

whitead Oct 4, 2024

Uh oh!

whitead Oct 4, 2024

Uh oh!

Uh oh!

Uh oh!

Made it possible to skip chunking #526

Made it possible to skip chunking #526

Uh oh!

Conversation

whitead commented Oct 4, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whitead Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

whitead Oct 4, 2024 •

edited

Loading