Skip to content

Feat: Add Cookbook for Content-Based Recommendations with Gemini & Qdrant #700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 16, 2025

Conversation

andycandy
Copy link
Collaborator

Status: Work in Progress / Seeking Feedback

This PR presents the initial implementation of the Gemini+Qdrant recommendation cookbook outlined in #690. While the core functionality is present, I am actively seeking feedback and collaborating with reviewers to refine the notebook's structure, clarity, and explanations before final merging. Please feel free to add comments and suggestions!

Description

This PR introduces a new cookbook example demonstrating how to build a scalable, content-based recommendation system using Google's Gemini API for semantic embeddings and Qdrant as an efficient vector database.

Motivation

As outlined in #690 , existing examples often use toy datasets or focus on narrow use cases. This cookbook aims to provide a more practical, near-production-level demonstration showing how Large Language Model (LLM) embeddings, specifically from Gemini, can power recommendations across potentially diverse media types without relying on user interaction history (addressing cold-start scenarios).

Solution:

A new Jupyter notebook, Movie_Recommendation, has been added. This notebook guides the user through the following steps:

  1. Setup: Installs necessary libraries (google-genai, qdrant-client, pandas), configures the Gemini API client, and initializes the Qdrant client.
  2. Data Loading & Preparation:
    • Loads a sample of the TMDB Movies Dataset.
      Note: The notebook uses a sample (e.g., 5000 items) for demonstration purposes due to the scale of the full dataset and API costs associated with embedding millions of items. Clear comments explain this and how to adapt for the full dataset.
    • Selects relevant features (title, overview, genres, keywords, tagline, release_date).
    • Performs data cleaning: handles missing titles, fills missing optional text fields, and extracts the release year.
    • Constructs a combined text_for_embedding string for each movie, leveraging available metadata.
  3. Embedding & Indexing with Qdrant:
    • Defines a function (get_embeddings_batch) to efficiently generate embeddings for batches of text using the Gemini embedding-001 model, including retry logic.
    • Creates a Qdrant collection with appropriate vector parameters (size 768, Cosine distance).
    • Implements a batched upsert process:
      • Iterates through the data sample in batches.
      • Generates embeddings for each batch via the Gemini API.
      • Creates Qdrant PointStruct objects containing the vector, a unique ID (movie_id), and a structured payload (movie metadata like title, genre, year, overview).
      • Upserts these points into the Qdrant collection in batches for efficiency.
    • Verifies the number of points indexed in Qdrant.
  4. Querying & Recommendation:
    • Defines a recommend_movies function that:
      • Takes a natural language query (e.g., movie title, theme description).
      • Generates an embedding for the query using Gemini (task_type="RETRIEVAL_QUERY").
      • Performs a similarity search against the indexed movie vectors in Qdrant using client.search.
      • Returns the top K most similar movies, including their metadata (payload) and similarity scores.
    • Includes example queries to demonstrate the recommendation functionality.

Disclaimer

This PR uses the TMDb Movie Dataset for non-commercial purposes only, as per its licensing terms.

  • The dataset is licensed under CC BY-NC 4.0, which restricts usage to non-commercial projects.
  • Proper attribution has been provided to TMDb as required by the license.

By submitting this PR, I confirm that:

  1. This dataset is used solely for educational and demonstration purposes.
  2. Any further use of this dataset must comply with its licensing terms, and users are responsible for ensuring compliance.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@andycandy andycandy marked this pull request as draft April 11, 2025 17:27
@github-actions github-actions bot added status:awaiting review PR awaiting review from a maintainer component:examples Issues/PR referencing examples folder labels Apr 11, 2025
@andycandy
Copy link
Collaborator Author

andycandy commented Apr 11, 2025

@markmcd @Giom-V Here is my first rough draft. Could you go through it once?

@Giom-V Giom-V marked this pull request as ready for review June 4, 2025 09:13
@Giom-V
Copy link
Collaborator

Giom-V commented Jun 4, 2025

@andycandy Overall I like the example, and I think it would be a great addition to the cookbook. I just think it needs way more explanations (but that was a first draft so that was expected).
Don't also forget to create a readme in the folder, update the one in examples/, and maybe update the "what's next" section of the embedding notebooks (and maybe other related examples) to mention this one.

@Giom-V Giom-V assigned Giom-V and andycandy and unassigned Giom-V Jun 4, 2025
@andycandy
Copy link
Collaborator Author

Thanks! I've completed all the required updates: added the README in the qdrant/ folder, updated the main examples/ README, and included this notebook in the "What's next" and related examples sections of the relevant notebooks, n also expanded the explanations throughout the notebook for clarity.

@Giom-V Giom-V merged commit 8d7b26b into google-gemini:main Jun 16, 2025
5 checks passed
@andycandy andycandy deleted the qdrant-movie-recommendation branch June 24, 2025 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:examples Issues/PR referencing examples folder status:awaiting review PR awaiting review from a maintainer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants