Skip to content

Conversation

@yrangana
Copy link
Contributor

@yrangana yrangana commented Nov 7, 2025

Description

Add support for collection suffixes in QdrantVectorDBStorage to allow creating separate sets of collections for different purposes such as embedding dimensions, environments, testing, and more.

Related Issues

This feature addresses the need to support different embedding dimensions in multi-tenant environments and provides a flexible way to manage separate collection sets without changing the core architecture.

Changes Made

  • Added collection_suffix parameter support in QdrantVectorDBStorage.post_init
  • Modified collection naming logic to include the suffix when provided (lightrag_vdb_{namespace}_{suffix})
  • Updated legacy namespace handling to maintain compatibility with data migration
  • Added comprehensive documentation with examples of different suffix use cases
  • Added informative logging when a suffix is used

Checklist

  • Changes tested locally
  • Code reviewed
  • Documentation updated (added docstring with examples)
  • Unit tests added (not applicable for this simple feature)

Additional Notes

Usage Example

# Initialize LightRAG with a collection suffix
lightrag = LightRAG(
    vector_db_storage_cls_kwargs={
        "collection_suffix": "768d"  # Or any other suffix
    },
    # Other parameters...
)

@danielaskdd
Copy link
Collaborator

The LightRAG Server does not appear to contain any code handling the collection_suffix parameter. Could you clarify how collection_suffix is intended to be configured and utilized within the main program? Is the suffix functionality applicable only to vector storage, and how was this design decision considered?

@yrangana
Copy link
Contributor Author

yrangana commented Nov 9, 2025

Hi @danielaskdd, Thanks for the review! Let me address your questions:

1. How is collection_suffix configured and utilized?

Users configure collection_suffix via the existing vector_db_storage_cls_kwargs mechanism:

from lightrag import LightRAG

# Workspace with 768-dim embeddings
workspace_768 = LightRAG(
    working_dir="./workspace_a",
    embedding_dim=768,
    vector_storage="QdrantVectorDBStorage",
    vector_db_storage_cls_kwargs={
        "collection_suffix": "768d"
    }
)

# Workspace with 1536-dim embeddings
workspace_1536 = LightRAG(
    working_dir="./workspace_b",
    embedding_dim=1536,
    vector_storage="QdrantVectorDBStorage",
    vector_db_storage_cls_kwargs={
        "collection_suffix": "1536d"
    }
)

The parameter flows through LightRAG's initialization to QdrantVectorDBStorage.__post_init__() where it's extracted and applied to collection names.

2. LightRAG Server integration

"The LightRAG Server does not appear to contain any code handling the collection_suffix parameter."

The LightRAG Server doesn't need special handling for collection_suffix - it's a storage-layer parameter that passes transparently through the existing configuration system, similar to how api_key, port, or other Qdrant-specific parameters are handled. If the server exposes LightRAG configuration, users would simply include it in the vector_db_storage_cls_kwargs dictionary, and it would flow through to the storage backend automatically.

3. Is suffix functionality only applicable to vector storage?

Yes, specifically only when using QdrantVectorDBStorage, because: Why this is needed: After PR #2247 's multitenancy refactoring (which I originally requested in #2190 ), all workspaces share the same Qdrant collections using payload-based filtering. However, this means all workspaces must use the same embedding dimension, since Qdrant collections have fixed vector dimensions.

The problem: LightRAG creates three Qdrant collections when using vector_storage="QdrantVectorDBStorage":

lightrag_vdb_chunks         (text chunks)
lightrag_vdb_entities       (extracted entities)
lightrag_vdb_relationships  (entity relationships)

Without collection_suffix, all workspaces share these three collections and cannot use different embedding dimensions (e.g., workspace A with 768d, workspace B with 1536d would conflict). The solution: collection_suffix creates separate collection sets per dimension:

lightrag_vdb_chunks_768d         → {workspace_id: "A", workspace_id: "C", ...}
lightrag_vdb_entities_768d       → {workspace_id: "A", workspace_id: "C", ...}
lightrag_vdb_relationships_768d  → {workspace_id: "A", workspace_id: "C", ...}

lightrag_vdb_chunks_1536d         → {workspace_id: "B", workspace_id: "D", ...}
lightrag_vdb_entities_1536d       → {workspace_id: "B", workspace_id: "D", ...}
lightrag_vdb_relationships_1536d  → {workspace_id: "B", workspace_id: "D", ...}

All three collections get the same suffix to keep them synchronized.

Other use cases:

  • Environment separation: collection_suffix="dev"/"staging"/"prod"
  • Testing isolation: collection_suffix="test"

4. How was this design decision considered?

Why vector_db_storage_cls_kwargs instead of a top-level parameter?

  • Storage-specific: Only relevant when using QdrantVectorDBStorage - other vector databases (Milvus, Weaviate) or graph stores (Neo4j) don't need this
  • Consistent: Follows the existing pattern for storage-specific configuration (Qdrant connection strings, API keys, etc.)
  • Flexible: Users can specify custom suffixes (dimension-based, environment-based, or custom logic)
  • Backward compatible: Default behavior (no suffix) remains unchanged
  • Minimal impact: Only requires changes to qdrant_impl.py

Scalability: Qdrant supports 1,000 collections. With ~6 common embedding dimensions × 3 collections each = ~18 collections total, leaving plenty of room for growth while still supporting unlimited workspaces per collection set via payload filtering.

@danielaskdd
Copy link
Collaborator

Since a collection can only contain vectors of the same dimensionality, manually specifying dimension suffixes is prone to errors. Would it be more reasonable and convenient to automatically append dimension-based suffixes to collections based on the vector dimensions?

@yrangana
Copy link
Contributor Author

Thanks for the great feedback!
Automatic dimension detection would be convenient and reduce potential errors.

I considered implementing auto-detection initially, but there are a few challenges that make it better suited for a future enhancement:

1. Backward Compatibility
Automatically appending suffixes would be a breaking change for existing Qdrant users. Their collections would suddenly have different names (lightrag_vdb_{namespace} → lightrag_vdb_{namespace}_768d), making existing data inaccessible without migration. The opt-in approach keeps things backward compatible.

2. General Purpose Use Cases
While dimension-based separation is important, collection_suffix is designed to be flexible for other scenarios too:
- Environment separation ("production", "staging")
- Testing isolation ("test-suite-1")
- A/B experiments ("experiment-v2")
- Dimension-based separation ("768d")

3. Implementation Complexity

Auto-detecting dimensions from embedding functions is non-trivial due to various function types. It's doable, but adds complexity.

I think automatic dimension detection would make an excellent Phase 2 feature!

We could add:
- A helper function that inspects embedding_func and generates the suffix automatically
- An optional flag like auto_dimension_suffix=True
- This would work alongside manual suffix for maximum flexibility

Would you be open to merging this as Phase 1, and we can add auto-detection as a follow-up enhancement if there's demand?

@danielaskdd
Copy link
Collaborator

Vector collection suffixing is crucial for enabling LightRAG's future online workspace switching and multi-tenancy capabilities. This involves several key considerations:

  • Data Segregation: Not only data with varying embedding dimensions must not be co-located within the same collection/table, but also, data generated by different embedding models must also be segregated into separate collections/tables. Therefore, collection/table suffixes should incorporate both the embedding model's name and its dimension information.
  • Storage Compatibility: The proposed solution must accommodate both Qdrant and PostgreSQL vector storage systems, ensuring compatibility across both types.
  • Historical Data Migration: The implementation plan must facilitate straightforward migration of existing historical data to the new structure.

Based on these considerations, we propose the following recommendations:

  1. Enhance EmbeddingFunc Class: Augment the EmbeddingFunc class definition to include a model_name attribute, in addition to its existing embedding_dim attribute.
  2. Autonomous Suffix Determination: The BaseVectorStorage class already contains an embedding_func property. Leveraging this, Qdrant and PostgreSQL vector stores can autonomously determine their respective collection/table suffix names during initialization, based on the embedding_func's embedding_dim and model_name attributes. This approach eliminates the need for modifications to the LightRAG object's initialization code.
  3. Automated Data Migration: Implement data migration logic such that during storage initialization, the system checks for the existence of collections/tables suffixed with model and dimension information. If such a collection/table is not found, it should be automatically created, and existing historical data should be migrated to it. The detailed data migration logic can be referenced in the setup_collection function within the Qdrant implementation.

We recommend against incorporating environment-specific suffixes (e.g., _dev, _staging, _prod) into the vector store names. This differentiation can be effectively managed through the existing workspace mechanism within LightRAG, which is capable of providing comprehensive environment segregation across all storage types, rendering separate suffixes redundant.

@BukeLy @LarFii

@yrangana
Copy link
Contributor Author

yrangana commented Nov 11, 2025

Hi @danielaskdd,

I agree with your long-term vision. However, there's a critical breaking change we need to address immediately:

The Problem

LightRAG currently creates a set of collections fixed to a dimension based on the first workspace. This creates a system-wide limitation:

  1. First workspace created with 1536-dim embedding model → Qdrant collections expect 1536-dim vectors
  2. User creates another workspace with 3072-dim embedding model → Qdrant rejects all vectors (dimension mismatch error)
  3. System is locked to one embedding dimension across all workspaces

This breaks multi-embedding-model support entirely and needs an immediate fix.

Regarding environment suffixes: Agreed - workspace-based isolation is the right approach.

@danielaskdd
Copy link
Collaborator

I understand your concern. As previously mentioned, during the startup of the LightRAG Server, data is automatically migrated from collection without suffix to those with proper suffix. After this process, the original collections without suffixes are deprecated. Going forward, all newly created workspaces will store data in correctly suffix-named collections.

@yrangana
Copy link
Contributor Author

Thank you for clarifying @danielaskdd . can you assign someone to this pls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants