Skip to content

Analysis of Performance Issues and Proposed Solution for Context Window (num_ctx) #1732

Open
@renatosalopes

Description

@renatosalopes

Hello everyone,

I've been reading issues related to performance, like #295, #557, #1518 and I believe I've pinpointed the root cause of the significant CPU/GPU split when using local Ollama models.

The Problem: VRAM Over-estimation

Like others, I was experiencing frustratingly slow performance on my machine (8GB VRAM GPU, 16GB RAM). When running a small 4B parameter model, which should easily fit into my VRAM, ollama ps reported that Bolt was requesting enough memory for an 11GB allocation. This forced Ollama to offload most of the model to the CPU, creating a severe performance bottleneck.

This experience was confusing, as other tools like Open Web UI or VSCode extensions run the same models on the same machine using 100% GPU without any issues.

Root Cause in Code

After digging into the source code, I found the cause in app/lib/modules/llm/providers/ollama.ts on line 59:

const num_ctx = process.env.DEFAULT_NUM_CTX ? parseInt(process.env.DEFAULT_NUM_CTX) : 32768

The code attempts to read the DEFAULT_NUM_CTX environment variable. If it's not set, it falls back to a hardcoded context window of 32,768 tokens.

Ollama pre-allocates VRAM based on this num_ctx value. For a 32k context, the VRAM estimation skyrockets, forcing the model onto the CPU even on systems where it would otherwise fit perfectly.

Suggestion

I understand that a large context window is crucial for Bolt's core functionality—enabling autonomous reasoning and editing across multiple files. However, the current implementation makes the tool almost unusable for a significant portion of the local LLM community with consumer-grade GPUs (e.g., 8GB or 12GB VRAM).

Temporary Workaround:

For anyone else facing this issue, a temporary fix is to set the environment variable before launching Bolt:

DEFAULT_NUM_CTX=4096

Depending on the size of your model and your VRAM, this number may be slightly higher. I recommend testing. This dramatically improves performance by keeping the model fully on the GPU.

Proposed Long-Term Solution:

I would like to suggest making the num_ctx value a configurable setting within the application's UI. This would empower users to balance performance and capability based on their hardware.

This setting could include a brief explanation of the trade-offs:

Higher Context (e.g., 32k): "Ideal for complex, multi-file tasks. Requires a large amount of VRAM (16GB+ recommended)."
Lower Context (e.g., 4k-8k): "Recommended for faster performance on systems with limited VRAM. May impact the model's ability to process very large codebases at once."
This change would greatly improve the user experience and make Bolt.diy accessible to a much wider audience.

Thanks for creating such an ambitious project. I hope this analysis is helpful!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions