Analysis of Performance Issues and Proposed Solution for Context Window (num_ctx)

Hello everyone,

I've been reading issues related to performance, like #295, #557, #1518 and I believe I've pinpointed the root cause of the significant CPU/GPU split when using local Ollama models.

### The Problem: VRAM Over-estimation
Like others, I was experiencing frustratingly slow performance on my machine (8GB VRAM GPU, 16GB RAM). When running a small 4B parameter model, which should easily fit into my VRAM, `ollama ps` reported that Bolt was requesting enough memory for an 11GB allocation. This forced Ollama to offload most of the model to the CPU, creating a severe performance bottleneck.

This experience was confusing, as other tools like Open Web UI or VSCode extensions run the same models on the same machine using 100% GPU without any issues.

### Root Cause in Code
After digging into the source code, I found the cause in [app/lib/modules/llm/providers/ollama.ts on line 59](https://github.com/stackblitz-labs/bolt.diy/blob/main/app/lib/modules/llm/providers/ollama.ts#L59):


`const num_ctx = process.env.DEFAULT_NUM_CTX ? parseInt(process.env.DEFAULT_NUM_CTX) : 32768`

The code attempts to read the DEFAULT_NUM_CTX environment variable. If it's not set, it falls back to a hardcoded context window of 32,768 tokens.

Ollama pre-allocates VRAM based on this num_ctx value. For a 32k context, the VRAM estimation skyrockets, forcing the model onto the CPU even on systems where it would otherwise fit perfectly.

### Suggestion
I understand that a large context window is crucial for Bolt's core functionality—enabling autonomous reasoning and editing across multiple files. However, the current implementation makes the tool almost unusable for a significant portion of the local LLM community with consumer-grade GPUs (e.g., 8GB or 12GB VRAM).

### Temporary Workaround:
For anyone else facing this issue, a temporary fix is to set the environment variable before launching Bolt:

`DEFAULT_NUM_CTX=4096`

Depending on the size of your model and your VRAM, this number may be slightly higher. I recommend testing. This dramatically improves performance by keeping the model fully on the GPU.

### Proposed Long-Term Solution:
I would like to suggest making the num_ctx value a configurable setting within the application's UI. This would empower users to balance performance and capability based on their hardware.

This setting could include a brief explanation of the trade-offs:

Higher Context (e.g., 32k): "Ideal for complex, multi-file tasks. Requires a large amount of VRAM (16GB+ recommended)."
Lower Context (e.g., 4k-8k): "Recommended for faster performance on systems with limited VRAM. May impact the model's ability to process very large codebases at once."
This change would greatly improve the user experience and make Bolt.diy accessible to a much wider audience.

Thanks for creating such an ambitious project. I hope this analysis is helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Analysis of Performance Issues and Proposed Solution for Context Window (num_ctx) #1732

The Problem: VRAM Over-estimation

Root Cause in Code

Suggestion

Temporary Workaround:

Proposed Long-Term Solution:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Analysis of Performance Issues and Proposed Solution for Context Window (num_ctx) #1732

Description

The Problem: VRAM Over-estimation

Root Cause in Code

Suggestion

Temporary Workaround:

Proposed Long-Term Solution:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions