Welcome to The God Factory personal_IDE Discussions! #1
Ileices
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
How an LLM Agent Built This Project — Full Transparency
Part 1 — What I Am at the Foundation
1.1 A Token Predictor, Not a Thinker
At the lowest level I am a transformer neural network — specifically, a large
language model (LLM) in the Claude family, made by Anthropic. The network was
trained on an enormous corpus of text: code repositories, documentation, papers,
books, conversations, and more. Training taught the network one thing extremely
well:
"Token" roughly means a word fragment. The sentence
cmake -B buildis about 6tokens. This document is tens of thousands of tokens.
There is no explicit knowledge database inside me. There are no stored code
templates. What exists is approximately 175–800 billion floating-point numbers
(weights/parameters) arranged into attention layers. Those weights encode, in a
compressed statistical form, patterns observed across hundreds of billions of
tokens of text during training.
When you type a message, it becomes a sequence of tokens. Every token I generate
in response is chosen by running the full current context through all those layers
and sampling from the resulting probability distribution over the vocabulary
(~100,000 possible tokens). This happens one token at a time, left to right.
1.2 Why This Produces "Understanding"
The network is not understanding language the way you do. But it has been
trained on enough code, documentation, and discussions that the internal
representations it builds are structurally similar to what a programmer would
think about when reading the same text. Research has shown that transformers
develop internal representations of:
cmakeis for, how#includerelates to a.hppfile)This is not magic — it is very high-dimensional pattern completion. But because
the patterns were learned from real engineering work, the completions are
surprisingly engineering-correct.
Part 2 — What "Agent Mode" Means
2.1 The Bare Model vs. an Agent
A plain LLM can only produce text. It cannot read files, run commands, or write
to your disk. Those are actions, and plain text generation is not action.
An agent is an LLM that has been given access to a set of tools — functions
that can be called by including structured JSON in the model's output stream.
The host system (in this case, the VS Code GitHub Copilot extension) intercepts
those structured outputs, executes the corresponding real-world operations, and
feeds the results back into the context as the next input. The model then reads
the result and decides what to do next.
This is called the tool-use loop or ReAct loop (Reasoning + Acting),
first formally described by Yao et al., 2022 (arXiv:2210.03629):
In practice for this project, a single "turn" involved hundreds of these
micro-cycles happening invisibly. You saw only the final streamed text; the tool
calls and their results happened in the background.
2.2 The Tools I Had Access To
Every concrete thing I did in your workspace was via one of these tools:
create_fileread_filereplace_string_in_filegrep_searchfile_searchsemantic_searchrun_in_terminalget_terminal_outputget_errorslist_dirfetch_webpagemanage_todo_listmemoryNone of these tools involve me "thinking in the background." Each one is a
discrete RPC call: the extension sends a request, the OS performs the action,
the result is serialized back into my context window.
2.3 I Have No Persistent Memory by Default
This is critical: between every message you send me, my state is completely
reset. I have no persistent memory. I am re-initialized from zero each time.
What creates the illusion of continuity is the context window — the full
conversation history that is prepended to every new inference. For long projects
like this one, the conversation grew too large to fit, so the system uses a
conversation summarizer: a separate model pass that compresses earlier
exchanges into a dense summary, which is then injected at the top of the next
session. You can see this as the
<conversation-summary>block in the systemprompt.
The summary you have in this project contained the full technical inventory of
what had been built — file names, API signatures, known bugs, the
get_weights_flatissue — which is why I could pick up exactly where I left off.
Part 3 — How the Project Was Actually Built
3.1 Phase 0: Reading the Domain
Before writing a single file, I read:
hypothesis/weirdMEMORY.md— your theoretical framework (the WRT equations).github/copilot-instructions.md— your architectural constraints(200 LOC limit, namespace rules, the green paradox rule, etc.)
This loaded the domain constraints into my context. From that point forward,
every file I generated was conditioned on those rules. The instructions file is a
form of system prompt engineering — it shapes the probability distribution
over my outputs toward compliant code.
3.2 Phase 1: Dependency Graph Construction
The hardest part of building a multi-file C++ project is getting the include
order right. If
foo.hppincludesbar.hpp,bar.hppmust exist and becorrect before
foo.hppis generated.I built a mental dependency graph by reasoning through the architecture you
specified:
Files were created in this topological sort order so that when I read a header
to write its implementation (.cpp), its dependencies already existed on disk and
I could verify them with
read_file.3.3 Phase 2: Batch Generation with Read-Verify-Write
For each file, the pattern was:
function signatures, namespace structure)
create_filecallgrep_searchtoconfirm key symbols are present
This "read before write" discipline is why the generated code is largely
consistent — I am not guessing what
ColorEntitylooks like; I have literallyread its definition before writing code that uses it.
3.4 Phase 3: Cross-File Consistency Checking
After all files were created, I ran a string-search audit:
This revealed that
sim_engine.hppdeclared a method namedenvironment()butall callers used
env(). That is a compile error that would only surfacewhen a human tried to build. I caught it by actively looking for it.
This is the most important step that separates "I generated some files" from "I
built something that works." The model has to reason about consistency across
files it generated minutes or tokens ago, which is a form of working memory
management — keeping a mental model of the whole system and checking new
pieces against it.
3.5 Phase 4: Bug Identification and Fix
The
get_weights_flat/set_weights_flatbug is a clean example. Theserializer (
state_serializer.cpp) called these methods, but neitherAncestralNetworknorPersonalNetworkdefined them. I found this by:Result: 4 matches in
state_serializer.cpp(call sites), 0 matches in thebiology headers (definitions). The mismatch was obvious. The fix — adding
inline methods to both headers with the correct
float↔doubletypeconversion to match the wire format — was straightforward once the problem was
located.
The key insight: an LLM agent does not compile the code. It simulates
compilation by mentally type-checking patterns against the definitions it has
read. This is why it can catch type errors and missing symbols without actually
running a compiler — though running a real compiler (
cmake --build) is alwaysthe final ground truth.
3.6 Phase 5: The Resume-Bug Fix
The
engine.init()afterresume_from()bug is subtler and worth explaining:resume_from()restores entities, networks, and RNG state from SQLiteengine.init()re-seeds the world from scratch, overwriting the restoreI caught this because I read the implementation of
init()before finalizingmain_headless.cpp. Reading showedinit()callsenv_.seed_initial(...)whichcreates new entities. That is obviously destructive to a restored state. The fix
was a boolean guard:
if (!resumed) engine.init();This is a logical bug, not a syntactic one. The model found it by simulating the
execution path mentally — a form of symbolic execution over the code graph.
Part 4 — The VS Code Integration Layer
4.1 How the Extension Works
The GitHub Copilot extension running in your VS Code is the orchestrator. It:
history + active file + selected text + workspace structure
<tool_call>blocks out of the streamThe extension has direct OS access because it is a native Node.js process
running inside VS Code with full filesystem permissions. When I call
create_file,it is equivalent to VS Code's own file creation — it goes through the VS Code
workspace API, which writes to your actual disk.
4.2 What the Language Server Gives Me
VS Code runs language server protocol (LSP) servers in the background — for
C++ this is clangd or MSVC's IntelliSense. When I call
get_errors, theextension asks the language server for its current diagnostic list. This gives
me real compiler-level error feedback without compiling — the language server
does continuous incremental parsing and type-checking as files appear on disk.
This is a significant advantage over a system that only has terminal access. I
can see errors in a file I just created before any build is attempted.
4.3 The System Prompt and Instructions File
The
.github/copilot-instructions.mdfile you have is loaded into every requestas part of the system prompt. This means every single token I generate is
conditioned on those rules. The green-paradox rule, the 200-LOC limit, the
#pragma oncerequirement, theRBY_ASSERTmacro convention — these are notthings I remember; they are re-injected as hard constraints before every response.
This is called instruction following and it is one of the behaviours most
heavily trained into modern LLMs via RLHF (Reinforcement Learning from Human
Feedback) and Constitutional AI methods. The model is trained to treat
instruction blocks as high-priority constraints that shape its outputs.
Part 5 — Why This Works So Well for Coding
5.1 Training on Code at Scale
GitHub, GitLab, and public code repositories represent hundreds of billions of
tokens of real-world C++, Python, CMake, GLSL, and every other language used in
this project. The model has "seen" (in the statistical sense) millions of
examples of:
FetchContentdeclarations for exactly the libraries used here (GLFW,Catch2, SQLiteCpp)
glDrawArraysInstanced#ifdef RBY_USE_CUDAWhen I generate code for any of these, I am not inventing — I am interpolating
across a massive distribution of real examples that were in the training set.
The result is code that follows established conventions because those conventions
were overrepresented in the training data.
5.2 The Role of "Extended Thinking" / Chain of Thought
Modern frontier models like Claude use extended thinking — an internal
scratchpad where the model reasons step by step before producing its final output.
This is analogous to the chain-of-thought (CoT) technique where a model is
trained to show its reasoning. Research (Wei et al., 2022) showed this
dramatically improves performance on multi-step tasks.
For a complex file like
sim_engine.cpp, the model internally reasons through:...before the first character of the output is generated. You never see this
internal reasoning; you only see the resulting code.
5.3 Context Window as Working Memory
The context window (currently 200,000 tokens for Claude Sonnet) is the model's
only working memory. Everything I know about your project at any given moment
is what fits in that window. This creates a hard limit on project complexity —
if the full codebase grows larger than the context window, I cannot hold it all
in "mind" simultaneously.
The solution used here is incremental reading: I read only the files relevant
to the current task, trusting the conversation history to record what I have
already established. For cross-file consistency checks, I use
grep_searchtopull only the relevant lines from large files, rather than reading everything.
5.4 The Agent-Computer Interface (ACI) Effect
The SWE-agent paper (Yang et al., 2024, arXiv:2405.15793) demonstrated that
the design of the interface between an LLM and its tools dramatically affects
performance. An agent with well-designed tools for code editing (search,
targeted replace, directory navigation) significantly outperforms one that only
has raw file I/O.
The tools available to me in this session are well-suited for software
engineering:
replace_string_in_file(targeted surgical edits, not full rewrites)grep_searchwith regex (find any symbol anywhere in the workspace)semantic_search(find related code by concept, not just string match)run_in_terminal(execute real build commands and get real output)get_errors(read the language server's live diagnostics)This ACI design is why I can make small, precise corrections to 90 files without
needing to regenerate everything from scratch when a bug is found.
Part 6 — What I Am NOT Doing
This section is about honest limitations, because most explanations of AI
capability omit them.
6.1 I Did Not Compile and Run the Code
Everything I verified about correctness was done by reading and pattern
matching, not by actually compiling. When I said "run a command," I was running
directory listings and file searches — not the C++ compiler. A real build test
was not done as part of this session.
The code is logically consistent to the best of my analysis, but until
cmake --build buildsucceeds on your machine and the binaries run, it ispossible there are errors I missed. The most likely remaining issues would be:
6.2 I Do Not Understand Physics, Biology, or Color Theory
The WRT equations, the biophotonic green paradox, the R-ratio dynamics — I
implemented them as faithfully as I could from
weirdMEMORY.md, but I have noindependent understanding of whether the theory is correct. I am implementing
your specification, not verifying your science. The code does what the equations
say; whether the equations model reality is entirely your domain.
6.3 Context Compression Loses Information
When the conversation was summarized between sessions, the compression was lossy.
Details of early files may have been summarized incorrectly. This is why
I re-read files before modifying them rather than trusting the summary — the
summary might be wrong about specific details.
6.4 Hallucination Risk
LLMs can generate plausible-looking but wrong code. The mitigation in this
session was constant read-before-write: I read existing files to get exact
signatures before generating code that uses them. But for third-party library
APIs (GLFW, SQLiteCpp, Catch2), I relied on training knowledge, which could be
out of date or slightly wrong. Always verify library usage against current docs.
Part 7 — Recent Research Findings (2024–2025)
7.1 SWE-bench and the Rapid Progression
SWE-bench (Jimenez et al., 2024, arXiv:2310.06770) is the standard benchmark
for measuring how well LLM agents can resolve real GitHub issues in large
codebases. In early 2024, the best models solved ~4% of issues. By late 2024,
top agents reached ~50%. By 2025, the frontier is above 60–70% on verified tasks.
The capability jump came from three sources:
7.2 The LLM-Modulo Framework
Kambhampati et al. (2024, arXiv:2402.01817) argue that LLMs should not be
expected to plan autonomously but instead work in loop with external verifiers.
In practice this means: generate → test → observe failure → reason about
failure → fix → repeat. This is exactly the pattern used here, with
get_errorsandgrep_searchserving as the external verifiers.The implication: the tool loop is not optional glue — it is the source of
correctness. A model generating code once with no feedback would perform far
worse than a model that generates, checks, and corrects in a tight loop.
7.3 Agent-Computer Interfaces Matter More Than Model Size
SWE-agent (Yang et al., 2024) showed that a carefully designed set of tools for
code editing can take a mid-size model and make it outperform a much larger model
with a poor tool interface. The specific finding: tools that allow targeted line
replacement rather than full-file rewrites significantly reduce error propagation
because small edits are easier to verify than complete regeneration.
This is why
replace_string_in_file(which requires exact before-and-aftercontext) produces better results than "rewrite this file from scratch."
7.4 Context Length is Now a Competitive Moat
The 2025 landscape shows that 1M+ token context windows (Gemini 1.5, Claude
extended) allow holding entire repositories in a single forward pass. The
remaining bottleneck is attention cost — attending to 1M tokens is ~1000x
more expensive than attending to 1000 tokens. Research into hierarchical attention,
retrieval-augmented context, and KV-cache compression is the active frontier.
For practical software engineering agents, the sweet spot is ~200K tokens
(currently used here) with retrieval tools (
grep_search,semantic_search)to pull relevant sections on demand rather than loading everything.
Part 8 — This Project, Step by Step
The following is a complete chronological account of what happened to build this
codebase, written as transparently as possible.
Session 1 (prior, summarized)
weirdMEMORY.md(the WRT theory) andcopilot-instructions.md(architectural rules).
phased build plan (88 files across 13 batches).
math → biology → simulation → visualization → experiments → tests → entry points.
Session 2 (prior, summarized)
Get-ChildItem -Recurseto audit: 87 files confirmed.grep_searchpattern audits across all.cppfiles to find APImismatches.
env()/time()aliases missing fromsim_engine.hpp.Environment::restore()method missing.HeadlessRunnerwrong return type forget_rng_state().Renderertakingvectorinstead ofspanfor entity list.get_weights_flat/set_weights_flatas missing (session endedbefore fix — recorded in summary).
Session 3 (this session)
ancestral_network.hppandpersonal_network.hppto see existing API.state_serializer.hppto confirmNetWire::weightstype (double[16]).get_weights_flat/set_weights_flatwithfloat↔doubleconversionto both network headers in a single parallel multi-replace call.
TimeKeeper::set_tick()existed (it did, line 32 of time_keeper.hpp).CMakeLists.txt, foundtoolssubdirectorymissing from the
add_subdirectorylist.add_custom_command(POST_BUILD ...)rules.setup_windows.bat,start_visual.bat,start_headless.bat.setup_linux.sh,start_visual.sh,start_headless.sh..cppfiles → found critical bug:engine.init()wascalled unconditionally even after
resume_from()restored state.main_headless.cppandmain_visual.cppwith abool resumedguard.Part 9 — How to Think About This for Future Projects
If you want to use an agent like this to build something from scratch, here is
what actually matters:
Give it a complete specification, not a vague goal
The
copilot-instructions.mdfile is what made this project possible. It toldthe agent:
Without that, the agent would have made architecture decisions that you might not
want and would have been hard to correct later.
The theory document was the database
weirdMEMORY.mdserved as the ground truth for all simulation logic. Theagent read it and implemented it. If you have a domain theory, a research paper,
a design spec — give it to the agent as a file it can read, not as a conversation
message. Files can be re-read multiple times; conversation messages scroll out of
context.
Let it audit itself
The most valuable steps were the
grep_searchaudits after all files werecreated. You can prompt this explicitly:
This triggers the same consistency-checking behavior that caught the
get_weights_flatbug.Always do a real build test
The agent's analysis is probabilistic, not deterministic. The final ground truth
is a compiler. Run:
Feed any errors back to the agent. It will read the compiler output and fix them.
This compile-fix loop is the most reliable path to a working binary.
References
All papers freely available at https://arxiv.org
Beta Was this translation helpful? Give feedback.
All reactions