Building claude code from scratch

The other day I was casually prompting claude code, and I had this feeling…
That claude code had gotten dumb.
I mean just a few days ago it was breezing through code,
solving fairly high-level tasks without much planning.
It still needed corrections.
But most of the time the first prompt was already heading in the right direction.
And I was able to land on the desired solution within a few iterations.
But over the last few days, I was unable to extract the desired output no matter how much I tried, and eventually had to resort to manual coding.
This was just days before release of opus 4.8

I don’t know what happened.
But I felt this quality downgrade before too.
AFAIR that time also new model were dropping soon.
Also whenever new models come out from these big AI “startups”;
hitting rate limits becomes a frequent occurrence.

This made me realize that I have no control over my dev environment.
Time to fix that.

Recently there has been buzz about local models like qwen3.6 and gemma4 being viable for dev work. Naturally, I wanted to see how much of that hype survived contact with reality.
So I set up openrouter keys and opencode to try them out. But…

So yeah... This side of the ecosystem is still immature and needs some tinkering to make it work. But I have no idea how these systems work, and I can’t keep relying on frontier labs forever.
Fine.. Let's open the hood and see what's actually going on.

First Steps

Okay lets try to understand the fundamentals of claude code and build it from scratch. Unfortunately... that means I need to write Python manually ....like a medieval programmer.… or do I??

Maybe I can use that...

Self-mutating systems are nothing new in computer science.
We shouldn’t have to wait for a claude code-level system to get started. Fundamentally, what claude code does is read the code from filesystem, analyze it, generates update version, and write it back to filesystem.
So.... if we restrict ourselves to a single main.py we can create a similar setup.
Just on a smaller scale, which allow us to prompt our way through the rest.

We can just read this main.py, call an LLM with the whole file content and a prompt which outputs the complete updated file, parse this content and write it back to main.py

Time for a quick test.

Yes... it worked.

Jargons

Before we keep adding features, I want to establish some terminology. Because if you spend enough time in AI Twitter, eventually everyone starts using different words for the same thing.

So an llm is basically a stateless pure function.
It takes in some text and outputs some text.
As functional bros would remind us, a pure function doesn’t have any side effects. Well in this case, no side-effect apart from water, energy and memory crisis.

So... we need to add some stuff around LLMs to make them useful.
Stuff like... tools, memory, retries, context, etc and coordination logic to tie it all together.
That's basically a runtime. People also call it a harness, scaffolding, orchestrator, etc. Yes, the terminology is messy. But for our purpose, lets just stick to runtime.

This runtime can be broadly divided into 2 camps, workflows and agents.

A workflow is fixed execution. Like we have here. Read file → Call model → Write file. The steps and control flow are predetermined

But an agent is different. We don’t specify the exact steps. Instead, we give the system a goal, and it decides what actions to take.

So in a sense it's similar to how imperative and declarative programming differ. Imperative tells the computer how to do something (step-by-step), like C, Go, etc. While declarative tells it what you want to achieve, like SQL, HTML, etc.
This is not a shallow comparison, as it shapes how we interact with these kinds of system. Will explore more on this in later posts.

So right now we have created very primitive a workflow, but eventually we want to make it agentic.

Having Constraints

Now we understand what we are building, lets add some constraints to our project which will give us some architectural direction. Cause constraints beget creativity.

First, capability. The end goal isn't to build an AI demo. The end goal is to build something I would actually want to use. So Claude Code is our rough benchmark.

Second is early failure detection.
Because LLMs are non-deterministic and running it in a loop compounds.
So lets not get into a situation where we are debugging a symptom that's twenty steps removed from the original problem.
Its better to halt early than burning through tokens on wrong solution.

Third is modularity i.e every LLM call should have exactly one job.
One call for planning.
Another one for verifing.
Another one modifing code.
And so on...
If one prompt is trying to do fifteen different things at once, debugging becomes a nightmare. It also makes it much harder to swap models, test improvements, or tune individual pieces later.
So we're going to keep responsibilities separated wherever possible.
This also aligns with the unix philosophy of doing one thing well, and the single responsibility principle from SOLID.

That's enough architecture for one day. We have a direction now. We can add to this as requirements come up.

So going back to our workflow, since we want early failure signals, lets add one more step after writing main.py

Now how should verify work? Here we have 2 options -

either it's independent of the updated code, e.g. lint checks, existing test suites, build checks, etc
or it depends on the updated code, like new unit tests that depend on the interface of the generated code

For now lets simplify our life and go with option 1, because with the second option we would need to ensure the correctness of the tests too.
Keeping build checks, lint checks and any other explicit check that user has mentioned in the instruction.
For that we will make another LLM call with the user instruction and a verification prompt to output json, something like

{
	"build_command": "...",
	"lint_command": "...",
	"test_command": "..."
}

The openrouter apis have response_format where we can provide a schema. This ensures that the response generated by the LLM is in json. Here it is in action...

Lets add a few more convenience things.
We don’t always want to change the codebase, sometimes we just want to ask questions. For that purpose, adding a basic decision gate on the user instruction to check whether it's a query or mutation.
Also, I have been hard-coding the input filepath and output filepath until now. Instead extract them from the user instruction. The decision gate should return

{
	"instruction_type": "query|mutation",
	"input_filepath": "…",
	"output_filepath": "…"
}

Now our workflow would look something like this

For a while, everything felt magical.
I was refactoring code through prompts. Adding features. Cleaning up old functions. The whole thing felt suspiciously smooth.
And then it deleted 90% of main.py

Investigation

Lets see what's happening.
The LLM is able generate the complete output for file content.
But when I was refactoring, the LLM decided to rewrite the prompt.
Now the new prompt contains backticks as part of an example. That's the issue.
The current process_response function is a very basic state machine and doesn’t handle nested backticks very well.
This looks like a good problem to test our runtime on.
Now trying to fix process_response function to handle nested backticks.
Not working.
Revising the prompt.
Still not working.

Here's example prompts that i tried.

"In process_response, handle nested code blocks that occur within strings, comments, etc. Each code block starts with triple backticks followed by a language identifier and ends with triple backticks. Ensure the return type of process_response remains unchanged."

"Update the process_response function in main.py to correctly handle nested code blocks that appear within strings, comments, etc. Each code block starts with triple backticks followed by a language and ends with triple backticks. Ensure the function's return type remains unchanged"

"in process_response, response can have nested code blocks. these nested code blocks occur as a part of strings, comments, etc. fix process_response to handle these cases and outputs a flat list of code blocks and markdown blocks. each code block starts with triple backticks followed by language and ends with just triple backticks. process_response return type should remain same. do not use regex."

"process_response's output can have nested code blocks. these nested code blocks can occur as a part of strings, comments, etc. fix process_response function to handle these cases and output a flat list of code blocks and markdown blocks. maintain a depth count to achieve this. each code block starts with triple backticks followed by language and ends with just triple backticks. process_response return type should remain same."

Weird.
This felt like exactly the sort of problem LLMs are supposed to be absurdly good at. Parse some text. Handle nested delimiters. Generate a small state machine.
Apparently not...
I can give detailed instructions on how to do it, but at that point I might as well manually write the code. In my current claude code usage pattern also, I don’t give instructions more detailed than this.

Parking this for later. The medieval programmer inside me was starting to suspect I should just write the fix manually.

Lets try solving it another way...
This looks like a good place to add self-correcting capabilities to our system.
First I generated test_main.py with some test cases, with the previous run's output file content as one of the test case.
Then I added an instruction to perform testing using this test_main.py file.
Cool, the test cases are working.

Now feeding back pytest’s results into the LLM instruction, thus allowing it to correct its own mistakes.
Lets maintain a variable to store our verification step results and add it to the prompt for updating file content.
Retry this update-verify loop until the verifier is satisfied or max 3 retries.
And with that, we arrive at our first version of an agent.
Now our architecture looks like-

Lets see it in action.

Its still not working. Tried different prompt. But no luck...

I guess the pytest's output doesn't have enough data which makes sense.
It just shows which tests are failing, not the exact test case.
So help our runtime a bit by providing test case and actual output.
Ok now we have done that. Trying again. Third time is a charm 🤞

Congrats 🎉 🎉 We have successfully fixed our first bug using our own agent.
And yeah.. It was just a one line change. Should have just manually written it.

Though its not been a smooth sail. There were lot of issues when working with raw LLMs which I don't have to think when using claude code. Each model has their own weird quirk. Such as

qwen models have default reasoning on, so have to explicitly turn off reasoning
json_object response format doesn't work at all when reasoning is on. some providers weren't able to get json even with reasoning off
there are no consistency/quality assurance in openrouter responses, have to pin provider.
LLMs tend to use regex as their initial approach
even the final result I got is a not through single attempt. have to roll the dice multiple times
etc...

Complaints aside. We have a working agentic loop now.
The result isn't remotely close to Claude Code.
But its a good foundation to further improve upon.
In the next part, we will see how to make it work in an actual codebase.

[devLog]Building claude code for local models from scratch(v0.1)

First Steps

Jargons

Having Constraints

Investigation

Comments