Coding Agents Demystified: Part I - Coding Agent Evaluation Harness
This is Part I of a two-part series on building and improving a coding agent from scratch.
- Part I (this post): Build an eval harness to measure how well models fix bugs
- Part II (upcoming): Use the harness as a reward signal to fine-tune a small model with GRPO
The meta idea here is that the eval harness we create here helps serve Part II by evaluating the quality of the model every step of the way and rewarding the model accordingly. The model gets positive feedback for passing tests and zero for failing.
TL;DR
I created 20 python bug-fixing tasks (or dare I say generated?). Each task is a buggy function paired with a test suite. The model gets the buggy code and has to return a fixed version. The fixed code runs in a Docker container, the tests execute, and the result is pass or fail.
I ran three models (arbitrarily picked) through OpenRouter — google/gemini-2.0-flash, openai/gpt-4o-mini, and qwen/qwen3-coder-next — across all 20 tasks:
Model Tasks Pass Pass Rate 95% CI
google/gemini-2.0-flash 20 20 100.0% [83.9, 100.0]
openai/gpt-4o-mini 20 18 90.0% [69.9, 97.2]
qwen/qwen3-coder-next 20 14 70.0% [48.1, 85.5]
In part 2, we’ll finetune a non-coding-optimized model using GRPO to push that number up.
Note: A fair bit of the code for the harness is AI generated in a very ouroboros like setup.
The nitty gritty
Tasks
A task is a problem statement for the coding agent: here is a buggy function, fix it. +1 if you do, 0 if you don’t. Each task lives in its own directory:
tasks/dict-lookup/
solution.py ← buggy function
test.py ← assertions that pass only on the correct solution
For example:
# tasks/dict-lookup/solution.py
def get_score(scores, player):
"""Return the score for player, or None if they are not in the table."""
# BUG: raises KeyError if player is not in scores, but we don't send this comment to the model
return scores[player]
# tasks/dict-lookup/test.py
from solution import get_score
assert get_score({"ada": 100, "bob": 42}, "ada") == 100
assert get_score({"ada": 100}, "missing") is None # fails with buggy code
assert get_score({}, "anyone") is None # fails with buggy code
The full set is on GitHub.
The bugs cover a range: off-by-one errors, missing edge case handling, wrong string comparison, incorrect algorithm initialization. Two tasks (none-input and string-unicode) tripped up multiple models — both needed reasoning about edge cases not stated in the docstring, which made them the most discriminating.
Getting a completion
OpenRouter handles routing to different model providers with the benefit of needing one API key and client for a variety of models. The runner calls the API with just the buggy code and asks for a json_schema structured response.
Grading in Docker
The grader needs to run arbitrary code returned by an LLM. Running that directly on the host is a bad idea, as in theory, a model could return import os; os.system("rm -rf /") and it would execute with your full permissions. So we need isolation.
Docker gives us:
- a separate filesystem so the model’s code can’t touch the host,
- resource limits so a runaway loop won’t freeze your machine, and
--rmso the container is gone the moment the test finishes.
For each task, we spin up a fresh container and run test.py against the model generated solution.py. Rather than baking the files into the image (which would require a rebuild per task), we write them to a temp directory on the host and volume-mount them in:
def grade(task: dict, solution_code: str) -> GradeResult:
with tempfile.TemporaryDirectory() as tmp_dir:
tmp_path = Path(tmp_dir)
(tmp_path / "solution.py").write_text(solution_code)
(tmp_path / "test.py").write_text(task["test_suite"])
result = subprocess.run([
"docker", "run", "--rm",
"-v", f"{tmp_path / 'test.py'}:/workspace/test.py:ro",
"-v", f"{tmp_path / 'solution.py'}:/workspace/solution.py:ro",
"coding-eval-runner",
], capture_output=True, text=True, timeout=10)
return GradeResult(**json.loads(result.stdout))
Inside the container, docker/runner.py runs test.py as a subprocess and returns JSON to stdout:
def run_task() -> dict:
try:
result = subprocess.run(
["python3", "/workspace/test.py"],
cwd="/workspace",
capture_output=True, text=True, timeout=10,
)
except subprocess.TimeoutExpired:
return {"is_successful": False, "error_type": "timeout", "error_log": None}
if result.returncode == 0:
return {"is_successful": True, "error_type": None, "error_log": None}
return {"is_successful": False, "error_type": "error", "error_log": result.stderr}
Two things that aren’t obvious:
Fresh container per run, shared image. All runs share the same coding-eval-runner image, but each docker run --rm is a new container that gets thrown away after. Swapping the volume mount is how we change what code runs — each task gets its own isolated /tmp/run-xyz/ on the host, mapped to /workspace inside. This means run N can’t bleed into run N+1. Measured overhead: ~0.5s per run, ~30s total for 60 runs, which is fine. In a foundational lab, you would pool containers and route requests accordingly, but this is a toy example.
test.py runs as a subprocess, not via exec(). If the test calls sys.exit(), segfaults, or hangs, exec() would take down runner.py with it and nothing gets returned to the host. Running it as a subprocess means runner.py survives regardless and always sends back a structured result.
The runner loop
for task in tasks:
for model in MODELS:
try:
candidate = get_completion(task, model=model)
result = grade(task, candidate)
f.write(json.dumps({
"model": model,
"task_id": task["id"],
"is_successful": result.is_successful,
"error_type": result.error_type,
"error_log": result.error_log,
}) + "\n")
except Exception:
f.write(json.dumps({
"model": model, "task_id": task["id"],
"is_successful": False, "error_type": "api_error",
"error_log": traceback.format_exc(),
}) + "\n")
f.flush()
f.flush() after every write means a crash mid-run loses nothing already computed. Each task-model pair is wrapped in its own try/except so one API failure doesn’t abort the whole run.
Confidence intervals
We use Wilson score CIs rather than the naive p ± 1.96·√(p(1-p)/n), which breaks at the boundaries — a model that passes all 20 tasks gets [100%, 100%], which overstates certainty.
In practice: Gemini’s [83.9%, 100%] and GPT-4o-mini’s [69.9%, 97.2%] overlap a lot. With 20 tasks you can’t conclude one is actually better. You’d need 50+ tasks to tighten the CIs enough to say anything firm.
How this connects to RL
In Part II, this harness becomes the training loop directly:
- A small model generates a candidate solution for each task
- The grader runs it in Docker and returns
is_successful is_successful = True→ positive reward;False→ zero reward- GRPO updates the model weights toward solutions that pass tests
We don’t use human labeling or a reward model, and only use tests. It’s a good fit for code because the output is verifiable: it either passes the tests or it doesn’t. Part II of the blog post would be a lot more involved if I were training a non coding LLM, lucky me.
How this maps to real-world benchmarks
This harness is structurally ~similar to production coding benchmarks but stripped down. The differences are in scale and scaffolding:
| This harness | SWE-bench | |
|---|---|---|
| Tasks | 20 | 2,294 |
| Scope | Single buggy function | Full repository |
| Input to model | ~10 lines | Thousands of files |
| Output | Fixed function | Multi-file patch |
| Test execution | python test.py | Repo-specific setup, deps, env |
In the real world:
- On the simpler end of the spectrum, HumanEval and MBPP are the most similar to this with isolated functions and self-contained tests. These are also saturated and not as useful beyond sanity checks.
- On the harder side, SWE-bench has real GitHub issues, real repos and the model has to find the relevant code before it can fix anything. This is more relevant to training real coding agents.
The toy version here lets us iterate on the eval infrastructure and RL loop fast, without the overhead of cloning thousands of repos and managing per-task environments.
If you were Anthropic building Claude Code, the benchmark stack would probably look like:
- SWE-bench — 500 human-curated instances, the main signal for real-world issue resolution and what shows up on leaderboards
- LiveCodeBench — competitive programming problems scraped live from Codeforces, LeetCode, AtCoder; harder to game since the problems are new
- BigCodeBench — function-level tasks that require real library calls (numpy, pandas, requests), closer to day-to-day coding than algorithmic puzzles
- Internal evals — every serious lab has private benchmarks; public ones get Goodharted
HumanEval and MBPP are basically just sanity checks now — models have been at 90%+ for years.
The code
Everything is on GitHub. Clone it, add your OpenRouter key, and ./scripts/run_all.sh runs the full eval.
If you have questions or want to discuss, find me on X.