Coding Agents Demystified: Part I - Coding Agent Evaluation Harness


This is Part I of a two-part series on building and improving a coding agent from scratch.

  • Part I (this post): Build an eval harness to measure how well models fix bugs
  • Part II (upcoming): Use the harness as a reward signal to fine-tune a small model with GRPO

The meta idea here is that the eval harness we create here helps serve Part II by evaluating the quality of the model every step of the way and rewarding the model accordingly. The model gets positive feedback for passing tests and zero for failing.

TL;DR

I created 20 python bug-fixing tasks (or dare I say generated?). Each task is a buggy function paired with a test suite. The model gets the buggy code and has to return a fixed version. The fixed code runs in a Docker container, the tests execute, and the result is pass or fail.

I ran three models (arbitrarily picked) through OpenRoutergoogle/gemini-2.0-flash, openai/gpt-4o-mini, and qwen/qwen3-coder-next — across all 20 tasks:

Model                         Tasks    Pass   Pass Rate     95% CI
google/gemini-2.0-flash          20      20     100.0%  [83.9, 100.0]
openai/gpt-4o-mini               20      18      90.0%  [69.9, 97.2]
qwen/qwen3-coder-next            20      14      70.0%  [48.1, 85.5]

In part 2, we’ll finetune a non-coding-optimized model using GRPO to push that number up.

Note: A fair bit of the code for the harness is AI generated in a very ouroboros like setup.

The nitty gritty

Tasks

A task is a problem statement for the coding agent: here is a buggy function, fix it. +1 if you do, 0 if you don’t. Each task lives in its own directory:

tasks/dict-lookup/
    solution.py   ← buggy function
    test.py       ← assertions that pass only on the correct solution

For example:

# tasks/dict-lookup/solution.py
def get_score(scores, player):
    """Return the score for player, or None if they are not in the table."""
    # BUG: raises KeyError if player is not in scores, but we don't send this comment to the model
    return scores[player]  
# tasks/dict-lookup/test.py
from solution import get_score
assert get_score({"ada": 100, "bob": 42}, "ada") == 100
assert get_score({"ada": 100}, "missing") is None  # fails with buggy code
assert get_score({}, "anyone") is None              # fails with buggy code

The full set is on GitHub.

The bugs cover a range: off-by-one errors, missing edge case handling, wrong string comparison, incorrect algorithm initialization. Two tasks (none-input and string-unicode) tripped up multiple models — both needed reasoning about edge cases not stated in the docstring, which made them the most discriminating.

Getting a completion

OpenRouter handles routing to different model providers with the benefit of needing one API key and client for a variety of models. The runner calls the API with just the buggy code and asks for a json_schema structured response.

Grading in Docker

The grader needs to run arbitrary code returned by an LLM. Running that directly on the host is a bad idea, as in theory, a model could return import os; os.system("rm -rf /") and it would execute with your full permissions. So we need isolation.

Docker gives us:

  1. a separate filesystem so the model’s code can’t touch the host,
  2. resource limits so a runaway loop won’t freeze your machine, and
  3. --rm so the container is gone the moment the test finishes.

For each task, we spin up a fresh container and run test.py against the model generated solution.py. Rather than baking the files into the image (which would require a rebuild per task), we write them to a temp directory on the host and volume-mount them in:

def grade(task: dict, solution_code: str) -> GradeResult:
    with tempfile.TemporaryDirectory() as tmp_dir:
        tmp_path = Path(tmp_dir)
        (tmp_path / "solution.py").write_text(solution_code)
        (tmp_path / "test.py").write_text(task["test_suite"])

        result = subprocess.run([
            "docker", "run", "--rm",
            "-v", f"{tmp_path / 'test.py'}:/workspace/test.py:ro",
            "-v", f"{tmp_path / 'solution.py'}:/workspace/solution.py:ro",
            "coding-eval-runner",
        ], capture_output=True, text=True, timeout=10)

    return GradeResult(**json.loads(result.stdout))

Inside the container, docker/runner.py runs test.py as a subprocess and returns JSON to stdout:

def run_task() -> dict:
    try:
        result = subprocess.run(
            ["python3", "/workspace/test.py"],
            cwd="/workspace",
            capture_output=True, text=True, timeout=10,
        )
    except subprocess.TimeoutExpired:
        return {"is_successful": False, "error_type": "timeout", "error_log": None}

    if result.returncode == 0:
        return {"is_successful": True, "error_type": None, "error_log": None}

    return {"is_successful": False, "error_type": "error", "error_log": result.stderr}

Two things that aren’t obvious:

Fresh container per run, shared image. All runs share the same coding-eval-runner image, but each docker run --rm is a new container that gets thrown away after. Swapping the volume mount is how we change what code runs — each task gets its own isolated /tmp/run-xyz/ on the host, mapped to /workspace inside. This means run N can’t bleed into run N+1. Measured overhead: ~0.5s per run, ~30s total for 60 runs, which is fine. In a foundational lab, you would pool containers and route requests accordingly, but this is a toy example.

test.py runs as a subprocess, not via exec(). If the test calls sys.exit(), segfaults, or hangs, exec() would take down runner.py with it and nothing gets returned to the host. Running it as a subprocess means runner.py survives regardless and always sends back a structured result.

The runner loop

for task in tasks:
    for model in MODELS:
        try:
            candidate = get_completion(task, model=model)
            result = grade(task, candidate)
            f.write(json.dumps({
                "model": model,
                "task_id": task["id"],
                "is_successful": result.is_successful,
                "error_type": result.error_type,
                "error_log": result.error_log,
            }) + "\n")
        except Exception:
            f.write(json.dumps({
                "model": model, "task_id": task["id"],
                "is_successful": False, "error_type": "api_error",
                "error_log": traceback.format_exc(),
            }) + "\n")
        f.flush()

f.flush() after every write means a crash mid-run loses nothing already computed. Each task-model pair is wrapped in its own try/except so one API failure doesn’t abort the whole run.

Confidence intervals

We use Wilson score CIs rather than the naive p ± 1.96·√(p(1-p)/n), which breaks at the boundaries — a model that passes all 20 tasks gets [100%, 100%], which overstates certainty.

In practice: Gemini’s [83.9%, 100%] and GPT-4o-mini’s [69.9%, 97.2%] overlap a lot. With 20 tasks you can’t conclude one is actually better. You’d need 50+ tasks to tighten the CIs enough to say anything firm.

How this connects to RL

In Part II, this harness becomes the training loop directly:

  1. A small model generates a candidate solution for each task
  2. The grader runs it in Docker and returns is_successful
  3. is_successful = True → positive reward; False → zero reward
  4. GRPO updates the model weights toward solutions that pass tests

We don’t use human labeling or a reward model, and only use tests. It’s a good fit for code because the output is verifiable: it either passes the tests or it doesn’t. Part II of the blog post would be a lot more involved if I were training a non coding LLM, lucky me.

How this maps to real-world benchmarks

This harness is structurally ~similar to production coding benchmarks but stripped down. The differences are in scale and scaffolding:

This harnessSWE-bench
Tasks202,294
ScopeSingle buggy functionFull repository
Input to model~10 linesThousands of files
OutputFixed functionMulti-file patch
Test executionpython test.pyRepo-specific setup, deps, env

In the real world:

  • On the simpler end of the spectrum, HumanEval and MBPP are the most similar to this with isolated functions and self-contained tests. These are also saturated and not as useful beyond sanity checks.
  • On the harder side, SWE-bench has real GitHub issues, real repos and the model has to find the relevant code before it can fix anything. This is more relevant to training real coding agents.

The toy version here lets us iterate on the eval infrastructure and RL loop fast, without the overhead of cloning thousands of repos and managing per-task environments.

If you were Anthropic building Claude Code, the benchmark stack would probably look like:

  • SWE-bench — 500 human-curated instances, the main signal for real-world issue resolution and what shows up on leaderboards
  • LiveCodeBench — competitive programming problems scraped live from Codeforces, LeetCode, AtCoder; harder to game since the problems are new
  • BigCodeBench — function-level tasks that require real library calls (numpy, pandas, requests), closer to day-to-day coding than algorithmic puzzles
  • Internal evals — every serious lab has private benchmarks; public ones get Goodharted

HumanEval and MBPP are basically just sanity checks now — models have been at 90%+ for years.

The code

Everything is on GitHub. Clone it, add your OpenRouter key, and ./scripts/run_all.sh runs the full eval.


If you have questions or want to discuss, find me on X.