The autograder check

After approving instructions, you click Run autograder check. This article walks through the two validation tiers (sandbox + replica), what passes and fails mean, why we require 100%, and why we removed the AI auto-fixer.

Written By Alan Gandy

Last updated About 1 month ago

After you approve instructions, a Run autograder check button appears on the Review screen. Clicking it kicks off the validation phase — the most important quality gate before deploying. This article walks through what runs, what passes and fails actually mean, and why every assignment must hit 100% before you can ship to students.

Why we validate before deploying

The point of CodeTeach is that what you ship is known-good — the same autograder that GitHub Classroom will run on student submissions has already passed against your reference solution before any student sees the assignment. No surprise failures, no broken workflows, no "but it works on my machine."

Validation runs in two tiers, both inside CodeTeach (not on real GitHub Actions yet — that comes after deploy as a separate post-push check). Both tiers must pass at 100% before the assignment can be deployed.

Tier 1 — Sandbox solution check

The first tier runs your solution against your tests in a Linux container that mirrors what GitHub Actions provides.

This catches:

  • Solution doesn't actually pass its own tests (test expected output is wrong, solution has a bug, or expectations and behaviour have drifted)
  • Solution depends on a missing import / package
  • Solution prints debug output that breaks Exact-match tests
  • Solution times out or crashes on any test
  • File encoding issues (Unicode characters, line endings)

Tier 1 is fast — usually 30–60 seconds. The wizard shows live progress.

Passing means: all tests green, status flips to validated, Continue to Deploy button unlocks.

Failing means: the assignment status flips back to generated with a yellow bounce-back banner explaining what failed. See When validation fails.

Tier 2 — Autograder replica check

If Tier 1 passes, Tier 2 runs the actual GitHub Actions workflow file through nektos/act (a tool that runs Actions workflows locally inside Docker). This is dramatically more realistic than Tier 1 — it uses the real classroom-resources/autograding-command-grader Action and Docker to closely mirror how Classroom will actually grade student submissions.

Tier 2 catches things Tier 1 doesn't:

  • Workflow YAML syntax errors
  • POSIX-shell compatibility (the real grader uses /bin/sh, not bash, so bash-isms like <(...) break)
  • Permissions block missing required keys
  • Step IDs that aren't alphanumeric (the grader uppercases them as env vars)
  • Missing workflow_dispatch trigger
  • Missing FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 env flag

Tier 2 is slower — usually 1–2 minutes — because it spins up Docker.

Passing means: all tests pass at 100%, status stays validated.

Failing means: same intervention banner as Tier 1, but the failure is usually about the workflow itself (not the solution). CodeTeach gets one shot at a deterministic mechanical fix automatically; if that fails, you're bounced to Review.

What runs during validation, in order

When you click Run autograder check:

  1. Status flips to validating
  2. Wizard shows a blue banner: "Running the internal autograder on your solution… This runs the exact same tests GitHub Classroom will use, against your solution in a sandbox that mirrors the GitHub Actions runner. Usually finishes in 30–90 seconds."
  3. Tier 1 runs (sandbox solution check)
  4. If Tier 1 passes, the deterministic workflow builder constructs the GitHub Actions YAML
  5. Tier 2 runs (real workflow in Docker)
  6. If Tier 2 passes, status flips to validated and Continue to Deploy unlocks
  7. If anything fails, status flips back to generated with a reviewIntervention flag and the bounce-back banner appears

The whole thing usually takes 1–3 minutes. The page polls automatically and updates without a refresh.

Why we don't auto-fix anymore

Earlier versions of CodeTeach had an AI auto-fixer that, when validation failed, would automatically try to rewrite the failing artifact and re-validate, in a loop. We removed it.

Why: the auto-fixer produced regressions in real instructor sessions. Failing solutions were getting worse over time (4/5 tests passing → 0/5 after a fix attempt), and the fixer often added bugs the instructor would have to catch later anyway. The same model that produced the failing solution was being asked to fix it with no new information.

Today, when validation fails, you're bounced to Review & Edit with a banner telling you exactly what failed. You fix it (or use the AI Wizard to assist), then re-run validation. This is slower per round but produces dramatically better results.

Why 100%? Can I deploy with one failing test?

No. You can't deploy until validation hits 100%. The reason is simple: every test in the autograder is going to be run on every student submission. If your reference solution can't pass test #4, it's because either the solution has a bug or the test is wrong. Either way, students will hit it. Better to fix it now than discover it after deployment when 30 students have already submitted.

If you genuinely don't care about test #4 (e.g., it's testing something out of scope), delete it from the Tests tab — that's the correct fix. Don't try to deploy with a known-failing test "because students might still figure it out."

Lessons from past failures (silent quality boost)

CodeTeach captures lessons from validation runs that fail then succeed (after you fix something). Those lessons feed forward into future generations: when the AI generates a new assignment, it knows about common failure patterns from past assignments in the same language and avoids them.

You don't see these directly — they live in the database — but they're why the AI gets better at generating Python autograders the more Python autograders have been validated by real instructors.

Where to go next