A fun way to compare LLMs creativity and knowledge.
Find a file
Moritz 6c6cbed414 Don't crash when a model returns a bare scalar instead of the schema object
DeepSeek (and any provider with loose structured-output support) can return a
bare JSON value like `42` instead of `{"answer": "42"}`. parse_json_content then
returned an int and `data["answer"]` raised TypeError, killing the whole run.

parse_json_content now only accepts a JSON object: a top-level non-dict falls
through to the brace scan and ultimately raises ValueError, which every phase
already catches and turns into a wrong answer / failed generation — matching the
"schema violation counts as wrong" rule instead of aborting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 17:33:58 +02:00
ai_battlequiz Don't crash when a model returns a bare scalar instead of the schema object 2026-06-10 17:33:58 +02:00
tests Don't crash when a model returns a bare scalar instead of the schema object 2026-06-10 17:33:58 +02:00
.gitignore Implement AI Battlequiz: engine, OpenRouter client, CLI, renderer 2026-06-10 16:16:37 +02:00
CLAUDE.md Add per-round review phase with a private, ephemeral notepad 2026-06-10 17:23:10 +02:00
config.example.toml Allow duplicate model ids: warn and run as a mirror match instead of aborting 2026-06-10 17:31:28 +02:00
pyproject.toml Implement AI Battlequiz: engine, OpenRouter client, CLI, renderer 2026-06-10 16:16:37 +02:00
README.md Implement AI Battlequiz: engine, OpenRouter client, CLI, renderer 2026-06-10 16:16:37 +02:00

AI Battlequiz

A CLI game where two or more LLMs challenge each other. Each round every model invents a quiz task with a reference solution, then every model — including the author, in a fresh session that never sees its own solution — answers every task. The author's own fresh-session answer is the built-in quality gate: no neutral judge model.

Runs against the OpenRouter API. Everything is logged losslessly to a single run file, which a separate render step turns into a website.

Scoring

  • Correct answer to an opponent's task: +1, wrong: 1.
  • The author's self-answer is the quality gate: correct → 0 points but unlocks the stump bonus; wrong → 1 and no bonus.
  • Stump bonus: +1 to the author for every opponent who gets the task wrong — only if the author answered its own task correctly.

Grading is a strict normalized exact match (lowercased, whitespace collapsed). Models are pushed to over-specify the answer format, otherwise they fail their own task.

Install

pip install -e .
export OPENROUTER_API_KEY="sk-or-..."

Configure

Copy the example config and pick your models:

cp config.example.toml config.toml

Each [[models]] entry needs an OpenRouter id; name is an optional display label. At least two models are required.

Play

ai-battlequiz run                    # uses config.toml
ai-battlequiz run --config foo.toml --rounds 5
ai-battlequiz run --skip-model-check # skip the OpenRouter catalog lookup

On startup every configured model id is verified against the OpenRouter catalog; the run aborts if any is missing. Progress is shown live in a modern CLI style.

Render a run to a website

ai-battlequiz render runs/run-20260610-153000.jsonl
# -> runs/run-20260610-153000.html

Module layout

Module Responsibility
config.py Load & validate the TOML config
client.py Thin async OpenRouter wrapper (structured output, retry)
prompts.py Prompt builders + JSON schemas (generation vs answering)
engine.py Round orchestration: generate → answer → grade → score
grader.py Normalized exact-match grading & scoring
logger.py Append-only lossless JSON-lines run log
console.py Rich-based terminal UI
renderer.py Run log → self-contained HTML site
cli.py Argument parsing, startup checks, run loop

Tests

pytest