A fun way to compare LLMs creativity and knowledge.

Python 100%

Find a file

Moritz 6c6cbed414 Don't crash when a model returns a bare scalar instead of the schema object DeepSeek (and any provider with loose structured-output support) can return a bare JSON value like `42` instead of `{"answer": "42"}`. parse_json_content then returned an int and `data["answer"]` raised TypeError, killing the whole run. parse_json_content now only accepts a JSON object: a top-level non-dict falls through to the brace scan and ultimately raises ValueError, which every phase already catches and turns into a wrong answer / failed generation — matching the "schema violation counts as wrong" rule instead of aborting. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-10 17:33:58 +02:00
ai_battlequiz	Don't crash when a model returns a bare scalar instead of the schema object	2026-06-10 17:33:58 +02:00
tests	Don't crash when a model returns a bare scalar instead of the schema object	2026-06-10 17:33:58 +02:00
.gitignore	Implement AI Battlequiz: engine, OpenRouter client, CLI, renderer	2026-06-10 16:16:37 +02:00
CLAUDE.md	Add per-round review phase with a private, ephemeral notepad	2026-06-10 17:23:10 +02:00
config.example.toml	Allow duplicate model ids: warn and run as a mirror match instead of aborting	2026-06-10 17:31:28 +02:00
pyproject.toml	Implement AI Battlequiz: engine, OpenRouter client, CLI, renderer	2026-06-10 16:16:37 +02:00
README.md	Implement AI Battlequiz: engine, OpenRouter client, CLI, renderer	2026-06-10 16:16:37 +02:00

README.md

AI Battlequiz

A CLI game where two or more LLMs challenge each other. Each round every model invents a quiz task with a reference solution, then every model — including the author, in a fresh session that never sees its own solution — answers every task. The author's own fresh-session answer is the built-in quality gate: no neutral judge model.

Runs against the OpenRouter API. Everything is logged losslessly to a single run file, which a separate render step turns into a website.

Scoring

Correct answer to an opponent's task: +1, wrong: −1.
The author's self-answer is the quality gate: correct → 0 points but unlocks the stump bonus; wrong → −1 and no bonus.
Stump bonus: +1 to the author for every opponent who gets the task wrong — only if the author answered its own task correctly.

Grading is a strict normalized exact match (lowercased, whitespace collapsed). Models are pushed to over-specify the answer format, otherwise they fail their own task.

Install

pip install -e .
export OPENROUTER_API_KEY="sk-or-..."

Configure

Copy the example config and pick your models:

cp config.example.toml config.toml

Each [[models]] entry needs an OpenRouter id; name is an optional display label. At least two models are required.

Play

ai-battlequiz run                    # uses config.toml
ai-battlequiz run --config foo.toml --rounds 5
ai-battlequiz run --skip-model-check # skip the OpenRouter catalog lookup

On startup every configured model id is verified against the OpenRouter catalog; the run aborts if any is missing. Progress is shown live in a modern CLI style.

Render a run to a website

ai-battlequiz render runs/run-20260610-153000.jsonl
# -> runs/run-20260610-153000.html

Module layout

Module	Responsibility
`config.py`	Load & validate the TOML config
`client.py`	Thin async OpenRouter wrapper (structured output, retry)
`prompts.py`	Prompt builders + JSON schemas (generation vs answering)
`engine.py`	Round orchestration: generate → answer → grade → score
`grader.py`	Normalized exact-match grading & scoring
`logger.py`	Append-only lossless JSON-lines run log
`console.py`	Rich-based terminal UI
`renderer.py`	Run log → self-contained HTML site
`cli.py`	Argument parsing, startup checks, run loop

Tests

pytest

README.md Unescape Escape