CTA3 — an LLM solver for community riddles
A full-stack research database and agentic solver for osu!'s hardest community puzzle tournament — 1.45 million scores indexed, every miss forensically analyzed.
"Capture The Achievement" is a community tournament where players race to decode deliberately obscure riddles — answers hide in beatmap metadata, player histories, medal solutions, forum lore, and layered encodings. Solving them requires both a knowledge base nobody has and reasoning nobody has automated. So I built both.
The knowledge base
Ingestion pipelines pull the entire relevant universe into one SQLite database (WAL mode, FTS5 everywhere): 219,440 beatmaps, 23,142 players with per-mode stats, 1,453,429 scores, 347 medals with community solutions, 83,634 comments, tournament rosters and mappools parsed from wiki pages, plus high-signal Reddit history distilled into a facts table. A Next.js frontend over a 15-endpoint FastAPI layer makes the whole thing explorable — search anything, pivot to anything.
The solver
The interesting engineering is in the solve loop:
- Self-looping LLM agent — the solver decomposes a clue, queries the database through tools, detects encodings (base64, hex, regex-structured layers), forms hypotheses, and re-enters with what it learned. Every step is traced to a per-clue engine log: context, hypotheses, action, timing.
- Engines are A/B-tested, not vibed. Competing solver architectures run against the full 168-clue corpus; results land in an A/B log with summary statistics before an engine earns its place.
- Misses get autopsies. A gap-analysis pass asks, for every failed clue, why — missing data? bad retrieval? reasoning failure? encoding blindness? Those categories drove the roadmap: each ingestion source above exists because a gap analysis demanded it.
- Embeddings + reasoning traces are first-class artifacts, stored for regression comparison across solver versions.
Why it matters
This is the whole applied-AI loop in one project: hostile-domain data engineering, retrieval design, agentic tool use, and — rarest in hobby projects — honest measurement of whether the agent is actually getting better. It's also proof that I'll build production-grade infrastructure for something purely because it's hard and fun, which is roughly the job description of a prototyper.
Status & limits — the database and UI are complete and fast; solver accuracy is a living number that improves per engine generation rather than a solved problem. The tournament's puzzle authors remain, for now, ahead of the machine — as it should be.