Specimen Report · Python

autoresearch-tsp

phiat/autoresearch-tsp

Autonomous LLM research loop on the Kaggle Traveling Santa 2018 TSP. Inspired by marimo video

Stars
★ 1
Forks
⑂ 0
Language
Python
Size
1,455 kB
Last Push
25d ago
Forged
28d ago
autoresearchkaggletsp
# autoresearch-tsp Two autonomous LLM research loops attacking the [Kaggle Traveling Santa 2018 Prime Paths](https://www.kaggle.com/competitions/traveling-santa-2018-prime-paths/overview) TSP (197,769 cities, 1.1× penalty on every 10th step from a non-prime origin). Same metric, same 5-min budget per cycle, same keep-or-revert mechanic. Different *levers* the agent can pull. See [Provenance](#provenance) for inspiration. ## The two projects | | [`tsp_heuristic/`](tsp_heuristic/) | [`tsp_neural/`](tsp_neural/) | |------------------|------------------------------------------|------------------------------------------------| | **Approach** | classical heuristic search | neural-guided local search | | **Lever** | algorithm design | model design + integration | | **Trains a model?** | **no** — pure numpy/numba/scipy | **yes** — small PyTorch model per cycle | | **What's optimised** | a permutation of 197,769 ints (the tour) | a tour, *plus* the move-scorer that helps build it | | **Deps** | numpy, pandas, sympy, scipy, numba | + `torch` | | **Pipeline shape** | construction → local search → perturbation → polish (agent mutates every cycle) | construction → local search guided by a learned model (agent mutates every cycle) | | **Risk** | low; classical literature is deep | high; learned/classical glue is fiddly | Both share `prepare.py` (the metric is bit-identical) and the same harness shape: agent edits one file (`solve.py`), runs under a fixed 5-min budget, scores against `val_cost`, keeps or reverts, appends to `results.tsv`, repeats. Each subdir is self-contained. See its `README.md` for setup, its `AGENTS.md` for the tooling inventory, and its `program.md` for the loop's operating rules. ## Layout ``` tsp_heuristic/ classical loop tsp_neural/ neural-guided loop autoresearch/ vendored upstream (karpathy's repo, reference only) AGENTS.md agent guidance for the outer repo ``` ## Kick off the agents Each loop runs as its **own Claude Code session** inside its subdir. Both share branch `main`; each agent only ever edits files inside its own subdir, and `just revert` uses `git revert HEAD --no-edit` so a discard from one loop never wipes the other's commits. ### One-time setup ```bash # Unzip the Kaggle archive into tsp_heuristic/data/ so cities.csv exists. unzip traveling-santa-2018-prime-paths.zip -d tsp_heuristic/data/ # Install deps for each loop (tsp_neural pulls ~2 GB of PyTorch first time). (cd tsp_heuristic && uv sync) (cd tsp_neural && uv sync) ``` ### Launch each loop Open *two* terminals. In each, `cd` into the subdir and start a fresh Claude Code session (`claude`). Paste this prompt verbatim: > *"Read `AGENTS.md` and `program.md`. Verify smoke test (`just data` then `just run`), then start the experiment loop on `main` per `program.md`. Iterate until I interrupt — never stop on your own."* (For a fresh `tsp_neural/` setup with no prior checkpoint, additionally note: *"the first ~3 cycles must introduce learning per `program.md` (T1 harvest → M1 train → I1 integrate)"*. Skip if a checkpoint already exists in `checkpoints/`.) ### Monitor progress ```bash git log --oneline -10 # interleaved exp: / meta: commits (cd tsp_heuristic && just status) # per-loop head + last result + recap-pending (cd tsp_neural && just status) xdg-open progress.png # or any image viewer ``` The Status table + `progress.png` are refreshed periodically by an out-of-band cron driver (separate from either loop's session). ### Stop a loop - Interrupt the relevant Claude Code session (Ctrl-C or `/exit`). - The other loop is unaffected (separate session, separate `solve.py`, no shared `HEAD` operations). ## Status *As of 2026-04-27 (neural cycle 50, M11 keep — bigger-but-shallower ranker). SOTA = 1,514,000 (Santa 2018 top public-LB). `val_cost` is the official Santa 2018 cost (lower is better). `% SOTA` = `100 − gap`, where `gap = (val_cost − SOTA) / SOTA`. **Δ12** = improvement in `% SOTA` over the last 12 experiments (in percentage points; positive = progressing).* | Loop | % SOTA | Δ12 | Best `val_cost` | Cycles | |---|---|---|---|---| | `tsp_heuristic/` | 97.80% | +0.000pp | 1,547,347 | 68 | | `tsp_neural/` | 97.57% | +0.038pp | 1,550,769 | 50 | ![Progress](progress.png) Run-by-run history lives in each subproject's `recaps/` (committed) and `results.tsv` (gitignored, local-only). ## Provenance - **Inspiration & template**: [karpathy/autoresearch](https://github.com/karpathy/autoresearch). - **Task**: [Kaggle Traveling Santa 2018 Prime Paths](https://www.kaggle.com/competitions/traveling-santa-2018-prime-paths/). - **Inspired by this video**: [marimo: autoresearch on the Santa 2018 TSP](https://www.youtube.com/watch?v=bMoNOb0iXpA). ## Forking on different hardware This run targets **one specific host**: an RTX 4070 (16 GB VRAM) + ample CPU/RAM. The 5-minute per-cycle budget and the various tuning defaults (`K_NEIGHBORS`, model param caps, batch sizes) are sized for that machine. Both `program.md` files reflect that target as written. If you fork on different hardware, treat the program.md hardware lines as your customisation point and adjust derived knobs accordingly: - **`tsp_heuristic/`** is CPU-bound (numpy / scipy / numba — no GPU). Relevant specs are core count and RAM. A 4-core laptop will struggle with the 5-min budget — consider tightening `K_NEIGHBORS` or disabling expensive Or-opt sweeps. - **`tsp_neural/`** uses the GPU for PyTorch model training: - **Smaller GPU (≤8 GB)** — cap params at ~250K, halve batch size, consider fp16 / bf16 inference (E-class engineering ideas). - **Larger GPU (24 GB+)** — train bigger MLPs / shallow transformers, but inner-loop inference latency stays the bottleneck — bigger models need to earn their cost vs the distilled-numba baseline. - **CPU-only / MPS** — latency budget will tighten dramatically. May not be worth pursuing the neural differentiator on this hardware vs. just running `tsp_heuristic/`. Land hardware-target changes (and any derived knob tweaks) in a `meta:` commit so the trail is auditable. ## Methodology caveat — we started "cheating" Around **heuristic-loop cycle 38** we began invoking `paper-researcher` with queries scoped to **Santa 2018-specific writeups** — i.e., reading other people's solutions to this exact problem rather than only general TSP / heuristic-search / learned-LKH literature. (The first such invocation actually fell back to generic GLS / LKH / Blazinskas literature per a user instruction, but the *intent* and the era directive crossed the line.) This contaminates the "can the LLM agent independently solve this from first principles + generic literature" question. Any subsequent `val_cost` lift may have been seeded by an idea the agent recognized from a competition writeup rather than derived autonomously. We're keeping the door open going forward: experiments and harness-evaluation reports should distinguish the un-contaminated subset (pre-domain-research, plus runs using only `classical` / `hybrid` / `modern-learned` era queries) from the contaminated subset (post-domain-research, runs using `domain-specific` era queries). Note this explicitly in any future write-up. ## License MIT.
↗ GitHub