Parallel tempering visibly accelerated the cold replica, yet the measurement could not see far enough to turn that acceleration into a verdict — so exp12 returns Outcome F, not a win.
This is the complete technical record for experiments/exp12-pt-vs-psym/. Here we keep the gates, the windows, and the claim-status discipline. Ran 2026-06-15 on a laptop CPU, float64, ~274 s wall. Gates frozen at pre-commitment 46306c2 (gate-1) and runner a976d80 (gate-2). Reproduce: P0_MODE=full HOST_RAM_GB=8 python3 pt_calibrate.py — MEASURE-ONLY.
The question
exp5 fixed the baseline: on cell C-deep the exact slow-mode timescale of the bare reversible kernel P_sym is , with . Does a reversible parallel-tempering mixture — local sweeps composed with replica swaps , symmetrized so detailed balance holds — cut that slow cluster relative to bare P_sym? This is an operational-tier feasibility precursor, asking whether PT moves the mixing-speed axis at all on this substrate.
The setup
A frozen 8-cell grid over C-deep (R4/R6/R8, primary and convex schedules), plus a C-uni diagnostic and an optional C-deep2 cell. P1 is the reversibility gate (selfadjoint_check_pt.py): at , every reversible kernel — , swaps, , , , , and the actual super-sweep — self-adjoint to ; the deterministic control shows the test has teeth; swap-formula cross-check . Formula-confirmed PASS, frozen, not re-run in the verdict.
The verdict pipeline sets by a doubling rule, then estimates the operational timescale independently from each short operational window at . The A6 adequacy gate must clear before any speedup (P2) is read: in particular the F1 / no-upward-divergence guard requires to have stabilized by the window.
The result
Outcome F on all 8 cells (A6_FAIL). PT did what it was supposed to on the descriptive axis: cut 14–22× on the primary cells (e.g. R6 21×, R8 22×), and projected bulk VAC rates landed at 0.23–0.45 (primary) / 0.06–0.10 (convex). But kept rising across windows on every cell:
| cell | | cut | @ | A6 |
|---|---|---|---|---|
| C-deep R4 primary | 8.0 | 14× | 32 → 68 → 86 | A6_FAIL |
| C-deep R6 primary | 5.3 | 21× | 14 → 32 → 51 | A6_FAIL |
| C-deep R8 primary | 5.0 | 22× | 14 → 28 → 44 | A6_FAIL |
| C-deep R4 convex | 21.5 | 5× | 88 → 215 → 271 | A6_FAIL |
| C-deep R6 convex | 25.5 | 4× | 77 → 206 → 264 | A6_FAIL |
| C-deep R8 convex | 31.2 | 4× | 121 → 243 → 310 | A6_FAIL |
| C-uni R4 (diag.) | 5.8 | 19× | 29 → 56 → 70 | A6_FAIL |
| C-deep2 R4 (opt.) | 4.1 | 22× | 14 → 23 → 26 | A6_FAIL |
Because the binding F1 guard fired, map_outcome mapped every cell A6_FAIL → F (pt_calibrate.py:908). Crucially, A6_FAIL is not UNRESOLVED: stabilized and the windows completed — only the verdict- stabilization sub-check failed.
The speedups are therefore descriptive (non-verdict): they are gated behind the failed A6 and read at the largest still-unstabilized window, so they are not valid P2 verdicts. P4 (Q_op tracks the predictor) passed within the band on every cell with the ratio trending toward 1 — but, again, gated behind A6 and read as descriptive only. The A7 multimodal calibration (m=3, M=2, R=2; extended 4096) validated the dominant VAC gap: sampled-vs-exact ratio 0.997.
Scope and caveats
This is the honest core: exp12 does not distinguish two readings, and says so. The verdict- is re-estimated from each short window (pt_calibrate.py:774) whose longest span is 82–400 steps, whereas the doubling probe that set ran 1,000–2,000 steps. Because the longer probe trajectories are not reused for verdict , the half-Sokal autocorrelation sum at the short windows is plausibly truncated low and still rising — not failing outright. So the open fork is (i) genuine residual long-memory in the cold-replica observables ( reading below the bare-P_sym 0.78), versus (ii) window-dependent estimation. exp12 does not resolve which. The evidence that is approaching a plateau rather than diverging: the growth is smaller than on every cell, the F1 checks largely pass (R6-primary excepted), and Q_op/predictor trends toward 1.
Two corrections are recorded without unfreezing the pre-commitment: the cache-aware FSE cost is (not the §7 illustration's ), so the raw -reduction demand at the bar is at R=4/6/8 — but the binding bar, compute-normalized speedup , is unchanged and does not affect the F verdict, which A6 sets first. And the slow-cluster acceleration ratio at R4 is a coarse projected proxy computed with unit VAC-mode weights (pt_calibrate.py:1031) — not evidence that PT "solves the mixing-speed axis."
No tag moves. No GPU authorization, no Route-C verdict, no fundamentality claim. The conditional factorization stays [solid], the operational tier stays [conjectured]; the central-spine A2↔A6 / Risk-ledger interpretation is HELD pending separate explicit conferral. This entry records the measured outcome only.
What this feeds: a future design must size verdict- from the longer probe-length trajectories (or from a -convergence criterion rather than -stability) before PT's – cut can be read as a clean P2 verdict; Route C remains a reasonable candidate but exp12 does not establish that the A6 obstruction has moved to a new intrinsic limit. The lineage continues from exp11's λ-sweep on the objective axis and exp5's spectral baseline on the mixing axis.