An at-scale operational read that looked like a clean F4-fail turned out to never have been inside a valid regime — the PT chain it relied on does not mix the trained model.
This is the complete technical record for experiments/exp16-operational-validation/. Read it through its erratum. Here we keep the numbers, the gates, and exactly what is withdrawn.
The erratum, first
The run reused exp15's build_alpha_programs verbatim, and that builder has a bug: the per-replica PT local kernels were built from the init weights while the swap energy used the trained weights. This inconsistent chain has a cold replica that does not target . So the run's central mechanism — "the R4-PT negative phase mixes near-ideally at scale (, ), leaving the plain-Gibbs positive phase as the F4 bottleneck" — rests on a chain that never mixed the trained DTM.
A measure-only re-check with a trained-weight refresh (real ) returns LADDER-INADEQUATE-TRAINED: with correct weights the R4 ladder fails to mix — swap-acceptance max 0.0099 against the band , round-trips 0 — upstream of and . Therefore was an init-weight artifact, and the F4 ratio, the P4 drift, and the "negative phase passed at scale" table are withdrawn as trained-DTM evidence. Corrected ground truth lives in experiments/exp15-recheck-trained-weights/.
What still stands: the A2 certificate, the convergence-certified empirical (the Risk-3 / P1 reference work — that gate does not depend on the PT local kernel), and the measure-only discipline itself. No tag moves — this was always MEASURE-ONLY.
The question (as originally posed)
Does the operational factorization hold at DTM scale, driven by exp15's PT cold chain? The conditional factorization is [solid]; the operational tier is [conjectured]. A validation-positive was the only outcome that could feed a researcher-conferred validated. None of what follows changes that — but the run never produced one anyway.
The setup
H200 (Lightning.ai), 2026-06-17, EXP16_MODE=full, wall 1.349 GPU-h under the 3 h cap. backend=gpu, jax=0.10.1, patch_live=True, decision gate PROCEED, probe_rng_isolated=True. K-grid , all . F4 is the registered subdominance leg: must be .
The result (now withdrawn)
The originally-reported preconditions all "passed": A2 self-adjoint check ; calibration Cal-STABLE with (rel vs exp15's ); the P1 reference converged to relchange , cos , two-ref 0.0148 — clearing the Risk-3 shared-bias check that exp3 could never reach (, ). The F4 table:
| K | | | | | |---|---|---|---|---| | 100 | 0.826 | 51.87 | 96.57 | 0.537 | | 300 | 2.024 | 93.06 | 289.7 | 0.321 | | 1000 | 5.391 | 146.7 | 965.7 | 0.152 |
The reading was: (PT near-ideal ESS), while plain-Gibbs falls much slower, so by the positive phase dominates and F4 crosses 0.3. P4 slid because grows linearly in while measured saturates (), floored by the unmodeled positive phase. All of this assumed a mixing PT chain it did not have. The re-check shows the ladder does not mix the trained model, so the entire branch is invalid.
Scope and caveats
This was never evidence against — and now it is not evidence for the failure mechanism either. The factorization was neither validated nor refuted. The original framing called it a regime-precondition violation (F4 broke at scale); the erratum demotes even that, because the operational read fell outside any valid regime when the kernel does not mix. No overclaim survives: not to the plateau (this was ), not to the training Gibbs kernel, and now not to "negative phase mixes well at scale."
What this feeds
The intended follow-up — give the positive phase a PT-grade estimator so also falls — is suspended pending a ladder that actually mixes the trained DTM. The real effect: this run sharpens Risk 4 (the operational-seam / mixing risk), now grounded in the corrected experiments/exp15-recheck-trained-weights/.
What this feeds: a sharpened Risk 4 and a gated re-run — the operational validation of stays [conjectured], behind a clean validation-positive that does not yet exist.