Exp 16 — Operational Validation (Withdrawn by Erratum)

An at-scale operational read that looked like a clean F4-fail turned out to never have been inside a valid regime — the PT chain it relied on does not mix the trained model.

This is the complete technical record for experiments/exp16-operational-validation/. Read it through its erratum. Here we keep the numbers, the gates, and exactly what is withdrawn.

The erratum, first

The run reused exp15's build_alpha_programs verbatim, and that builder has a bug: the per-replica PT local kernels were built from the init weights while the swap energy used the trained weights. This inconsistent chain has a cold replica that does not target $\pi_\theta$ . So the run's central mechanism — "the R4-PT negative phase mixes near-ideally at scale ( $MSE_{neg} \propto 1/K$ , $F1_{median}\approx 0.99$ ), leaving the plain-Gibbs positive phase as the F4 bottleneck" — rests on a chain that never mixed the trained DTM.

A measure-only re-check with a trained-weight refresh (real $t=200$ ) returns LADDER-INADEQUATE-TRAINED: with correct weights the R4 ladder fails to mix — swap-acceptance max 0.0099 against the band $[0.15, 0.60]$ , round-trips 0 — upstream of $T_O$ and $MSE_{neg}$ . Therefore $MSE_{neg} \propto 1/K$ was an init-weight artifact, and the F4 ratio, the P4 drift, and the "negative phase passed at scale" table are withdrawn as trained-DTM evidence. Corrected ground truth lives in experiments/exp15-recheck-trained-weights/.

What still stands: the A2 certificate, the convergence-certified empirical $g_{ref}$ (the Risk-3 / P1 reference work — that gate does not depend on the PT local kernel), and the measure-only discipline itself. No tag moves — this was always MEASURE-ONLY.

The question (as originally posed)

Does the operational factorization $Q_{op}\approx Q_{struct}^{\perp}$ hold at DTM scale, driven by exp15's PT cold chain? The conditional factorization is [solid]; the operational tier is [conjectured]. A validation-positive was the only outcome that could feed a researcher-conferred $\to$ validated. None of what follows changes that — but the run never produced one anyway.

The setup

H200 (Lightning.ai), 2026-06-17, EXP16_MODE=full, wall 1.349 GPU-h under the 3 h cap. backend=gpu, jax=0.10.1, patch_live=True, decision gate PROCEED, probe_rng_isolated=True. K-grid $K\in\{100, 300, 1000\}$ , all $\gg \hat\tau^* = 0.55$ . F4 is the registered subdominance leg: $F4 = MSE_{pos}/MSE_{neg}$ must be $\le 0.3$ .

The result (now withdrawn)

The originally-reported preconditions all "passed": A2 self-adjoint check $<10^{-10}$ ; calibration Cal-STABLE with $T_O^*=12662$ (rel $1.9\times10^{-4}$ vs exp15's $\approx 1.266\times10^4$ ); the P1 reference converged to relchange $5.2\times10^{-5}$ , cos $0.99999$ , two-ref 0.0148 — clearing the Risk-3 shared-bias check that exp3 could never reach ( $relchange=0.234$ , $cos\approx0.97$ ). The F4 table:

| K | $F4 = MSE_{pos}/MSE_{neg}$ | $Q_{op}$ | $Q_{struct}^{\perp}$ | $P4 = Q_{op}/Q_{struct}^{\perp}$ | |---|---|---|---|---| | 100 | 0.826 | 51.87 | 96.57 | 0.537 | | 300 | 2.024 | 93.06 | 289.7 | 0.321 | | 1000 | 5.391 | 146.7 | 965.7 | 0.152 |

The reading was: $MSE_{neg}\propto 1/K$ (PT near-ideal ESS), while plain-Gibbs $MSE_{pos}$ falls much slower, so by $K=300$ the positive phase dominates and F4 crosses 0.3. P4 slid because $Q_{struct}^{\perp}=(K/2)\lVert g\rVert^2/T_O^*$ grows linearly in $K$ while measured $Q_{op}$ saturates ( $\approx K^{0.46}$ ), floored by the unmodeled positive phase. All of this assumed a mixing PT chain it did not have. The re-check shows the ladder does not mix the trained model, so the entire $MSE_{neg}$ branch is invalid.

Scope and caveats

This was never evidence against $Q_{op}\approx Q_{struct}^{\perp}$ — and now it is not evidence for the failure mechanism either. The factorization was neither validated nor refuted. The original framing called it a regime-precondition violation (F4 broke at scale); the erratum demotes even that, because the operational read fell outside any valid regime when the kernel does not mix. No overclaim survives: not to the $t=2000$ plateau (this was $t=200$ ), not to the training Gibbs kernel, and now not to "negative phase mixes well at scale."

What this feeds

The intended follow-up — give the positive phase a PT-grade estimator so $MSE_{pos}$ also falls $\approx 1/K$ — is suspended pending a ladder that actually mixes the trained DTM. The real effect: this run sharpens Risk 4 (the operational-seam / mixing risk), now grounded in the corrected experiments/exp15-recheck-trained-weights/.

What this feeds: a sharpened Risk 4 and a gated re-run — the operational validation of $Q_{op}\approx Q_{struct}^{\perp}$ stays [conjectured], behind a clean $t=200$ validation-positive that does not yet exist.