Thermodynamic Machine Learning · MMXXVI
Experiment17.VI.MMXXVIRead 4 min

Exp 16 — Operational Validation (Withdrawn by Erratum)

Entry 18

An at-scale operational read that looked like a clean F4-fail turned out to never have been inside a valid regime — the PT chain it relied on does not mix the trained model.

This is the complete technical record for experiments/exp16-operational-validation/. Read it through its erratum. Here we keep the numbers, the gates, and exactly what is withdrawn.

The erratum, first

The run reused exp15's build_alpha_programs verbatim, and that builder has a bug: the per-replica PT local kernels were built from the init weights while the swap energy used the trained weights. This inconsistent chain has a cold replica that does not target πθ\pi_\theta. So the run's central mechanism — "the R4-PT negative phase mixes near-ideally at scale (MSEneg1/KMSE_{neg} \propto 1/K, F1median0.99F1_{median}\approx 0.99), leaving the plain-Gibbs positive phase as the F4 bottleneck" — rests on a chain that never mixed the trained DTM.

A measure-only re-check with a trained-weight refresh (real t=200t=200) returns LADDER-INADEQUATE-TRAINED: with correct weights the R4 ladder fails to mix — swap-acceptance max 0.0099 against the band [0.15,0.60][0.15, 0.60], round-trips 0upstream of TOT_O and MSEnegMSE_{neg}. Therefore MSEneg1/KMSE_{neg} \propto 1/K was an init-weight artifact, and the F4 ratio, the P4 drift, and the "negative phase passed at scale" table are withdrawn as trained-DTM evidence. Corrected ground truth lives in experiments/exp15-recheck-trained-weights/.

What still stands: the A2 certificate, the convergence-certified empirical grefg_{ref} (the Risk-3 / P1 reference work — that gate does not depend on the PT local kernel), and the measure-only discipline itself. No tag moves — this was always MEASURE-ONLY.

The question (as originally posed)

Does the operational factorization QopQstructQ_{op}\approx Q_{struct}^{\perp} hold at DTM scale, driven by exp15's PT cold chain? The conditional factorization is [solid]; the operational tier is [conjectured]. A validation-positive was the only outcome that could feed a researcher-conferred \to validated. None of what follows changes that — but the run never produced one anyway.

The setup

H200 (Lightning.ai), 2026-06-17, EXP16_MODE=full, wall 1.349 GPU-h under the 3 h cap. backend=gpu, jax=0.10.1, patch_live=True, decision gate PROCEED, probe_rng_isolated=True. K-grid K{100,300,1000}K\in\{100, 300, 1000\}, all τ^=0.55\gg \hat\tau^* = 0.55. F4 is the registered subdominance leg: F4=MSEpos/MSEnegF4 = MSE_{pos}/MSE_{neg} must be 0.3\le 0.3.

The result (now withdrawn)

The originally-reported preconditions all "passed": A2 self-adjoint check <1010<10^{-10}; calibration Cal-STABLE with TO=12662T_O^*=12662 (rel 1.9×1041.9\times10^{-4} vs exp15's 1.266×104\approx 1.266\times10^4); the P1 reference converged to relchange 5.2×1055.2\times10^{-5}, cos 0.999990.99999, two-ref 0.0148 — clearing the Risk-3 shared-bias check that exp3 could never reach (relchange=0.234relchange=0.234, cos0.97cos\approx0.97). The F4 table:

| K | F4=MSEpos/MSEnegF4 = MSE_{pos}/MSE_{neg} | QopQ_{op} | QstructQ_{struct}^{\perp} | P4=Qop/QstructP4 = Q_{op}/Q_{struct}^{\perp} | |---|---|---|---|---| | 100 | 0.826 | 51.87 | 96.57 | 0.537 | | 300 | 2.024 | 93.06 | 289.7 | 0.321 | | 1000 | 5.391 | 146.7 | 965.7 | 0.152 |

The reading was: MSEneg1/KMSE_{neg}\propto 1/K (PT near-ideal ESS), while plain-Gibbs MSEposMSE_{pos} falls much slower, so by K=300K=300 the positive phase dominates and F4 crosses 0.3. P4 slid because Qstruct=(K/2)g2/TOQ_{struct}^{\perp}=(K/2)\lVert g\rVert^2/T_O^* grows linearly in KK while measured QopQ_{op} saturates (K0.46\approx K^{0.46}), floored by the unmodeled positive phase. All of this assumed a mixing PT chain it did not have. The re-check shows the ladder does not mix the trained model, so the entire MSEnegMSE_{neg} branch is invalid.

Scope and caveats

This was never evidence against QopQstructQ_{op}\approx Q_{struct}^{\perp} — and now it is not evidence for the failure mechanism either. The factorization was neither validated nor refuted. The original framing called it a regime-precondition violation (F4 broke at scale); the erratum demotes even that, because the operational read fell outside any valid regime when the kernel does not mix. No overclaim survives: not to the t=2000t=2000 plateau (this was t=200t=200), not to the training Gibbs kernel, and now not to "negative phase mixes well at scale."

What this feeds

The intended follow-up — give the positive phase a PT-grade estimator so MSEposMSE_{pos} also falls 1/K\approx 1/K — is suspended pending a ladder that actually mixes the trained DTM. The real effect: this run sharpens Risk 4 (the operational-seam / mixing risk), now grounded in the corrected experiments/exp15-recheck-trained-weights/.


What this feeds: a sharpened Risk 4 and a gated re-run — the operational validation of QopQstructQ_{op}\approx Q_{struct}^{\perp} stays [conjectured], behind a clean t=200t=200 validation-positive that does not yet exist.

— fin. —