The first GPU-scale run of the operational-tracking test, where a real MNIST model's glacial mixing turned the headline prediction into an untested — not a confirmed or refuted — claim.
The question
Does the computable predictor track the operational gradient SNR at scale, on a genuinely trained model rather than a toy substrate? The factorization conjecture says inside the assumption package; the prior experiments lived on exactly-diagonalizable substrates. Exp 3 is the first attempt to read the ratio off a real Denoising Thermodynamic Model.
The setup
One rented H100 80GB (Lightning Studio), dtm-replication @ 7c22d19 plus a CPU-audit patch, EXP3_MODE=full. Wall time 24,028 s ≈ 6.67 h, inside the declared 8 GPU-h budget. The full record is in experiments/exp3-htdml-embedding/ (report.md, results_full.json, exp3_full.log).
Scope (re-freeze 2026-05-30c): predictions P1–P5 on the single default 60_12 graph, single-input conditional — fixed seed-0 input, all 32 chains share it, varying only MC seed and hidden init. epochs gradient updates. The negative kernel is deterministic alternating-scan block-Gibbs (neg_kernel.scan), i.e. A2-excluded / F3-regime: every tracking number sits outside the proof-sketch reversibility regime regardless of verdict. The HTDML ladder and P6/P7 were deferred — this substrate has no NN embedding to attach them to.
The result
The binding fact is mixing. The MNIST DTM negative kernel is extremely slow: , essentially constant across ( mean ). So the frozen and the largest memory-feasible TRAJ_LEN=3000 / warmup=1000 are deeply under-equilibrated: and , so traj_adequate_50tau and burn_adequate_5tau are False at every checkpoint.
In that regime the predictor does not track. Because by construction, while the empirical is near-flat / sub-linear in (), the ratio climbs monotonically with :
| checkpoint | | | | |---|---|---|---| | | 0.37 | 0.99 | 2.94 | | | 0.54 | 1.41 | 3.18 | | | 0.68 | 2.13 | 6.05 | | | 0.60 | 1.84 | 5.46 |
Tracking holds inside the band only at and busts at (ratio up to 6). The P3a-median is in-band only by averaging across — an artifact, not a pass. Two further degraders, both pre-registered: kref_reference_unequilibrated_risk=True (the -reference moved 23%, kref_g_relchange — Risk 3 circumvented, not closed); and truncation inflating . Verdict: P5 weak / under-powered / partially Risk-3-live — not a clean at-scale tracking confirmation.
The cleanly-measured result is P2 (F4): the positive (data) phase is strongly subdominant. Median (per-checkpoint ); . The negative phase dominates the estimator variance, as the Q-program assumes. → Risk 4 sharpened (stays open).
The rest: P1 NOT PASSED — CLT-sanity median , growing with ( at ), violated upward by autocorrelation truncation, a regime failure not a formula failure (the estimator was validated against AR(1) in review). P3a PASS but only as a mean- statement: at all checkpoints, yet , so the slow modes are not in the fixed-θ regime. P4 clause-1 NOT PASSED via the structurally over-stable CDF proxy (card_rel_change median ; is of observables — no low-rank structure); clause-2 is construction-confirmed, not ε-measured.
Scope and caveats
This does not show that the factorization fails — it shows the test was equilibration-limited. A faithful at-scale read needs and chain lengths (a re-freeze plus heavier compute), out of scope here. It is also single-input conditional only, not batch-operational; per-input sensitivity is a registered follow-up. And it is in the A2-excluded alternating-scan regime, outside the proof sketch. One suggestive aside, against the forecaster: the exp1/exp2 observable-orthogonality mechanism partially transfers (, gradient largely orthogonal to the temporal slow mode; ) — but under heavy under-equilibration the lag-1 surrogate is unreliable, so suggestive, not conclusive. No tag flip in any outcome. The factorization and stay [conjectured].
What this feeds: the at-scale claim is neither confirmed nor refuted — untested at adequate equilibration — which motivates the reversible-kernel mixing investigation and a future re-freeze; P2's clean PASS sharpens Risk 4.