Exp 3 — At Scale: The Equilibration Limit · Thermodynamic Machine Learning

The first GPU-scale run of the operational-tracking test, where a real MNIST model's glacial mixing turned the headline prediction into an untested — not a confirmed or refuted — claim.

The question

Does the computable predictor $Q_{struct}^{\perp}$ track the operational gradient SNR $Q_{op}$ at scale, on a genuinely trained model rather than a toy substrate? The factorization conjecture says $Q_{op} \approx Q_{struct}^{\perp}$ inside the assumption package; the prior experiments lived on exactly-diagonalizable substrates. Exp 3 is the first attempt to read the ratio off a real Denoising Thermodynamic Model.

The setup

One rented H100 80GB (Lightning Studio), dtm-replication @ 7c22d19 plus a CPU-audit patch, EXP3_MODE=full. Wall time 24,028 s ≈ 6.67 h, inside the declared 8 GPU-h budget. The full record is in experiments/exp3-htdml-embedding/ (report.md, results_full.json, exp3_full.log).

Scope (re-freeze 2026-05-30c): predictions P1–P5 on the single default 60_12 graph, single-input conditional $\pi_\theta(\cdot\,|\,\tilde{x})$ — fixed seed-0 input, all 32 chains share it, varying only MC seed and hidden init. $T = 2000$ epochs $= 122{,}000$ gradient updates. The negative kernel is deterministic alternating-scan block-Gibbs (neg_kernel.scan), i.e. A2-excluded / F3-regime: every tracking number sits outside the proof-sketch reversibility regime regardless of verdict. The HTDML ladder and P6/P7 were deferred — this substrate has no NN embedding to attach them to.

The result

The binding fact is mixing. The MNIST DTM negative kernel is extremely slow: $\tau_{max} \approx 486\text{–}500$ , essentially constant across $t = 200 \to 2000$ ( $\tau_f$ mean $24\text{–}48$ ). So the frozen $K_{ref} = 4000$ and the largest memory-feasible TRAJ_LEN=3000 / warmup=1000 are deeply under-equilibrated: $50\,\tau \approx 25{,}000 \gg 3000$ and $5\,\tau \approx 2500 > 1000$ , so traj_adequate_50tau and burn_adequate_5tau are False at every checkpoint.

In that regime the predictor does not track. Because $Q_{struct}^{\perp} = (K/2)\|g\|^2 / T_O \propto K$ by construction, while the empirical $Q_{op} = \|g\|^2 / \mathbb{E}\|\hat g(K) - g\|^2$ is near-flat / sub-linear in $K$ ( $Q_{op} \approx 0.60\text{–}1.05$ ), the ratio climbs monotonically with $K$ :

| checkpoint | $K{=}100$ | $K{=}300$ | $K{=}1000$ | |---|---|---|---| | $t{=}200$ | 0.37 | 0.99 | 2.94 | | $t{=}500$ | 0.54 | 1.41 | 3.18 | | $t{=}1000$ | 0.68 | 2.13 | 6.05 | | $t{=}2000$ | 0.60 | 1.84 | 5.46 |

Tracking holds inside the $c=3$ band only at $K \approx 100\text{–}300$ and busts at $K=1000$ (ratio up to 6). The P3a-median $Q_{struct}^{\perp}/Q_{op} = 1.624 \in [1/3, 3]$ is in-band only by averaging across $K$ — an artifact, not a pass. Two further degraders, both pre-registered: kref_reference_unequilibrated_risk=True (the $t{=}500$ $g$ -reference moved 23%, kref_g_relchange $= 0.026/0.234/0.144/0.100$ — Risk 3 circumvented, not closed); and $\tau_{int}$ truncation inflating $Q_{struct}^{\perp}$ . Verdict: P5 weak / under-powered / partially Risk-3-live — not a clean at-scale tracking confirmation.

The cleanly-measured result is P2 (F4): the positive (data) phase is strongly subdominant. Median $F4_{ratio} = MSE_{pos}/MSE_{neg} = 0.046 \le 0.3$ (per-checkpoint $0.032/0.030/0.069/0.059$ ); $\text{mse}_{pos} \approx 0.004\text{–}0.011 \ll \text{mse}_{neg} \approx 0.11\text{–}0.16$ . The negative phase dominates the estimator variance, as the Q-program assumes. → Risk 4 sharpened (stays open).

The rest: P1 NOT PASSED — CLT-sanity median $2.125$ , growing with $K$ ( $1.17/2.65/7.49$ at $t{=}2000$ ), violated upward by autocorrelation truncation, a regime failure not a formula failure (the estimator was validated against AR(1) in review). P3a PASS but only as a mean- $\tau_{int}$ statement: $F5_{ratio} \le 0.1$ at all checkpoints, yet $\|\Delta\theta\|\cdot\tau_{max}/K_{train} \approx 0.40\cdot500/400 \approx 0.50 > 0.1$ , so the $\tau{\sim}500$ slow modes are not in the fixed-θ regime. P4 clause-1 NOT PASSED via the structurally over-stable CDF proxy (card_rel_change median $0.559 > 0.5$ ; $|C^*|$ is $\sim 7000\text{–}12500$ of $25{,}200$ observables — no low-rank structure); clause-2 is construction-confirmed, not ε-measured.

Scope and caveats

This does not show that the factorization fails — it shows the test was equilibration-limited. A faithful at-scale read needs $K_{ref}$ and chain lengths $\gtrsim 50\,\tau \approx 25{,}000$ (a $K_{ref}$ re-freeze plus heavier compute), out of scope here. It is also single-input conditional only, not batch-operational; per-input sensitivity is a registered follow-up. And it is in the A2-excluded alternating-scan regime, outside the proof sketch. One suggestive aside, against the forecaster: the exp1/exp2 observable-orthogonality mechanism partially transfers ( $r_{yy} = 0.123/0.056/0.032/0.024$ , gradient largely orthogonal to the temporal slow mode; $\tau_m/\tau_f > 2.5$ ) — but under heavy under-equilibration the lag-1 surrogate is unreliable, so suggestive, not conclusive. No tag flip in any outcome. The factorization and $Q_{struct}^{\perp}$ stay [conjectured].

What this feeds: the at-scale $Q_{op} \approx Q_{struct}^{\perp}$ claim is neither confirmed nor refuted — untested at adequate equilibration — which motivates the reversible-kernel mixing investigation and a future $K_{ref}$ re-freeze; P2's clean PASS sharpens Risk 4.