Thermodynamic Machine Learning · MMXXVI
Foundations18.VI.MMXXVIRead 6 min

The Trainability Theorem (the spine)

Entry 19

This is the spine of the program: the precise statement of the thermodynamic-trainability quantity QQ, the corrected observable-projected target it factorizes into, and the discipline that keeps a proven conditional from being mistaken for a validated law.

The question

For one reverse-process EBM layer of a Denoising Thermodynamic Model (DTM) — energy EθE_\theta, model law πθ(x)=eEθ(x)/Zθ\pi_\theta(x) = e^{-E_\theta(x)}/Z_\theta on x{1,1}Nx \in \{-1,1\}^N, sampled by a reversible Gibbs kernel PθP_\theta — does it train? The training gradient component (DTM paper Eq. (14)) is a data-minus-model difference ga:=Edata[fa]Eπθ[fa]g_a := \mathbb{E}_{\text{data}}[f_a] - \mathbb{E}_{\pi_\theta}[f_a] of the per-parameter observable fa:=aEθf_a := \partial_a E_\theta. All the mixing pain is in the model (negative) phase, estimated by Gibbs sampling. The question is whether one scalar predicts when that estimate is too noisy to descend on.

The setup

What you measure is the operational quantity — the gradient SNR, squared:

Qop:=g2/Eg^g2.Q_{op} := \lVert g \rVert^2 \,/\, \mathbb{E}\lVert \hat{g} - g \rVert^2.

This carries the [solid] tag. Qop1Q_{op} \gg 1 means SGD descends; Qop1Q_{op} \lesssim 1 is the training plateau of DTM paper Fig. 5(b). Crucially this is an estimation plateau — the recoverability of gg collapses, the estimator MSE swamps the signal — not signal-extinction. That is the deliberate contrast with the quantum barren plateau, where the true gradient variance itself vanishes. Same phenomenology, different mechanism.

The estimator is canonical burn-in BB + window-KK: discard BB steps, average faf_a over the next KK. Standard reversible-MCMC machinery (Levin–Peres 2017, Lemma 12.2 / Thm 12.21) then gives a geometric bias (1γ)B\sim (1-\gamma)^B and a variance Var[g^a](2wa/(γK))Varπ[fa]\mathrm{Var}[\hat g_a] \approx (2 w_a/(\gamma K))\,\mathrm{Var}_\pi[f_a], with slow-mode weight waw_a, gap γ:=1σ2\gamma := 1-\sigma_2, and ESSaγK/(2wa)\mathrm{ESS}_a \approx \gamma K/(2 w_a).

The result — and why the obvious form is wrong

The historical factorization, [conjectured], was the single-gap product

Qstruct(θ,K):=γθK2R(θ),R(θ):=g2awaVarπ[fa],Q_{struct}(\theta, K) := \frac{\gamma_\theta K}{2}\,R(\theta), \qquad R(\theta) := \frac{\lVert g \rVert^2}{\sum_a w_a \mathrm{Var}_\pi[f_a]},

read as ESS×R\mathrm{ESS} \times R, computable without training to convergence and differentiable in the couplings JJ (the HTDML property). It is superseded as written. Exact diagonalization (experiments/exp1-exact-diag/) and the RBM smoke test (experiments/exp2-thrml-smoke/) showed the γ=1σ2\gamma = 1-\sigma_2 anchor is the wrong object for two model-driven reasons (exp2 reproduced both under block-Gibbs, so they are not single-site artifacts):

  1. Mis-anchored. Under the Z2\mathbb{Z}_2 spin-flip symmetry of the b=0b=0 EBMs, the slowest mode φ2\varphi_2 is odd and exactly orthogonal to the even gradient observables fij=xixjf_{ij} = -x_i x_j — so f^a,220\hat f_{a,2}^2 \approx 0 (overlap 3.5×1017\le 3.5\times 10^{-17} in exp2) and the single-γ\gamma RR is a divide-by-symmetry-zero, over-predicting QopQ_{op} by 1026\sim 10^{26}103010^{30}.
  2. Clustered. The observable-relevant slow structure clusters; a single relevant gap tracks QopQ_{op} in 23/48 exp1 cells, the multi-mode correction in 45/48.

The corrected target is QopQstructQ_{op} \approx Q_{struct}^{\perp} (observable-projected, multi-mode). It restricts attention to the modes the gradient sees, C(O):={j2:af^a,j2>εaVarπ[fa]}C^*(O) := \{j \ge 2 : \sum_a \hat f_{a,j}^2 > \varepsilon \sum_a \mathrm{Var}_\pi[f_a]\}, builds the aggregate timescale TO:=aτint[fa]Varπ[fa]T_O := \sum_a \tau_{int}[f_a]\,\mathrm{Var}_\pi[f_a] (half-Sokal τint\tau_{int}), and the harmonic-mean gap γeff\gamma_{eff} over C(O)C^*(O). The predictor is

Qstruct(θ,K):=K2g2TO=γeffK2Reff,Reff:=g2ajC(O)f^a,j2.Q_{struct}^{\perp}(\theta, K) := \frac{K}{2}\frac{\lVert g \rVert^2}{T_O} = \frac{\gamma_{eff} K}{2}\,R_{eff}, \qquad R_{eff} := \frac{\lVert g \rVert^2}{\sum_a \sum_{j \in C^*(O)} \hat f_{a,j}^2}.

Scope and caveats — the two-tier tag

This is the precise part. The factorization carries a split tag (researcher-conferred 2026-06-01):

  • Conditional factorization — [solid]. In regime A1–A8 + plateau (γeff0\gamma_{eff} \to 0) + F4, Qop=Qstruct(1+o(1))Q_{op} = Q_{struct}^{\perp}(1+o(1)). This is a written proof across six obligations O1–O6, each adversarially verified. O1.c is flipped to proven-here — the wiki's first and only terminal-tagged block (the projection-vs-conditioning SNR invariance). O2–O6 are [solid] assemblies closing to Levin–Peres 2017 + Younes 1999 (whose §7 asymptotic-variance object S(θ)S(\theta) grounds what TOT_O is, no more). The numerical chain needs no A9.
  • Operational / unconditional claim — [conjectured]. "QopQstructQ_{op} \approx Q_{struct}^{\perp} on a real DTM at the KK one runs" is gated on A7 (overlapping-bulk relaxation) and KτintK \gg \tau_{int} — both assumed. The conditional is vacuous on a real DTM's plateau until those gates are met.

Neither tier is validated. The supporting evidence is construction-confirmation on small/moderate controlled models: exp1 (45/48 cells) + exp2 (92–99% across m=416m=4\to 16, both kernels). At scale it does not hold up: experiments/exp3-htdml-embedding/ on the real 60_12 MNIST DTM is untested at adequate equilibration (τmax486\tau_{max} \approx 486500500, 50τ25,00050\tau \approx 25{,}000 \gg the feasible window; the linear-in-KK predictor over-predicts, Qstruct/QopQ_{struct}^{\perp}/Q_{op} reaches \sim5–6 at K=1000K=1000). The re-freeze on a reversible kernel (experiments/exp4-reversible-50tau/) found τ^\hat\tau UNRESOLVEDτmaxL\tau_{max} \propto L (166.7 to 5,280 over L=1,00032,000L=1{,}000\to 32{,}000, τ/L0.16\tau/L \approx 0.16 constant) — so KτintK \gg \tau_{int} is unreachable, not merely unmet (P0-HALT). exp6 reproduced τL\tau \propto L at every checkpoint t{25,50,100,200}t \in \{25,50,100,200\}.

The honest reading is the A2 ↔ A6 structural-obstruction observation: reversibility (A2, load-bearing for O4/O5) and KτintK \gg \tau_{int} (A6) are not contradictory, but they are operationally antagonistic in exactly the multimodal/plateau regime the theorem is about — satisfying A2 pushes the sampler toward the slow plateau where A6 is hardest. Whether this is fundamental or scale-dependent stays open: at small scale it is escapable (exp7 found a crossable sweet spot, 25/64 cells; exp12 cut τmax\tau_{max} 14–22×\times with reversible parallel tempering), but the trained-DTM PT run (exp16) was withdrawn for an init-weight kernel bug, so the at-scale reversal is not established.

The six-entry risk ledger records each threat [open] with a mitigation: (1) slow-mode cluster (the big one — fires, drove the correction); (2) differentiability of γ\gamma through the embedding (O5.c partial — γeff\gamma_{eff} is smooth across within-cluster crossings via the cluster Riesz projector, but the C(O)C^*(O)-membership boundary stays open); (3) QopQ_{op} circularity at the plateau (circumvented, live where the reference is unconverged — exp3's KrefK_{ref} moved \sim23%); (4) positive phase not free (confirmed subdominant, F4F4 median 0.046 on exp3; the at-scale reversal withdrawn); (5) the regime gate (binding; τL\tau \propto L may make it unsatisfiable at scale); (6) canonical-estimator choice (burn-in + window fixed; PCD sits outside as a moving-target tracking-lag family).

What this feeds

Everything else in the program feeds or tests this page: the MET corollary reads its three Ragone-shaped factors (γ0\gamma \to 0, R0R \to 0, budget starvation) through the corrected projected quantities; the HTDML objective (exp8/exp9/exp11) exercises the differentiable-in-JJ property; the at-scale validation path (exp14 → a GPU DTM PT P0) is the still-pending route to → validated.


What this feeds: the spine that every literature bridge, parent translation, and experiment ladder either supports or attacks — and the standing open step is an at-scale run with A7 + KτintK \gg \tau_{int} met on a reversible kernel.

Sources

  • Jelinčič et al. 2025, Denoising Thermodynamic Models, arXiv:2510.23972 (Eq. (14), Fig. 5b, App. E/G/H).
  • Levin & Peres 2017, Markov Chains and Mixing Times (Lemma 12.2, Thm 12.21).
  • Younes 1999, Markovian stochastic-approximation convergence (§7, Thm 3 — the S(θ)S(\theta) variance object).
  • Hinton 2002, contrastive divergence (the data-minus-model gradient lineage).
— fin. —