The Trainability Theorem (the spine)

This is the spine of the program: the precise statement of the thermodynamic-trainability quantity $Q$ , the corrected observable-projected target it factorizes into, and the discipline that keeps a proven conditional from being mistaken for a validated law.

The question

For one reverse-process EBM layer of a Denoising Thermodynamic Model (DTM) — energy $E_\theta$ , model law $\pi_\theta(x) = e^{-E_\theta(x)}/Z_\theta$ on $x \in \{-1,1\}^N$ , sampled by a reversible Gibbs kernel $P_\theta$ — does it train? The training gradient component (DTM paper Eq. (14)) is a data-minus-model difference $g_a := \mathbb{E}_{\text{data}}[f_a] - \mathbb{E}_{\pi_\theta}[f_a]$ of the per-parameter observable $f_a := \partial_a E_\theta$ . All the mixing pain is in the model (negative) phase, estimated by Gibbs sampling. The question is whether one scalar predicts when that estimate is too noisy to descend on.

The setup

What you measure is the operational quantity — the gradient SNR, squared:

$Q_{op} := \lVert g \rVert^2 \,/\, \mathbb{E}\lVert \hat{g} - g \rVert^2.$

This carries the [solid] tag. $Q_{op} \gg 1$ means SGD descends; $Q_{op} \lesssim 1$ is the training plateau of DTM paper Fig. 5(b). Crucially this is an estimation plateau — the recoverability of $g$ collapses, the estimator MSE swamps the signal — not signal-extinction. That is the deliberate contrast with the quantum barren plateau, where the true gradient variance itself vanishes. Same phenomenology, different mechanism.

The estimator is canonical burn-in $B$ + window- $K$ : discard $B$ steps, average $f_a$ over the next $K$ . Standard reversible-MCMC machinery (Levin–Peres 2017, Lemma 12.2 / Thm 12.21) then gives a geometric bias $\sim (1-\gamma)^B$ and a variance $\mathrm{Var}[\hat g_a] \approx (2 w_a/(\gamma K))\,\mathrm{Var}_\pi[f_a]$ , with slow-mode weight $w_a$ , gap $\gamma := 1-\sigma_2$ , and $\mathrm{ESS}_a \approx \gamma K/(2 w_a)$ .

The result — and why the obvious form is wrong

The historical factorization, [conjectured], was the single-gap product

$Q_{struct}(\theta, K) := \frac{\gamma_\theta K}{2}\,R(\theta), \qquad R(\theta) := \frac{\lVert g \rVert^2}{\sum_a w_a \mathrm{Var}_\pi[f_a]},$

read as $\mathrm{ESS} \times R$ , computable without training to convergence and differentiable in the couplings $J$ (the HTDML property). It is superseded as written. Exact diagonalization (experiments/exp1-exact-diag/) and the RBM smoke test (experiments/exp2-thrml-smoke/) showed the $\gamma = 1-\sigma_2$ anchor is the wrong object for two model-driven reasons (exp2 reproduced both under block-Gibbs, so they are not single-site artifacts):

Mis-anchored. Under the $\mathbb{Z}_2$ spin-flip symmetry of the $b=0$ EBMs, the slowest mode $\varphi_2$ is odd and exactly orthogonal to the even gradient observables $f_{ij} = -x_i x_j$ — so $\hat f_{a,2}^2 \approx 0$ (overlap $\le 3.5\times 10^{-17}$ in exp2) and the single- $\gamma$ $R$ is a divide-by-symmetry-zero, over-predicting $Q_{op}$ by $\sim 10^{26}$ – $10^{30}$ .
Clustered. The observable-relevant slow structure clusters; a single relevant gap tracks $Q_{op}$ in 23/48 exp1 cells, the multi-mode correction in 45/48.

The corrected target is $Q_{op} \approx Q_{struct}^{\perp}$ (observable-projected, multi-mode). It restricts attention to the modes the gradient sees, $C^*(O) := \{j \ge 2 : \sum_a \hat f_{a,j}^2 > \varepsilon \sum_a \mathrm{Var}_\pi[f_a]\}$ , builds the aggregate timescale $T_O := \sum_a \tau_{int}[f_a]\,\mathrm{Var}_\pi[f_a]$ (half-Sokal $\tau_{int}$ ), and the harmonic-mean gap $\gamma_{eff}$ over $C^*(O)$ . The predictor is

$Q_{struct}^{\perp}(\theta, K) := \frac{K}{2}\frac{\lVert g \rVert^2}{T_O} = \frac{\gamma_{eff} K}{2}\,R_{eff}, \qquad R_{eff} := \frac{\lVert g \rVert^2}{\sum_a \sum_{j \in C^*(O)} \hat f_{a,j}^2}.$

Scope and caveats — the two-tier tag

This is the precise part. The factorization carries a split tag (researcher-conferred 2026-06-01):

Conditional factorization — [solid]. In regime A1–A8 + plateau ( $\gamma_{eff} \to 0$ ) + F4, $Q_{op} = Q_{struct}^{\perp}(1+o(1))$ . This is a written proof across six obligations O1–O6, each adversarially verified. O1.c is flipped to proven-here — the wiki's first and only terminal-tagged block (the projection-vs-conditioning SNR invariance). O2–O6 are [solid] assemblies closing to Levin–Peres 2017 + Younes 1999 (whose §7 asymptotic-variance object $S(\theta)$ grounds what $T_O$ is, no more). The numerical chain needs no A9.
Operational / unconditional claim — [conjectured]. " $Q_{op} \approx Q_{struct}^{\perp}$ on a real DTM at the $K$ one runs" is gated on A7 (overlapping-bulk relaxation) and $K \gg \tau_{int}$ — both assumed. The conditional is vacuous on a real DTM's plateau until those gates are met.

Neither tier is validated. The supporting evidence is construction-confirmation on small/moderate controlled models: exp1 (45/48 cells) + exp2 (92–99% across $m=4\to 16$ , both kernels). At scale it does not hold up: experiments/exp3-htdml-embedding/ on the real 60_12 MNIST DTM is untested at adequate equilibration ( $\tau_{max} \approx 486$ – $500$ , $50\tau \approx 25{,}000 \gg$ the feasible window; the linear-in- $K$ predictor over-predicts, $Q_{struct}^{\perp}/Q_{op}$ reaches $\sim$ 5–6 at $K=1000$ ). The re-freeze on a reversible kernel (experiments/exp4-reversible-50tau/) found $\hat\tau$ UNRESOLVED — $\tau_{max} \propto L$ (166.7 to 5,280 over $L=1{,}000\to 32{,}000$ , $\tau/L \approx 0.16$ constant) — so $K \gg \tau_{int}$ is unreachable, not merely unmet (P0-HALT). exp6 reproduced $\tau \propto L$ at every checkpoint $t \in \{25,50,100,200\}$ .

The honest reading is the A2 ↔ A6 structural-obstruction observation: reversibility (A2, load-bearing for O4/O5) and $K \gg \tau_{int}$ (A6) are not contradictory, but they are operationally antagonistic in exactly the multimodal/plateau regime the theorem is about — satisfying A2 pushes the sampler toward the slow plateau where A6 is hardest. Whether this is fundamental or scale-dependent stays open: at small scale it is escapable (exp7 found a crossable sweet spot, 25/64 cells; exp12 cut $\tau_{max}$ 14–22 $\times$ with reversible parallel tempering), but the trained-DTM PT run (exp16) was withdrawn for an init-weight kernel bug, so the at-scale reversal is not established.

The six-entry risk ledger records each threat [open] with a mitigation: (1) slow-mode cluster (the big one — fires, drove the correction); (2) differentiability of $\gamma$ through the embedding (O5.c partial — $\gamma_{eff}$ is smooth across within-cluster crossings via the cluster Riesz projector, but the $C^*(O)$ -membership boundary stays open); (3) $Q_{op}$ circularity at the plateau (circumvented, live where the reference is unconverged — exp3's $K_{ref}$ moved $\sim$ 23%); (4) positive phase not free (confirmed subdominant, $F4$ median 0.046 on exp3; the at-scale reversal withdrawn); (5) the regime gate (binding; $\tau \propto L$ may make it unsatisfiable at scale); (6) canonical-estimator choice (burn-in + window fixed; PCD sits outside as a moving-target tracking-lag family).

What this feeds

Everything else in the program feeds or tests this page: the MET corollary reads its three Ragone-shaped factors ( $\gamma \to 0$ , $R \to 0$ , budget starvation) through the corrected projected quantities; the HTDML objective (exp8/exp9/exp11) exercises the differentiable-in- $J$ property; the at-scale validation path (exp14 → a GPU DTM PT P0) is the still-pending route to → validated.

What this feeds: the spine that every literature bridge, parent translation, and experiment ladder either supports or attacks — and the standing open step is an at-scale run with A7 + $K \gg \tau_{int}$ met on a reversible kernel.

Sources

Jelinčič et al. 2025, Denoising Thermodynamic Models, arXiv:2510.23972 (Eq. (14), Fig. 5b, App. E/G/H).
Levin & Peres 2017, Markov Chains and Mixing Times (Lemma 12.2, Thm 12.21).
Younes 1999, Markovian stochastic-approximation convergence (§7, Thm 3 — the $S(\theta)$ variance object).
Hinton 2002, contrastive divergence (the data-minus-model gradient lineage).