Theory · Thermodynamic Machine Learning

One number decides whether an energy-based model on a thermodynamic sampler learns or stalls. Almost everything in this program is the story of computing that number honestly — what it is, why the textbook proxy for it is wrong by thirty orders of magnitude, and the wall that stands between the proof and a real machine.

The quantity

Training a reverse-process EBM layer means descending a data-minus-model gradient, g_a = E_data[f_a] − E_π[f_a], whose model phase you can only estimate from finite, correlated samples — and all the mixing pain lives in that phase. One scalar predicts when the estimate is too noisy to descend on: Q = ‖g‖² / E‖ĝ − g‖², the squared signal-to-noise ratio of the gradient (solid). Q ≫ 1 and SGD descends; Q ≲ 1 and learning stalls on a plateau. The distinction worth keeping in view: this is an estimation plateau — the estimator MSE swamps a still-nonzero gradient, so the signal is no longer recoverable from the samples you can afford — not the signal extinction of the quantum barren plateau, where the true gradient variance itself vanishes. Same phenomenology, different machine.

The spectral gap is the wrong object

The obvious move is to anchor Q to the single Gibbs spectral gap γ = 1 − σ₂ and read it as effective-sample-size times signal — the form the program first shipped as conjectured. It is superseded as written, and the failure is not subtle. Under the ℤ₂ spin-flip symmetry of the bias-free models the slowest mixing mode φ₂ is odd, while the gradient observables f_ij = −x_ix_j are even, so their overlap is exactly zero — measured at ≤ 3.5×10⁻¹⁷. The single-gap predictor divides by that symmetry-zero and over-predicts Q by 10²⁶–10³⁰×. And the slow structure is clustered, not solitary: a single relevant gap tracks the true Q in only 23/48 exact-diagonalization cells. One number cannot stand in for a cluster.

Projection, not conditioning

The fix restricts attention to the modes the gradient actually sees — the cluster C*(O) — and aggregates a timescale T_O = Σ_a τ_int[f_a]·Var_π[f_a], giving the observable-projected predictor Q_struct^⊥ = (K/2)·‖g‖²/T_O. The load-bearing distinction is in the name: the projection Π_C is an orthogonal projection in the function space L²(π), not a conditioning on a region of state space. For the free ℤ₂ flip there is no fixed-point set, so “the even configurations” is a category error — the slow mode is killed by L²(π)-orthogonality alone. That invariance is O1.c, the program’s first and only proven-here lemma. The corrected predictor tracks the operational gradient-SNR in 45/48 exact-diag cells and 92–99% across RBMs — and, being differentiable in the couplings, it can be optimized rather than merely measured.

The reversibility–mixing wall

The factorization Q_op ≈ Q_struct^⊥ is a written proof over six obligations O1–O6 — but only as a conditional, in a regime that needs both reversibility (A2) and a window far longer than the integrated autocorrelation, K ≫ τ_int (A6). On a real machine those two demands fight. On the trained 60_12 MNIST DTM the reversible chain shows τ ∝ L (τ/L ≈ 0.16, constant from L = 1,000 to 32,000) — so K ≫ τ_int is not merely unmet but unreachable — and an optimal equal-acceptance parallel-tempering ladder needs R* = 136 reversible rungs against a 96-rung budget: a thermodynamic-length cost wall. Whether the wall is fundamental or only an artifact of scale stays open — small models escape it; the at-scale reversal is not yet established.

The claim-status discipline

Every statement in the program carries exactly one of four tags — solid, conjectured, proven-here, validated — that move only by explicit conferral, never by self-flip. The discipline that matters here: the conditional factorization is solid and one lemma is proven-here, but the operational claim that Q_struct^⊥ tracks Q on a real DTM at a feasible K stays conjectured, and nothing in this pillar is validated. The supporting evidence is construction-confirmation on controlled models; at scale the gates are open. The point of the theory is not to hide that gap but to name precisely where it sits.

Read the full statement

The summaries above are the load-bearing claims; the derivations, assumptions, and obligations live in the notebook.

[Read the paper][The proof program][All experiments][Notebook]