Field notes from a research program where the question is not can we build a bigger model but can the thing we build actually be trained. The writings below are attempts to work that out in the open.
Our acceleration theorem needs a reversible sampler and a flood of decorrelated samples per step. On a real trained MNIST model, those two demands are antagonistic — and we have the curve that shows it.
Read entry →The textbook way to predict whether a thermodynamic model trains — read off its slowest sampling mode — can over-predict the gradient SNR by a factor of ten-octillion. Under spin-flip symmetry the slowest mode is invisible to the gradient. Here is why, and the fix.
Read entry →The most convenient thing to watch in a sampler — a single scalar like the magnetization — is the wrong thing to watch. It can mistime your gradient by 72x, and the failure is exactly where you most need the reading.
Read entry →Parallel tempering is supposed to fail when adjacent replicas stop overlapping. On a trained MNIST energy model it fails differently: the specific heat is so uneven along the ladder that no uniform temperature step satisfies every rung at once.
Read entry →Two expensive GPU runs concluded parallel tempering mixed a trained MNIST energy model near-ideally. A third run found the sampler's local kernels were silently built from the untrained weights — the real swap-acceptance was 0.0099, and nothing mixed.
Read entry →The primer named Q; this is the refined object behind it. An observable-projected, multi-mode, differentiable predictor read from the sampler's spectrum — tracking the gradient SNR in 45 of 48 exact-diag cells where a single gap manages 23.
Read entry →The primer argued that trainability comes down to a single number, Q — the squared signal-to-noise ratio of the gradient. This is how you actually measure it: the estimator and its bias, a three-rung ladder from exact diagonalization to GPU scale, and the pre-registered result that plateau onset tracks the Gibbs spectral gap.
Read entry →We dropped the trainability predictor into the loss as a regularizer and turned the knob. The diagnostic soared a billionfold. The model stopped learning. A clean Goodhart's-law negative result, traced across the whole dose ladder.
Read entry →We audited eight thermodynamic-ML substrates for one property — can a network produce differentiable energy-model couplings? Every one fails a different gate. So we built the missing piece, and a numerical trap nearly sank it at an eigenvalue gap of 1e-16.
Read entry →Some energy-based models learn and others stall on a plateau, and the difference is rarely the architecture — it is whether the gradient signal survives the sampling noise. A field guide to thermodynamic trainability: the quantity Q, the Gibbs spectral gap, the fight between mixing and expressivity, and the p-bit hardware where it all comes down to a single thermal coin-flip.
Read entry →A public notebook on whether energy-based models running on physical thermodynamic hardware can actually be trained. The bottleneck is no longer the model — it is whether the thing that runs on thermal noise emits a gradient you can still hear. This is where that gets worked out, in the open.
Read entry →Plant a handful of modes in an RBM and the observable-relevant gap drops 100-fold — 0.65 to 0.006. We read it straight off the exact spectrum, so it cannot be a burn-in or finite-sample lie.
Read entry →