Brando Miranda

Score Matching: Training EBMs Without Ever Computing Z

2026-06-09T00:00:00-07:00

Brando Miranda — June 2026 · ~8 min read

Warning: this post is a draft — content may change and errors may remain.

TL;DR. An EBM defines $p_\theta(x) = e^{-E_\theta(x)}/Z_\theta$, where $Z_\theta$ is intractable. But $Z_\theta$ is a sum over all configurations — it does not depend on the particular $x$ you evaluate at — so differentiating $\log p_\theta$ with respect to the input kills it: $\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$. Score matching turns this observation into a training principle: match the model’s score $\nabla_x \log p_\theta$ to the data’s score $\nabla_x \log p^$. Because normalization removes exactly one degree of freedom, matching scores forces matching *distributions — $Z$ never appears. The resulting loss is the Fisher divergence $D^{F}{p^*}(p^* \Vert p\theta)$, training is (any) gradient descent on it, and every choice in the update rule — SGD vs. AdamW vs. Muon vs. Shampoo — is an open experimental question for EBMs. One thing gets swept under the rug: the objective contains the data’s score, which we don’t have. Fixing that introduces a Hessian-trace term whose alleged intractability I’ll interrogate in the next post.

The problem: $Z$ is the enemy

Recap from the previous post. An energy-based model scores whole configurations with a learned energy $E_\theta : X^{T_x} \to \mathbb{R}$, and probabilities — if you insist on them — require the partition function:

\[p_\theta(x) \;=\; \frac{e^{-E_\theta(x)}}{Z_\theta}, \qquad Z_\theta \;=\; \sum_{\tilde x \,\in\, X^{T_x}} e^{-E_\theta(\tilde x)},\]

a sum (integral, in the continuous case) with $V^{T_x}$ terms. Maximum likelihood drags it right back in:

\[\nabla_\theta \log p_\theta(x) \;=\; -\nabla_\theta E_\theta(x) \;-\; \nabla_\theta \log Z_\theta,\]

and that second term costs you either $Z_\theta$ itself or MCMC samples from $p_\theta$ — the classical contrastive-divergence tax (Song & Kingma, 2021).

But notice what we actually want from an EBM: the ability to say “this $x$ is more plausible than that $\tilde x$” — to learn $E_\theta$ — without ever computing the normalization. The normalization is bookkeeping. Can we fit the model and skip the bookkeeping?

One observation (a tool, not yet a method)

Look at $Z_\theta$ again. It sums $x$ out. Whatever number it equals, it is a constant with respect to $x$. Therefore:

\[\nabla_x \log p_\theta(x) \;=\; \nabla_x \Big( -E_\theta(x) - \log Z_\theta \Big) \;=\; -\nabla_x E_\theta(x).\]

Differentiating with respect to the input kills the partition function. That’s the whole observation. I’m not yet claiming anything about what to do with it — think of it as a tool we just picked up.

This gradient has a name, and the name is reserved (which is why I refuse to call the energy “the score”): the score of a density $p$ is

\[s(x) \;:=\; \nabla_x \log p(x).\]

So the model’s score is $s_\theta(x) = -\nabla_x E_\theta(x)$ — computable with one backward pass, no $Z$ — and the data’s score is $s^(x) = \nabla_x \log p^(x)$, where I write $p^* := p_{\mathrm{data}}$ throughout.

Fine print: $\nabla_x$ presumes $x$ lives in a continuous space ($\mathbb{R}^D$, e.g. an embedding space). For literal token sequences the gradient is undefined and you need either a continuous relaxation or a discrete variant of what follows (ratio matching, Hyvärinen 2007; concrete score matching, Meng et al. 2022). Flagged, deferred — it’s a real design decision for the Lean setting, not a footnote I get to wave away forever.

The crazy idea: make the scores match

Here is the move. Demand that the model’s score agree with the data’s score everywhere:

\[\nabla_x \log p_\theta(x) \;\approx\; \nabla_x \log p^*(x) \qquad \forall x.\]

Geometrically: picture the two log-density surfaces over the same domain, and require their slope fields to be identical at every point — the surfaces are parallel everywhere.

This looks insufficient. Two functions with identical gradients everywhere are equal only up to a constant:

\[\log p_\theta(x) \;=\; \log p^*(x) \;+\; C.\]

For generic functions, $C$ is a genuine leftover degree of freedom, and “parallel” $\ne$ “equal.” So why would anyone expect score matching to pin down the distribution?

Parallel ⇒ equal — but only for probability distributions

Because normalization already spent that degree of freedom. Intuitively: when you normalize — when you impose $\sum_x p(x) = 1$ — you remove exactly one degree of freedom from the function class. Score matching fixes the function up to exactly one degree of freedom. The two halves click.

The proof is almost embarrassing. Exponentiate the displaced-log condition and sum over $x$:

\[1 \;=\; \sum_x p_\theta(x) \;=\; \sum_x e^{\log p^*(x) + C} \;=\; e^{C} \sum_x e^{\log p^*(x)} \;=\; e^{C} \cdot 1 \quad\Longrightarrow\quad C = 0.\]

If you like it even smaller: if $p_1 + p_2 = 1$ must hold, a global constant multiplying the distribution has nowhere to hide. So:

\[s_\theta(x) = s^*(x)\ \ \forall x \quad\Longrightarrow\quad p_\theta = p^*.\]

Matching scores is sufficient to match distributions — and we never touched $Z$. Note what made this work: both objects are probability distributions. For unnormalized functions the argument dies immediately. (Modulo the usual regularity conditions — full support and smoothness, and “everywhere” meaning $p^*$-almost everywhere; see Hyvärinen 2005, Theorem 2.)

The dumbest norm you can think of: the Fisher divergence

A condition is not a loss. To train, we penalize the score mismatch — and we should reach for the most naive penalty available, the $L_2$ norm, averaged over the data:

\[D^{F}_{p^*}\!\left(p^* \,\Vert\, p_\theta\right) \;:=\; \mathbb{E}_{x \sim p^*}\!\left[\,\tfrac{1}{2}\,\big\Vert\, \nabla_x \log p_\theta(x) \;-\; \nabla_x \log p^*(x) \,\big\Vert_2^2 \,\right].\]

This object is the Fisher divergence, and minimizing it is score matching (Hyvärinen, 2005).

A note on my notation

Standard texts write $D_F(p_{\mathrm{data}} \Vert p_\theta)$ — or hide the whole thing inside a loss symbol $\mathcal{J}(\theta)$ — and leave implicit that the expectation is taken under the first argument. That implicitness matters: the Fisher divergence is not symmetric (which is why it’s a divergence and not a distance), and which distribution you average under is a modeling decision, not a typographic afterthought. So I subscript it:

\[D^{F}_{p^*}(\,\cdot\, \Vert \,\cdot\,) \qquad \text{where the subscript names the distribution carrying the expectation.}\]

I’d write the KL divergence the same way if I could rewrite the textbooks. Slightly redundant, never ambiguous.

Training is just descent on $D^F$

Everything after this point is ordinary deep learning. Vanilla gradient descent:

\[\theta^{} \;:=\; \theta^{} \;-\; \eta\, \nabla_\theta\, D^{F}_{p^*}\!\left(p^* \,\Vert\, p_{\theta^{}}\right).\]

But it’s 2026, and nobody ships raw SGD. So abstract the update rule:

\[\theta^{} \;=\; H\!\Big(\theta^{},\ F\!\big(-\eta\, \nabla_\theta D^{F}_{p^*}(p^* \Vert p_\theta)\big)\Big),\]

where $F$ transforms the raw gradient — first/second-moment estimates in Adam/AdamW (Kingma & Ba 2015; Loshchilov & Hutter 2019), Kronecker-factored preconditioning in Shampoo (Gupta et al. 2018), orthogonalized momentum in Muon (Jordan et al. 2024) — and $H$ folds the transformed step into the iterate (momentum buffers, decoupled weight decay, schedules). Plain SGD is the special case $F = \mathrm{id}$, $H = \mathrm{add}$.

Here’s the part I find genuinely under-explored. The score-matching literature was largely built before — or in benign neglect of — the modern optimizer stack. Hyvärinen (2005) predates Adam by a decade, and the EBM-training lineage doesn’t systematically sweep 2026-grade optimizers against score objectives. Every cell in ${\text{objective: } D^F} \times {F:\ \text{SGD, momentum, AdamW, Shampoo, Muon}}$ is cheap to run and, as far as I can tell, mostly unrun.

Research questions (tracked as issues at github.com/brando90/free-energy):

RQ1 — sweep before invent. Fix $E_\theta$, data, batch size, and step budget. Sweep $F \in {\text{SGD, SGD+momentum, AdamW, Shampoo, Muon}}$ on $D^{F}_{p^}$. Does modern preconditioning change *whether/what score matching trains — or only how fast?
RQ2 — objective × optimizer interaction. $\nabla_\theta D^{F}$ is the gradient of a gradient mismatch: it contains mixed $\partial^2 / \partial\theta\,\partial x$ structure that plain MLE gradients don’t have. Does that structure favor — or break — particular preconditioners? Is there a bespoke $F$ for score objectives?
RQ3 — then innovate. Only after the sweep do we get to design a new $F$/$H$ for $D^F$. The baselines are the alibi for the invention.

What I swept under the rug (next post)

Look back at the definition of $D^{F}_{p^}$. It contains $\nabla_x \log p^(x)$ — the score of the data distribution. We don’t have $p^$. We have *samples from it.

The classical resolution (Hyvärinen, 2005) is an integration-by-parts identity that rewrites $D^F$ — up to a constant that doesn’t depend on $\theta$ — purely in terms of $p_\theta$ and samples from $p^*$. The price is a new term: the trace of the Hessian of the model’s log-density,

\[\mathrm{tr}\!\left(\nabla_x^2 \log p_\theta(x)\right) \;=\; -\,\mathrm{tr}\!\left(\nabla_x^2 E_\theta(x)\right).\]

Song & Kingma’s tutorial treats this trace as the computational bottleneck of score matching — the thing that pushes you toward denoising score matching (Vincent, 2011) and sliced score matching (Song et al., 2019). And look: the full Hessian is quadratic in the dimension — clearly out. But the trace is a linear number of terms, and yet it’s still declared impractical. I have questions about that claim — what exactly the per-term cost is, what 2026 autodiff and hardware change, and where the exact-vs-stochastic-estimator crossover actually sits. That’s the next post.

Appendix A — Notation

Symbol	Meaning
$p^*$	The data distribution; $p^* := p_{\mathrm{data}}$ (the paper’s notation).
$E_\theta$	Energy function; $-E_\theta(x)$ is the unnormalized confidence for configuration $x$.
$Z_\theta$	Partition function $\sum_{\tilde x} e^{-E_\theta(\tilde x)}$ — constant in $x$, which is the whole trick.
$s(x)$	The score of a density: $s(x) := \nabla_x \log p(x)$. Model: $s_\theta = -\nabla_x E_\theta$. Data: $s^* = \nabla_x \log p^$. Note: gradient w.r.t. the input* $x$, not the parameters $\theta$.
$D^{F}{p^}(p^* \Vert p*\theta)$	Fisher divergence; the subscript names the distribution the expectation is taken under (it is not symmetric).
$\theta^{}$	Parameters at optimization step $t$.
$\eta$	Step size.
$F$, $H$	The update-rule abstraction: $F$ transforms the raw gradient (Adam moments, Shampoo/Muon preconditioning); $H$ folds the step into the iterate (momentum, weight decay, schedules). SGD: $F=\mathrm{id}$, $H=\mathrm{add}$.

References

BibTeX for the references

@article{hyvarinen2005estimation,
  author  = {Hyv{\"a}rinen, Aapo},
  title   = {Estimation of Non-Normalized Statistical Models by Score Matching},
  journal = {Journal of Machine Learning Research},
  volume  = {6},
  pages   = {695--709},
  year    = {2005}
}
@misc{song2021how,
  author = {Song, Yang and Kingma, Diederik P.},
  title  = {How to Train Your Energy-Based Models},
  year   = {2021},
  eprint = {2101.03288},
  archivePrefix = {arXiv}
}
@article{vincent2011connection,
  author  = {Vincent, Pascal},
  title   = {A Connection Between Score Matching and Denoising Autoencoders},
  journal = {Neural Computation},
  volume  = {23},
  number  = {7},
  pages   = {1661--1674},
  year    = {2011}
}
@inproceedings{song2019sliced,
  author    = {Song, Yang and Garg, Sahaj and Shi, Jiaxin and Ermon, Stefano},
  title     = {Sliced Score Matching: A Scalable Approach to Density and Score Estimation},
  booktitle = {UAI},
  year      = {2019}
}
@article{hyvarinen2007extensions,
  author  = {Hyv{\"a}rinen, Aapo},
  title   = {Some Extensions of Score Matching},
  journal = {Computational Statistics \& Data Analysis},
  volume  = {51},
  number  = {5},
  pages   = {2499--2512},
  year    = {2007}
}
@inproceedings{meng2022concrete,
  author    = {Meng, Chenlin and Choi, Kristy and Song, Jiaming and Ermon, Stefano},
  title     = {Concrete Score Matching: Generalized Score Matching for Discrete Data},
  booktitle = {NeurIPS},
  year      = {2022}
}
@inproceedings{kingma2015adam,
  author    = {Kingma, Diederik P. and Ba, Jimmy},
  title     = {Adam: A Method for Stochastic Optimization},
  booktitle = {ICLR},
  year      = {2015}
}
@inproceedings{loshchilov2019decoupled,
  author    = {Loshchilov, Ilya and Hutter, Frank},
  title     = {Decoupled Weight Decay Regularization},
  booktitle = {ICLR},
  year      = {2019}
}
@inproceedings{gupta2018shampoo,
  author    = {Gupta, Vineet and Koren, Tomer and Singer, Yoram},
  title     = {Shampoo: Preconditioned Stochastic Tensor Optimization},
  booktitle = {ICML},
  year      = {2018}
}
@misc{jordan2024muon,
  author = {Jordan, Keller and Jin, Yuchen and Boza, Vlado and You, Jiacheng and Cesista, Franz and Newhouse, Laker and Bernstein, Jeremy},
  title  = {Muon: An Optimizer for Hidden Layers in Neural Networks},
  year   = {2024},
  howpublished = {\url{https://kellerjordan.github.io/posts/muon/}}
}
@misc{miranda2026whyebms,
  author = {Miranda, Brando},
  title  = {Why Energy-Based Models? The Toy AR-vs-EBM Argument},
  year   = {2026},
  month  = {June},
  howpublished = {\url{https://brando90.github.io/brandomiranda/2026/06/09/why-energy-based-models-the-toy-ar-vs-ebm-argument.html}},
  note   = {Blog post}
}

If you’d like to cite this post:

@misc{miranda2026scorematching,
  author = {Miranda, Brando},
  title  = {Score Matching: Training EBMs Without Ever Computing Z},
  year   = {2026},
  month  = {June},
  howpublished = {\url{https://brando90.github.io/brandomiranda/2026/06/09/score-matching-training-ebms-without-z.html}},
  note   = {Blog post}
}

Why Energy-Based Models? The Toy AR-vs-EBM Argument

2026-06-09T00:00:00-07:00

Brando Miranda — June 2026 · ~6 min read

Warning: this post is a draft — content may change and errors may remain.

TL;DR. Autoregressive models and sequence-level energy-based models owe the same debt — the partition function $Z$ — on different payment plans. AR pays $Z$ in $T_x$ installments of $O(V)$ each (one softmax per token); a sequence-level EBM owes one balloon payment of $O(V^{T_x})$ (a single normalization over all sequences). The installment plan is exactly what makes AR cheap, and the per-token factorization it requires is exactly what the error-compounding critique attacks. This post is the toy version of that tradeoff I use to explain why EBMs exist at all — plus my hypothesis for why AR works in practice anyway (frontier labs buy the error rate down with scale), and what that implies an academic lab should do instead.

The autoregressive contract

An autoregressive (AR) model commits to the factorization

\[p_\theta(x) \;=\; \prod_{t=1}^{T_x} p_\theta\!\left(x^{} \,\middle|\, x^{<1:t-1>}\right),\]

where $x = (x^{<1>}, \dots, x^{})$ is a sequence over a vocabulary $X = {x_1, \dots, x_V}$ with $|X| = V$. Each conditional is a softmax:

\[p_\theta\!\left(x^{} = v \,\middle|\, x^{<1:t-1>}\right) \;=\; \frac{e^{f_\theta(v;\, x^{<1:t-1>})}}{Z_\theta\!\left(x^{<1:t-1>}\right)}, \qquad Z_\theta\!\left(x^{<1:t-1>}\right) \;=\; \sum_{v' \in X} e^{f_\theta(v';\, x^{<1:t-1>})}.\]

Concretely, the model’s head emits a length-$V$ vector

\[\left[\; \frac{e^{f(v_1)}}{Z_\theta},\; \frac{e^{f(v_2)}}{Z_\theta},\; \dots,\; \frac{e^{f(v_V)}}{Z_\theta} \;\right],\]

and the normalizer $Z_\theta$ is a sum of $V$ terms. Computing this vector — and its $Z$ — costs $O(V)$ per step (suppressing hidden-dimension factors). Run it for the whole sequence and you pay

\[O(V \cdot T_x).\]

The thing that makes AR cheap is worth saying out loud: the normalization axis is always a single token slot. You never normalize over sequences — only over the alphabet, one position at a time.

The energy-based contract

An energy-based model (EBM) drops the normalization requirement from the model class. You learn an energy function

\[E_\theta : X^{T_x} \to \mathbb{R}\]

— one scalar for the whole sequence. (I’d rather call $-E_\theta$ a confidence score; the physics name stuck, so energy it is.) For example, $x = [\text{“the”}, \text{“hat”}, \text{“hi”}, \dots]$ and $-E_\theta(x) = 0.3$. Any neural network is a legal $E_\theta$ — a two-layer MLP $-E_\theta(x) = \sigma!\left(x^\top W^{(1)}\right) W^{(2)}$, or a full transformer. The EBM question is orthogonal to the architecture question.

Crucially, $-E_\theta(x)$ is not a probability. It tells you how confident the model is about $x$, with no promise that confidences sum to anything. If you insist on probabilities, you must normalize:

\[p_\theta(x) \;=\; \frac{e^{-E_\theta(x)}}{Z_\theta}, \qquad Z_\theta \;=\; \sum_{\tilde{x} \,\in\, X^{T_x}} e^{-E_\theta(\tilde{x})}.\]

That sum runs over every possible sequence: $V^{T_x}$ terms. The cost is

\[O\!\left(V^{T_x}\right),\]

which is not “expensive.” It is intractable, full stop.

Same debt, different payment plans

	normalization axis	when you pay	total cost of $Z$
AR	the vocabulary, one slot at a time	every token	$O(V \cdot T_x)$
sequence EBM	all of $X^{T_x}$, at once	once (if ever)	$O(V^{T_x})$

Both model classes write down $e^{\text{score}}/Z$. Neither escapes $Z$; they defer it differently. AR’s softmax is a clever payment plan, not an exemption — a point I’ll keep returning to in this series, because nearly every proposed “alternative to softmax” turns out to relocate $Z$ rather than eliminate it.

Then why would anyone sign the expensive contract?

Because the installment plan has a hidden fee. The headline objection to AR — the Exponential Error Compounding Argument (LeCun, 2022) — goes: if each generated token independently steps off the “correct manifold” with probability $\varepsilon$, and errors are unrecoverable, then

\[\Pr\!\left[\, x^{} \in \text{correct},\ \forall\, t \le T_x \,\right] \;\approx\; (1-\varepsilon)^{T_x} \;\longrightarrow\; 0\]

exponentially fast in the length of the generated object. Notice what is being blamed: the per-token factorization — the very design choice that made AR’s $Z$ cheap. The model commits to one token at a time and, in the blind-rollout picture, never gets to revise.

A sequence-level EBM does not factorize over time. Its type signature, $X^{T_x} \to \mathbb{R}$, judges the whole object at once. There is no per-step commitment to compound — by construction. So the toy tradeoff is:

AR: linear-cost normalization, exposure to compounding. EBM: holistic judgment, an intractable $Z$.

One more reason this framing matters to me specifically: the Lean kernel is a hand-built energy function. It maps a whole candidate proof to ${\text{valid}, \text{invalid}}$ — an energy of ${0, \infty}$ if you like. Judging complete objects rather than keystrokes is the native mode of formal verification, which is a large part of why Lean is my testbed for this program.

Fine print: is the compounding argument actually true?

The algebra is one line and it is fine. The contestable part is the error model: constant $\varepsilon$, independent across steps, unrecoverable. I wrote a separate post on exactly this — AR Error Compounding — Real or Fiction? — whose punchline is that under a hard verifier with recovery (backtrack/resample), a recoverable-Markov error process can fit reality far better than the geometric one, and the right contrast becomes AR-without-verifier vs. AR-with-verifier rather than AR vs. EBM. Empirically, what does collapse with problem size is compositional depth, not raw token count (Dziri et al., 2023).

So, to be precise: treat $(1-\varepsilon)^{T_x}$ as a motivation, not a theorem. The error-compounding axis and the partition-function axis are independent claims, and conflating them is the most common confusion I see in EBM discussions. The deeper pro-EBM case lives elsewhere — inference as energy minimization against a verifier, energies composing additively, $Z$ canceling in energy differences — and deserves its own post.

My hypothesis: frontier labs buy $\varepsilon$ down with scale

Here is the hypothesis I find most plausible for why AR systems work in practice despite the compounding story. Frontier labs drive $\varepsilon$ down by brute force — more data, higher-quality data, more compute, heavy post-training — until the usable horizon ($\sim 1/\varepsilon$ tokens) exceeds the trajectory lengths users actually need. The circumstantial evidence: in 2022 a human had to babysit essentially every model step; in 2026, multi-step agentic trajectories are routine. Nothing about the architecture changed in kind. $\varepsilon$ changed.

Two consequences for an academic lab:

We cannot compete on $\varepsilon$-suppression-by-scale. That game is won with data and dollars we don’t have.
The interesting question is at fixed resources. Same model size, same data, same compute: does the EBM contract buy a better error-vs-compute frontier than the AR contract — or at minimum, a clean scientific account of the pros and cons?

And one strategic corollary: since pretrained open-weight LLMs already embody billions of dollars of $\varepsilon$-suppression, the rational first move is not to train an EBM from scratch. It is to convert a pretrained LLM into an EBM — keep the digested data, change the contract. That conversion problem (call it grafting) is where my group is starting.

The catch, and the next post

To run the fixed-resources comparison we have to train the EBM, and the obvious objective — maximum likelihood — needs $\log Z_\theta$, the $V^{T_x}$-term monster. The escape hatch is one of my favorite observations in machine learning: $Z_\theta$ does not depend on the $x$ you evaluate at, so differentiating with respect to the input kills it. Building a training principle out of that observation is called score matching, and it’s the subject of the next post.

Appendix A — Notation

Symbol	Meaning
$X$	The vocabulary (alphabet) ${x_1, \dots, x_V}$; $V = \lvert X \rvert$.
$x$, $x^{}$	A sequence $x \in X^{T_x}$ and its token at position $t$.
$T_x$	Length of the modeled object $x$ (unconditional setting). In the conditional setting of the error-compounding post, the exponent variable is the output length $T_y$; the prompt length never enters the exponent.
$\varepsilon$	Per-step unrecoverable error probability. (Written $e$ in the earlier post; renamed here to avoid collision with the exponential base.)
$f_\theta(v; \cdot)$	The AR model’s logit for token $v$ given the context.
$E_\theta$	Energy function $X^{T_x} \to \mathbb{R}$; $-E_\theta(x)$ is an unnormalized confidence score for the whole sequence.
$Z_\theta$	Partition function. AR: $\sum_{v \in X} e^{f_\theta(v)}$ per step ($V$ terms). EBM: $\sum_{\tilde x \in X^{T_x}} e^{-E_\theta(\tilde x)}$ ($V^{T_x}$ terms).
AR	Autoregressive factorization $p(x) = \prod_t p(x^{} \mid x^{<1:t-1>})$.
EBM	Energy-based model: scores configurations with $E_\theta(x)$; probabilities only via $e^{-E_\theta}/Z_\theta$.

References

BibTeX for the references

@misc{lecun2022path,
  author = {LeCun, Yann},
  title  = {A Path Towards Autonomous Machine Intelligence},
  year   = {2022},
  howpublished = {\url{https://openreview.net/pdf?id=BZ5a1r-kVsf}}
}
@incollection{lecun2006tutorial,
  author = {LeCun, Yann and Chopra, Sumit and Hadsell, Raia and Ranzato, Marc'Aurelio and Huang, Fu Jie},
  title  = {A Tutorial on Energy-Based Learning},
  booktitle = {Predicting Structured Data},
  publisher = {MIT Press},
  year   = {2006}
}
@misc{song2021how,
  author = {Song, Yang and Kingma, Diederik P.},
  title  = {How to Train Your Energy-Based Models},
  year   = {2021},
  eprint = {2101.03288},
  archivePrefix = {arXiv}
}
@misc{miranda2026arerrorcompounding,
  author = {Miranda, Brando},
  title  = {Autoregressive Models + LLMs Exponential Error-Compounding Argument --- Is It Real or Fiction?},
  year   = {2026},
  month  = {May},
  howpublished = {\url{https://brando90.github.io/brandomiranda/2026/05/26/ar-error-compounding-real-or-fiction.html}},
  note   = {Blog post}
}
@inproceedings{dziri2023faith,
  author = {Dziri, Nouha and Lu, Ximing and Sclar, Melanie and others},
  title  = {Faith and Fate: Limits of Transformers on Compositionality},
  booktitle = {NeurIPS},
  year   = {2023}
}
@inproceedings{du2024ired,
  author = {Du, Yilun and Mao, Jiayuan and Tenenbaum, Joshua B.},
  title  = {Learning Iterative Reasoning through Energy Diffusion},
  booktitle = {ICML},
  year   = {2024}
}
@misc{gladstone2025ebt,
  author = {Gladstone, Alexi and Nanduru, Ganesh and Islam, Md Mofijul and others},
  title  = {Energy-Based Transformers are Scalable Learners and Thinkers},
  year   = {2025},
  eprint = {2507.02092},
  archivePrefix = {arXiv}
}

If you’d like to cite this post:

@misc{miranda2026whyebms,
  author = {Miranda, Brando},
  title  = {Why Energy-Based Models? The Toy AR-vs-EBM Argument},
  year   = {2026},
  month  = {June},
  howpublished = {\url{https://brando90.github.io/brandomiranda/2026/06/09/why-energy-based-models-the-toy-ar-vs-ebm-argument.html}},
  note   = {Blog post}
}

Autoregressive Models + LLMs Exponential Error-Compounding Argument — Is It Real or Fiction?

2026-05-26T00:00:00-07:00

Brando Miranda — May 2026 · ~3 min read

TL;DR. The Exponential Error Compounding Argument against autoregressive language models says that if each generated token has an independent unrecoverable error probability $e$, then the chance of producing a fully correct length-$T_y$ object is $(1 - e)^{T_y}$ — which goes to zero exponentially fast in $T_y$. This post asks whether real verifier-guided systems actually behave like that. The algebra is fine. The empirical question is whether the independence + unrecoverable assumption survives in trained, verifier-guided AR systems. If a recoverable-error model fits the data better than $(1 - e)^{T_y}$, the argument is not false algebraically; it is false as a model of the system we actually run.

The argument has a name

The headline objection to autoregressive language models — “errors compound, so long-form generation collapses” — has a name worth keeping. Call it the Exponential Error Compounding Argument. The most-cited articulation is in Yann LeCun, A Path Towards Autonomous Machine Intelligence (2022); the same shape appears in classical EBM writeups such as LeCun et al., A Tutorial on Energy-Based Learning (2006). The equation is one line:

\[P(\text{success at length } T_y) \;=\; (1 - e)^{T_y} \;=\; \exp\!\bigl(T_y \log(1 - e)\bigr).\]

where $T_y$ is the output sequence length the model is asked to generate (the prompt length $T_x$ is a separate variable that does not appear in the exponent). The exponential form makes the decay rate explicit: for small $e$, the half-life in $T_y$ is roughly $\log 2 / e$. At $e = 1\%$ that is about $70$ tokens; at $e = 0.1\%$ it is about $700$. Either way, the bound says that fully correct long-form generation is asymptotically impossible.

The hard part is deciding whether this equation describes autoregressive agents we actually deploy — or just a simplified blind rollout.

The assumptions are the experiment

The bound is mathematically valid. What is contestable is the error model it assumes:

Per-token error probability $e$ is constant across positions.
Per-token errors are independent.
Errors are unrecoverable — once a step is wrong, the trajectory stays off-manifold.

Drop any of these and the geometric curve loosens — often dramatically. A handwritten note I keep going back to phrases this exactly:

“If independence is true, this example shows — as $T_y$ gets large — is an upper bound to LeCun (but we can prob fix that).”

That is the hinge. The point of this experiment is not to argue with $(1 - e)^{T_y}$; it is to measure whether the assumptions feeding it survive a hard verifier and a real trained model. The hypothesis under test is sharper than “are LLMs good?”:

Do autoregressive model-plus-verifier systems behave like independent unrecoverable error processes?

If a recoverable-Markov process — a 2-state chain ${\text{on-manifold},\ \text{off-manifold}}$ with a nonzero per-step recovery probability — fits success-vs-length curves better than the geometric model, then the right contrast is not AR vs. EBM. It is AR-without-verifier vs. AR-with-verifier (recovery changes the exponent). That is a different research program from “abandon autoregressive models.”

Appendix A — Notation

Symbol	Meaning
$T_y$	Length of the output sequence the AR model generates (proof steps, code tokens, words, tactics). The exponent in $(1 - e)^{T_y}$ is this $T_y$.
$T_x$	Length of the input / prompt the model conditions on. Not the primary axis of the error-compounding claim — included for symmetry; the model conditions on $T_x$ context tokens to produce $T_y$ output tokens.
$e$	Per-step “unrecoverable error” probability under the geometric model: assumed independent across steps and never repaired.
$(1 - e)^{T_y}$	The geometric prediction: probability that all $T_y$ output steps are simultaneously correct under the independent-unrecoverable error model. Equivalently $\exp\bigl(T_y \log(1 - e)\bigr)$.
$p$	Constant pass probability — the trivial baseline that ignores length entirely.
recoverable-Markov	A 2-state chain ${\text{on-manifold},\ \text{off-manifold}}$ with a nonzero per-step recovery probability; the alternative to “errors are unrecoverable.”
AR	Autoregressive: the factorization $p(x_{1:T_y}) = \prod_{t} p(x_t \mid x_{
EBM	Energy-based model: scores configurations with $E_\theta(x)$ and normalizer $Z_\theta = \sum_x \exp(-E_\theta(x))$.
verifier	A hard checker (e.g., the Lean type-checker) that returns valid / invalid on a generated step or object, enabling recovery via backtrack / resample.

References

If you’d like to cite this post:

@misc{miranda2026arerrorcompounding,
  author = {Miranda, Brando},
  title  = {Autoregressive Models + LLMs Exponential Error-Compounding Argument --- Is It Real or Fiction?},
  year   = {2026},
  month  = {May},
  howpublished = {\url{https://brando90.github.io/brandomiranda/2026/05/26/ar-error-compounding-real-or-fiction.html}},
  note   = {Blog post}
}

Metallica Goes Jazz: A 2010 High School Arrangement, Reanimated by AI

2026-05-07T00:00:00-07:00

Brando Miranda — May 2026 · ~6 min read

TL;DR. In 2010, as a high schooler at Greengates, I arranged Metallica’s The Day That Never Comes by hand for the school orchestra. Sixteen years later, I fed that same arrangement into Suno and asked it to recast the piece in the styles of Chet Baker, Cannonball Adderley, Charlie Parker, Joe Henderson, Stan Getz, and Sonny Rollins. Here is the first take. The point isn’t novelty. The point is that AI lets you keep playing with old work in ways your past self never had the chops — or the time, or the ensemble — to attempt.

Greengates, 2010

I was sixteen, and I wanted my school orchestra to play Metallica. Specifically The Day That Never Comes — the 2008 ballad off Death Magnetic with the long clean intro, the slow burn into distortion, and that anthemic final solo that I had been listening to on repeat. The orchestra was not a metal band. It was a school orchestra. So I sat down in Sibelius and wrote out parts: strings carrying the clean intro lines, brass and winds handling the rhythmic hits, percussion holding the long-form structure. It took weeks. I had no idea what I was doing.

I cannot overstate how primitive my workflow was. Sibelius 4. A laptop. A pirated MIDI rip of the song to reference by ear. No engraving expertise, no orchestration class, no internet community to ask. Just me, a deadline, and a stubborn belief that this song would sound good with strings. The PDF and .sib files still exist in the experiment folder on GitHub, with public copies linked below. The audio below is the Sibelius playback of the score. It is not a real ensemble recording. It is what 2010-me thought the arrangement should sound like, rendered by a sample library that, charitably, did its best:

Your browser does not support inline audio. Download the MP3.

The full score (PDF) is here:

If the embed does not render in your browser, open the PDF directly.

Why jazz, why now

I have always loved jazz. I wrote about that back in 2019 — jazz and improvisation are two activities close to my heart, and the deepest skill in either is internalizing structure so completely that you can leave it on purpose without falling apart. I am confident I can play a Charlie Parker line — I’ve played Confirmation before. I do not have a working quintet on call. I cannot book Stan Getz for an afternoon. But the curiosity has always been there: what would my Metallica arrangement sound like if a 1957 Blue Note session got hold of it?

That question used to be a thought experiment. Now it is a Suno prompt.

The prompts

I sat down and wrote six style prompts, one for each player whose voice I wanted to hear interpret the piece. The prompts live alongside the arrangement files in agents.md, tuned for Suno V5 — short, era-tagged, instrument-explicit, and aggressively non-rock to keep the model from falling back on power chords. A representative one:

Bebop, 200 BPM, virtuosic alto saxophone with rapid eighth-note runs and chromatic passing tones, walking upright bass, busy bebop drums with snare comping, piano with rootless voicings and tritone substitutions, frantic and intricate, 1940s bebop quintet.

That is the Charlie Parker prompt — the one behind the first take below. The others — Chet Baker, Cannonball, Joe Henderson, Stan Getz, Sonny Rollins — follow the same shape: subgenre, BPM, lead instrument and tone, rhythm-section behavior, mood, era. Each one is a hypothesis about how the harmonic skeleton of the song would refract through a different musical sensibility.

First listen

The first take is here on Suno. I will let you form your own opinion before I form mine for you. What I will say is that it captured something I genuinely could not have produced on my own — not in 2010, and not now. Whether it captured what I wanted is a different question, and that gap is exactly the interesting part. I am going to keep iterating: more prompts, more takes, more comparisons across players. This is take one of an experiment, not a finished product.

What this is really about

I have written before that AI is a force of nature, and the rational response is to merge with it rather than fight it. That post was about research and peer review and career stakes. This one is the lighter cousin. It is the same thesis applied to a personal artifact from sixteen years ago.

Here is the thing: if you had told sixteen-year-old me, while I was hand-engraving viola parts at midnight, that one day I would be able to type six sentences and hear what The Day That Never Comes sounds like in the voice of Charlie Parker, I would not have believed you. Not because the technology seemed impossible — I had no model for the technology at all — but because the gap between intent and execution felt like the thing that defined being an amateur. You wanted to hear it; you couldn’t; that was the deal.

That deal is over. The gap between I wonder what this would sound like and here is what it sounds like has collapsed for an enormous range of creative tasks. That collapse is uncomfortable for people who built their identity inside the gap. I get it. But I would rather use the new tools to revisit old work, ask better questions, and explore the corners of my own taste that were previously gated behind skills I did not have the years to develop.

The arrangement is the same arrangement. The structure I wrote down at sixteen has not changed. What has changed is that I can now hear that structure performed by ensembles that exist only in prompt-space, and use those performances to learn something about what I was reaching for back then.

That is enough for a Wednesday afternoon.

All files for this experiment (downloadable):

If you’d like to cite this post:

@misc{miranda2026metallicajazz,
  author = {Miranda, Brando},
  title  = {Metallica Goes Jazz: A 2010 High School Arrangement, Reanimated by {AI}},
  year   = {2026},
  month  = {May},
  howpublished = {\url{https://brando90.github.io/brandomiranda/2026/05/07/metallica-goes-jazz.html}},
  note   = {Blog post}
}

Acknowledgments

Thanks to my friend Audriix — a professional pop artist and songwriter based in the Bay Area — for cheering this experiment on and for the conversation that nudged me toward the next iteration of it. Go listen to her music.

The Em Dash Is Only an AI Fingerprint If You Didn’t Already Use It ;)

2026-05-04T00:00:00-07:00

Brando Miranda — May 2026 · ~1 min read

TL;DR. The em dash is only an AI fingerprint if it was not already part of the writer’s fingerprint. Some of us were dash people long before ChatGPT made punctuation suspicious. I got mine from William Zinsser’s On Writing Well, where the dash is treated as a normal, useful piece of nonfiction craft. The mark is not the giveaway. The baseline is.

There is a small, ambient annoyance I have been carrying for the past year, and I want to put it down.

People keep telling me that my writing “sounds like AI” because I use em dashes. The accusation is usually delivered as a compliment-shaped jab: “great post, but you should maybe dial back the em dashes, it makes it look generated.” I understand where the suspicion comes from. Long-form models do over-use the em dash, and pattern-matching on punctuation is a cheap heuristic for spotting low-effort text.

The problem is not that the heuristic is always false. The problem is that it ignores the author’s baseline. If someone never used em dashes, then suddenly starts writing every paragraph like a LinkedIn ghostwriter with a punctuation sponsorship, fine, raise an eyebrow. But if someone was already using them before the current AI panic, the dash is not an AI fingerprint. It is just a fingerprint.

In other words: the em dash is only an AI fingerprint if you did not already use it.

I bought William Zinsser’s On Writing Well years before GPT-3, before ChatGPT, before “ChatGPT-style writing” was a phrase anyone said out loud. There is a chapter called “Bits & Pieces,” and inside it Zinsser gives the dash full citizenship in good English. The em dash did not arrive with AI. Some of us were using it back when the hot productivity stack was underlining a paperback and feeling profound about it.

Zinsser’s first example is simple: “We decided to keep going—it was only 100 miles more and we could get there in time for dinner.” The dash is not decoration there. It turns the sentence forward and gives the reason. His other use is the parenthetical aside: a thought inside a thought, without stopping the sentence cold.

That is why I use it. Not because it is a vibe, not because it performs intelligence, and not because it carries some secret modern signal. I use it because sometimes the sentence wants a turn, and the dash is the cleanest turn available.

Calling that an AI fingerprint is like calling clean indentation a Copilot fingerprint. The tool was a tool before the panic started. The flattering version is that cool people like me were simply early.

A practical request: when you see writing you suspect of being generated, do not start with one punctuation mark in isolation. Compare it to the author’s prior writing. Then look at the thinking. Is there a claim in the post that the author committed to? Is there a counter-argument they actually wrestled with? Is there a sentence that could only have been written by someone with skin in the game? Those are the fingerprints worth checking.

In the meantime, I am keeping my em dashes. Zinsser got there first, I got there before the panic, and the dash remains undefeated ;)

If you’d like to cite this post:

@misc{miranda2026emdash,
  author = {Miranda, Brando},
  title  = {The Em Dash Is Only an AI Fingerprint If You Didn't Already Use It ;)},
  year   = {2026},
  month  = {May},
  howpublished = {\url{https://brando90.github.io/brandomiranda/2026/05/04/em-dash-not-an-ai-fingerprint.html}},
  note   = {Blog post}
}

Embrace AI or Be Left Behind — By People, Not Machines

2026-04-27T00:00:00-07:00

Brando Miranda — April 2026 · ~6 min read

TL;DR. The real risk of ignoring AI isn’t that machines replace you — it’s that people who use them will. AI is a force of nature: trained on the world’s data, improving relentlessly. Adapting isn’t optional enthusiasm; it’s survival. I’d rather merge with the wave than fight it.

A conversation with my collaborator Dan this week crystallized something I’ve felt for a while but hadn’t written down. He mentioned wanting to opt out of AI-assisted peer review, half-joking that he didn’t trust me not to use Claude Code for mine. I told him what I tell everyone: I personally feel morally obliged to always say AI is allowed — if not encouraged — to help. Otherwise I’d be inconsistent with my own values, and worse, I’d be promoting a future I think is dangerous.

That sounds strong. Let me explain what I mean.

Trained on the world’s data

These models were trained on the entire data of the world. Think about that for a moment. Every textbook, every paper, every forum post, every codebase — distilled into a system that can reason across all of it simultaneously. How do we expect to compete with that? It’s ridiculous to believe we can, at least on the axis of breadth. A single human who has read a thousand papers is impressive. A model that has ingested millions is operating at a fundamentally different scale.

This isn’t a reason for despair. It’s a reason to change the game. The human advantage was never breadth — it was taste, intent, and knowing what question to ask. But breadth matters enormously for execution, and on execution these systems are already better than the average practitioner in most domains. That includes peer review, code generation, and mathematical reasoning — three things I care about deeply.

A force of nature

I keep coming back to the same phrase: AI is a force of nature. Not because I think it’s mystical or beyond understanding, but because its trajectory has the same quality as other forces we can’t individually opt out of. You don’t negotiate with compound interest. You don’t vote on whether Moore’s Law applies to you. And you don’t get to decide that language models trained on the sum of human knowledge won’t affect your field. They will. The question is whether you’re positioned to benefit or positioned to be displaced.

Once these systems digest the whole universe of human knowledge — and they’re close — I don’t understand why we’d doubt them. It’s just the way things are. It’s better to accept reality and merge with it than to fight a pointless battle.

Left behind — by whom?

Here’s the part people get wrong. The fear isn’t that AI replaces you. Not directly. The fear is that people who embrace AI will leave you behind.

A researcher who uses AI to draft, review, iterate, and verify will produce more, catch more errors, and explore more ideas than one who doesn’t. A formal methods group that integrates language models into proof search will make faster progress than one that spends years grinding for a 0.02% improvement on a benchmark no one uses. The technology doesn’t care about your preferences. But the people who adopt it will outpace the people who don’t, and that gap compounds.

I don’t want to be left behind. And to put it more bluntly — being left behind in a world that moves this fast isn’t a career inconvenience. It’s closer to an existential risk. Not from AI itself, but from the people who wield it while you’re still deciding whether to try.

Why I feel morally obliged

I say “morally obliged” because I think there’s a responsibility that comes with seeing where things are heading. If I believed AI tools were harmful, I’d say so. But I don’t. I think they’re the most powerful amplifier of human capability we’ve ever built, and that restricting access to them — or discouraging their use — actively harms the people who listen.

When a student asks me whether they should use AI for their research, the answer is always yes. When a reviewer asks whether AI-assisted review is acceptable, the answer is always yes. Not because I think the tools are perfect — they aren’t — but because refusing to use them doesn’t make you more rigorous. It makes you slower. And slower, in a competitive landscape that moves at model-training speed, means fewer ideas explored, fewer papers written, and fewer opportunities to do the work that matters.

If I told my students not to use AI, I’d be promoting a future where they’re less competitive, less productive, and less prepared. That’s not a future I’m willing to endorse. It’s inconsistent with everything I believe about technology and human progress.

Adaptation is the only strategy

I think about this through the lens of my own research. I work on AI for formal verification in Lean 4. I co-founded Stanford AI for Lean because I believe the intersection of AI and formal methods is where the future of mathematics lives. I built a multi-agent workflow because I wanted to use AI agents seriously — not as toys, but as structured components of a rigorous process.

Every one of these decisions was an adaptation. I saw where the wave was going, and I chose to ride it instead of watching it from shore. Sometimes that means my tools are better than me at specific subtasks, and that’s fine. The point was never to be the best at everything — it was to be the person who knows how to direct the best tools at the right problems.

That’s the real skill now. Not whether you can write code faster than a model — you can’t. Not whether you can review a paper more carefully than an ensemble of models — you probably can’t. But whether you can choose the right problem, formulate the right question, and structure a process that catches errors before they matter. That’s still ours. For now.

Why I’m here in the first place

I should also say plainly: I find this incredible. The fact that we built systems that learn from data, generalize beyond what they were shown, and now reason across nearly the entire written record of humanity — it’s astonishing. I’m here because I’m fascinated by how it actually works, not only because the wave is unavoidable. Curiosity got me into this field long before it was fashionable.

I’ve been at this since 2012, training base learning algorithms back when neural nets were still a slightly disreputable thing to bring up at a serious ML group meeting. I can pinpoint the exact moment I committed to it: a clip from Andrew Ng’s Stanford machine learning course where the VC-dimension generalization bound goes up on the board. That equation made the argument concrete for me. Learning is a quantifiable thing. Generalization is bounded by structure plus data. If you can automate the loop — hypothesize, test, update — you can automate the scientific method itself. And if you can automate the scientific method, you can solve intelligence; and if you can solve intelligence, you can in principle solve any problem science is capable of solving. That was the argument that hooked me, and it still is.

That conviction is why “force of nature” doesn’t read as fatalism to me. It reads as the field finally catching up to what the theory implied a long time ago.

I have no choice

People sometimes ask why I’m so enthusiastic about AI. The honest answer is: I have no choice. Not in the sense that someone is forcing me, but in the sense that the alternative — pretending these tools don’t change everything — is self-defeating. I embrace AI because the cost of not embracing it is higher than the cost of any mistake I might make along the way.

It’s a force of nature. The only rational response is to learn to work with it.

If you’d like to cite this post:

@misc{miranda2026embraceai,
  author = {Miranda, Brando},
  title  = {Embrace {AI} or Be Left Behind --- By People, Not Machines},
  year   = {2026},
  month  = {April},
  howpublished = {\url{https://brando90.github.io/brandomiranda/2026/04/27/embrace-ai-or-be-left-behind.html}},
  note   = {Blog post}
}

Asking the Right Question: Formal Methods as Scalable Oversight

2026-04-22T00:00:00-07:00

Brando Miranda — April 2026 · ~3–4 min read

TL;DR. In the era of AI agents, formal methods, and Lean specifically, are finally within practical reach for scalable oversight of complex arguments. Let the verifier check the answer; let the human check the question. Deciding what to ask, knowing what we actually want, is the part we may not be able to outsource.

A conversation broke out in my lab channel this week about whether AI coding assistants are actually helping us do better research, or just helping us feel like we are. It’s a question I keep coming back to, and I want to write down where I’ve landed.

The worry

A colleague laid out the concern cleanly. Long-term, he thought the usefulness of these tools splits along three axes: how much you already know, what you choose to spend their output on, and how well you parse what they produce. His sharpest observation was the third. Models generate text faster than any human can read it. Over time that speeds up the surface and slows down the understanding. You skim, you trust, you move on. Another colleague echoed it from a different angle: he’d noticed a pattern of confidence without understanding, students sure they’d found something, unable to explain why, unable to catch errors when they appeared.

I don’t think either of them is wrong. I’ve felt the same pull, and I include myself on the list of guilty parties. But the picture is incomplete, because it treats “checking” as a single unstructured activity: one human eye against an avalanche of generated text. That framing makes the problem look unwinnable. It isn’t, at least not in every domain.

The need for simple verification

In mathematics, the act of checking has a structure we don’t always take advantage of. A proof of Fermat’s Last Theorem is long, intricate, and beyond most of us to audit line by line. But the statement of Fermat’s Last Theorem is short. If the theorem is written in Lean, the kernel checks the proof automatically and deterministically. What a human actually has to verify is not the hundreds of pages of argument but a single claim: that the Lean statement faithfully expresses the informal theorem.

This is still not trivial. Autoformalization is hard, and a wrong statement with a valid proof is worse than no proof at all. But checking a translation is dramatically cheaper than checking a proof. The complex reasoning gets offloaded to a deterministic verifier. The human is freed to do the one thing that was always ours to do: decide whether the question is the right question. The quality of research, like the quality of life, depends on the quality of questions one asks.

That, to me, is what scalable oversight looks like. Not a human racing a language model through a wall of text, but a human checking the short, human-readable end of an argument while a formal system checks the long, mechanical end.

And here’s what’s new: in this era of AI agents, formal methods are within practical reach in a way they weren’t five years ago, and Lean specifically is the tool. There is no comparable alternative, and the momentum is unambiguous. AWS verifies its Cedar authorization language in Lean. DeepMind’s AlphaProof reached IMO silver-medal performance proving theorems in Lean (Nature, 2025). Harmonic and Axiom were both founded with Lean as their backbone. The community library Mathlib is past 1.6M lines and includes the formal verification of Tao’s Polynomial Freiman–Ruzsa Conjecture and parts of Scholze’s Liquid Tensor Experiment. Coordinating this ecosystem is the non-profit Lean FRO, which is fighting to stay funded; if you care about trustworthy software and AI, it is worth supporting. What used to be a niche enterprise for a handful of specialists is starting to look like a workflow the rest of us can adopt, which is exactly why scalable oversight through formal methods is a live question now and wasn’t five years ago.

How far it goes

If you believe that most of what we care about can be expressed mathematically, a lot of oversight problems reduce to “are you asking the right question?” And here’s my hot take: I think “most of what we care about” includes things people usually assume are off-limits. Love included. Probability theory is surprisingly accommodating, and I don’t see a principled reason to exempt feelings, values, or relationships from formalization in some language rich enough to hold them. The claim is not that a Lean file replaces a relationship; it’s that the structure of what we mean by one is not, in principle, beyond expression. That’s not a small residue to be left with. What remains is the most important thing a researcher does. But it’s a residue we’re equipped to handle, because it’s the part that depends on intent, taste, and knowing what you actually want rather than symbol pushing.

There are limits, at least in theory. A brief conversation with a friend reminded me that some true statements have no proof: Gödel incompleteness, the halting problem, diagonalization arguments all roughly the same shape of result. These impossibilities apply to discrete, countable systems. If reality is that kind of object, the residue matters. If reality is continuous, or uncountably infinite in some way our formal systems don’t capture, the impossibility may not bite in the same way. And here’s the honest part: which regime we’re in strikes me as unfalsifiable. We don’t get to check whether the universe is the kind of structure these theorems quantify over. So the worry is real in theory but unknowable in practice, and I’d rather say that out loud than paper over it.

What this means for the worry

The concerns my colleagues raised don’t go away under this view; they sharpen. If the human job is to ask the right question and verify the formalization, then the human has to actually understand the question and the formalization. Shallow reading is still a problem; confidence without understanding is still a problem. But the target of understanding shifts. I don’t need to hold every line of a generated proof in my head. I need to hold the statement, the definitions, and the assumptions. That’s a much smaller object, and it’s the object that actually encodes what I meant.

This is why I find formal methods so generative as a research direction, and why I care about Lean beyond its role in mathematics. It gives us a way to scale human oversight by shrinking what humans have to oversee, without pretending we’ve removed humans from the loop. We haven’t. We’re still the ones who decide what counts as the right question. Could machines eventually learn to ask it for us? Maybe they can, maybe they can’t, I genuinely don’t know. But if there’s a limit, I suspect it’s not a capability limit; it’s a substrate one. These systems aren’t running on biological neurons, they aren’t human, and knowing what a human actually wants may be inseparable from being one. Wanting isn’t obviously a thing you can outsource.

Addendum: the OpenAI geometry result

Several people asked how this post relates to OpenAI’s May 2026 report, An OpenAI model has disproved a central conjecture in discrete geometry. My reaction is mostly that it makes the point more urgent. I don’t understand the construction well enough to comment on the math, and the philosophical difference between a counterexample and a “normal proof” is not the main thing. The oversight lesson is: as model-generated arguments become increasingly superhuman, informal expert review will not scale by itself. Lean-style formalization becomes more important, not less, because it lets us reduce the human job to the part that still has to be human: checking that the formal statement is the theorem we meant.

Appendix A: Lean is also a real programming language

One reason industry is converging on Lean rather than treating it as a math-only artifact is that it is a full-fledged programming language, not a proof tool with a side of code. Lean 4 compiles to native code via C, the standard library (batteries) ships with hash maps, datetime, and concurrency primitives, and the IO monad covers the system-call surface: files, processes, threads, refs, environment. Lean is implemented in Lean. In 2025 it won the SIGPLAN Programming Languages Software Award for “significant impact on mathematics, hardware and software verification, and AI.” For a guided tour of the language as a language, see Functional Programming in Lean.

The point, for this post: when you formalize a system in Lean, you are not crossing a chasm into “the math world.” You are writing a program in a language whose compiler happens to also be a proof checker. That is why the same artifact can both run and be verified, and why formal methods stop being exotic and start being a normal thing a working engineer can do.

If you noticed the em dashes: yes, I have always used them, people should still use them, and they do not mean the writing was ML-generated; I wrote a short note on that here: The Em Dash Is Only an AI Fingerprint If You Didn’t Already Use It ;).

Discuss

I shared this post on X/Twitter here: discussion thread. Replies, disagreements, pointers, and follow-up questions are welcome there.

If you’d like to cite this post:

@misc{miranda2026formaloversight,
  author = {Miranda, Brando},
  title  = {Asking the Right Question: Formal Methods as Scalable Oversight},
  year   = {2026},
  month  = {April},
  howpublished = {\url{https://brando90.github.io/brandomiranda/2026/04/22/formal-methods-scalable-oversight.html}},
  note   = {Blog post}
}

Two TLDRs Are Better Than One: A Small Prompt-Design Fix That Matters More Than It Looks

2026-04-16T00:00:00-07:00

Brando Miranda — April 2026 · ~5 min read

TL;DR. A short post about a tiny change I made to my agents-config rules — adding a second TLDR to the top of every agent response — and why the tiny change is actually a lesson about how position on the page interacts with chain-of-thought.

The annoying problem

I ask every agent in my workflow to end its response with a **TLDR:** line. It’s in the Hard Rules. The rule exists because once you run agents in parallel across byobu panes, across SSH sessions, across phone previews, you stop reading full responses — you skim. A one-line summary at the end is the thing that lets me decide, in half a second, whether to look closer.

The problem: Claude Code has a prefix-s mode that collapses the response. In collapsed view the end of the response is exactly what you don’t see. So the TLDR — the entire reason the rule exists — is hidden precisely when I most want it.

Obvious fix: put the TLDR at the top.

Except that breaks the reason I put it at the bottom in the first place.

Why putting the summary at the top is worse than it sounds

The reason I ended responses with a TLDR rather than opening with one is chain-of-thought. If the agent writes the summary first, the summary is generated from the prompt alone — before the model has worked through the actual problem. That summary is a prediction. It’s the model’s best guess at what it’s about to conclude, not what it actually concluded after reasoning through the task.

The summary written after the response is different. By the time the model writes it, it has already produced the reasoning, caught its own mistakes in the middle of the response, narrowed down which of two approaches actually wins. The end-of-response TLDR is the COT-informed TLDR. It’s the one that matches the body of the response.

If you replace the end TLDR with a top TLDR, you lose that. If you put TLDRs at both positions without thinking about it, the model will almost always just copy the top one to the bottom — anchoring bias is strong, and LLMs really dislike contradicting themselves inside a single response. You end up with two TLDRs that are both the worse TLDR.

The fix

The rule I added is slightly weird, and the weirdness is the whole point:

Open every response with **TLDR-start:** and close with **TLDR-end:** (1–2 sentences each). The top one is a fast preview. The bottom one is authoritative: write it last, from the actual response content, ignoring what TLDR-start said. If the reasoning changed your conclusion, TLDR-end should reflect that — divergence between the two is expected and fine.

Two things are load-bearing here. The first is renaming the labels. Calling them TLDR-start and TLDR-end — instead of two things both called TLDR: — gives the model permission to let them differ. A rule that says “write TLDR, then TLDR” feels like a consistency violation to satisfy. A rule that explicitly distinguishes a preview from an authoritative final is a rule the model can actually obey.

The second is the explicit instruction to ignore TLDR-start when writing TLDR-end. Without it, the anchoring effect wins: whatever the model forecasted at the top leaks into what it writes at the bottom, even after a long reasoning chain. You have to name the trap out loud.

What this is really about

The reason I’m bothering to write this up is not the TLDR specifically. It’s that this is one of a growing list of small prompt-design fixes where the position of a thing on the page changes the quality of the thing, not just its visibility.

Other examples from my own workflow:

Asking an agent to list risks before proposing a solution produces different risks than asking it after. Before: cautious, abstract, generic. After: grounded in the actual proposed approach, but biased toward risks the solution already handles.
Asking for a plan, then having the agent critique its own plan in a new section, produces better plans than asking for “a plan with caveats” in one pass.
Putting “think step by step” at the top versus at the bottom of a prompt is not the same prompt.

All of these are the same lesson: an LLM response is not a document with a fixed meaning. It is a sequence, and what appears earlier conditions what appears later. Once you take that seriously, a rule like “always end with a TLDR” is not really a formatting preference. It is a specification of when you want the model to commit to its conclusion — before reasoning, or after.

The connection to everything else

In the workflow I wrote about last week, the headline idea was correctness gating: agents produce, gates check. Gates are the big lever. But the small lever — where on the page you ask the model to summarize, critique, or commit — is, in aggregate, doing a lot of work too. Both levers point at the same thing: the output is not just a function of the model, it is a function of the shape of the interaction.

Agent output gets better when the rules take the shape of reasoning seriously. Dual-position TLDR is a five-line rule. It is also a small argument for the broader bet: the structure of how you ask matters as much as what you ask.

agents-config is open source under Apache 2.0. The dual TLDR rule lives in INDEX_RULES.md Hard Rule 4.

Velocity in Research

2026-04-16T00:00:00-07:00

Brando Miranda — April 2026 · ~5 min read

Warning: this post is a draft — content may change and errors may remain.

TL;DR. Research feels slow because we grade ourselves on whether experiments worked — something largely outside our control. The better metric is velocity: the rate at which you reduce uncertainty in a chosen direction. Vectoring picks the riskiest question; velocity answers it as cheaply as possible. Be ruthless about core versus periphery (fake, mock, or skip everything inessential), and lean on $v = d/t$ — the fastest way to go faster is almost never doing more, it’s doing less, faster, and ignoring the rest.

Most of my students arrive with the same complaint. Research feels slow. Weeks disappear, experiments fail, deadlines pass, and the advisor starts to look worried. I’ve been there myself, more than once. The instinct is to work harder — longer hours, more code, a bigger system. That instinct is almost always wrong.

What fixed it for me was rethinking what progress even means in a research project. Research is a project with high uncertainty. If you knew the answer, it wouldn’t be research. And if the result of your experiment is outside your control, then “it worked” is a terrible metric for how well you’re doing your job. The better metric is velocity — how quickly you’re learning — and once I internalized that, the swamp stopped feeling like failure and started feeling like the actual work.

Why progress is the wrong metric

The swamp is my shorthand for the middle of any serious project. The model won’t train. The proof won’t close. The engineering keeps cropping up new problems. From the outside it looks like nothing is happening, and from the inside it feels the same. If you grade yourself on whether the experiment worked, you will be miserable, because whether it worked is not something you can fully control.

Velocity flips this. Velocity is the rate at which you reduce uncertainty in the direction you chose to explore. A week where you tried five ideas and all of them failed is a high-velocity week, as long as you actually learned something from each one. A week where you wrote a thousand lines of clean code and produced zero experimental results is a low-velocity week, no matter how good the code looks.

Gowers — the mathematician — put it well. Research is an iterative process of exploration, not a linear path from idea to result. Classes trained us to put in X units of effort and get X points back. Research does not work like that. The sooner you stop pretending it does, the sooner you get unstuck.

Vectoring sets the question, velocity answers it

Vectoring and velocity are different jobs. Vectoring — a term I borrow from Michael Bernstein’s CS197 at Stanford — is picking the most uncertain assumption in your project — the one that, if wrong, kills everything downstream. It’s an abstract question: is assumption X actually valid? Is the metric the reason LLMs look like they have emergent jumps?

Velocity is what you do after vectoring. You have the question; now how do you answer it this week, with the least possible work? This separation matters because it’s easy to conflate them and end up prototyping the wrong thing fast, or the right thing slowly. Pick the question first. Then sprint.

The emergence-in-LLMs paper is my favorite example. The vector was: maybe those sudden jumps in capability aren’t intrinsic to scaling — maybe they’re an artifact of the scoring metric. Fine. Now how do you test that as quickly as humanly possible? One line of code changes the metric from accuracy to token edit distance. Modular arithmetic instead of Persian QA, because you can generate modular arithmetic examples automatically in Python — no hiring Persian speakers, no new dataset. GPT-3.5 through the OpenAI API instead of training anything yourself, because the cluster and the GPUs and the out-of-memory errors are all someone else’s problem. A plot in a week. That is velocity.

Core versus periphery

The single most useful frame I teach is core versus periphery. The core is what you genuinely need to answer this week’s question. The periphery is everything else — the user interface, the full evaluation suite, the beautiful abstractions, the secondary features — and your job is to fake it, mock it, or skip it.

Your approach should be necessarily incomplete. Run evaluations on a subset. Use a smaller dataset. If you need a theoretical result, code up a simulation first — Python is faster than math, and the simulation tells you whether the theorem is worth proving. If the question is “does this approach even work on a toy case,” build the toy case in pen and paper.

This wrinkles every instinct you picked up in school. In undergrad, you finish the problem set, you submit all the parts, you get points. In research you get rewarded for answering one question well, and punished for engineering a system that never produces an experiment. Don’t engineer. Answer questions. The gold is a plot you reacted to — something that interacted with reality and told you whether your hypothesis was true. Everything else is decoration.

The v = d/t trick

Velocity is distance over time. If your velocity is low, you have two options: cover more distance, or spend less time. Covering more distance means becoming a better engineer, learning more skills, working longer hours. That helps linearly at best, and you’re usually already close to your skill ceiling in the short run.

Spending less time is where the real gains are. One over t changes very quickly. When t gets small, 1/t blows up. When t goes to zero — when you simply don’t do the thing — velocity goes to infinity. That is why experienced researchers don’t answer every email. That is why advisors sometimes go dark. Not doing something is the fastest way to make progress on everything else.

So: lower your fidelity. Strip the periphery. Spit out the ugly draft. The first pass is not supposed to be perfect — it’s supposed to exist, so you can iterate. Perfection and permanence are the enemies of the first pass.

Habits that keep velocity up

A few practices I actually run on myself:

Walks without headphones. I bike to lab and code in my head on the way. On a stuck problem, I walk and don’t listen to anything. The brain solves problems in the background if you stop shoving input into it.

Present your work often. Lab meetings, office hours, a friend in a different subfield. Other people think differently than you and will ask questions you literally cannot generate from the inside.

Many pots, not the perfect pot. There’s an old ceramics study where one class was graded on a single perfect pot and the other on quantity. The quantity group made better pots — they iterated, got feedback, calibrated. Research is the same. Aim for many failed experiments this week, not one perfect one next month.

Don’t pivot out of the swamp. If you switch projects because you’re stuck, you’ll be stuck again in the new one, because the stuckness came from doing research, not from this particular project. The only way out is to prototype faster inside the swamp.

Write the experiment in one line when you can. If you can reuse code from an earlier experiment by swapping a metric or a dataset, do that. If you can use an API instead of standing up infrastructure, do that. Friction is the enemy.

Takeaways

Redefine success as learning, not as positive results. The swamp stops being a signal you’re failing and starts being a signal you’re in research.
Separate vectoring from velocity. Pick the riskiest question first; then figure out the cheapest way to answer it this week.
Be ruthless about core versus periphery. Fake, mock, or skip everything that isn’t strictly required to answer the question. Incompleteness is the point.
Lean on 1/t. The fastest way to increase velocity is almost never doing more — it’s doing less, faster, and ignoring the rest.

If you’re stuck right now, walk home without your headphones and ask yourself one thing: what is the actual question this week, and what is the smallest possible experiment that would answer it? Then go do that experiment tomorrow. The rest can wait.

References

Michael S. Bernstein. CS197: Computer Science Research — Stanford University. The vectoring / velocity framing for navigating research projects comes from this course. cs197.stanford.edu · Bernstein’s homepage.
Rylan Schaeffer, Brando Miranda, Sanmi Koyejo. Are Emergent Abilities of Large Language Models a Mirage? Neural Information Processing Systems (NeurIPS), 2023 — Outstanding Main Track Paper Award & Oral. arXiv:2304.15004 · OpenReview. The one-line metric change (accuracy → token edit distance) that this post uses as the canonical high-velocity example.
W. T. Gowers. The Two Cultures of Mathematics and related writing on research as iterative exploration. gowers.wordpress.com.
David Bayles & Ted Orland. Art & Fear: Observations on the Perils (and Rewards) of Artmaking (1993). The quantity-over-perfection ceramics-class parable.

Vectoring: Attack the Biggest Uncertainty First

2026-04-16T00:00:00-07:00

Brando Miranda — April 2026 · ~6 min read

Warning: this post is a draft — content may change and errors may remain.

TL;DR. The hardest part of research is choosing what to attack next when everything is uncertain. Vectoring (Michael Bernstein’s term) is the answer: pick the single dimension where, if you are wrong, the whole project falls apart — usually your main hypothesis, the central assumption behind the objective you actually care about, which you don’t yet know is true — and reduce that uncertainty first with the cheapest experiment that actually answers it. Split the work into core (the central assumption you test now) and periphery (scaffold it with the laziest thing that works). Worked examples: the Mirage emergent-abilities paper and a study of internet trolling. Vector first, then velocity.

Most research advice I got early on was useless because it assumed I already knew what to work on. The hardest part of a research project is not executing a plan — it is choosing what to attack next when everything is uncertain and nothing is proven. The single most important technique I teach in CS197 for this is what Michael Bernstein at Stanford calls vectoring: identify the direction of biggest uncertainty in your project, and reduce it first.

This is one of two skills that separate researchers who ship from researchers who flounder. The other is velocity — the speed at which you reduce risk once you have picked a direction. They operate in a tight loop. Pick the wrong vector and velocity does not save you; move too slowly on the right vector and the project dies anyway. Today I want to focus on the picking.

Research is not a project spec

The cartoon version of research — have an idea, build it, publish — is a lie. Papers read that way because papers are written to communicate a finished result. The process that produced them looked nothing like the abstract. It looked like things on fire, an ostrich running loose, and somebody yelling about what they just learned.

The trap is treating a research goal like a homework set or a project spec. You take the concept, break it into parts — build the model, collect the data, write the proof, design the interface — and execute all of them in parallel. You spend weeks perfecting something before you know whether the core idea works at all. When the fatal flaw shows up — and it always shows up — you have built a pile of scaffolding around an assumption that was never tested. That is what I mean when I say do not build your life on a lie.

Research is iterative exploration, not a linear path from idea to result. Your currency is experiments, not effort. A startup’s cash is the paying customer. A researcher’s cash is reality coming back and telling you whether your hypothesis survives.

The vectoring heuristic, stated cleanly

Here is the rule. Pick the single dimension where, if you are wrong, the whole project falls apart — and reduce the uncertainty there first, as fast as you can, using the cheapest experiment that actually answers the question.

That sentence does a lot of work. Three parts:

One dimension. Not five. The more dimensions you try to optimize at once, the harder gradient descent becomes — on neural nets and on careers. Pick one.
Biggest risk, most essential assumption. This is usually your main hypothesis — the central claim your whole project rests on, the thing tied to the objective you are actually passionate about — and the catch is you don’t yet know if it’s true. That unknown is exactly what makes it the riskiest direction. If the vector you chose turned out false, would the project die? If not, you picked wrong.
Cheapest experiment that answers it. The point is not rigor for its own sake. The point is to learn. If you can answer the question with a manual check of thirty data points in thirty minutes, do that before you train a classifier.

The output of a vectoring decision is a clear split between core and periphery. The core is what you attack. The periphery — infrastructure, polish, anything not testing the central assumption — gets scaffolded with the laziest thing that works. Reuse code. Draw the interface on paper. Hand-label ten examples. The periphery exists to let the core experiment run.

A walkthrough: emergent abilities in LLMs

The cleanest example I teach is the Mirage paper — the one arguing that “emergent abilities” of large language models are not a fundamental property of scale but an artifact of how we measure.

The prior belief everyone held was that as you scaled model size, certain capabilities appeared as sharp, unpredictable jumps. The authors had a hunch this was wrong. Their hypothesis: the jumps are caused by other factors — the scoring metric, the size of the test set, or the sparse sampling of model scales on the x-axis.

Three candidate vectors. All important. All unknown. Which one do you pick?

Vector A: the scoring metric. Change the last layer of evaluation — from exact string match to something smoother like token edit distance or average accuracy.
Vector B: the test set size. If your benchmark has three examples and the model is small, the probability it gets anything right is near zero, which looks like a sudden jump when scale finally clears the threshold. Testing this means building bigger benchmarks — in something like Persian-language QA, hiring speakers, writing questions, validating them.
Vector C: sparse sampling of model scales. Add more points on the x-axis, especially at the large end. Training one foundation model costs at least a million dollars.

Vector A wins. Not because it is the most important in isolation — all three matter — but because it is the one you can reduce now, cheaply, without new data and without a compute budget the size of a small country. You change one layer in the evaluation pipeline. That is it. And if the emergent jumps vanish when the metric changes, you have already learned the headline result of the paper.

This is the coupling between vectoring and velocity. Sometimes the right vector is not literally the biggest risk — it is the biggest risk you can actually attack this week. Creativity in choosing a vector often means finding the version of the question that a one-week experiment can answer.

Another walkthrough: trolling on the internet

A non-ML example, because vectoring generalizes. The common assumption is that online trolling is driven by a small number of antisocial users. The hypothesis a team explored: normal people also troll, when triggered. The dataset was roughly 16 million CNN comments with moderator flags.

Candidate vectors:

Check a subset of the data manually — read thirty flagged comments and see what kind of person left them.
Train a classifier to predict who will troll, then compare weights on personal history versus post context.
Analyze whether the same person trolls more on angry topics than boring ones.

The right first vector is the manual check. Thirty minutes of reading gives you direct evidence for the central question: do normal-looking accounts troll when the thread is heated? Training a classifier is expensive and one layer removed from the hypothesis. The topic-comparison analysis is indirect. Reading the data is direct, cheap, and targets the main claim. This is also what good ML researchers do constantly — they read their data and model outputs, especially when things break.

Assumption mapping and how to scope a vector

When I am stuck choosing, I draw the assumption map on paper. X-axis: known to unknown. Y-axis: unimportant to important. Every open question goes on the grid. The target quadrant is upper-right: important and unknown. Revisit this map weekly. I have friends in biology who pipetted for a month before looking up and asking why am I doing this. Writing is the cheapest form of articulation I know, and articulation forces the question to the surface.

A good vector is achievable in a one-to-two-week sprint. “Can normal people be responsible for trolling online” is not a vector — that is the whole project. “Can normal people be responsible for trolling on CNN.com in flagged comments under politically charged articles” is closer. If a week of focused work cannot give you a preliminary answer, you just restated the project. Scope harder.

One trick I use on myself: put the result on a calendar as if I have to present it to my advisor next week. Often I do. Sometimes I play advisor to myself. Either way, the deadline forces creativity about what can be cut.

Takeaways

Research is not a project spec. Do not execute in parallel across every dimension. You will perfect the wrong thing.
Vector, then velocity. Identify the single biggest uncertainty whose failure would kill the project. Reduce it with the cheapest experiment that actually answers the question.
Core versus periphery. Scaffold everything that is not the core experiment. Reuse code. Draw on paper. Hand-label thirty examples.
Iterate honestly. When a vector resolves, new unknowns appear — that is the job. Reprioritize. Unexpected results are a gift, not a failure.

Iteration beats planning. Hemingway, Picasso, every researcher I admire — all wrong on the first try. Assume you will be too, and make your first try cheap.

References

Michael S. Bernstein. CS197: Computer Science Research — Stanford University. Vectoring is Bernstein’s term, taught in CS197; this post is my take on it. cs197.stanford.edu · Bernstein’s homepage.
Rylan Schaeffer, Brando Miranda, Sanmi Koyejo. Are Emergent Abilities of Large Language Models a Mirage? Neural Information Processing Systems (NeurIPS), 2023 — Outstanding Main Track Paper Award & Oral. arXiv:2304.15004 · OpenReview. The Mirage emergent-abilities walkthrough in this post.
Justin Cheng, Michael Bernstein, Cristian Danescu-Niculescu-Mizil, Jure Leskovec. Anyone Can Become a Troll: Causes of Trolling Behavior in Online Discussions. CSCW, 2017. arXiv:1702.01119. The ~16M CNN-comments trolling study used as the non-ML walkthrough.

Symbol	Meaning
$X$	The vocabulary (alphabet) ${x_1, \dots, x_V}$; $V = \lvert X \rvert$.
$x$, \(x^{}\)	A sequence $x \in X^{T_x}$ and its token at position $t$.
$T_x$	Length of the modeled object $x$ (unconditional setting). In the conditional setting of the error-compounding post, the exponent variable is the output length $T_y$; the prompt length never enters the exponent.
$\varepsilon$	Per-step unrecoverable error probability. (Written $e$ in the earlier post; renamed here to avoid collision with the exponential base.)
$f_\theta(v; \cdot)$	The AR model’s logit for token $v$ given the context.
$E_\theta$	Energy function $X^{T_x} \to \mathbb{R}$; $-E_\theta(x)$ is an unnormalized confidence score for the whole sequence.
$Z_\theta$	Partition function. AR: $\sum_{v \in X} e^{f_\theta(v)}$ per step ($V$ terms). EBM: $\sum_{\tilde x \in X^{T_x}} e^{-E_\theta(\tilde x)}$ ($V^{T_x}$ terms).
AR	Autoregressive factorization \(p(x) = \prod_t p(x^{} \mid x^{<1:t-1>})\).
EBM	Energy-based model: scores configurations with $E_\theta(x)$; probabilities only via $e^{-E_\theta}/Z_\theta$.