Tilting at Windmills II

Part I of this series applies REINFORCE to nonce prediction and finds that the expected gradient is exactly zero: SHA-256’s avalanche effect makes the score of a sampled nonce independent of each of its bits, so every parameter update is averaged noise. That post closes by proposing a Decision Transformer, on the theory that improving a nonce step by step might be easier than generating a good one outright. This post covers that attempt and six more, spanning 2021 to 2025. Each gets a section rather than its own post, because where the series eventually lands reframes all of them, and later parts get there. The sections are grouped by idea rather than by strict date.

First, though, comes the prehistory, because the windmills predate the series.

Before the windmills

Before asking which nonce produces a low hash, there is an even more naive question: can a network simply invert the hash? The first attempt I still have is a Python 2 script from December 2017. It wires a PPO agent (a stack of 91 dense layers of 512 SELU units) to a live Bitcoin node over RPC: getblocktemplate supplies real work, the agent observes the 640 header bits and the 256 bits of the current block hash, emits 32 booleans as a nonce, and a submitblock call stands ready for the day a hash lands under the real network target. Ten payout addresses are configured in advance, which captures the spirit of the enterprise. I had not yet thought carefully about what one-in-a-few-trillion odds mean for an agent whose only feedback is failure.

In 2018 I point the same appetite at elliptic curves: given the two coordinates of a secp256k1 public key, predict the 256 bits of the private key, using 64 blocks of dense layers with the input re-injected at every block, and a curriculum that samples keys of geometrically distributed bit length so that short keys (which a network could at least memorise) dominate early training.

By 2020 the target is the hash itself, reduced to make the question fair: hash a random 16-byte preimage through a single-round SHA-256 and train roughly 150 dense layers, input re-injected throughout, to emit the preimage bits under binary cross-entropy. A companion notebook reimplements SHA-256 in TensorFlow bitwise ops and logs all 64 rounds of internal state, and a smaller network is trained on the transition from one round’s state to the next. In mid-2021 there is a brief detour through Ethereum’s Ethash, where a DDPG agent outputs 64 continuous values, interpreted as bit probabilities for the 64-bit nonce, and is rewarded with the best score among ten thousand sampled nonces per step.

None of this works, which is by now expected. What I take from this period is a narrower question, the one Part I finally puts properly: set aside inversion, and ask whether anything at all can be learned about which inputs the forward direction favours. Everything below is an answer to that question, attempted and rejected.

Maybe it needs to be sequential

Part I ends by asking whether the avalanche effect leaves any local structure to exploit even though it leaves no global structure: rather than generating a good nonce wholesale, learn to improve an existing one. The Decision Transformer¹ is the natural tool for this in late 2021. It reframes reinforcement learning as supervised sequence modelling: log trajectories of (return-to-go, state, action), train a causal transformer to predict actions, then prompt it at inference time with a high desired return and let it emit the actions that historically accompanied success.

The game walks a cursor across the nonce. The cursor starts at the most significant bit and advances one position per turn toward the least significant; at each turn the model makes one binary decision, flip the focused bit or leave it. The episode runs at most 32 moves and ends early if the hash drops below a target $\kappa$ set at “first byte zero”, roughly a one-in-256 event for a random nonce. The observation contains the 67 varying header bytes (current nonce included), the full 256-bit hash of the current candidate, and a one-hot cursor position, so the model sees the consequence of every previous flip before committing to the next one: exactly the step-by-step feedback Part I asks for. Training data comes from a uniform random policy, with return-to-go computed as the reverse cumulative reward, and the model learns to predict the logged action conditioned on state and return.²

The behaviour-cloning loss never leaves the neighbourhood of $\ln 2 \approx 0.693$ , which for a binary action is exactly chance, across training on hundreds of thousands of fresh random trajectories. Return conditioning is supposed to do the heavy lifting here: among trajectories that ended well, the actions that led there should be recoverable. But which flips led to good outcomes is itself uniform noise. Flipping the focused bit rescrambles the entire hash regardless of which bit is focused, what the previous flips were, or what order anything happened in. Autoregressive generation changes only how the policy is factorised; the sparse, uncorrelated reward survives the sequential framing untouched.

The objective is the problem

By late 2023 the suspicion has shifted to the loss function. REINFORCE weights log-probabilities by a raw score, and raw scores from a hash are mostly noise; perhaps a loss that asks a gentler question can extract a gentler signal. The model is now a small attention network in JAX that reads the 640 header bits, a mask marking the handful of bits it is free to choose (six, in these configurations), and a one-hot round count that keeps a reduced-round curriculum available, and outputs a probability for every free bit. Three objectives are tried in sequence.

The first is reward-weighted cross-entropy: cross-entropy toward the bits of each sampled nonce, scaled by that nonce’s mean-centred score, so the model is pulled toward what scored well and pushed from what scored badly. The second is pairwise preference with a frozen reference model, the same construction as direct preference optimization:³ for a winner and loser nonce on the same stem, maximise the logistic sigmoid $\sigma$ of how much more the policy prefers the winner than the reference does. The third drops the reference and weights a plain Bradley-Terry⁴ comparison, $-\log \sigma(\log p_w - \log p_l)$ , by the winning margin.

The thinking runs: if absolute scores are too noisy, compare pairs; if pairwise training drifts, anchor it to a reference. Each is a standard remedy for a real pathology of policy-gradient training. Neither pathology is the problem here. For a fixed stem, whether one random nonce beats another is a fair coin toss carrying no information about either nonce’s bits, and the Bradley-Terry likelihood of independent coin tosses is maximised by indifference. The ranking losses train exactly as designed and converge faithfully to the model that declines to express a preference, which samples identically to uniform. These are three ways of asking the data for a correlation, and reshaping the question cannot create a correlation the data does not contain. The evaluation harness samples nonces from model and from uniform across round counts 5 through 63 and compares score distributions; we eventually retire the whole family of losses to backstory.

What if it’s not ML at all

Perhaps the problem is gradients as such. A SAT solver needs no gradient, only structure, and SHA-256 is after all a finite circuit. In mid-2022 we encode it as Z3⁵ BitVec constraints: every word a 32-bit vector, rotations and XORs exact, and the nonce the lone unknown, a single z3.BitVec("nonce", 32) inserted as word 3 of the second message block while z3.simplify constant-folds everything else as the rounds unroll. Given the eight working registers after 10 compression rounds, the solver inverts the circuit and returns the planted nonce in seconds. At 11 rounds, the nonce’s eighth, the solve does not return in hours: a thousandfold cliff in a single round, and the circuit complexity of even partial SHA-256 is out of a SAT solver’s reach almost as soon as the nonce starts mixing.

The exercise pays for itself in structural insight. The diffusion pieces of SHA-256 are individually trivial: $\Sigma_0$ turns out to be a bijection on 32-bit words, and Z3 hands us its exact inverse as an XOR of 32 constants.⁶ The hardness lives entirely in how the linear pieces compose with Ch, Maj, and the carry chains of modular addition.

The encoding also exposes a decomposition the later parts of this series lean on. The nonce occupies bytes 76 to 80 of the header, which is word 3 of the second message block, so the first three rounds of the nonce-bearing compression are nonce-free and precomputable; round 3 is the first place the nonce touches the state. We isolate that boundary: a step1 function emits a 352-bit snapshot (the eight registers entering the round, the two words it updates, and one schedule word) and a small dense network, a dozen 32-unit ReLU layers, is trained to map the snapshot back to the 32 nonce bits. This experiment is rigged in our favour: the snapshot determines the nonce exactly, by a short chain of modular subtractions any pocket calculator could perform.⁷ Over 100,000 batches of 8,192 examples, more than 800 million in all, the cross-entropy sits at $\ln 2$ plus $1.4 \times 10^{-6}$ . The information is present, the inversion has a closed form one round deep, and the architecture extracts none of it.

Treat it as a game

Concurrently with the Decision Transformer, in late 2021, the other dominant paradigm of the era gets its turn: AlphaZero-style⁸ search, in Julia. The game starts from a random nonce and permits 32 moves, each XOR-flipping one nonce bit (so any nonce is reachable in any order); reaching a hash below $\kappa$ wins (+1), running out of moves loses (-1), and $\kappa$ starts loose, with a ratchet designed to tighten it as play improves (the surviving runs leave the ratchet disabled). A policy-and-value network reads a 672-bit observation (608 header bits, 32 nonce bits, a one-hot move counter) through four dense layers, emitting 32 action logits and a tanh value estimate, and guides a PUCT-driven Monte Carlo tree search with Dirichlet noise at the root, self-play workers feeding a state pool across GPUs.

On paper this is a two-player game, one player improving the hash and one worsening it. In the implementation the second player needs no network, because the avalanche effect plays that side for free: every flip the first player makes is answered by a complete rescrambling of the position. This is where the analogy to board games dies. AlphaZero’s leverage comes from the value network learning positional evaluation, which works because advantage in chess or Go persists: a good position remains good after a move. Here no partial nonce is closer to a solution than any other, so the probability of winning from any state depends only on the number of moves remaining, and the optimal value function is a constant.⁹ The Q-values in the tree are rollout noise around that constant, the PUCT selection rule reduces to prior times visit-count bookkeeping, and the search expands uniformly. Monte Carlo tree search guided by a noise oracle is exhaustive search with administrative overhead.

Back to RL, properly this time

In December 2024 REINFORCE returns with three years of better tooling: transformer backbones in Flax, 1,024 sampled nonces per stem (2,048 in the byte variant), mean-baseline advantages in all but the masked variant, weight decay, gradient clipping. Three action parameterisations are tried: 32 independent Bernoulli bit probabilities, as in Part I; four 256-way categorical distributions, one per nonce byte, in case bytes are the natural unit; and a variant that scores each digest bit separately and masks the reward to a chosen subset of output bits, a curriculum over what counts as success rather than over the inputs. All three train on freshly sampled stems rather than a fixed dataset, and the bit-level variant scales its output probabilities by $1 - 1/e$ so that no bit is ever sampled as a certain one and the policy keeps exploring rather than saturating. The score distributions of sampled nonces sit on the uniform baseline throughout, as before.

The result is available in advance from Part I’s argument. The zero-gradient reasoning never mentions the network: it is a statement about $E[\text{score} \cdot \nabla_\theta \log \pi]$ for any policy $\pi$ however parameterised, and an expectation that is zero through a feedforward stack is zero through an attention stack. Running the experiment anyway has a certain value as hygiene: it confirms that the wall is where the theory says it is and has not moved with the state of the art.

What does SHA-256 actually do to probability

After this many failures the productive question stops being “which method next” and becomes “what exactly does this function do to the probability mass our models push through it”. So we instrument the function itself. Every bit becomes a probability: XOR of independent soft bits is $a + b - 2ab$ , AND is $ab$ , and 32-bit addition becomes a soft ripple-carry adder propagating a soft carry.¹⁰ The whole pipeline is differentiable end to end under reverse-mode autodiff, so we can hand SHA-256 a nonce that is 70% one and watch what it does, and ask for gradients at any point.

A first cut in JAX dies on numerical range, lacking the precision the question demands, and Julia takes over. There, arbitrary-precision floats let the mantissa grow by powers of two, from $2^{16}$ bits per value up a ladder that tops out at sixteen megabytes for a single intermediate value.¹¹ Intermediate values other than 0, 0.5, and 1.0 appear only at extremes like $10^{-20}$ : by round 9, six rounds after the nonce enters the state, every soft bit has saturated to certainty or to a perfect coin flip, the partial information in between (rounds 3 through 8) never touches more than about a tenth of the state bits, and the leftover deviations are visible at all only because the precision is absurd. Any uncertainty in any input bit cascades to total uncertainty in every output bit, because XOR against an unknown operand turns any distribution into a coin flip and the carry chain spreads one uncertain bit across a whole word within a round. And the gradient of the expected hash value with respect to the soft nonce bits is numerically zero for any individual block; an attempted objective taking the minimum over a hundred blocks fares no better.

This is the empirical companion to Part I’s algebraic argument, and it closes the casefile on every gradient method above. The gradient is zero in expectation, and now we can also see why no estimator gets lucky: the probability mass itself saturates at maximum entropy almost immediately, even within the first of the two SHA-256 applications (the probe’s second application sits commented out, never needed). Nothing differentiable survives to round 64, because nothing differentiable survives round 9.

Cheat a little

Part I dismisses supervised learning as circular: labels are winning nonces, and producing winning nonces is the problem being solved. That argument is correct for the network target and wrong for a weaker label. Every stem has a best nonce among its $2^{32}$ candidates, and finding it requires no luck at all, only the willingness to hash all four billion. On a GPU this is a jax.lax.scan over batches that tile the nonce space, a jax.vmap across each batch, and a lexicographic-minimum reduction that carries the best digest and its nonce through the scan.¹² It is the same loop mining hardware runs; the only difference is keeping the argmin instead of comparing against a target. This oracle arrives in January 2025 and changes the structure of the project: supervised learning has labels now, at a price of $2^{32}$ hashes each.

The first thing to do with an oracle is the experiment that was impossible in Part I. The key space shrinks to $2^d$ nonces so that every training example can be labelled exhaustively on the fly; random stems go in, and a transformer (up to 512 attention blocks, 68 million parameters in the deepest configuration) trains to predict the $d$ bits of the best nonce under cross-entropy. The training logs are unambiguous: the loss settles at 0.6931 and stays there for the remainder of every run, across model widths from 32 to 512, depths from 32 to 512 blocks, and key spaces of $2^8$ and $2^1$ . The $d = 1$ runs are the cleanest statement of the result this series has produced so far. Given a stem and two candidate nonces, zero and one, predict which hashes lower; the answer is supplied exactly, for every stem, by exhaustive evaluation; and after roughly twelve million labelled stems the network still predicts a coin flip.

This is the structural turning point the series has been heading toward. Every section above asks some variant of “can reinforcement learning find signal in a sparse, uncorrelated reward”, and the answer is consistently no, for reasons that are ultimately about the reward and never about the method. With the oracle the question becomes “given perfect labels, is there anything stem-conditional about where the best nonce lives at all”, and for the direct representation the $d = 1$ run answers that too. What remains is a structure search: any proposed structure can now be tested against exact ground truth instead of argued about.

What’s next

With exact labels per stem the hunt becomes systematic: propose a feature of the round function, score it against the oracle, discard it, repeat. Nearly every proposed feature dies this way, which is the point of having the oracle. One candidate eventually survives: a hand-built score on the first carry layer of the round function, computed from $T_1$ and $T_2$ , the two summands of the round’s final addition that the stage-wise decomposition above already isolates. It improves on random nonce sampling by a margin that turns out to have a closed form, and it comes with a catch. Part three covers what it is, why it works, and what the catch costs.

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 15084–15097. We start from the reference implementation (a GPT-2 backbone with the positional embeddings replaced by timestep embeddings, and separate linear heads embedding return, state, and action into a shared sequence), but the model we actually train distils the framing to its core: a deep residual MLP predicting the logged action from the concatenated observation and return-to-go, which preserves the return-conditioning mechanism while removing the attention machinery as a variable. ↩
The reward is shaped rather than binary: for a final hash $x$ against target $t$ it is $\min(2 - \log_2(4x/t),\ 256)$ when $x \le t$ and a matching logarithmic penalty when $x > t$ , clipped to $[-256, 256]$ , so trajectories that end close to the target are distinguishable from trajectories that end far away. The final evaluation loss on held-out trajectories is 0.697 against the chance value $\ln 2 \approx 0.6931$ . Cloning a uniform random behaviour policy yields $\ln 2$ trivially when actions are unpredictable; the entire content of the Decision Transformer bet is that conditioning on return makes the good actions predictable, so the gap between 0.697 and 0.693 is the measured value of that bet. ↩
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” in Advances in Neural Information Processing Systems, vol. 36, 2023. The loss in these notebooks is the same construction with hashes in place of human preferences: $-\log \sigma\big((\log p_w - \log p_w^{\text{ref}}) - (\log p_l - \log p_l^{\text{ref}})\big)$ , where winners and losers are nonces ranked by score on the same stem and the reference is a frozen copy of the policy. I adopt the form within months of the paper appearing. ↩
R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. The method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952. ↩
L. de Moura and N. Bjørner, “Z3: An efficient SMT solver,” in Tools and Algorithms for the Construction and Analysis of Systems, 2008, pp. 337–340. The encoding calls z3.simplify after every symbolic round, so concrete subexpressions collapse as they form and the formula handed to the solver contains only the genuinely nonce-dependent structure; the message schedule beyond word 16 is also symbolic, since the nonce feeds the schedule recurrence. The notebook’s first version recorded a 12-round solve (planted nonce 3270573274), which later review traced to a transcription bug, $\Sigma_1$ reading register $d$ instead of $e$ ; on the corrected circuit the measured profile is 8 rounds in 0.08 seconds, 10 rounds in 6.3 seconds, and a timeout past two hours at 11. ↩
Asking Z3 for the preimages of each single-bit output of $\Sigma_0(x) = (x \ggg 2) \oplus (x \ggg 13) \oplus (x \ggg 22)$ yields 32 singleton solutions; because $\Sigma_0$ is linear over GF(2), the inverse of an arbitrary $y$ is the XOR of the singleton preimages of its set bits, making the map a bijection (a bulk uniqueness count over a sixty-fourth of the input space corroborates this). The same construction inverts $\Sigma_1$ and the schedule’s $\sigma$ functions; only Ch, Maj, and addition destroy information. ↩
From the snapshot’s registers $(a, \ldots, h)$ entering round 3 and the round’s output word $T_1 + T_2$ : compute $T_2 = \Sigma_0(a) + \text{Maj}(a, b, c)$ directly, subtract to recover $T_1$ , then $w_3 = T_1 - h - \Sigma_1(e) - \text{Ch}(e, f, g) - K_3 \pmod{2^{32}}$ , and $w_3$ is the nonce. Four modular subtractions and three bitwise evaluations of values present in the input. That a dense ReLU network fails to represent this after 800 million examples says nothing about SHA-256 and a great deal about modular arithmetic as an inductive-bias problem, a theme that returns when the series reaches carry structure. ↩
D. Silver, J. Schrittwieser, K. Simonyan, et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, pp. 354–359, 2017. ↩
A uniform random hash falls below $\kappa$ with probability $p = \kappa / 2^{256}$ , and since every flip produces an effectively fresh uniform hash, the win probability from any state with $m$ moves remaining is $1 - (1-p)^m$ regardless of the header or nonce bits. A value function that depends only on the move counter contains no information the search can use to order actions, which is what “no positional evaluation exists” means operationally. ↩
The XOR identity is implemented as $a + b - 2ab$ rather than the algebraically equal $a(1-b) + b(1-a)$ , because forming $(1-x)$ explicitly loses precision exactly where it matters, when $x$ saturates toward 1. The adder computes the half-adder sum and carry vectorised, then propagates a soft carry-in bit serially from the least significant position, multiplying probabilities along the chain. ↩
ArbFloat types from Julia’s ArbNumerics, with the working precision left as a compile-time constant. The source’s comment ladder steps through $2^{16}$ , $2^{17}$ , $2^{18}$ , $2^{19}$ , $2^{20}$ , and $2^{21}$ bits of precision, annotated from roughly half a megabyte to 16 MiB per value. Gradients flow through the same representation via a compiled ReverseDiff tape, so a single gradient evaluation at the top of the ladder is a memory event as much as a computation. ↩
The lexicographic minimum is itself a small scan: walk the 32 digest bytes, keep the candidate set achieving the minimum at each position, and stop when one candidate remains. Batches of $2^{10}$ to $2^{12}$ nonces are hashed in parallel under vmap, and the outer scan carries (best digest, best nonce) across batches, so the full $2^{32}$ sweep compiles to a single XLA program. In the supervised experiment the swept space is restricted to $2^d$ so that every gradient step can afford exhaustive labels for each of its stems. ↩