Tilting at Windmills III

This is the third post in Tilting at Windmills, a series where we take machine learning approaches, point them at the Bitcoin mining problem, and examine how they fail.¹ Part I shows that the expected gradient through double-SHA256 is exactly zero, so nothing that follows gradients can learn anything about which nonces hash low. Part II surveys the broader campaign and ends at the same wall viewed from the supervised side: given exact best-nonce labels from a brute-force JAX oracle, a transformer still cannot learn which of two candidate nonces hashes lower. What survives that post is the oracle itself, since rewritten as a CUDA kernel that evaluates reduced-round SHA-256 at billions of candidates per second on a single GPU, 355 times faster than the JAX path it replaces.²

The oracle changes the economics of asking questions. A hypothesis that previously cost a training run now costs seconds, and the labels are exact rather than estimated. It also lets us control difficulty: rather than working through the full double hash, we pose reduced-round problems, where the first hash stops $r$ rounds after the nonce enters the compression function and the leading state word stands in for the hash. Depth $r$ becomes a knob, from two rounds of mixing up to the full block. So instead of asking whether a model can learn the whole function, we can ask something narrower: is there structure anywhere inside the round function, even one layer in, that distinguishes good nonces from bad ones before the full hash is computed? This post covers the one place we find some, what it is worth, and what happens when we try to carry it to the surface.

One round, two additions

SHA-256’s compression function runs 64 rounds over eight 32-bit working variables $a$ through $h$ . Each round computes two temporaries and shifts the state along:³

\begin{aligned} T_1 &= h + \Sigma_1(e) + \text{Ch}(e,f,g) + K_r + W_r \\ T_2 &= \Sigma_0(a) + \text{Maj}(a,b,c) \\ a' &= T_1 + T_2 \\ e' &= d + T_1 \end{aligned}

The operations here come in three kinds. The $\Sigma$ functions are XORs of rotations: linear over $GF(2)$ , and transparent to any analysis that works bit by bit. $\text{Ch}$ and $\text{Maj}$ are bit-local: output bit $i$ depends only on input bits at position $i$ . Modular addition is the one operation that couples bit positions, and it couples them in a single direction, low to high, through the carry chain. If the round function has any shape worth exploiting, the additions are where it has to live.

The discovery arrives by ablation rather than design. The first probe is generic: a polytope classifier, stem-conditional halfspaces over the raw round-output bits, fit to brute-force-mined low-hash nonces. It returns null across three configurations. Exposing $T_1$ and $T_2$ of the output round as separate features, rather than letting the model see only their sum, lifts the same architecture to $+17.13\% \pm 1.54\%$ on held-out stems: the first statistically significant improvement over random selection the series has produced. The follow-up ablation is the deflating one. A ranker with no model at all, which simply scores each candidate by the sum of the high bytes of the two summands, beats the learned polytope outright once the candidate pool is large enough.⁴ The model’s entire contribution is feature exposure; the learning is replaceable by one line of arithmetic:

\text{score}(n) = (T_1 \gg 24) + (T_2 \gg 24)

where $T_1$ and $T_2$ are the two summands of the round that produces the target state word. A score of zero is a certificate: both summands are below $2^{24}$ , their sum is below $2^{25}$ , and no carry can reach the high byte. The output word is guaranteed to land in the bottom $1/128$ th of its range, and the guarantee costs one comparison on top of arithmetic we have already done.

Sixty-seven percent

The protocol: for each stem, draw $N = 65{,}536$ candidate nonces, score each one, keep the 64 with the smallest scores, hash those, and take the minimum. The baseline keeps 64 random candidates instead. Over 2,048 held-out stems, the ranker’s mean minimum hash beats the baseline’s by +67.81% at depth 2, and by +68.84%, +67.97%, and +67.08% at depths 3, 5, and 10. The improvement does not move with depth, which is the first hint that it has a clean explanation.

It does: for a random candidate, the probability that both high bytes are zero is $1/65{,}536$ . Conditioned on that event, $T_1$ and $T_2$ are independent and uniform on $[0, 2^{24})$ , so their sum follows a triangular distribution on $[0, 2^{25})$ with mean $2^{24}$ , against an unconditional mean near $2^{31}$ . The number of score-zero candidates in a pool of $N$ is approximately Poisson with mean $\mu = N/65{,}536$ , and the expected minimum of $\mu$ triangular samples follows from order statistics. Dividing by the random baseline (the expected minimum of 64 uniform draws on $[0, 2^{32})$ is $2^{32}/65$ ) gives

\text{improvement}(N) \approx 1 - 65\sqrt{\frac{\pi}{2N}} \approx 1 - \frac{81.4}{\sqrt{N}}

with no dependence on depth anywhere in the derivation. The formula matches measurement to within half a percentage point at every $N$ tested, from 4,096 to 1,048,576. It even tracks its own failure regime: at $N = 4{,}096$ the pool usually contains no score-zero candidate at all, the ranker fills its top 64 with biased medium-score picks, and it performs worse than random, measured at -27.83% against a predicted -27.29%.⁵

The derivation assumes very little. The only property of SHA-256 it uses is that the high bytes of $T_1$ and $T_2$ are independent and uniform, which is the bit-mixing behaviour the designers explicitly aim for. Everything after that assumption is the order statistics of a known distribution: the improvement is the expected minimum of $N/65{,}536$ triangular samples, with nothing learned anywhere in it. The finding is that this much of the output’s distribution is readable before the final addition completes, through a certificate one byte wide on each summand.

The cost accounting kills the result as an attack even as it confirms it as a finding. Scoring a candidate means resolving the carries in $T_1$ and $T_2$ , which means computing essentially the entire hash: the measured cost ratio is 1.08. So the fair comparison is against brute force over the same $N$ , and brute force wins everywhere: +98.41% at $N = 4{,}096$ where the ranker is negative, +99.90% at $N = 65{,}536$ against the ranker’s +67.81%. The ranker is dominated at every $N$ tested and at every $N$ the formula projects, and the mean is the flattering metric: mining pays only for tail hits below a difficulty threshold, where the ranker finds none.⁶ What we have is a structural fact about SHA-256’s round arithmetic with a one-line closed form, and not a way to make money. The interesting question becomes whether the structure can be reached more cheaply than it was found.

Trying to cash it in

The cost problem is per-candidate arithmetic. If a generator could map the stem directly to high-scoring nonces, all the SHA-aware work would move into training, where it amortises, and inference would be a forward pass. One accounting rule shapes everything that follows: at training time, any SHA-aware feature is fair game, however expensive, since its cost amortises over the deployed model’s lifetime; at inference time, only stem-level conditioning is free, and any feature computed per candidate must be priced in hash-equivalents and compared against brute force at equal compute immediately. Two branches try to thread this needle.

The first is a discrete normalising flow: a stack of coupling layers over $\{0,1\}^{32}$ , conditioned on the stem, intended to act as a bijection between nonce space and a rank-like coordinate $z$ . Train it so that low $z$ corresponds to low hash, then at inference enumerate $z = 0, 1, 2, \ldots$ and invert. The flow passes its sanity checks on synthetic XOR targets (Spearman $\rho$ of 0.88 and 0.61). On the real reduced-round target it learns to recognise mined low-hash nonces perfectly (AUC 1.0 against random) while learning nothing about ordering them: $\rho = 0.02$ against the true rank. The economics are no kinder: one flow inversion costs about as much as a hundred oracle evaluations, so each proposed candidate would need to be a hundred times better than random before the method breaks even.⁷

The diagnostic that ends the branch is the one we should have run first. Call it Oracle A: put the answer in the conditioning. Supply each candidate’s true rank as an input feature and ask only that the flow’s output coordinate correlate with it. The XOR-coupling flow fails it in all three configurations tried, and the alternatives fail with it: direct regression, stacked additive couplings, and a masked autoregressive flow all land within noise of zero. The only architecture that passes is a single additive shift, $z = x + f(c) \bmod 2^{32}$ : $\rho = 0.66$ under the per-bit loss, and $\rho = 0.9964$ once a ranking loss lets it learn a scaled shift over the full range. Stacking shifts only destabilises it, so capacity is not what limits the branch; what no architecture finds is a way to produce the rank from the stem alone at inference, which is the part the oracle was supplying for free.⁸ The second branch, teacher distillation (an expensive SHA-aware teacher supervising a stem-only student), is specified with Oracle A as its entry gate and never earns its way past it. The one experiment from that branch which does run to completion is the hand score itself, priced fairly: score many candidates, verify the top few, at equal total compute against brute force. It is dominated in every cell of the budget grid, with bootstrap confidence intervals excluding equality everywhere either method scores at all.

Nothing learned in either branch matches the hand score, and the hand score does not pay.

The second hash

What remains is to characterise the structure properly: establish how deep it goes, and whether it survives composition into the hash Bitcoin actually uses.

Depth comes first: the closed form has no depth term, so the prediction is that the concentration appears at every round inside a SHA-256 block. The sweep confirms it: at depths 3, 5, 10, 16, 32, 48, and 61, measured improvement sits within one percentage point of the predicted +68.18%, and depth 61 is the full first-SHA block.⁹ Every one of the 64 rounds ends in the same two additions, and the ranker reads the same structure off all of them.

But the compression output is not the mining hash. Two boundaries stand between them. First, the SHA-256 finalisation adds the chaining variable $s_{b0}$ (the digest state after the nonce-free first 64 bytes of the header) to the compression output, producing the first-SHA digest.¹⁰ Second, that 32-byte digest goes through an entire second SHA-256. Each boundary gets an experiment, on the same 2,048-stem fixture, with every pipeline validated bit-exact against hashlib before any data is collected.¹¹

Experiment B reads the carry-aware selection out at all three levels. At the compression output the finding reproduces: +68.64% [+67.09, +70.19]. At the first-SHA digest the improvement is -2877% [-3031, -2724]: the selected arm’s mean minimum digest is nearly thirty times the random arm’s. The mechanism is exact: the ranker’s chosen candidates cluster near zero before the chaining-variable addition, so after it they cluster near $s_{b0}[0]$ , a uniformly random per-stem offset, while the random arm stays uniform and keeps its natural floor. At the SHA-256d output the measurement is +2.91% [-2.85, +8.33], a confidence interval straddling zero, with a rank correlation between score and final hash of -0.00003.

The level-B catastrophe, though, is the scorer’s fault rather than the boundary’s. $s_{b0}$ is known per stem, and an offset-aware scorer could target the compression output at $-s_{b0}$ instead of zero. So rather than patch the scorer and iterate, we close every scorer at once. Experiment B-prime replaces the carry-aware score with a perfect oracle on the entire first hash: for each stem, select the 64 nonces with the lexicographically smallest first-SHA digests out of all 65,536. This is the offset-aware patch perfected: selecting by the digest itself is everything that targeting $-s_{b0}$ aspires to, and more. At the first digest this scores +99.90%, the theoretical maximum for any 64-of-65,536 selector. At the SHA-256d output it scores -1.80% [-8.11, +4.26]. The rank correlation between first-digest order and final output is +0.0006 with a confidence interval containing zero, and the tail hit-rates at every threshold tested are within sampling noise of the random arm. At ten times the stems the intervals tighten threefold, to $+0.59\%$ $[-1.24, +2.54]$ for the carry-aware score and $+0.13\%$ $[-1.80, +2.13]$ for the oracle, and the null holds. Cheating completely on the first hash, with a selector no real method could ever match, transfers nothing through the second.

That is the sharpest result the series has produced so far, and it closes the thread. The carry-aware structure is single-block-internal: present at every round inside one compression function, erased by the first composition on top of it. No selector that ranks candidates by first-SHA digest order, up to and including a perfect oracle, produces any signal at the mining hash. The only door this leaves open is a method that models the second SHA directly from the first digest, and any such method starts from scratch against brute force at full double-hash economics.

What’s next

Two questions fall out of the wreckage. The first is whether the carry structure has any company: shallower features, message-schedule signatures, selection at the stem level rather than the nonce level, anything at all with measurable transfer. Part IV pre-registers that campaign, with kill criteria written down before the experiments run (a habit Oracle A taught the hard way), and follows it to ground. The second question is the one I find harder to put down: why is the one structure that exists sitting in the final addition, exactly one carry layer before the output word, and why does composition destroy it so completely and so fast? Parts V and VI take that question seriously, because the shape of the failure is starting to look less like bad luck and more like a property of the function.

See more on the series at the cryptography tag page. ↩
The reduced-round problem family: the nonce enters the first SHA’s second block as message-schedule word $W_3$ , so the midstate after the three nonce-free rounds is fixed per stem. Depth $r$ means $r$ further compression rounds, with the leading state word taken as the hash; $r = 61$ covers all 64 rounds of the block. The CUDA kernel evaluates 2 to 12 billion candidates per second on an RTX 3090 depending on depth (a measured 355x over the JAX path at the experiment-relevant size), which is what makes the brute-force baselines cheap: the full baseline grid for this post’s held-out fixture, 8.6 billion evaluations, takes 36 seconds end to end. ↩
National Institute of Standards and Technology, “Secure hash standard (SHS),” FIPS PUB 180-4, Aug. 2015. The notation here follows the standard: $\Sigma_0$ and $\Sigma_1$ are each the XOR of three rotations, $\text{Ch}(e,f,g)$ selects $f$ or $g$ bitwise by $e$ , and $\text{Maj}(a,b,c)$ takes the bitwise majority. $K_r$ is a round constant and $W_r$ the message-schedule word. ↩
The ablation chain in numbers, all on held-out stems: polytope over raw round-output bits, three configurations, -0.25% to +0.86%, all null; an MLP scorer over the same features, $+2.57\% \pm 3.42\%$ , null; the same polytope with $T_1$ and $T_2$ exposed as separate uint32 features, $+17.13\% \pm 1.54\%$ . At $N = 16{,}384$ candidates the no-learning ranker scores +36.16% against the learned polytope’s $+24.02\% \pm 1.72\%$ . The polytope retains one distinction: at $N = 4{,}096$ , where the ranker is at -27.83%, the polytope still manages +17%, interpolating better when score-zero candidates are absent. Interpolation is no rescue: brute force at $N = 4{,}096$ scores +98.41%, so the regime where the polytope beats the ranker is still a regime where neither matters. ↩
The small- $N$ regime is where $\mu = N/65{,}536 < 1$ . With no score-zero candidate in the pool, the top 64 fill with score-1-and-above candidates, whose hashes are bounded away from zero by the same arithmetic that makes score-zero candidates good: a high-byte sum of $s$ concentrates the output near $s \cdot 2^{24}$ , which for the medium scores that dominate a small pool’s top-K is worse than the natural minimum of 64 uniform draws. Neither form of the theory strictly covers this regime: the Poisson mixture needs $\mu \gtrsim 1$ , and the asymptotic formula in the body assumes $\mu \gg 1$ . The quoted $-27.29\%$ comes from the asymptotic formula running outside its own assumptions, and its landing within half a point of the measurement is agreement the theory does not claim credit for. ↩
Mean minimum and tail hit-rate also come apart, which matters because mining pays only for tail hits. Score zero certifies the output word lands in $[0, 2^{25})$ , but a mining-style threshold at $2^{16}$ occupies the bottom $1/512$ of that interval, and within the certified region candidates are effectively uniform, so threshold hits among the selected 64 occur at chance. A spot check on one stem’s 65,536 candidates: the single hit below $2^{16}$ has score 255 and sits at the median of the score ordering. The equal-compute re-ranker’s hit-rate at that threshold is 0.000 at every budget from 64 to 65,536. The score also misses the wraparound path, $T_1 + T_2 \approx 2^{32}$ , which produces a near-zero word with a large high-byte sum, and the miss is not small: a wraparound-aware variant selects nearly as well as brute force over the same pool (+98.39% at $N = 4{,}096$ , where the one-sided score is negative), because the two pre-carry high bytes plus the wrap state determine the output’s high byte exactly. Economically nothing changes: scoring still costs a full hash per candidate, and brute force at equal compute still dominates. ↩
Measured on a dedicated RTX 3090 at like-for-like batch sizes: one flow inversion costs 105.5 oracle evaluations at batch 1,024, and the best configuration found (4 layers, hidden 128, batch 65,536) still costs 88. At a compute budget of $B$ oracle-equivalents per stem, the flow’s effective candidate count is roughly $B/90$ before any quality argument begins. ↩
Oracle A asks less than it appears to: a conditional bijection cannot literally output the supplied rank while ignoring its input, since distinct inputs under the same conditioning must map to distinct outputs. The test requires only correlation between the output coordinate and the supplied rank, which a bijection is free to achieve, and the failures held across 5,000 to 15,000 training steps and varied loss weightings. The single-shift pass also needed care: integer regression on $z$ has gradient proportional to $2^{-i}$ in bit $i$ , so the low bits never train; the pass came from per-bit binary cross-entropy on the shift logits. Measured rank correlations against a pass gate of 0.3: XOR coupling $0.18 \pm 0.25$ , $0.01 \pm 0.06$ , and $-0.01 \pm 0.22$ across its three configurations; direct regression $-0.02$ ; stacked additive couplings $0.02$ ; the masked autoregressive flow $-0.03$ . Stacked shifts under the ranking loss score anywhere from 0.13 to 0.85 depending on depth and seed, against the single shift’s 0.97 to 1.00. ↩
Depth 2 on this fixture measures +62.51%, the one depth outside the one-point band, and the reason is the statistics rather than the structure: at $N = 65{,}536$ , $\mu = 1$ , so most of the between-stem variance is in whether a stem has any score-zero candidate at all ( $P = 1 - e^{-1} \approx 0.63$ ), and the theory’s interval at this operating point spans +49% to +68%. The same depth on a different 2,048-stem fixture measures +67.81%; both runs sit inside the interval, and a spot-check run reproducing the earlier run’s exact metric convention on this fixture confirms the gap is fixture variance, not implementation. ↩
$s_{b0}$ is itself a SHA-256 chaining state: the digest state after compressing the header’s first 64 bytes, which contain no nonce, so it is a per-stem constant. The finalisation adds it to the compression output word-wise mod $2^{32}$ , and word-wise modular addition of a per-stem constant does not preserve order across stems: each stem’s selected cluster lands near its own $s_{b0}[0]$ , which averages $2^{31}$ over stems. ↩
All three experiments in this section share one fixture: 2,048 held-out stems, 65,536 consecutive nonces per stem, 64 selections per arm, with the random arm’s nonces fixed by seed. The CuPy schedule and round pipeline is validated bit-exact against the CUDA oracle (12/12 trials) and the full double-hash pipeline against Python’s hashlib.sha256(hashlib.sha256(header)) (16/16 trials) before any data collection. Wall times: 28 to 40 seconds per depth for the sweep, 129 seconds for Experiment B, 192 seconds for B-prime. ↩

One round, two additions

Sixty-seven percent

Trying to cash it in

The second hash

What’s next

Footnotes