Tilting at Windmills I

Code on GitHub at captainpete/bitcoin-science.

This is the first post in a series called Tilting at Windmills, where we take machine learning approaches, point them at the Bitcoin mining problem, and examine how they fail.¹ Bitcoin mining is useful for this because it is genuinely unsolvable by any method other than exhaustive search, which means failure is guaranteed and the interesting question is always why.

Bitcoin mining is a guessing game. You have a block of transaction data and you need to find a number (the nonce) such that when you hash the block together with that nonce, the result is smaller than a target value. The only known way to find such a nonce is to try numbers until one works. Neural networks are famous for finding non-obvious patterns in data: they have learned to recognise faces, predict protein structures, and generate text. The natural first question is whether a network could learn which nonces are more likely to produce a lucky hash. That is what this post investigates.

The mining problem

A Bitcoin block header is 80 bytes. The last four bytes are the nonce. The preceding 76 bytes (the “stem”) describe the block: which transactions it contains, which block preceded it, when it was created.²

Mining means hashing those 80 bytes through SHA-256 twice and checking whether the result falls below the current difficulty target. The target is set so that roughly one in every few trillion nonces succeeds. There are only $2^{32}$ possible nonces (about four billion), so a mining chip exhausts the entire nonce space in milliseconds and then has to vary the timestamp or swap transactions to get a fresh stem to search.

The important property of this process: there is no such thing as “almost right.” A hash that is twice the target is no better than one that is a billion times the target. Each nonce either passes or it fails, and you can only find out by hashing it. There is no gradient pointing toward the good nonces, no sense in which one nonce is “warmer” than another before you hash it. But what if a network with enough capacity could find a pattern in the nonce-to-hash relationship that isn’t apparent to direct inspection?

Why it seemed worth trying

SHA-256 was designed to hash inputs of arbitrary length. Bitcoin’s double-SHA256 does not fully exercise that generality. The first round hashes the 80-byte block header. The second round hashes the 32-byte output of the first round, a short fixed-length input that skips the padding and length-encoding steps the full algorithm requires for longer inputs. The security properties of SHA-256 are established for the full algorithm; how much of that strength carries over when only part of it runs is less certain.

The input space is constrained in other ways. Block stems are not arbitrary byte sequences: version fields are constant within a protocol era, timestamps increment predictably, the bits field moves slowly with difficulty adjustments. The nonce always occupies the same four bytes at the end of the stem. This makes the problem narrower than the classical preimage problem³, which SHA-256 was specifically designed to resist. The problem here is adjacent: given a stem with known statistical regularities, assign higher probability to the four-byte nonces that produce a low-valued hash. Satoshi Nakamoto’s choice of double-SHA256 over single-round was publicly justified as protection against length-extension attacks and general cryptographic conservatism, which suggests an awareness that the simpler construction left questions open.⁴ SHA-256 has no formal mathematical proof of security; its strength is empirical, built on decades of failed cryptanalysis rather than a reduction to a provably hard problem.

The difficulty is intuitive rather than formal. And the track record of intuitive difficulty as a reliable guide to what neural networks can learn has not held up well: playing chess at superhuman level, predicting protein structure from sequence, and generating coherent language all seemed like they should be beyond reach, and each turned out to be a matter of scale and architecture rather than fundamental impossibility.

The approach

The most direct application of machine learning would be supervised learning: given a block stem as input, output the nonce that produces a valid hash. The problem is that generating training examples requires finding valid nonces, which requires mining blocks: exactly the problem we are trying to solve. There is no way around this circularity with a fully supervised approach.⁵

There is a more indirect route. Instead of predicting a specific nonce, a model can output a probability distribution over nonces: for each of the 32 nonce bits, it outputs a number between 0 and 1 representing its belief that the bit should be 1. You then sample many nonces from that distribution, hash each one, and score it. The scores tell you which samples were good. You update the model to make high-scoring nonces more probable and low-scoring ones less probable. This is the REINFORCE⁶ algorithm applied to nonce prediction.

For scoring, we need something smoother than pass/fail. We construct a score over $[0, 1]$ that deliberately concentrates its sensitivity on the most significant bits of the hash: gaining a leading zero in the first byte contributes far more than a gain in any later byte. This is a curriculum choice: if there is any learnable signal at all, it should appear most strongly in the easiest sub-problem first, which is getting the leading bits right. A uniformly random nonce scores around 0.1 to 0.2. A valid mined block scores close to 1.0.⁷

The model takes the 63 variable bytes of the block stem, unpacked bit by bit to a 504-dimensional input, and outputs 32 sigmoid values representing per-bit nonce probabilities.⁸ We model individual bits rather than predicting over full bytes or the 32-bit integer directly, because the score function weights each bit geometrically: the leading bits of the hash matter far more than the trailing ones, and the output space should reflect that structure.

The result

Loss is flat throughout all 200 training steps. The model’s score distribution is indistinguishable from sampling uniformly at random, and stays that way across 200 training steps regardless of batch size, accumulation steps, or network depth.

Why it failed this time around

SHA-256 belongs to a family of cryptographic hash functions designed around one core property: the avalanche effect. Flip any single bit of the input, and roughly half of the 256 output bits flip in response, not the same half every time but a different, unpredictable set depending on the specific input.⁹

To see what this implies, consider any two nonces that differ by a single bit. Hash each one. The two outputs share almost no structure. Their scores are uncorrelated: knowing that one nonce scored well tells you nothing about whether its neighbour will.

The score landscape over the four-billion nonce space is pure noise: every point is independent of every adjacent point. Gradient-based learning works by following gradients uphill through structured terrain, and a landscape of pure noise has no terrain to climb.

This is what breaks the approach. REINFORCE reinforces the bits that appeared in high-scoring samples. But which bits those are is uniformly random and uncorrelated with the stem, because the high-scoring nonces are themselves randomly distributed. Sampling more nonces reduces the variance of this reinforcement noise, but the noise averages to zero however many samples you take. The expected gradient is zero for every model parameter, regardless of what probabilities the model assigns.¹⁰ There is no loss shaping, no reward weighting, and no increase in batch size that can produce a non-zero gradient in expectation when the underlying mapping has no learnable structure.

The avalanche effect is a design requirement of SHA-256. A hash function without it would leak information about inputs through its outputs, making collisions easier to find and undermining its cryptographic uses. The property that makes SHA-256 useful as proof-of-work is exactly the property that makes it unlearnable.

What we found along the way

Running this experiment with Adam rather than SGD produced an instructive failure mode. Loss drops immediately and the model appears to converge, but inspection shows the output probabilities have all collapsed to 0.5, producing maximum-entropy distributions that score identically to random. Adam maintains per-parameter estimates of gradient variance, and when the true gradient is near zero, this normalisation amplifies noise into large updates, driving sigmoid outputs toward saturation and collapsing the sampling distribution. SGD has no such normalisation and fails more transparently: loss stays flat, nothing moves, and the absence of signal is legible directly from the training curve.¹¹

The slowest part of the training loop was the hashing itself, so we built a JAX implementation of double-round SHA-256, JIT-compiled and vmap-compatible for GPU scoring. Since the input is always exactly 80 bytes, the padding and length-encoding fields are pre-computed constants rather than dynamically derived. The first SHA-256 block covers bytes 0–63 of the header, which contain no nonce, so its message schedule and compression round can be computed once per stem and shared across all nonce samples rather than recomputed for each one.¹²

What’s next

The curriculum scoring focused the gradient signal on the first output bits, but we still needed a gradient across all 32 nonce bits simultaneously. That gradient does not exist. What if we reduced the problem further, shifting from “what does a high-scoring nonce look like” to the narrower question of “given this nonce, what does a slightly better one look like”?

That reframing changes the target from generating a good nonce wholesale to learning to improve on an existing one. It is a smaller step, and potentially an easier one: the model no longer needs to understand the full nonce space, only to learn a local improvement operator. Part two applies a Decision Transformer to this idea, treating nonce discovery as a sequence of small moves rather than a single prediction, and using the hash score at each step as feedback before committing to the next. Whether the avalanche effect leaves any learnable local structure to exploit is a separate question from whether it leaves any global structure, and one worth asking.

See more on the series at the cryptography tag page. ↩
A Bitcoin block header is 80 bytes: version (4), prev_hash (32), merkle_root (32), timestamp (4), bits (4), nonce (4). The first 76 form the stem. ↩
A preimage attack on a hash function $H$ finds an input $x$ such that $H(x) = y$ for a given target $y$ . A second-preimage attack finds a different input $x'$ such that $H(x') = H(x)$ for a given $x$ . A collision attack finds any pair $x, x'$ where $H(x) = H(x')$ . SHA-256 is designed to resist all three. ↩
S. Nakamoto, “Re: Hash() function not secure,” BitcoinTalk, Jul. 16, 2010. [Online]. Available: https://satoshi.nakamotoinstitute.org/posts/bitcointalk/threads/102/ ↩
One could use winning (stem, nonce) pairs from the blockchain as training data, but two problems arise. First, selection bias: the stems that appear on the blockchain are precisely the ones that turned out to have accessible nonces, which may make them structurally unrepresentative of stems in general. The failed attempts (stems where miners exhausted the full nonce space and found nothing) are never recorded. Second, scale: at the time of writing the blockchain contains approximately 716,600 blocks, each with one winning nonce. Models tackling comparably hard pattern-recognition problems are typically trained on datasets orders of magnitude larger, and there is no reason to expect the nonce-prediction problem to be easier. ↩
R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, pp. 229–256, 1992. REINFORCE estimates the policy gradient as $E[R \cdot \nabla_\theta \log \pi(a|s)]$ : the gradient of each action’s log-probability is scaled by the return it received, making high-return actions more probable and low-return actions less so. Our loss uses $\text{score}^2$ rather than the raw score as the return, squaring to amplify the rare high-scoring samples and give them disproportionate influence on the gradient estimate. This is a reward-shaping heuristic rather than strict REINFORCE, but the structure is the same: sample actions from the policy, observe outcomes, reinforce what worked. ↩
Given $d_i \in [0, 255]$ as the $i$ -th byte of the reversed double-SHA256 output (Bitcoin convention: lower value = better hash):

$\text{score} = 1 - \frac{1}{9} \sum_{i=0}^{31} \frac{\log_2(d_i + 1)}{9^i}$

The leading $\frac{1}{9}$ normalises the sum: when all bytes are 255, $\log_2(256) = 8$ and $\sum_{i=0}^{31} 8/9^i \approx 8 \cdot \frac{9}{8} = 9$ , so the maximum sum is 9, the minimum score is 0. A perfect hash (all zero bytes) gives score 1.

The $1/9^i$ per-byte decay is a curriculum parameter. With this steep decay, nearly all the score’s sensitivity sits in the first byte: a leading zero gain in byte 0 adds roughly 0.11 to the score, the same gain in byte 1 adds roughly 0.012, and by byte 2 the contribution is negligible. The training signal is concentrated on the easiest sub-problem. Replacing 9 with a smaller constant (say 2) flattens the decay and spreads attention across more bytes, allowing the curriculum to progress once signal is established in the first bits. The null result here holds at the steepest setting, the most sensitive version of the problem. ↩
The 63 variable bytes are: bytes 4–26 of prev_hash (23 bytes; the remaining 9 are zero-padded in this experiment), merkle_root (32 bytes), timestamp (4 bytes), and bits (4 bytes). Version is fixed and dropped. Model architecture: an embedding layer projects the 504-dim input to a 512-dim hidden space, followed by 12 residual blocks. Each block applies three 512-dim linear layers with ReLU activations on the first two, then adds the block input as a skip connection. A final linear layer projects to 32 outputs, each passed through sigmoid. Training uses $K=5$ stems and $M=8000$ nonces per stem, accumulated over 32 steps before each parameter update, giving an effective batch of $K \times M \times 32 = 1{,}280{,}000$ (stem, nonce) pairs per update. ↩
The avalanche property is sometimes formalised as the strict avalanche criterion (SAC): flipping any single input bit should flip each output bit with probability exactly 1/2. SHA-256 does not satisfy SAC in the formal sense but approximates it closely enough in practice that no statistical distinguisher for the nonce-to-hash mapping is known. ↩
More precisely: let $p_i$ be the model’s probability that nonce bit $i$ is 1. The expected gradient is $\partial E[\text{score}] / \partial p_i = E[\text{score} \cdot (y_i - p_i)]$ where $y_i$ is the sampled bit value. By the avalanche property, score is independent of $y_i$ in expectation for any fixed stem, so the expectation factors as $E[\text{score}] \cdot E[y_i - p_i] = E[\text{score}] \cdot (p_i - p_i) = 0$ . The same argument applies to every parameter via the chain rule. ↩
The collapse produces a misleadingly low cross-entropy loss, since a maximum-entropy distribution over bits assigns probability 0.5 to every bit and incurs the minimum possible binary cross-entropy regardless of whether the nonces it implies are any good. ↩
I would expect current Bitcoin ASIC chips to apply the same shortcut, precomputing the nonce-independent first block once per stem and pipelining only the second block across nonce candidates. The fixed-input nature of the problem makes this a straightforward and significant optimisation at any level of implementation. ↩

The mining problem

Why it seemed worth trying

The approach

The result

Why it failed this time around

What we found along the way

What’s next

Footnotes