Tilting at Windmills VI

This experiment is Part III of a three-paper series:

Part I — The Amortization Envelope: Where a SHA-256d Mining Advantage Can and Cannot Live: mining cannot beat brute force unless SHA-256 is broken.
Part II — Carry Depth: A Circuit-Independent Coordinate of ARX, and the One-Round Advantage Cliff: the 386-layer coordinate and the one-round cliff.
Part III — Neural Distinguishers Expire on Carry Composition: this post.

Protocol, code, and results on GitHub at captainpete/carry-adder.

Part V steps away from training runs and makes the case on paper: every path from the nonce to the final hash passes through layers of 32-bit modular addition, 386 of them for Bitcoin’s double SHA-256, and the carry chains inside those additions are what erases local structure. The anchor sweep measures this directly. A hand-built score that reads the internal state at round 30 and adds the top bytes of the round’s two modular sums holds a 0.886 selection advantage one adder layer downstream, and loses essentially all of it within a single round.

That measurement carries a standing objection, and Part V ends by raising it. The cliff is measured with one hand-built feature, and “a human looked and found nothing” is a weak form of evidence in cryptanalysis. The machine-learning era supplies a specific counterexample: Gohr shows at CRYPTO 2019 that a neural distinguisher trained on round-reduced Speck32/64, an ARX cipher whose only nonlinear operation is the same modular addition, finds local structure that human cryptanalysts had missed.¹ If a network can do that to Speck, the polite question is what one finds at our read point.

Before the series can end, a learned attacker has to be given a fair shot at the wall, and fair has a technical meaning here. The protocol, the power gates, and the escalation rules are written down before any training run, and the results count whichever way they fall. This post is the record of that experiment: the strongest learned attack the series mounts, run under pre-registered rules, and the result that closes the series.

The protocol

The setup mirrors the paper’s anchor sweep. A fixed message stem (sixteen words derived from a seeded SHA-256 digest) fills the schedule, with word W[3] playing the nonce; the state advances through thirty rounds of a single SHA-256 compression from the IV, and round 30 is the read point. The network’s task is prediction at depth: given everything available at the read point, predict a state observable $j$ rounds deeper. At $j=0$ one round has been applied, the same one-adder-layer point where the hand score earns its 0.886; at $j=1$ the target sits a full round (roughly three carry layers) past the $j=0$ point, and so on up to $j=4$ .

The inputs include the schedule words consumed by the rounds being predicted, so the target is a deterministic function of the input. Failure cannot be blamed on missing information, only on the learnability of composed carry layers at fixed network depth. The split is mining-relevant: training nonces $0$ to $2^{20}-1$ , test nonces the next $2^{19}$ , disjoint but drawn from the same stem.

Three design choices do the epistemic work. First, a positive control: the cell at $j=0$ , where signal provably exists, must clear a power gate before any null at $j \geq 1$ means anything. A null result is evidence about SHA-256 only if the pipeline can detect signal where signal is known to be present; if the positive control fails, the experiment is unpowered, and the nulls say something about the optimiser rather than the hash. Second, a noise model with teeth: the reported statistic is the network’s maximum test advantage over all epoch evaluations, deliberately anti-conservative (the attacker gets its best epoch), which strengthens any null that survives it; the gate for a candidate signal is set at five standard errors of that max statistic, and a shuffle control with permuted training labels calibrates the noise ceiling empirically.² Third, an escalation rule: any candidate signal at $j \geq 1$ triggers a rerun with a fresh seed and four times the data, and only a confirmed signal counts as refuting the cliff. The grid is three stems by five depths, and the interpretation of every outcome is committed in advance.

The registered model is a residual MLP trained with Adam: fifty epochs, batch 8192, learning rate $10^{-3}$ dropping tenfold for the last fifth of training.³ The architecture matters less than what the network is given to look at.

Four amendments

The protocol permits amendments under one constraint: each must be documented before any $j \geq 1$ result is interpreted. Four turn out to be necessary, and together they read as a lab notebook on what SHA-256 is made of.

Amendment 1. The first smoke run fails its positive control outright: asked to predict bit 31 of the new $a$ register one round past the read point, the network reaches an advantage of 0.008 where the hand score’s threshold classifier sits at 0.499. The diagnosis is that predicting a raw top bit from raw input bits requires SGD to learn 32-bit modular arithmetic from scratch, a parity-shaped optimisation barrier that has nothing to do with the question under test. The redesign regresses the top byte of the target word instead and scores with the paper’s own metric: select the $1/256$ fraction of test nonces with the smallest prediction and measure how much smaller their mean top byte is than the population’s. The hand-score baseline under the new metric reproduces the paper’s number (0.8896 against the published 0.886), which validates the pipeline end to end.

Amendment 2. The positive control fails again: 0.013 against the hand score’s 0.8896. This time the network is spending its capacity learning the round function’s GF(2)-linear and bit-local operations ( $\Sigma_0$ , $\Sigma_1$ , Ch, Maj) from raw bits, and those are exactly the operations the paper classifies as depth-0, carry-free. The faithful operationalisation of the carry/local attack class is to hand the attacker every depth-0-computable quantity for free and test how far SGD sees past the carry layers, which are the claimed wall. The four derived words join the inputs, and the $j=0$ target becomes “top byte of a sum of given operands”: exactly one carry layer.

Amendment 3. Still a failure, though a nearer miss: $j=0$ maxes at 0.064, with the training loss plateauing at the target’s variance, the signature of nothing learned. The root cause is that the target is modular: the top byte is $\lfloor 256 \cdot \mathrm{frac}(\sum \text{values}) \rfloor$ , and modular reduction has zero linear correlation with any input direction, the known hard case for MSE and SGD. This is the paper’s own thesis rendered as an optimisation pathology, but an experiment whose attacker cannot represent the obvious computable baseline is unpowered, so the attacker gets the standard remedies from the neural-arithmetic literature: Fourier features at byte harmonics (sine and cosine of $2\pi \cdot 2^m \cdot \text{value}$ for $m \in \{0, \dots, 7\}$ , so that modular sums become products of given rotations), and a 256-way softmax head with cross-entropy loss, which handles the wrap where regression does not.⁴

Amendment 4. Diagnostics narrow the remaining failure to a known open problem: at feasible budgets, SGD does not synthesise exact 32-bit modular addition. Gated multiplicative networks, the architecture built for this, simply memorise the $2^{20}$ training examples, reaching a training cross-entropy of 0.09 with zero test advantage, and weight decay does not flip them to generalisation within budget.⁵ That difficulty is orthogonal to the question being asked, and it is itself an observation in the paper’s favour; Gohr’s networks never synthesised cipher arithmetic either, they exploited statistical signal in data they were given. The final design therefore hands the network $T_1$ and $T_2$ of the read round as feature words: the attacker computes round 30 in full, which is everything an attacker standing at the read point can compute exactly, and which is precisely the anchor sweep’s own read point (the hand score is a function of $T_1$ and $T_2$ ). With that, $j=0$ lies in the feature span and the power gate tests the pipeline, as it should; $j \geq 1$ becomes the genuine reach question, since the operands of later rounds require resolving the previous round’s carries.

By the time the positive control passes, the feature set has become a restatement of the paper’s taxonomy. Everything carry-free is given away for free, the read round’s sums are given away because the attacker can compute them, and the only thing left between the network and the target is carry.

The result

The positive control passes. At $j=0$ the network reaches a mean selection advantage of 0.9978 across the three stems (range 0.9972 to 0.9990), against the hand score’s 0.8841, re-measured on the same three stems and test split (per-stem scatter of about 0.005 covers the spread between this figure, the smoke run’s 0.8896, and the paper’s 24-stem 0.8856). The network beats the hand score in every stem: it learns the carry-in corrections that the truncated $(T_1 \gg 24) + (T_2 \gg 24)$ score discards, making it a strictly stronger member of the carry/local attack class than the one the paper measures the cliff with. The learned feature search is doing its job, and doing it better than the human one, where there is signal to find.

One round past the read point the advantage disappears, and it stays gone through every deeper cell.

depth	net max advantage (mean of 3 stems)	hand score	verdict
$j=0$ (1 adder layer)	0.9978	0.8841	positive control passes
$j=1$ (1 round)	0.0291	~0.01	null
$j=2$	0.0260	~0.01	null
$j=3$	0.0304	~0.01	null
$j=4$	0.0309	~0.01	null

All twelve $j \geq 1$ cells sit below the 0.065 gate; the largest is 0.0372, and the per-cell range is 0.0211 to 0.0372. The shuffle control, trained on permuted labels, comes in at 0.0212, inside that band, which pins the $j \geq 1$ readings as the pipeline’s max-over-epochs noise floor rather than residual signal. The unbiased statistic confirms it: the pooled final-epoch advantage over the twelve cells is $-0.0040$ , with a 95% confidence interval of $[-0.0079, -0.0001]$ .⁶ No cell qualifies as a candidate signal, so the escalation rule never fires.

Reach, not power

A null at one model size proves little; the $j=1$ null could be a power limit rather than a reach limit, and the follow-up sweep in scale_j1.py scales each axis independently to check. There is a methodological trap here: the max-over-epochs statistic is upward-biased, and the bias grows with capacity and with the number of epoch evaluations, so a fixed gate would manufacture false signal as scale increases. Every cell is therefore paired with a shuffle-label run of the identical configuration, and the verdict per cell is the gap between the real run’s max and its own shuffle’s max.

Capacity runs from 0.4M to 205.7M parameters across five model sizes; the gaps range from $-0.032$ to $+0.013$ , none approaching the 0.02 threshold. The largest model is excluded as invalid (its test predictions are bit-identical from epoch 35 onward, a constant-output collapse, and a rerun at lower learning rate with gradient clipping collapses again, its loss pinned at $\ln 256$ throughout⁷), which caps the defensible capacity range at 90x, from 0.4M to 35.7M parameters. Data runs from $2^{16}$ to $2^{22}$ training samples, a factor of 64; the gaps are $+0.009$ , $+0.003$ , $+0.013$ , $-0.014$ , with the largest-data cell sitting below its own shuffle ceiling. Training time runs to 400 epochs, eight times the base budget, as a probe for a late generalisation transition: the advantage at epochs 10, 25, 50, 100, 200, and 400 reads $-0.004$ , $+0.005$ , $+0.021$ , $+0.005$ , $+0.013$ , $-0.018$ , and the final evaluation is the most negative of the six. No axis shows a dose-response, so the $j=1$ null reads as a reach limit rather than a power limit: more network, more data, and more time all fail to move the learned attacker past one round.

The ladder

The sharpest of the follow-up runs strips away SHA-256 entirely and presents the mechanism in its barest form. The task is synthetic: predict the top byte of a sum of $k$ uniformly random 32-bit words modulo $2^{32}$ , with the same features, the same network, and the same pipeline, for $k$ from 2 to 7. Between the inputs and the answer sit only $k-1$ chained modular additions; the round function, the schedule, and every other piece of cryptographic structure are gone.

The network handles two operands at 0.9996 and three at 0.9983, and then dies: $k=4$ through $k=7$ are all dead, with maxima between 0.025 and 0.041 and final-epoch advantages near zero. Carry composition, by itself, on data with no SHA-256 structure at all, is where learning stops, somewhere between two and three chained additions. Probing the boundary does not move it: a curriculum that masters $k=3$ before fine-tuning on $k=4$ reaches 0.024, and doubling the Fourier bank reaches the same floor. Narrowing the operands does: at widths of 8 to 28 bits, $k=4$ seeds either solve the task almost perfectly or sit at the floor, a bimodal outcome with nothing in between, and at the full 32 bits zero of eight seeds escape.⁸ The death is an optimisation cliff rather than a smooth capacity limit. This calibrates the interpretation of the main grid: the $j \geq 1$ nulls demonstrate that this attacker class dies on carry composition, and they cannot certify any SHA-256-specific hardness beyond that point, because the attacker never gets past the carries to find out.⁹

Two follow-ups settle what that wall is. The first tests the precision reading the width cliff suggests. If the narrow-width seeds win by estimating the real-valued sum, float32’s 24-bit mantissa should cap that shortcut exactly as the operand width crosses it, so rebuild the $k=4$ , width-32 task in full double precision and look. Paired seed-for-seed against single precision, with identical operands and identical initialisation, the two arms are null together: zero of six seeds learn in either, the double-precision arm tracking the single to within about $0.01$ at every seed. Full operand resolution changes nothing. The wall is a trainability barrier, not a numerical one, and the analog-shortcut story is disfavoured.

The second cuts against the easy misreading of the ladder, that $k=4$ is where carry composition becomes intrinsically hard. It is where this learner stops, which is not the same thing. The carry chain for $k$ summands is Holte’s matrix, and its second eigenvalue is $\tfrac12$ for every $k$ ;¹⁰ adding operands only introduces faster-decaying modes, so nothing in the carry’s own mixing distinguishes four operands from two or three. What ends at $k=4$ is gradient descent’s reach over a fixed feature set, not an information boundary in the sum, the same verdict the precision test returns from the other side: once the analog shortcut is gone only the exact bit-level carry circuit remains, and that is the thing the attacker cannot synthesise.

The remaining follow-ups round out the picture, and not all of them favour the experiment. A replication of the full design at read point 31 reproduces the pass-then-cliff pattern exactly ( $j=0$ at 0.9988; $j=1$ at 0.0284, below all three of its own shuffle ceilings; $j=2$ at 0.0206), so the wall tracks the carry boundary rather than anything special about round 30. An interleaved train/test split at $j=1$ is also null (0.0297 against shuffle maxima up to 0.033), closing the concern that sequential nonce ranges might form a structured coset.

The convolutional arm takes longer to settle. The first run fails as an experiment rather than as an attack: a Gohr-faithful one-dimensional convolution over bit positions, the natural inductive bias for translation-equivariant carry chains, collapses at epoch 14 of its $j=1$ run, and its $j=0$ power control reaches only 0.058 against a gate of 0.5. By our own methodology that null is uninterpretable, so the architecture gets the fair runs it had not had: a stabilised variant (warmup, cosine decay, pre-norm residual blocks) trains without collapse and reaches 0.063 and 0.027 on the power control at two learning rates, and a dilated variant whose receptive field spans the whole carry chain reaches 0.036. Three architectures, stable optimisation throughout, and the same floor on a task the MLP passes at 0.99: the convolutional bias is not under-trained here, it is wrong for carries, built for the positional locality of XOR differentials that carry chains do not have. The objection closes by exhaustion. We report the first run’s failure plainly because a pre-registered protocol that only reports its flattering cells is not worth pre-registering.

A last probe turned up a second, separate way the learner fails. Holding $k=4$ at full width and changing which byte of the sum is the target, every read-out is dead, including the low byte, which is just an 8-bit modular addition of the operands’ low bytes and is perfectly learnable on its own. Stripping the feature set finds the culprit: with all 32 bits present the task stays at the floor, and only when the 24 irrelevant high bits are removed does a seed revive. The network cannot pull the relevant sub-circuit out of distractor inputs. That is a different failure from carry depth, and unlike carry depth it is plausibly fixable, since attention or an explicit feature-selection stage would likely defeat it, so it is reported as attacker phenomenology rather than as another wall. It leaves the main result untouched: the ladders predict the top byte, which every operand bit reaches through the carry chain, so there are no distractors there to hide behind.

Where the series ends

Part I opens with a guarantee: failure is certain, and the interesting question is always why. Six parts later the why has a name, carry depth; a count, 386 layers between the miner’s nonce and the hash that gets compared to the target; and a shape, a cliff one round wide at the strongest read point we can find. This post adds the check that does not depend on the limits of one person’s feature engineering: the strongest learned attacker we can build, handed every carry-free quantity for free and every sum the read point can compute exactly, beats the hand-built score where signal exists and agrees with it completely about where the signal ends.

Three qualifications bound that agreement. This is a one-shot learned predictor, the direct analogue of Gohr’s distinguisher; it does not rule out machine learning used as a search heuristic over algebraic representations, which is where Gohr’s strongest attacks actually lived, and which the paper scopes out rather than claims to close. It is the simplified single-compression read point, faithful to the round function’s mechanism rather than the full double-hash mining map. And every null here means “consistent with zero at this power” under gates fixed in advance, with the one escalation path never triggered because nothing ever cleared a gate.

The series is named for attacking imaginary giants, and the joke has aged into something more literal than intended. We charge the windmill from six directions across six posts, and it stands; on inspection, the tower turns out to be load-bearing masonry, 386 courses of it, and every attacker we send, hand-built or learned, falls off at the first course past wherever it starts climbing. That is the answer the series set out to find, and with the papers now carrying the proofs, this seems like the natural place to dismount.

A. Gohr, “Improving attacks on round-reduced Speck32/64 using deep learning,” CRYPTO 2019. Speck32/64 is an ARX cipher: its round function uses only modular addition, rotation, and XOR, so its sole source of nonlinearity is the same carry mechanism at issue here. Two differences between Gohr’s setting and ours matter later in this post: his networks are one-dimensional convolutions over bit positions operating on ciphertext-difference pairs, and his strongest attacks combine the learned distinguisher with classical key-ranking and multi-round search rather than using the network alone. ↩
The selection advantage is $1 - \overline{b}_{\text{sel}} / \overline{b}_{\text{all}}$ , where $\overline{b}_{\text{sel}}$ is the mean top byte over the $1/256$ fraction of test nonces with the smallest prediction and $\overline{b}_{\text{all}}$ is the mean over all of them. With $2^{19}$ test nonces the selected set is 2048, giving a standard error of about 0.013 on the advantage, so the candidate-signal gate of 0.065 is five standard errors on the max-over-epochs statistic. The gate governs only the escalation decision: by itself it excludes per-cell true advantages of roughly 0.04 to 0.05 and no smaller, which is why the quantitative conclusion rests on the pooled final-epoch confidence interval rather than on the gate. The hand score’s $\sim 0.01$ at $j \geq 1$ in the results table is its noise floor at this experiment’s three-stem size; Part V’s 24-stem sweep bounds the same quantity at 0.0011. ↩
The architecture is a GELU residual MLP, width 1024 with three residual blocks and a 256-way softmax head, about 7.4M parameters. The registration says “~2.4M parameters”; that figure is wrong arithmetic in the registration document itself (caught in external review and corrected in place, with the registered text left as written), not a design change, and no gate depended on it. ↩
Each input word contributes 50 features: its 32 bits, its value scaled to $[0,1)$ , its top byte scaled to $[0,1)$ , and sine and cosine at eight byte harmonics. Under the final design the input is fifteen words at $j=0$ : eight state words, the four carry-free derived words $\Sigma_0(a)$ , $\Sigma_1(e)$ , $\mathrm{Ch}(e,f,g)$ , $\mathrm{Maj}(a,b,c)$ , the two sums $T_1$ and $T_2$ , and one schedule word per predicted round. ↩
This is the obstacle the neural-arithmetic literature studies under the name “grokking”: networks trained on modular arithmetic memorise long before they generalise, if they generalise at all within budget. The diagnostics here reproduce the memorisation half (train cross-entropy 0.09, test advantage zero at $2^{20}$ samples) and never observe the transition; the 400-epoch probe in the scaling section looks for it directly at $j=1$ and finds nothing. ↩
Cells within a stem share test nonces, so treating the twelve cells as fully independent is optimistic; under any treatment of that correlation the data exclude a persistent learned advantage above roughly 0.01 at $j \geq 1$ . The $j=1$ cells alone pool to $-0.0060$ with a 95% interval of $[-0.0126, +0.0007]$ . ↩
$\ln 256 \approx 5.55$ is the cross-entropy of a uniform prediction over the 256 possible top-byte values: the rerun never learned anything at all, including the training set. ↩
The bimodal seed behaviour is the measured fact: at $k=4$ a seed either solves the task almost perfectly or sits at the floor, with nothing in between. The natural explanation, that float32’s 24-bit mantissa caps an analog shortcut exactly as the operand width crosses it, is tested in the two paragraphs after the ladder plot and disfavoured. ↩
The ladder runs the same width-1024, three-block network for 30 epochs on $2^{20}$ training and $2^{18}$ test samples per rung. The result also brackets the main experiment from both sides: the $j=0$ pass shows the attacker can learn one two-operand addition through this feature set, while its repeated failure during the amendments to compute $T_1$ and $T_2$ from their seven constituent words (about five chained additions) sits past the ladder’s death point, exactly where the ladder says it should. ↩
J. Holte, “Carries, combinatorics, and an amazing matrix,” American Mathematical Monthly 104(2), 1997; the same matrix governs the base- $b$ riffle shuffle, P. Diaconis and J. Fulman, “Carries, shuffling, and an amazing matrix,” 2009. The base-2 carry chain for $k$ summands has eigenvalues $\{2^{-j} : 0 \le j < k\}$ , so its spectral gap stays $\tfrac12$ no matter how many operands are added. ↩