Tilting at Windmills V

This post walks two parts of a three-paper series:

Part I — The Amortization Envelope: Where a SHA-256d Mining Advantage Can and Cannot Live: the reduction and the bounded sharing envelope.
Part II — Carry Depth: A Circuit-Independent Coordinate of ARX, and the One-Round Advantage Cliff: the 386-layer coordinate and the anchor sweep.

Reproduction code on GitHub at captainpete/carry-adder.

Four posts of this series are a catalogue of failures.¹ REINFORCE gets a gradient that is exactly zero in expectation. A Decision Transformer discovers that predicting nonce bits one at a time is no easier than predicting them all at once. Monte Carlo tree search finds it has no value function to learn. A SAT solver handles 10 rounds of SHA-256 and dies at 11. And the carry-aware ranker of Part III, the only method in the series that finds any structure at all, finds it exactly one carry layer in and loses it at the first composition on top of it. Each failure has its own local explanation, and Part IV assembled them into an impossibility argument that ends by naming a common mechanism. This post is that mechanism in full: hardness is the cost of resolving carries, and the distance every attack has to cover is measured in carry layers. The argument is now written up as the first two parts of a three-paper series with self-contained reproduction code; this post walks them at blog altitude, and the papers carry the proofs.

The one nonlinearity that travels

SHA-256 is built from a small inventory of operations: bit rotations and shifts, XOR (and the $\Sigma_0, \Sigma_1, \sigma_0, \sigma_1$ functions composed from rotations and XOR), the bitwise choice and majority functions $\mathrm{Ch}$ and $\mathrm{Maj}$ , and 32-bit modular addition. Classify each by two questions: is it linear over $\mathrm{GF}(2)$ , and does it couple different bit positions?

Rotations, shifts, XOR, and the $\Sigma$ / $\sigma$ functions are all linear over $\mathrm{GF}(2)$ : every output bit is a parity of input bits. Linear structure presents no obstacle to an attacker, because a linear system is solvable in polynomial time. $\mathrm{Ch}$ and $\mathrm{Maj}$ are nonlinear, but bit-local: output bit $i$ depends only on the three input bits at position $i$ , so whatever confusion they create stays at its own bit position. Modular addition is the one operation that is both nonlinear and position-coupling, and the coupling is entirely the work of the carry chain. Writing $\gamma$ for the carry vector, with $\gamma_0 = 0$ and $\gamma_{i+1} = \mathrm{Maj}(a_i, b_i, \gamma_i)$ ,

$a + b \bmod 2^{32} \;=\; a \oplus b \oplus \gamma(a, b).$

The $a \oplus b$ term is linear. Every nonlinear, cross-position effect of addition lives in $\gamma$ : a high bit of the sum depends on all the lower input bits, through the carry-in, and on nothing else.² If the carries were known, addition would be pure XOR.

Introduce an auxiliary variable for every nonlinear-gate output in SHA-256d: each carry bit and each $\mathrm{Ch}$ / $\mathrm{Maj}$ term. Fix all of them, and the entire double hash collapses to an affine map over $\mathrm{GF}(2)$ , at which point “find an input whose hash lands below the target” becomes a solvable linear system. All of the hardness of mining is the cost of resolving those nonlinear bits, and since $\mathrm{Ch}$ and $\mathrm{Maj}$ never move information between positions, the carries are the part that determines how far an attacker’s influence can reach.

That motivates a definition. The carry depth between a controllable input and a target output is the number of modular-addition layers on the dependency path between them. It is a property of the function’s dataflow, not of any particular circuit that implements it,³ and the claim of the paper is that it is the structural coordinate that decides whether a proof-of-work function is attackable: a two-round reduced hash has carry depth of a few layers and falls to straightforward algebra, and the question for SHA-256d is what the number actually is.

Counting to 386

The accounting is a static dataflow analysis, an integer depth propagated through the dependency graph of the double hash; it computes no hashes and touches no data. Each round of the compression function computes $T_1$ (a five-operand modular sum), $T_2$ (a two-operand sum), and then the new state words $a' = T_1 + T_2$ and $e' = d + T_1$ . Each of the 48 expanded message-schedule words is a four-operand sum. The Davies-Meyer feed-forward at the end of each compression adds the initial state back into the final state, wordwise, so it is itself a layer of carries. The miner’s controllable bits are the block-1 message words (the nonce sits in word $W_3$ , and the extranonce, version bits, and timestamp cover most of the rest), so depth is measured from there to the second hash’s output.

Two conventions in the count are deliberately conservative. Multi-operand sums are combined by the minimum-depth associative tree, which credits the attacker with the shallowest legal circuit; a sequential or nonce-only accounting gives a somewhat larger number. And $\mathrm{Ch}$ / $\mathrm{Maj}$ count as depth zero, since they inject no carry, so the figure is a floor on the nonlinear separation rather than an estimate of it.

The result: one compression (64 rounds plus feed-forward) is 194 adder layers deep, roughly 3 layers per round, and the full SHA-256d path from controllable input to output is 386 layers. Along that path sit 1200 modular additions, which at 31 carry bits per 32-bit add is about 37,200 carry bits.⁴ The script that computes this (repro/carry_depth.py) is dependency-free Python and runs instantly, which is a pleasant property in a field where reproduction usually starts with a GPU queue.

The two feed-forwards and the second hash contribute more separation than their round count suggests, because each acts as an independent carry reset: a fresh full-width operand is added across the whole state, injecting a new carry chain into every word at once. Satoshi’s choice of a double hash, commonly attributed to length-extension protection, also happens to double the carry distance an attacker has to cross.

Two kinds of structure

Total mining cost factors into two numbers: the count of candidates evaluated, and the compressions paid per candidate after every possible computation is shared. Part IV walks both factors to their limits: in the random-oracle model brute force is query-optimal on the first,⁵ a bound that survives precomputation (salting by the previous block’s hash defeats it⁶) and quantum search (Grover’s quadratic speedup, provably nothing more⁷), while the sharing envelope on the second is exhausted by midstate caching and ASICBoost’s roughly 20%,⁸ with the second hash unshareable below a $2^{128}$ collision. What the paper adds is the structural reading: SHA-256d contains two kinds of structure, and each cost factor belongs to one of them.

The message schedule is a fixed recurrence, the same for every block, which makes its computation shareable across candidates; every sharing win ever found, midstate and ASICBoost alike, lives in this layer. The round function is the other layer, and it is pseudorandom: its output cannot be predicted, so the candidate-count factor cannot be reduced by being clever about which candidates to try. Part IV’s localisation result, reached at the end of the empirical campaign, is this decomposition seen from the outside: the schedule layer is where the wins were, and the round layer is where all the series’ attacks went to die.

The candidate-count claim also transfers to the real SHA-256 by a reduction rather than by assertion: a miner that beats brute force by more than a constant factor succeeds at a rate no algorithm can achieve against an ideal oracle, so running it and watching whether it beats the bound is a distinguishing experiment against SHA-256.⁹ An unconditional proof that mining requires $2^D$ work is out of reach for anyone: for the fixed function it would be an explicit circuit lower bound, and in the asymptotic framing it would exhibit a one-way function and so separate $\mathsf{P}$ from $\mathsf{NP}$ . What the reduction buys instead is localisation. The entire residual risk is a single well-defined object: a global algebraic shortcut that resolves the target output without resolving the carries on the path. By the reduction, that object is a SHA-256 distinguisher.

Measuring the cliff

The depth accounting fixes one number, 386, and the structural reading says where advantage could live; the margin needs a second number, the advantage $\rho$ retained per adder layer. Before measuring it we have a prediction. Treat a single adder’s operand bits as uniform and independent, and the carry is a two-state Markov chain whose bias away from a fair coin halves with every bit position;¹⁰ chaining adders composes these contractions, suggesting that exploitable advantage decays geometrically, roughly $2^{-\Theta(\text{carry depth})}$ . The assumption does not hold inside SHA-256, where the state words are structured and correlated, so this is a heuristic to test rather than a theorem, and the test is the anchor sweep.

The instrument is the strongest local exploit on the books: Part III’s carry-aware score, $(T_1 \gg 24) + (T_2 \gg 24)$ , read at an interior round of the compression function (round 30, in a simplified single-compression setup with one message word playing the nonce). The score strongly predicts that the top byte of state word $a$ is small one adder layer downstream; the sweep selects the best 1/256 of candidates by score and then watches how that control decays as the observed target moves deeper, round by round. Over 24 stems (here, synthetic message blocks rather than real headers) and $2^{22}$ candidates each, the advantage at one adder layer is $0.8856 \pm 0.0005$ : the selected candidates’ mean top byte is about a ninth of the population’s.¹¹ One round deeper, about three adder layers, the retained advantage is at most $0.0011$ , consistent with zero.

The Markov heuristic predicts a slope, and the measurement comes back as a cliff. There is no geometric tail to fit; $\rho \approx 0$ , a one-layer reset. The mechanism is visible in the dataflow. At the next round, state word $a$ enters $\Sigma_0$ , whose three rotations scatter the controlled top byte across all 32 bit positions, and the result is then re-randomised by a fresh carry chain. The round function is a bijection, invertible given the message word, so the information survives in full; what the round destroys is accessibility. The structure the score controlled still exists, delocalised across the state where no local observable can read it.

The paper is deliberate about what it does not do with these numbers: it does not compound $0.0011$ over the remaining layers to manufacture an impressively tiny advantage at the output, because $0.0011$ is a detection limit consistent with zero rather than a measured decay rate. The claim needs no extrapolation. The strongest local attack we tested is below detection after one round, the read point sits at round 30, and the output is 128 rounds from the input, so the advantage expires roughly 100 rounds before reaching it. The sweep reproduces in about two seconds on a bare Python interpreter (repro/anchor_sweep.py --pure, which checks itself against hashlib first and shows the same cliff at a higher noise floor); the full-scale run uses a GPU and takes tens of minutes.

Rereading the series

With the coordinate in hand, the series’ failures stop being separate stories. Part I’s gradient is zero in expectation because the score of a nonce is read at the output, 386 carry layers from the bits the model controls, and no local correlation survives a distance the measured reach of one layer cannot begin to cover. The Decision Transformer’s reframing changes the order in which nonce bits are chosen, and carry depth does not depend on the order in which you choose input bits. MCTS needs a value function, an estimate that one partial nonce is closer to success than another, and “closer” is a depth-1 notion being asked about a depth-386 target; with no positional evaluation to learn, the tree search degenerates to random rollouts. The probability-collapse experiment watches the same mechanism from inside: push a single uncertain bit into a differentiable SHA-256 and the uncertainty cascades through the carry chains until every output bit is a fair coin, with intermediate probabilities surviving only at extremes like $10^{-20}$ . Z3’s profile is the cost statement made literal: resolving $d$ unknown carries costs about $2^d$ , each round adds roughly three layers of them, and at the nonce’s eighth round the budget runs out a thousandfold. And the ranker’s score, re-read by the anchor sweep, is the boundary case that makes the wall visible from both sides: it lives at carry depth 1, where the advantage is a measured 0.886, and it is dead at depth 4, one round later. The six failures are the same wall, approached from different angles.

What’s next

One objection survives all of this, and it deserves better than a dismissal. The carry-aware score is a single hand-built feature, and the recent history of cryptanalysis includes a pointed lesson about hand-built features: Gohr’s neural distinguishers found structure in round-reduced Speck that decades of human analysis had missed.¹² Perhaps the cliff is not a property of the round function but a property of our score, and a network given everything computable at the read point would carry some advantage through where the hand score carries none. That is a well-posed experiment with a pre-registerable protocol, and it is the right way to end the series; the final part runs it.

Parts I through IV are collected at the cryptography tag page, along with the rest of the series as it appears. ↩
The carry recurrence is the finite state of addition viewed as an S-function, the standard framework for differential analysis of ARX constructions: N. Mouha, V. Velichkov, C. De Cannière, and B. Preneel, “The differential analysis of S-functions,” SAC 2010. The observation that the carry is the sole nonlinearity of ARX appears in R. E. Field and B. C. Jones, “Using carry-truncated addition to analyze add-rotate-xor hash algorithms,” 2013, arXiv:1303.4448; their carry-truncation depth is a local approximation parameter on a single addition, where carry depth is a global count along the whole function’s dataflow. ↩
Carry depth is not multiplicative (AND) depth, the measure used in FHE and MPC circuit optimisation. The standard Bristol-format circuit for one SHA-256 compression has AND-depth 1607, far above 194, because AND-depth counts every nonlinear gate on the path, including the bit-local $\mathrm{Ch}$ / $\mathrm{Maj}$ gates and every gate of a ripple-carry adder. It is also a property of a chosen circuit rather than of the function: the same modular addition has linear AND-depth as ripple-carry but logarithmic depth as a parallel-prefix adder, since the carry-merge operator is associative, and published heuristics shrink a 128-bit adder from AND-depth 255 to 9 without altering its logic. P. Aubry, S. Carpov, and R. Sirdey, “Faster homomorphic encryption is not enough: Improved heuristic for multiplicative depth minimization of Boolean circuits,” CT-RSA 2020; N. P. Smart et al., “‘Bristol Fashion’ MPC circuits,” https://nigelsmart.github.io/MPC-Circuits/. ↩
Per compression: 48 expanded schedule words at 3 pairwise adds each (144), 64 rounds at 7 adds each ( $T_1$ takes 4, $T_2$ , $a'$ , and $e'$ one apiece; 448), plus 8 feed-forward adds, totalling 600. The path through SHA-256d crosses two such compressions (block 1 of the first hash, then the second hash; block 0 is nonce-independent and contributes depth 0), giving 1200 adds and $1200 \times 31 = 37{,}200$ carry bits. For depth, the first compression’s output words sit at 194 layers and the second hash lifts the maximum to 386. The two compressions thus contribute 194 and 192 layers; the totals come straight out of the script rather than from doubling. ↩
J. Garay, A. Kiayias, and N. Leonardos, “The Bitcoin backbone protocol: Analysis and applications,” EUROCRYPT 2015, is the standard idealisation under which proof-of-work hardness is proved. The bound itself is a union-bound argument: each fresh query lands in the target set with probability $2^{-D}$ independent of all prior responses, so the first-success time stochastically dominates a $\mathrm{Geometric}(2^{-D})$ variable, whose mean is $2^D$ . ↩
The auxiliary-input random-oracle framework models an adversary with arbitrary precomputed advice and shows salting generically defeats it; with a salt as wide as the function’s output, the advantage collapses to the non-preprocessing bound up to constants. D. Unruh, “Random oracles and auxiliary input,” CRYPTO 2007; S. Coretti, Y. Dodis, S. Guo, and J. Steinberger, “Random oracles and non-uniformity,” EUROCRYPT 2018; Y. Dodis, S. Guo, and J. Katz, “Fixing cracks in the concrete: Random oracles with auxiliary input, revisited,” EUROCRYPT 2017. Classical time-memory trade-offs (M. E. Hellman, IEEE Trans. Inf. Theory, 1980; P. Oechslin, CRYPTO 2003) need a fixed target function for one table to amortise over all instances, and a fresh per-block salt makes each block a distinct function. ↩
C. H. Bennett, E. Bernstein, G. Brassard, and U. Vazirani, “Strengths and weaknesses of quantum computing,” SIAM J. Comput. 26(5), 1997, proves Grover optimal. The result carries to mining specifically: A. Cojocaru, J. Garay, A. Kiayias, F. Song, and P. Wallden, “Quantum multi-solution Bernoulli search with applications to Bitcoin’s post-quantum security,” Quantum 7:944, 2023. At current difficulty the quadratic factor is also economically idle, since a coherent SHA-256d Grover oracle costs orders of magnitude more per query than a classical ASIC gate-pass. ↩
T. Hanke, “AsicBoost: A speedup for Bitcoin mining,” 2016, arXiv:1604.00575. The covert variant engineers expansion collisions through the Merkle root; the overt variant rolls version bits, described in “nVersion bits for general purpose use,” BIP 320, 2018. ↩
Formalised via indifferentiability: U. Maurer, R. Renner, and C. Holenstein, TCC 2004. One subtlety: plain Merkle-Damgård is not indifferentiable from a random oracle because of length extension (J.-S. Coron, Y. Dodis, C. Malinaud, and P. Puniya, “Merkle-Damgård revisited: How to construct a hash function,” CRYPTO 2005), and the outer application in SHA-256d is what seals the inner chaining value behind a second, length-committed invocation. The literature approaches the composition from both sides without covering it: Coron et al. prove Merkle-Damgård indifferentiable on prefix-free (hence fixed-length) inputs, and Y. Dodis, T. Ristenpart, J. Steinberger, and S. Tessaro, “To hash or not to hash again? (In)differentiability results for $H^2$ and HMAC,” CRYPTO 2012, prove only weak bounds for the unrestricted double hash, via an attack that needs digest-length inputs the 80-byte mining domain forbids. No published theorem covers SHA-256d on its fixed-length domain, so the paper states indifferentiability of the mining map as an explicit assumption; the reduction is conditional on it. ↩
With operand bits uniform and independent, $\Pr[\gamma_{i+1}{=}1 \mid \gamma_i{=}0] = \tfrac14$ and $\Pr[\gamma_{i+1}{=}1 \mid \gamma_i{=}1] = \tfrac34$ : a chain with stationary distribution $(\tfrac12, \tfrac12)$ and second eigenvalue $\tfrac12$ , so starting from the known $\gamma_0 = 0$ the bias decays as $\Pr[\gamma_i = 0] = \tfrac12 + 2^{-(i+1)}$ . The carry-in is a fair coin within a handful of bit positions of the anchor. ↩
The advantage at $j$ rounds downstream is $1 - \mu_{\text{sel}}/\mu_{\text{all}}$ , where $\mu_{\text{sel}}$ is the mean top byte of state word $a$ over the best 1/256 of candidates by score and $\mu_{\text{all}}$ is the mean over all candidates. Per-stem noise is about $0.005$ ; across 24 stems the standard error on the mean is about $10^{-3}$ , which sets the detection limit quoted for the deeper rounds. The per-stem values one round deeper scatter symmetrically about zero rather than sitting at a small consistent positive value. ↩
A. Gohr, “Improving attacks on round-reduced Speck32/64 using deep learning,” CRYPTO 2019. ↩