Tilting at Windmills IV

Part III leaves the series in an unfamiliar position: holding a real result. The carry-aware score predicts the top byte of the SHA-256 compression output well enough to improve the mean minimum hash of selected candidates by 67% over random selection, and the effect is solid, reproducible, and indifferent to how many rounds precede it. It is also exactly one layer deep. The chaining-variable addition at the end of the first hash erases it, and even a perfect oracle over the first digest’s ordering, granted by construction at no cost, selects winning nonces no better than chance once the second hash runs.¹ The structure is real; it just does not compose.

What remains after that is a short list rather than an open field. If an advantage exists anywhere in this problem, it hides in one of three places: in shallow algebraic structure near the hash boundaries that the carry-aware score did not capture; in the choice of which stems receive search effort at all; or in the per-candidate cost of the hashing itself. This post works through the list, one experiment per surface, and then assembles what is left into a single statement.

The method also changes here. The early notebooks in this series ran until something looked interesting, which is how exploratory work goes, and also how false discoveries are manufactured. After this many parts, the most likely way to obtain a positive result is to fool ourselves into one. So every experiment below is pre-registered: the hypotheses, the feature families, the pass gates, and the multiple-testing corrections are written down and locked before any confirmation data is evaluated. Families cannot be added mid-run, thresholds cannot be adjusted after the fact, and an anomaly that fails to reproduce under a fresh seed is noise however suggestive it looks. Each experiment specifies in advance what would count as a finding, which is what the early notebooks did not do.

A census of shallow features

The search is for a composable, stem-conditioned coordinate system: features that control round-level algebraic structure (carries, Boolean gate outputs, message-schedule words), in the hope that enrichment at low round counts composes forward through more rounds and eventually through the second hash. Three framings are ruled out before starting, each already proven unproductive: training a generic network to predict SHA-256d from scratch, optimising first-hash output and assuming it bridges to mining, and declaring victory on improvement at an intermediate state. Success is measured on final SHA-256d output under priced inference economics, and the investigation is oracle-first: before any model is trained, the features themselves, read from an exact execution trace, must demonstrate value. If no feature family ranks candidates better than chance with the answers in hand, no decoder follows.²

Where could such structure live? Part I’s avalanche argument rules out the deep interior: a feature read thirty rounds from the output is separated from it by enough mixing that it behaves as noise. The defensible surfaces are the shallow windows around the boundaries: the last four rounds of the first hash (rounds 60–63), the chaining-variable addition that closes it, and the first five rounds of the second hash (rounds 0–4). Six feature families are pre-registered to cover this window.³ F1 takes the full per-byte carry vectors of the round-63 additions, generalising Part III’s two-byte score. F2 collects joint carry signatures across rounds 60–63, structure no single-round score can see. F3 crosses the Boolean gate outputs Ch and Maj with the carry bits. F4 reads the message-schedule words $W[60..63]$ , a feature source distinct from the state words entirely. F5 records the carry signature of the second hash’s first five rounds, a function of the first digest but a different function than the ordering already ruled out. F6 isolates the carry pattern of the chaining-variable addition itself: the exact boundary that destroyed the Part III score, treated as a structural object in its own right. Between them the six cover the shallow window’s feature types (state carries, joint carries, Boolean hybrids, schedule words, and both boundaries), which is what lets a six-fold null close the surface rather than merely dent it.

The protocol is fixed before anything is fit. The 2048 test stems split deterministically into a 1024-stem discovery half and a 1024-stem confirmation half, persisted to disk first. Scorers are restricted to ridge regression or a shallow random forest; deep networks, PCA, and per-stem fits are forbidden, because the object under test is the feature family rather than a model’s capacity to overfit it. The trace generator supplying the features is checked bit-exact against a reference implementation before use: 16 of 16 trials, 992 individual scalar comparisons. Confirmation is one-shot per family: 65,536 candidates per stem, the 64 best by the family’s score, hashed for real, and compared against a shared random arm on the rate of easy-target hits. The gate, locked in advance: tail-hit enrichment of at least 1.10x, with the Bonferroni-corrected one-sided bootstrap confidence bound above 1.00.⁴

All six families fail. Point enrichments land between 1.01x and 1.07x; every confidence lower bound falls between 0.82 and 0.88, well short of 1.00. The cluster of small positive point estimates is one shared accident rather than six weak signals: the random arm undershoots its theoretical hit rate by about a percentage point, lifting every ratio computed against it.⁵ The method arms are statistically indistinguishable from the random arm, and from each other.

The entire census, from parity check to final bootstrap, costs about nine minutes of single-GPU time. The expensive part is everything around the compute: locking the family list, persisting the split, restricting the scorers, fixing the gate. Most of the experiment is discipline, which is the correct allocation when the expected result is null and the cost of a false positive is months of decoder work built on sand.

Maybe the stems are not all equal

The second surface abandons nonce prediction altogether and asks a question the series has not asked before: given many possible stems, which ones deserve to be mined? Define $Q(\text{stem})$ as the expected yield of a full nonce sweep over that stem. The ASICs keep doing exact SHA-256d over the whole nonce space; $Q$ only decides where they spend their time, which changes what a model needs to achieve. A cheap per-stem evaluation controls a work unit of $2^{32}$ hashes, so even a small, stable yield shift between stem classes would be operationally worth having. The hypothesis is that stems are not exchangeable: that cheap partitions of stem space exist whose minimum-hash distribution is shifted enough to exploit.

Testing this needs a label and an instrument. The label is the minimum hash over $K = 10{,}000$ sampled nonces per stem, transformed into extreme-value coordinates $y = -\log_2(\text{min\_hash} / 2^{256})$ , whose distribution under the null is known.⁶ Over 30,000 uniform-random stems the empirical mean and variance of $y$ sit within half a percent of the Gumbel reference, and eight stem-only partition families across 114 buckets show no Bonferroni-corrected lift on this negative control, which is what the instrument should report when fed noise. The positive control is the more interesting half: a known yield shift of $\delta = \log_2 1.5$ injected onto a fixed class of stems, which the pipeline must recover. It does, at both tested quantiles in every batch.⁷ A null from an instrument is only worth something if the instrument demonstrably could have found a signal.

The labels also need a stem source, and the obvious one is wrong in an instructive way. The blockchain holds hundreds of thousands of winning headers, each a stem paired with a known-good nonce: a ready-made dataset, except that every stem on it is conditioned on having won the mining race. Stems whose sweeps yield nothing are never recorded anywhere. Part I flagged this as a footnote about training data; here it disqualifies the source outright, because any yield shift measured on winners is selection bias by construction. So the stems come instead from a generator the experiment owns: a frozen, auditable simulator of operational stem generation, with template epochs, extranonce counters, timestamp windows, and PRF-derived merkle roots, which emits its own generator coordinates as candidate features.⁸

The sweep runs on three disjoint batches of 30,000 stems each: discovery, confirmation, and a fresh holdout. The feature catalogue is 52 families, 810 buckets in total, spanning midstate bytes, schedule words, stem-only carry summaries, and the generator’s own coordinates. Discovery ranks the top 40 bucket records by lift with no significance gate at all, and discovery duly finds things; the best record shows a 1.52x lift. Confirmation, Bonferroni-corrected, kills all 40. The holdout has nothing left to test. On the injected-control side of the same run, 26 records survive confirmation and 24 survive the holdout, every one an alias of the injected epoch class, so the funnel passes signal and blocks noise in the same run. An earlier phase of the pipeline produced exactly one marginal anomaly on real labels (confidence lower bound 1.003, about as marginal as an anomaly can be), and a re-run under a second seed returned it to noise, which is the pre-registered fate of anomalies that do not reproduce.

The verdict is exchangeability: under the tested generator and feature catalogue, stems carry no exploitable yield structure. Every cheap partition’s minimum-hash distribution matches the null, and $Q$ would have nothing to allocate. The result is narrow, closing this generator version under these 810 partitions, but it lands exactly where the pseudorandomness of the hash says it must.

What is actually left

With output prediction dead and stem selection dead, the synthesis that closes the project starts from an accounting identity. Total mining cost factors as the number of candidates any algorithm must evaluate, times the compression calls each candidate costs after maximal sharing. Every conceivable approach attacks one factor or the other.

The left factor is locked by the random-oracle bound: model the compression function as a random oracle, and any algorithm making $q$ queries finds a sub-target header with probability at most $(q+1) \cdot 2^{-D}$ , where $2^{-D}$ is the target density; the expected query count to first success is $2^D(1-o(1))$ , and sequential brute force achieves exactly $2^D$ .⁹ Brute force is query-optimal up to a vanishing factor, with no constant-factor slack hiding anywhere. Two riders close the realistic loopholes. Precomputation, in the rainbow-table or preprocessed-advice sense, is defeated by salting: every header commits to the previous block’s hash, a fresh 256-bit value unpredictable until that block is found, and the auxiliary-input bounds say offline work cannot target a salted search.¹⁰ Quantum search gets Grover’s quadratic speedup and provably no more.¹¹ In the standard model, the idealisation converts into a reduction: a miner beating $2^D$ by a super-constant factor succeeds above the random-oracle ceiling, which is by definition a distinguisher against SHA-256. Beating the search factor and breaking the hash are the same event.

So every economically meaningful win lives in the right factor: per-candidate computation sharing. Hashing one header costs exactly three compression calls, and the envelope is set by which of them can be shared across distinct candidates. The first block of the first hash contains no nonce and is shared across the entire sweep; this is midstate caching, universal in every miner, with amortised cost near zero (Part I guessed at this in a footnote, and it is indeed how the hardware works). The second block contains the nonce and must be recomputed per candidate, except for its message-schedule expansion, a fixed recurrence that small batches of candidates can be engineered to share; that region is ASICBoost, with reported gains around 20%.¹² The second hash is effectively unshareable: two candidates share it only by sharing a first-hash digest, which is a SHA-256 collision at around $2^{128}$ work, above the entire mining cost at any realistic difficulty.¹³ Every candidate pays for one full second-hash compression, and no engineering changes that.

That is the whole map: every win anyone has found against brute force lives in the sharing factor; the sharing factor has exactly one region with headroom, and a deployed technique already occupies it. The novel directions this series spent its years on, predicting outputs and selecting stems, attack the locked factor, and they are now closed twice over: empirically by the nulls above, structurally by the localisation argument.

What survives all of this is one object, and it can be named exactly: a global algebraic shortcut through SHA-256 that resolves the target output without resolving the carries on the path to it. By the reduction, such a shortcut is a distinguisher, so finding one means breaking SHA-256, a result that would matter far beyond mining. Excluding it unconditionally is beyond anyone’s reach: a proof that no shortcut exists would establish SHA-256 as a one-way function, and the existence of one-way functions implies $P \neq NP$ .¹⁴ The ceiling therefore reads: mining cannot beat brute force unless SHA-256 itself breaks, and the residual risk is localised to that single well-defined object.

What’s next

The synthesis also names the mechanism behind the series’ failures. Modular addition is XOR plus a carry vector, and the carry chain is the only operation in SHA-256 that couples bit positions nonlinearly; whatever local structure an attack establishes, the next addition’s carries re-randomise it. The zero gradient of Part I, the one-layer ranker of Part III, and the six dead families and 810 dead buckets of this post all hit the same mechanism. Part V counts the adder layers standing between the nonce and the output, measures how fast advantage decays across them, and finds the wall steeper than even the carry chain’s own arithmetic suggests.

The oracle experiment grants a hypothetical attacker the ability to select candidates whose first-hash digest is lexicographically small, for free. Selection quality at the first digest: +99.90% over random. Selection quality at the final SHA-256d output: -1.8%, with rank correlation indistinguishable from zero. The second hash decorrelates input order from output order completely, so no function of the first digest’s ordering can serve as a mining ranker, whatever computes it. ↩
The proposal pre-commits four go/no-go gates before any model training: (1) does any lifted carry/Boolean/schedule feature show oracle enrichment that transfers to later rounds; (2) can a decoder realise the feature from stem-only inputs on held-out stems; (3) does the enrichment survive the first-digest boundary and the second hash; (4) does batched multi-stem inference beat batched brute force on tail hits at equal compute. The census here is gate 1; failing it means gates 2 through 4 never run, which makes it the cheapest possible place to fail. ↩
Feature dimensions and scorers, for the record. F1: 8 binary features, the per-byte carry-outs of the two round-63 additions $T_1 + T_2$ and $d + T_1$ , scored by ridge regression. F2: 8 binary features, the byte-0 carries of both additions across rounds 60–63, ridge. F3: 36 features after bit expansion (Ch and Maj top bytes, the two carry bits, and AND/OR products of the Ch byte with each carry bit), shallow random forest of depth 4 with 32 trees. F4: 16 byte features, $W[60..63]$ at 4 bytes each, ridge. F5: 10 binary features, byte-0 carries of second-SHA rounds 0–4, ridge. F6: 16 binary features, the overflow and byte-0 carry of each of the 8 word-wise additions in the chaining-variable add, ridge. All ridge penalties are fit by cross-validation within the discovery half only. ↩
The easy target is a hash below $2^{248}$ (top eight bits zero); the measured rate is the fraction of stems whose chosen 64 candidates contain at least one such hash. Six families at a familywise $\alpha = 0.05$ give a Bonferroni-adjusted one-sided level of $0.05/6 \approx 0.0083$ , hence a 99.17% confidence bound. The bootstrap is paired: a single random arm is drawn once and shared across all six families and all resamples, so family-to-family comparisons are not contaminated by independent sampling noise in the baseline. Harder thresholds are reported as diagnostics only; at 1024 stems they are too sparse to gate on. ↩
The random arm’s observed hit rate is 0.2090 against a theoretical 0.2212 for 64 draws at a per-draw hit probability of $2^{-8}$ . Dividing each method arm’s rate by an undershooting baseline inflates all six ratios by the same few percent, which is why the point estimates cluster at 1.01–1.07 rather than at 1.00. The confidence intervals, computed on paired differences against that same arm, are not fooled. ↩
For $K$ independent uniform draws the transformed minimum $Y$ is asymptotically Gumbel with $E[Y] \approx \log_2 K + \gamma / \ln 2$ , giving 14.120 at $K = 10^4$ against a measured 14.116 (variance: 3.424 expected, 3.409 measured). The transform matters because raw minimum hashes span hundreds of binary orders of magnitude; in $y$ coordinates a yield shift becomes an additive location shift, and quantile thresholds become calibrated hit-rate tests. The primary thresholds are the median and the 90th percentile; the 99th is diagnostic only, since at these bucket sizes it is underpowered. ↩
The injection adds $\delta = \log_2 1.5 \approx 0.585$ to the transformed yield of stems whose epoch index is 3 mod 8, which by Gumbel arithmetic predicts tail lifts of roughly 1.45x at the 90th-percentile threshold and 1.29x at the median. Measured recovery across the three batches: 1.395x, 1.454x, and 1.413x at q90, and 1.254x, 1.278x, and 1.233x at q50, all six confidence lower bounds above 1.20. ↩
Each epoch fixes the version, bits, a PRF-derived previous-block hash, and a template seed; 100 stems per epoch vary an extranonce counter and a timestamp offset within a 60-second window, with the merkle root derived as SHA-256d of the template seed and extranonce. The generator is frozen under a named master seed before any labelling, and emits epoch, extranonce, timestamp, and merkle-prefix coordinates as features, so that generator-level structure, if any exists, is directly testable. A null closes this generator version specifically; a different operational generator (pool work-queue logs, say) would require the labelling and validation to be repeated from scratch. ↩
Proof sketch: each fresh query to a random oracle returns a uniform 256-bit value independent of everything previously revealed, so each query, and the final output if it goes unqueried, hits a target set of density $2^{-D}$ with probability exactly $2^{-D}$ ; a union bound over the $q+1$ events gives the ceiling, and the expectation follows from the geometric distribution of the first hitting time. The bound is information-theoretic within the model: no candidate ordering, feature, or filter does better, which is the formal content of every empirical null in this series. ↩
D. Unruh, “Random oracles and auxiliary input,” in Proc. CRYPTO 2007; S. Coretti, Y. Dodis, S. Guo, and J. Steinberger, “Random oracles and non-uniformity,” in Proc. EUROCRYPT 2018. The framework bounds an adversary holding arbitrary precomputed advice; for a salted search problem the advice term is negligible when the salt is fresh and on the order of 256 bits, which the previous-block hash supplies. The only advice worth computing is advice computed after the prior block is known, and that is just online mining, racing everyone else. ↩
C. H. Bennett, E. Bernstein, G. Brassard, and U. Vazirani, “Strengths and weaknesses of quantum computing,” SIAM Journal on Computing, vol. 26, no. 5, pp. 1510–1523, 1997. The bound makes Grover’s $\Theta(2^{D/2})$ optimal for unstructured search, so no exponential quantum break exists in the ideal model. The speedup is also economically irrelevant at present scale: a coherent SHA-256d Grover oracle costs many orders of magnitude more per query than a classical ASIC pass. ↩
T. Hanke, “AsicBoost: A speedup for Bitcoin mining,” arXiv:1604.00575, 2016. The technique constructs small batches of candidates that collide in the second block’s message-schedule expansion (varying version bits or merkle-tree structure rather than recomputing the expansion per candidate), so the expansion, a sizeable fraction of one compression’s work, is computed once per batch. Its batch members still have distinct first-hash digests, so it buys nothing on the second hash. ↩
Mining cost is $2^D$ compressions with the difficulty exponent $D$ well below 128, so manufacturing colliding first-hash digests in order to share second-hash work costs more than the mining it would save. The one-compression-per-candidate floor rests on collision resistance, the mildest cryptographic assumption anywhere in the argument. ↩
A function that is hard to invert on average is a one-way function, and a proven one-way function implies $P \neq NP$ . An unconditional “mining cannot beat brute force” theorem would therefore carry famous open problems as corollaries, which calibrates how much should be expected of any impossibility claim, including this one. ↩

A census of shallow features

Maybe the stems are not all equal

What is actually left

What’s next

Footnotes