Backtesting honestly: what survives when you remove hindsight
We went looking for the hindsight in our own backtest, found it, and rebuilt the test without it. The honest version is less flashy — and far more trustworthy.
A backtest is only as honest as its worst look-ahead
The most valuable thing you can do with a backtest is try to break it. Our regime classifier reads 20 indicators across a macro panel and an on-chain panel. Inside each panel the indicators are not equal-weighted — they are weighted by inverse correlation: an indicator that moves differently from the rest of its panel earns more weight, a redundant one earns less. It's a sound diversification idea, and it's standard practice.
But it raised a question worth taking seriously: when were those weights computed? In the original backtest, the weighting used the correlation structure of the most recent three years of data — and then applied that single weight vector to all eight years of history, including 2018. A 2018 regime call was being scored with weights derived from 2024–2026 correlations that did not exist yet. That is look-ahead bias. It's subtle, it's easy to ship by accident, and it's exactly the kind of thing a backtest should be stress-tested for. So we did.
Three ways to weigh the same signal
Everything else held identical — same raw data, same percentile ranking, same 0.33/0.67 buckets, same 5-day confirmation smoothing, same constant 2.6%/yr cash leg, same 10 bps rebalance cost. The only thing we varied was how the panel weights are computed:
- Static inverse-correlation — the original method. One weight vector from the latest 3-year window, applied to all history. Contains look-ahead.
- Equal weight (1/N) — every indicator in a panel weighted the same. No correlation input at all, so nothing to look ahead with. A robustness baseline.
- Walk-forward inverse-correlation — the honest version. On each day, the weights use only the trailing three years ending that day. Weights evolve slowly as the window slides, and once a day's weight is set it is never revised, because no future data was ever used to compute it.
The comparison, in one picture
The numbers — ETH / cash
| Strategy | Static | Equal 1/N | Walk-forward |
|---|---|---|---|
| Composite | 8.0× sh 0.74 | 12.8× sh 0.83 | 10.2× sh 0.79 |
| Conservative | 21.7× sh 1.06 | 13.9× sh 0.93 | 6.9× sh 0.75 |
| Aggressive | 16.3× sh 0.89 | 13.5× sh 0.85 | 8.3× sh 0.75 |
| 100% ETH HODL | 2.7× sh 0.56 | 2.7× sh 0.56 | 2.7× sh 0.56 |
Final multiple over ~8 years and Sharpe. Constant cash/cost assumptions, so multiples differ slightly from /regime; the cross-method comparison is the point.
The numbers — diversified 50/50 ETH+SP500
| Strategy | Static | Equal 1/N | Walk-forward |
|---|---|---|---|
| Composite | 5.6× sh 0.83 | 7.3× sh 0.94 | 6.0× sh 0.88 |
| Conservative | 7.2× sh 1.16 | 6.4× sh 1.06 | 4.2× sh 0.85 |
| Aggressive | 7.6× sh 0.98 | 7.6× sh 0.97 | 5.0× sh 0.82 |
| 50/50 HODL | 5.3× sh 0.67 | 5.3× sh 0.67 | 5.3× sh 0.67 |
What survived the honest test
This is the encouraging part, and it's the part that matters most. The findings that hold across all three weighting methods are the ones built on real signal, not on the weighting choice:
- The regime overlay beats buy-and-hold on risk-adjusted return — every method, every portfolio. Composite Sharpe is 0.74 / 0.83 / 0.79 on ETH vs HODL's 0.56; on the diversified blend it is 0.88 vs 0.67. The direction never flips.
- Drawdown control is real and weighting-independent. On the diversified portfolio the regime overlay caps the worst drawdown near -38% versus HODL's -66% — under static, equal, and walk-forward alike. The protection in coordinated sell-offs owes nothing to the weighting.
- The composite is the steady performer. It is the one strategy whose result barely moves across methods on ETH (8.0× → 12.8× → 10.2×). The averaging that can look like a weakness is exactly what makes it robust.
What was hindsight — and why we're glad we checked
The static method made the conservative rule look spectacular on ETH: 21.7×, more than double the composite. Under honest walk-forward weights that becomes 6.9× — and it now sits below the plain composite (10.2×) on both return and Sharpe. Most of conservative's apparent dominance was the backtest quietly knowing the final correlation structure in advance.
That is not a disappointment — it is the system working. The entire point of stress-testing a backtest is to find the result that doesn't survive contact with point-in-time data before it informs a decision, not after. The cost of finding this now is a less dramatic headline. The cost of not finding it would have been trusting a number that hindsight built.
What we're doing about it
- Walk-forward weighting becomes the standard for backtests. Weights are computed point-in-time from the trailing window only. It is the strictly more correct method and it also removes the data-revision instability we wrote about earlier — a past day's call can no longer be rewritten by future data, because future data was never used to make it.
- The composite stays the headline strategy. It is the rule that is strong and stable under honest evaluation. Conservative and aggressive remain interesting as risk-posture variants, reported on walk-forward numbers, with no claim of dominance.
- Every comparison is re-run on the honest basis. The numbers in this piece are generated by a committed, re-runnable script from open data — same discipline as the rest of /regime.
Takeaways
The substance is intact and arguably stronger for being tested: a two-panel regime overlay improves risk-adjusted return and materially reduces drawdown versus buy-and-hold, on a single asset and on a diversified blend, regardless of how you weight the indicators. The piece of the earlier story that didn't survive — the eye-popping conservative multiple on ETH — was hindsight, and we'd rather publish the version that survives scrutiny than the version that flatters. That is the whole job.
Considerations
- Still backtested. Walk-forward removes look-ahead in the weights; it does not remove the hindsight in choosing which 20 indicators and which transforms to use. Out-of-sample live performance will differ.
- One cycle. 2018–2026 is roughly two crypto cycles. A longer sample would tighten every magnitude here.
- Constant cash assumption. A flat 2.6%/yr cash leg is used so the only variable across methods is the weighting; the production snapshot uses the real daily T-bill series, which shifts absolute multiples modestly.