Each strategy is run across the same set of independently dealt 8-deck shoes (strategies reset per shoe).
Per-shoe outputs are aggregated into a paired bootstrap CI on stake-weighted ROI; the same shoe-index
resample is applied across all strategies so the diff against the flat_banker baseline preserves
cross-strategy correlation. Two-sided bootstrap p-values are corrected via Bonferroni and Benjamini-Hochberg
across the N−1 non-baseline comparisons.
Verdicts apply a practical-significance threshold (0.10 percentage points): a p<0.01 result with effect below the threshold is labeled BUSTED, not CONFIRMED. The intent is to prevent large-N "statistically distinguishable but practically meaningless" effects from being honored as wins.