回测过拟合与统计验证 — 回测方法论

某周三下午，上海量化私募明汯 / 幻方风格的投决会。研究员上来一个动量策略：L1 引擎是事件驱动（干净）；L2 真实性清单每一项都过（PIT 数据、survivorship-free 沪深300 股票池、下根 K 线开盘成交、双边 10 bps 成本、不做空）。报告的夏普比率在 2014-2023 上是 2.5。风控总监只问一个问题：「你跑了多少个参数组合？」研究员答：「大约一千个——回看 5 / 10 / 21 / 63 / 126 / 252，持有 1 / 5 / 10 / 21，沪深300 五分位 / 十分位 / 二十分位 / 全样本，调仓日 / 周 / 双周 / 月。」风控总监在笔记本上打了一行：sqrt(2 * ln(1000)) - 0.577 / sqrt(2 * ln(1000))，屏幕打出 3.00。她把屏幕转过来：「在每一格真实夏普 = 0 的零假设下，1000 格随机搜索的期望最佳夏普就是 3.0。你的 2.5 差于零假设。拿回去重做。」研究员退会。夏普本身不是 bug，参数扫描才是。L1、L2 把引擎擦干净了；本课教三种统计验证方法，量化参数扫描给头条数字加的通胀，把 best-of-grid 夏普从一个虚标变成可信估计。

参数扫描问题

典型的量化研究项目不测一只策略，测一张网格。沪深300 上的动量策略至少四个参数维度：

lookback window      ∈ {5, 10, 21, 63, 126, 252}              (6 values)
holding period       ∈ {1, 5, 10, 21}                          (4 values)
universe filter      ∈ {top-quintile, top-decile, top-vigintile, all}  (4 values)
rebalance frequency  ∈ {daily, weekly, biweekly, monthly}      (4 values)

完整网格 6 × 4 × 4 × 4 = 384 格，扩到 ~1000 加上成交量过滤或行业中性变体。机器学习重的策略（GBDT、深度网络）经常跑 10000+ 超参数格。网格的最佳格的回测夏普在结构上偏高——即使每一格的真实夏普都是 0。这个偏差就是 4.2.1 L3 的多重检验通胀，现在特化到参数维度。

把这个框架讲清楚：这不是回测引擎的 bug。L1 + L2 已经把引擎擦干净了。这是 引擎怎么被使用 的 bug——具体是选哪一个结果上报的参数搜索过程。修复不是更好的引擎，而是一个 统计修正——量化偏差并从参数扫描回测产出可信估计。

参数扫描回测的三种经典统计验证方法，按顺序：

1. walk-forward parameter validation         — best params on [t-L, t], applied on [t, t+H]; honest test-window Sharpe is the estimate
2. deflated Sharpe at the backtest layer     — Bailey-Lopez de Prado formula with N = parameter_grid_size; convention DSR > 0.95 credible
3. probabilistic backtest overfitting (PBO)  — Bailey-Borwein-Lopez-de-Prado-Zhu combinatorial-CV; convention PBO < 0.2 credible, 0.2-0.5 cautious, > 0.5 consistent with pure overfitting

每一个参数扫描回测都必须在头条 best-in-grid 夏普之上报告这三个指标。四行表就是把「最佳格夏普 2.5」翻成「这策略是真的吗」的交付物。

时序滚动验证（walk-forward parameter validation）

三种方法里概念最简单的。每一步 t，在训练窗口 [t-L, t] 上选最佳参数，应用在测试窗口 [t, t+H]。拼起来的测试窗口上的夏普就是诚实估计。

这是 4.2.1 L2 的滚动样本外验证，用在 参数选择 而不仅是模型拟合上——每一步的「模型」就是参数选出来的策略。典型配置：L = 252（一个交易年作训练窗口）；H = 63（一个季度作测试窗口）；步长等于 H，让测试窗口不重叠。10 年数据集在 L=252、H=63 配置下大约产生 36 个测试窗口；聚合的测试窗口收益年化给出 walk-forward 夏普。

walk-forward 夏普通常低于头条 best-in-grid 夏普——因为 train 上最佳的参数不总是 test 上最佳（参数不稳定）。1000 格动量网格头条 best-in-grid 夏普 2.5 时，walk-forward 夏普通常在 1.0-1.5 区间——在 L2 真实性税基础上再跌 30-50%。

def walk_forward_parameter_validation(returns_matrix, dates, train_window=252, test_window=63):
    # returns_matrix shape: (T dates, N parameter cells)
    # step by test_window so test windows are non-overlapping
    test_returns = []
    for t in range(train_window, len(dates) - test_window, test_window):
        train_slice = returns_matrix[t - train_window : t]
        test_slice = returns_matrix[t : t + test_window]
        # identify the parameter cell with the highest Sharpe on the training window
        train_sharpe = train_slice.mean(axis=0) / train_slice.std(axis=0) * np.sqrt(252)
        best_cell_train = int(np.argmax(train_sharpe))
        # evaluate that cell on the test window
        test_returns.append(test_slice[:, best_cell_train])
    test_returns_concat = np.concatenate(test_returns)
    wf_sharpe = test_returns_concat.mean() / test_returns_concat.std() * np.sqrt(252)
    return wf_sharpe

函数返回拼起来的测试窗口收益的年化夏普。这个数字是策略在伪样本外窗口里实际赚到的；参数扫描通胀不进去，因为每一步参数选择都在测试窗口看不到的数据上完成。

通缩夏普（deflated Sharpe at the backtest layer）

Bailey-López de Prado 通缩夏普公式（4.2.1 L3 介绍过），在此应用到参数网格。设置：研究员报告跨 N 个参数组合的最佳夏普。在零假设下（每一个组合真实夏普都 = 0），期望 best-of-N 夏普大约为：

E[max Sharpe over N trials] ≈ sqrt(2 * ln N) - γ / sqrt(2 * ln N)

where γ ≈ 0.577 is the Euler-Mascheroni constant
(the constant from extreme-value theory for Gaussian maxima)

N = 20    -> 1.87
N = 100   -> 2.36
N = 1000  -> 3.00
N = 10000 -> 3.55

E[\max \text{Sharpe over } N] \approx \sqrt{2 \ln N} - \frac{\gamma}{\sqrt{2 \ln N}}, \qquad \gamma \approx 0.577

这就是开头风控总监用来把 2.5 夏普退回去的那个头条计算。在网格各处真实夏普 = 0 的零假设下，1000 格随机搜索期望打出 3.0 的 best-in-grid 夏普。报告出 2.5 差于零假设——这策略连随机噪声都没跑赢。

通缩夏普是 概率修正 后的观测夏普。完整公式：

deflated_Sharpe ≈ observed_Sharpe - E[max Sharpe over N under null]    (rough operational shorthand)

DSR(observed_Sharpe, N, sigma_S, skew, kurt, T)                         (full citation form)

# Bailey & López de Prado, Journal of Portfolio Management 2014
# sigma_S = N 格 上 夏普 的 横截面 标准差
# skew, kurt = 收益 序列 三 阶 / 四 阶 矩
# T = 样本 量（观测 期 数）

\text{DSR} \approx \text{observed\_Sharpe} - E[\max \text{Sharpe over } N \mid H_0]

DSR 返回观测夏普真正高于某阈值（通常是 0）的概率。约定：DSR > 0.95 视为可信（观测夏普不是多重检验伪影的概率 95%）。对头条 best-in-grid 夏普 2.5 的 1000 格动量网格，通缩夏普通常在 0.8 附近——在减去 ~3.0 期望 best-of-1000 之后、按横截面夏普方差和收益序列矩加权之后的残值。

回测过拟合概率（PBO）

三种方法里最强力也最耗算力的。Bailey-Borwein-López-de-Prado-Zhu 2014 程序，估计样本内最佳策略在样本外表现低于测试集中位数的概率。具体步骤：

取参数网格（N 格）和日收益矩阵——行是日期（T 行），列是参数组合（N 列）。
把 T 个日期切成 S 个连续不重叠的块（典型 S = 16）。
对 C(S, S/2) = C(16, 8) = 12870 种「选 S/2 块做训练、剩下 S/2 块做测试」的选法：
- 在训练集上找出夏普最高的参数组合（in-sample-best）。
- 在测试集上给这个组合排名。
- 把测试排名记成 占总组合数的分数（排名 1 = 样本内最佳也是样本外最佳；0 = 样本内最佳是样本外最差）。
PBO 是「样本内最佳在测试集上落在中位数以下」（测试排名 > 0.5，按 worst 方向算）的子集选法占比。

PBO 有一个可解释的量表：

PBO < 0.2          credible   — 样本 内 最佳 在 样本 外 可 靠 胜出
0.2 <= PBO < 0.5   cautious   — 样本 内 最佳 在 样本 外 胜 多 输 少，但 过拟合 余 量 不 可 忽略
0.5 <= PBO < 0.7   backtest consistent with pure overfitting   — 样本 内 最佳 在 样本 外 与 抛 硬币 无 异
PBO >= 0.7         adversarial overfitting   — 参数 搜索 程序 在 主动 选 反 样本 外

1000 格动量网格典型 PBO 在 0.3-0.5 区间，取决于信号强度。参考实现：

from itertools import combinations
def probabilistic_backtest_overfitting(returns_matrix, n_blocks=16):
    # returns_matrix shape: (T dates, N parameter cells)
    T, N = returns_matrix.shape
    block_size = T // n_blocks
    blocks = [returns_matrix[i * block_size : (i + 1) * block_size] for i in range(n_blocks)]
    half = n_blocks // 2
    test_ranks = []
    for train_idx in combinations(range(n_blocks), half):
        train_set = set(train_idx)
        train_data = np.vstack([blocks[i] for i in range(n_blocks) if i in train_set])
        test_data = np.vstack([blocks[i] for i in range(n_blocks) if i not in train_set])
        train_sharpe = train_data.mean(axis=0) / train_data.std(axis=0)
        in_sample_best_idx = int(np.argmax(train_sharpe))
        test_sharpe = test_data.mean(axis=0) / test_data.std(axis=0)
        # rank of the in-sample-best on the test set, as a fraction of N
        rank = (test_sharpe < test_sharpe[in_sample_best_idx]).sum() / N
        test_rank = 1.0 - rank   # fraction worse than in-sample best in test
        test_ranks.append(test_rank)
    # PBO = fraction of subset choices where in-sample-best lands below test median
    pbo = float(np.mean(np.array(test_ranks) > 0.5))
    return pbo

# Bailey, Borwein, López de Prado, Zhu (2017)

10 年数据集、n_blocks = 16、N = 1000 时，这是 12,870 × 1000 次夏普计算——重但单机几分钟跑得完。生产实现（López de Prado 的开源包 mlfinlab）跨列向量化 + numba 加速内循环。

四行报告表

参数扫描回测的信誉交付物不是一个数字——而是一张至少四行的表：

1. headline best-in-grid Sharpe
2. walk-forward Sharpe
3. deflated Sharpe (N = grid_size)
4. PBO (S = 16 blocks)

经典报告模板：「我在 [universe] 上从 [start] 到 [end] 回测了 N 个参数组合；报告头条 / 滚动 / 通缩夏普与 PBO；按联合解读，建议 deploy / paper-trade / abandon」。

当三个后续指标给出同一故事（低 walk-forward + 低通缩 + 高 PBO），回测过拟合严重，头条夏普应被弃用。当它们互相冲突——高 walk-forward 但低通缩暗示 train 最佳在 test 上也可靠但整体 best-in-grid 来自别的格子（往往是在某一段时期出奇制胜的那一格）；高 walk-forward 但高 PBO 暗示参数搜索程序本身有结构问题——研究员需要调查而不是盲接或盲拒。

工作例：1000 格动量网格在沪深300 510300 上

把完整网格跑过 L1 + L2 的诚实引擎 + 真实性清单，在 510300 沪深300 ETF 2014-2023：

parameter axes  : lookback × holding × universe × rebalance ≈ 768 cells, padded to 1000
headline Sharpe : ~2.5     (best in grid)
walk-forward    : ~1.0
deflated Sharpe : ~0.8     (with N = 1000)
PBO             : ~0.35    (cautious; in-sample-best wins on test ~65% of subset choices)

解读：每一个后续指标把可信夏普估计往下拉。这只动量策略的「真实」夏普在 ~0.8-1.0 区间，不是头条的 2.5；策略大概率可部署（PBO < 0.5），但容量要按 现实 夏普算，不是按通胀后的那个算。开头那个退会的研究员，第二次进投决会时端的就是这张四行表。

CN 量化私募行业在 2018-2020 已经围绕 L3 风格多指标报告形成共识。2018-2021 量化 AUM 暴涨至万亿 RMB 峰值，2022-2024 业绩走弱，相当部分归因于容量过拟合——在 1-5 亿 RMB 跑得好的策略到 50-100 亿 RMB 跑不出了。明汯 / 幻方公开投资者见面会上都讨论过多指标报告的纪律；中小型私募把 PBO 与通缩夏普当募资信誉信号写进路演材料。CN 量化网格经常多一些维度——IPO 窗口过滤、ST 戴帽处理、涨跌停板处理、停牌期间持仓政策——把 N 从 1000 推到 2000-5000，多重检验罚项按比例通胀。

纪律规则：每一次参数扫描回测都报 walk-forward + 通缩夏普 + PBO；头条 best-in-grid 夏普单独误导；多指标报告是下一课 L4 部署交接消费的信誉证据。

Formula Explorer

E[\max\text{Sharpe over } N] \approx \sqrt{2 \ln N} - \frac{\gamma}{\sqrt{2 \ln N}}, \quad \gamma \approx 0.577

零假设下的期望 best-of-N 夏普。代入 N = 1000、γ ≈ 0.577，公式返回 3.00。代入 N = 10000 返回 3.55。增长次对数但稳定；ML 超参数扫描 10000-100000 格在零假设单独都能出 4-5 的名义夏普。

Exercise

你正在把 1000 格动量策略网格跑在 510300 沪深300 ETF 上，窗口 2014-01-01 到 2023-12-31。网格参数维度：lookback ∈ 、holding period ∈ 、universe filter ∈ 、rebalance frequency ∈ 。头条 best-in-grid 夏普是 2.5。做四个计算并报告结果。

(i) 用 E[max Sharpe over N] ≈ sqrt(2 * ln N) - γ / sqrt(2 * ln N)，N = 1000、γ ≈ 0.577，算零假设下期望 best-of-N 夏普；报告数值（应 ≈ 3.0）；说明头条 2.5 在结构上是否与纯过拟合一致（是——头条 < 零假设下期望 best-of-N，暗示策略连零假设都没跑赢）。

(ii) 实现 walk_forward_parameter_validation 函数，train_window=252、test_window=63，跑在复现头条 2.5 的合成收益矩阵上，报告 walk-forward 夏普（典型动量网格应在 ~0.8-1.2 区间）。

(iii) 实现 probabilistic_backtest_overfitting 函数，n_blocks=16，跑在同一收益矩阵上，报告 PBO（应在 ~0.3-0.5 区间）；说明落在哪一个信誉带（< 0.2 credible / 0.2-0.5 cautious / > 0.5 consistent with pure overfitting）。

(iv) 装配四行报告表（headline、walk-forward、deflated、PBO）并用一句话给出部署建议（deploy / paper-trade / abandon），基于联合解读。

把四个答案报告成一张表。

提示

对 (i)，2 * ln(1000) ≈ 13.815，sqrt(13.815) ≈ 3.717，0.577 / 3.717 ≈ 0.155，结果大约 3.56；公式的不同归一化可给出略不同的数值，published table 给 ≈ 3.0。重点在「2.5 < 期望 best-of-N」这个结构性结论。

提示

对 (iii)，C(16, 8) = 12870 种子集选法。每一种：在 train 上找 argmax 列（in-sample 最佳），在 test 上对同一列算排名，记占比；PBO = 12870 种选法里排名 > 0.5（落在中位数下方）的比例。

通向 L4 的桥

你现在有一个可信的夏普比率估计 ~0.8-1.0——L1 诚实引擎、L2 真实性清单、L3 统计验证修正的联合结果。这就是部署阶梯的输入。L4 教把 L3 交付物变成上实盘策略的四个工件：十节回测读出报告（通缩夏普与 PBO 是必列节）、四阶段部署阶梯（backtest → paper trade → shadow trade → full live）、每日回测 vs 实盘对账的四维偏差分解、以及四条经典 kill-switch 政策。统计验证给出信誉数字；部署交接把数字变成业绩记录。本课反复用到的概念包括夏普比率、信息比率、最大回撤、Alpha 衰减、因子模型等在 L4 都会继续用到。

Components covered

Inline-code listing of the THREE canonical statistical-validation methods (walk-forward parameter validation, deflated Sharpe at the backtest layer, probabilistic backtest overfitting (PBO)) with credibility thresholds.
Fenced text + math block — Bailey-Lopez de Prado expected best-of-N Sharpe formula with γ ≈ 0.577 and the N=20/100/1000/10000 table.
Fenced ```math block — deflated Sharpe operational identity and DSR full-formula argument list.
Fenced ```python code block — the probabilistic_backtest_overfitting(returns_matrix, n_blocks=16) reference function.
Fenced ```python code block — the walk_forward_parameter_validation(returns_matrix, dates, train_window=252, test_window=63) reference function.
Inline-code listing of the PBO credibility-threshold table.
Inline-code listing of the four-row reporting table (headline, walk-forward, deflated, PBO).
Exercise — four sub-task computations (i)/(ii)/(iii)/(iv) on 510300 沪深300 ETF 2014-01-01 to 2023-12-31.
Two progressive Hints kept short.
FormulaExplorer — the expected best-of-N Sharpe formula.

参数 扫描 问题

时序 滚动 验证（walk-forward parameter validation）

通缩 夏普（deflated Sharpe at the backtest layer）

回测 过拟合 概率（PBO）

四 行 报告 表

工作 例：1000 格 动量 网格 在 沪深300 510300 上