研究工具链与可复现性 — 研究工作流程与纪律

一位私募量化团队的资深研究员在原研究 PR 上线半年之后把报告递给一位初级队友。"重新跑一遍。基金经理在问这个信号在 2024 年数据上是否还 work。" 初级从共享盘拉出 notebook 打开，第一个错误立刻撞上来：ImportError: cannot import name 'X' from 'pandas'。本机上 pandas 版本是 2.4；notebook 是 1.5 时写的。初级 pip install pandas 1.5 —— 另一个单元格又挂了，因为 numpy 版本现在不兼容。两个小时依赖战争之后，notebook 跑通了，但算出的数与报告里不一致。种子没记录；共享盘上的数据文件自原始跑之后被触动过；原始代码状态的 git commit SHA 找不到。"结果" 无法复现。资深研究员走回基金经理那里。"我们没有一个可复现的信号。我们有一个信号的记忆。" 本课是模块的工程 capstone —— 让 L1 实验日志可审计、让 L2 测试集上锁可执行、让 L3 DSR 半年后可被另一研究员验证的那一层。本节结束后，你应当能产出一个队友可以从单一命令重跑的研究 PR。

六层经典研究栈

1. notebook vs script                       - Jupyter for exploration; .py for production; transition within two weeks
2. version-pinned dependencies              - pyproject.toml + uv.lock / poetry.lock / requirements.txt with hashes
3. seeded RNGs                              - every randomness source seeded; seed logged per run
4. git + feature-branch + pull-request workflow - every project a feature branch; result is a PR
5. experiment tracking                      - mlflow / wandb / SQLite log of hyperparameters / seed / data window / metrics / artefact path / git_commit_sha
6. code-review checklist                    - five binary checks enforced at PR review

规则：每一个研究结果都必须能从单条命令复现，给定锁定的依赖、锁定的数据快照、记录的种子与 git commit SHA。六层是让这条规则可执行的工程管道。第 (1) 层把想法生成（notebook）与结果主张（脚本）分开；第 (2) 层冻结运行时；第 (3) 层冻结随机性；第 (4) 层把一个结果打包为带版本的 diff；第 (5) 层记录谱系；第 (6) 层在合并时落实纪律。

研究 PR 的八项 artefact

每一个进入仿真交易的研究结果都以 pull request 形式发货。PR 按此顺序打包八项 artefact：

(a) notebook                           - frozen at result-claim moment
(b) production script                  - regeneratable from CLI
(c) experiment log                     - CSV / SQLite / mlflow run-ids
(d) pre-registration document          - the L1 six-field template, committed at project start
(e) in-sample result
(f) single out-of-sample evaluation result
(g) multiple-testing correction        - Bonferroni / BH-FDR / DSR with N counter from the experiment log
(h) write-up                           - the curated narrative for human review

规则：PR 不 merge 即研究结果不成立；代码评审清单不过即 PR 不 merge。八项 artefact 闭合前几课的全链：(d) 接 L1 预登记；(e) 与 (f) 接 L2 测试集单次触碰纪律；(g) 接 L3 修正；(c) 是 (g) 的试验计数器源；(a) 与 (b) 是 L4 notebook-vs-脚本纪律；(h) 是结果给人看的那一面。

实验日志 schema

实验日志的最小 schema 是 SQLite 里一张 runs 表。九列，按此顺序，SQL 类型固定：

CREATE TABLE runs (
    run_id              TEXT    PRIMARY KEY,
    timestamp           TEXT    NOT NULL,
    hyperparameters_json TEXT   NOT NULL,
    seed                INTEGER NOT NULL,
    data_window         TEXT    NOT NULL,
    metric_name         TEXT    NOT NULL,
    metric_value        REAL    NOT NULL,
    artefact_path       TEXT,
    git_commit_sha      TEXT    NOT NULL
);

表名 runs、九列名按此顺序（run_id、timestamp、hyperparameters_json、seed、data_window、metric_name、metric_value、artefact_path、git_commit_sha）以及 SQL 类型跨区域字节一致。git_commit_sha 列让日志不可伪造 —— 每一个指标都与一个具体的代码状态绑定。声称在某个 git_commit_sha 处 Sharpe 等于 2.0 的那一行可以重复验证：git checkout <sha>、恢复数据快照、设种子、重跑、比对。指标对不上，那行就是谎言。这一列的不可伪造性给了下文代码评审清单真正的牙齿。

五项代码评审清单

1. test set evaluated exactly once                    - search the experiment log for runs that touched the test set; expect exactly one
2. seed logged for every run                          - search the log for null seeds; expect zero
3. data window justified in pre-registration          - the window in the pre-registration document matches the window in the result
4. universe is survivorship-bias-free                 - the universe definition references `universe(date, symbol)` from 4.1.1 L4
5. multiple-testing correction applied with the experiment-log N - DSR / Bonferroni / BH-FDR reported alongside the headline metric using the actual count of trials from the log

规则：任一项不过即阻塞 PR 合并至修复；这是 L1 + L2 + L3 的工程落地。检查 (1) 是 L2 测试集上锁的工程牙齿。检查 (2) 是复现性地板 —— 没有记录种子的 run 无法字节一致地重现。检查 (3) 是预登记一致性检查 —— 窗口与预登记不一致的结果要么是未标注的偏离（即 +1 进多重检验修正的试验计数器），要么是文档 bug；任一情况 PR 都不能合并直至差异解决。检查 (4) 是 4.1.1 L4 的 survivorship-free universe；检查 (5) 是用实验日志 N 让 L3 修正可审计。

代码：seed 一切

def seed_all(seed: int) -> None:
    """Seed every randomness source the project might touch.

    Two runs with the same seed must produce byte-identical metrics on the same
    hardware and the same lock-file environment.
    """
    import os
    import random
    import numpy as np
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    try:
        import torch
        torch.manual_seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed_all(seed)
    except ImportError:
        pass
    try:
        import tensorflow as tf
        tf.random.set_seed(seed)
    except ImportError:
        pass

函数名 seed_all、参数 seed: int、随机源（random、numpy、PYTHONHASHSEED、torch、torch.cuda、tensorflow）以及规则 "two runs with the same seed must produce byte-identical metrics on the same hardware and the same lock-file environment" 跨区域字节一致。可选依赖（torch、tensorflow）的 try / except ImportError 让函数在缺库环境里仍然安全。

代码：log 一次 run

def log_run(run_id, hyperparameters, seed, data_window,
            metric_name, metric_value, artefact_path):
    """Insert one row into the experiments.db `runs` table.

    Retrieves the current git commit SHA via subprocess; binds the metric to the
    specific code state so the row cannot be faked retroactively.
    """
    import json
    import sqlite3
    import subprocess
    import time
    sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()
    conn = sqlite3.connect('experiments.db')
    conn.execute(
        "INSERT INTO runs VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)",
        (run_id, time.strftime('%Y-%m-%dT%H:%M:%S%z'),
         json.dumps(hyperparameters), seed, data_window,
         metric_name, metric_value, artefact_path, sha),
    )
    conn.commit()
    conn.close()

函数名 log_run、参数名、SQLite 文件名 experiments.db、subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip() 取 git commit SHA 与规则 "每一个写入任何报告的指标都必须能追溯回这张表里的一行" 跨区域字节一致。

代码：pyproject.toml 骨架

[project]
name = "alpha-research-momentum"
version = "0.1.0"
description = "5-day momentum signal research for 510300"
requires-python = ">=3.11"

[project.dependencies]
numpy = ">=1.26"
pandas = ">=2.2"
scikit-learn = ">=1.4"
matplotlib = ">=3.8"
mlflow = ">=2.10"
jupyter = ">=1.0"

[tool.uv]
lock-file = "uv.lock"

YAML 键（[project]、[project.dependencies]、[tool.uv]）、依赖名（numpy、pandas、scikit-learn、matplotlib、mlflow、jupyter）以及 Python 版本要求（>=3.11）跨区域字节一致。规则：lock 文件（uv.lock）和 pyproject.toml 一起提交到 git；部署用 uv pip sync 对 lock 文件。没有 lock 文件，"version-pinned dependencies" 只是口号 —— numpy>=1.26 允许任何未来 numpy 版本包括破坏性更新；lock 文件把完整传递闭包冻死。

复现一个结果的四项经典输入

1. the locked dependency set   - uv.lock or poetry.lock or requirements.txt with hashes, committed to git
2. the locked data snapshot    - S3 / OSS partitioned parquet with a fixed version tag, referenced in the config
3. the logged seed             - from the experiment log
4. the git commit SHA          - from the experiment log; reproduce by `git checkout <sha>`

规则：每一个研究结果都必须能从单条命令复现，给定这四项输入；另一研究员给定这四项输入仍不能复现，那这个结果就不可复现，也就不是研究结果。单条命令通常是 python run_experiment.py --config=<config-id> 或 make repro-<run-id>；四项输入是让这条命令在其中跑的那个确定性宇宙的配置。

经典 capstone 视角：本课是 capstone —— 它把 L1 实验日志、L2 测试集上锁、L3 试验计数器与 DSR 串起来，打包进让整个栈可复现的工程管道里。下文的 capstone 练习围绕这一综合设计。

Formula Explorer

\text{repro\_score} = w_1 \cdot \text{lock} + w_2 \cdot \text{snapshot} + w_3 \cdot \text{seed} + w_4 \cdot \text{sha}

纪律总结

四项经典输入都能从单条命令恢复的研究项目是可复现的；输入散落在个人笔记本、未打 tag 的共享盘文件夹、未锁版本的 conda 环境里的项目不是。策略的夏普比率、相对基准的信息比率、净值曲线的最大回撤、部署之后的 Alpha 衰减、对价值 / 质量 / 动量因子的因子模型归因、对 2007-2008 / 2015 股灾 / 2018 trade-war / 2020 疫情 / 2022 房地产 drawdown 的压力测试、下游 4.4 的均值方差优化、组合优化 —— 只有当它们都绑定到实验日志里的 git_commit_sha 时，每一项指标才可信。没有绑定时，指标是记忆；有绑定时，指标是研究结果。

买方在 2015-2018 年之间完成了这套栈的采用 —— 量化私募龙头（明汯、幻方量化、中诚、灵均、九坤投资）、公募量化部门（天弘 / 富国 / 华夏 / 嘉实 / 工银瑞信）以及卖方系统化桌子（中信系统化、华泰 QIS）。2018 年 López de Prado 《Advances in Financial Machine Learning》中文版与 2016 年 Harvey-Liu-Zhu 论文设定了学术地板；AMAC 中国证券投资基金业协会关于私募研究日志与信息披露的要求设定了监管地板；量化龙头的工程严格度设定了实战地板。

练习

Exercise

你正在给一项关于 510300 沪深 300 ETF 的 5 日动量信号研究创建研究 PR。按顺序产出四项工程 artefact 并报告答案。

(i) 写一份完整的 pyproject.toml 骨架，列出六项高阶依赖（numpy>=1.26, pandas>=2.2, scikit-learn>=1.4, matplotlib>=3.8, mlflow>=2.10, jupyter>=1.0）与 Python 版本要求（>=3.11），用一句话说明 uv.lock 文件做了 pyproject.toml 不做的什么事。

(ii) 写出实验日志最小 schema 的 SQL CREATE TABLE runs (...)，含九列（run_id, timestamp, hyperparameters_json, seed, data_window, metric_name, metric_value, artefact_path, git_commit_sha）与类型；用一句话说明为什么 git_commit_sha 列是不可伪造的那一列。

(iii) 写一个 Python seed_all(seed: int) 函数，seed random、numpy、PYTHONHASHSEED 以及可用时的 torch；演示用 seed=42 跑两次后再算 np.random.rand(3) 返回同样的三个数。

(iv) 把五项代码评审清单应用到一个假想研究 PR 上：该 PR 声称在 510300 2022-2023 窗口上样本外 Sharpe 1.8、N=10 次试验；对五项检查每一项说明该主张是 PASS 还是 FAIL（假定实验日志显示：测试集触碰 1 次、10 次 run 全部记录种子、数据窗口与预登记一致、universe 来自 4.1.1 L4、报告 DSR = 0.92 配 N=10）；每一项用一句话论证裁定。

把全部四个答案作为一套打包 artefact 报告。

提示

pyproject.toml 声明依赖区间；uv.lock 冻死版本 + 哈希闭包，保证字节一致安装。git_commit_sha 把指标与代码状态绑定，可由 git checkout <sha> 重新验证。

提示

N=10 + DSR=0.92：修正应用正确，DSR 在 suggestive 带。五项检查全 PASS。PR 可合并 —— suggestive（非 strong）DSR 下应进入影子交易。

研究栈的一天 —— 各层如何组合

一个具体的日常研究闭环走法，把经典研究栈的各层按序调用。研究员周一开一条新特性分支 —— research/momentum-510300-2024-q2。先提交预登记文档：L1 的六字段模板，填入阈值规则 Sharpe > 1.0 and DSR > 0.95 → paper-trade for one quarter; otherwise abandon，试验计数器初始化为 1。pyproject.toml 与 uv.lock 同一个 commit 提交；下一个人 uv pip sync 即得与原研究员完全相同的 numpy、pandas、scikit-learn、matplotlib、mlflow、jupyter 版本。

周二是 2015-2021 窗口上的样本内探索。研究员打开 EDA notebook；data/test/ 分区在文件系统层上锁，测试集无法被意外打开。每试一个变体，notebook 调 log_run(...) 在 experiments.db 里插一行，含种子、数据窗口、指标与当前 git commit SHA。到周五下午，实验日志有五十行；预登记时计数器为 1，现在是 50；增量是 L3 DSR 公式看到的 N。

下周一研究员冻结样本内调参、破封测试集恰好一次、用实验日志 N=50 计算样本外 Sharpe + DSR。结果：Sharpe 1.4、DSR 0.88 —— suggestive 不 strong。项目启动时预登记的决策规则自动触发：Sharpe > 1.0 → 影子交易一季度；DSR suggestive → 保守仓位。研究员开一个研究 PR 打包八项 artefact：notebook、生产脚本、实验日志转储、预登记文档、样本内结果、单次样本外结果、配 N=50 的多重检验修正、写入报告。PR 模板要求五项评审检查全部勾上；评审一确认单次测试集触碰；评审二确认 50 个种子全记录；评审三确认窗口与预登记一致。PR 周五合并；影子交易簿周一开仓。

六个月后基金经理问信号在 2024 数据上是否还 work。新研究员拉出合并 PR，git checkout <sha> 到合并 commit，uv pip sync uv.lock，从实验日志取种子，把脚本指向新的 2024 数据快照，重跑生产脚本。2024 Sharpe 落在 0.6；信号已衰减。决策规则再触 —— abandon。两次跑都可从单条命令复现；两次跑全链可审计。这是六层、八 artefact、SQLite schema、五项评审检查与四项输入组合起来的产出。少任一件，六个月后的跟进都做不到。

参考卡

本课装配的组件，按序：

Inline-code listing —— 六层经典研究工具栈。
Inline-code listing —— 研究 PR 的八项 artefact。
Fenced ```sql code block —— 实验日志最小 schema（含九列的 runs 表）。
Inline-code listing —— 五项代码评审清单。
Fenced ```python code block —— seed_all(seed: int)。
Fenced ```python code block —— log_run(run_id, hyperparameters, seed, data_window, metric_name, metric_value, artefact_path)。
Fenced ```yaml code block —— pyproject.toml 骨架与六项依赖。
Inline-code listing —— 复现一个结果的四项经典输入。
Exercise —— 四项研究 PR artefact 综合练习，配两条渐进 Hint。
FormulaExplorer —— 复现评分组合公式。

模块结尾

模块完结。四节课合成一套持久纪律。L1 把工作流写进书面合同 —— 七阶段、六字段预登记、四项 artefact、四个诊断问题。L2 把数据纪律机制讲精确 —— 三分区、四切法、四泄漏模式、五项泄漏检测清单。L3 把试验计数器转成通缩概率 —— Bailey-Lopez de Prado 公式、三项修正、DSR 阈值分层、五种 p-hacking 形式。L4 把上述三课包进工程 —— 六层栈、八项 PR artefact、含 git_commit_sha 的 SQLite schema、五项代码评审清单、四项复现输入。同时尊重四课的研究项目产出队友可从单条命令重跑的结果，配试验计数器、配 DSR、配书面假设。少任一课的项目产出一个记忆。下一个模块 4.2.2（信号构建）讲的是 建什么信号；本模块讲的是 怎么诚实地测试信号，让买方信任你发货的东西。

六 层 经典 研究 栈

研究 PR 的 八 项 artefact

实验 日志 schema

五 项 代码 评审 清单

代码：seed 一切

代码：log 一 次 run

代码：pyproject.toml 骨架

复现 一 个 结果 的 四 项 经典 输入

纪律 总结

练习

研究 栈 的 一 天 —— 各 层 如何 组 合

参考 卡

模块 结尾