部署、pybind11 与端到端延迟 — C++ 交易系统

某 HFT 私募的低延迟负责人在周五下午走进工程间，对写出 L1 / L2 / L3 这套交易二进制的团队问一个问题："开发机上跑得对。现在要把它放到 CFFEX 张江 COLO 撮合引擎旁边的机柜里，并对交易桌承诺端到端 P99.9 在 3 µs 以下，还要做哪些事？"这段从「能编译」到「桌子敢用」的差距，就是部署故事。四层一起出力：编译标志（PGO + -O3 -march=native -flto）、运行期库替换（LD_PRELOAD=libjemalloc.so）、内核启动参数（isolcpus + nohz_full + rcu_nocbs）、进程钉核（numactl + chrt --rr）。这之上还要把研究环境的接缝接好，让量化研究员能从 Python 调用你的 pricer（pybind11），并且要有把整条链路用 SO_TIMESTAMPING 两端硬件 NIC 时间戳闭环测量的能力——把预算「证明」出来。L4 是 capstone：把前三课变成桌子真能上线的东西。

四层部署栈

这份配方是一座栈：每一层都建立在下一层之上。少做一层，留下白白可拿的余量；少做两层，尾延迟一阶量级地膨胀。

层	机制	命令 / 配置	典型贡献
编译标志	`-O3 -march=icelake-server -flto -DNDEBUG` + PGO 双趟	下方完整 recipe	总余量的 ~70%
运行期库替换	`LD_PRELOAD=/usr/lib/libjemalloc.so` 替换 glibc malloc	启动时 preload	额外 ~5%
内核启动参数	GRUB cmdline 中 `isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7`	一次性 + 重启	额外 ~10%（对 P99.9 影响巨大）
进程钉核 + 实时优先级	`numactl --cpunodebind=0 --membind=0` + `chrt --rr 50`	每次启动	额外 ~15%（对尾延迟影响巨大）

# Step 1: build with profile generation.
g++ -std=c++17 -O3 -march=icelake-server -flto -DNDEBUG \
    -fprofile-generate=/tmp/pgo \
    -o trading_binary_pgogen \
    main.cpp orderbook.cpp feed_handler.cpp strategy.cpp risk.cpp router.cpp

# Step 2: run against a representative workload (replay an ITCH morning).
./trading_binary_pgogen --replay /data/itch/sse_l2_morning_replay.bin --runtime 600

# Step 3: rebuild with the collected profile data.
g++ -std=c++17 -O3 -march=icelake-server -flto -DNDEBUG \
    -fprofile-use=/tmp/pgo \
    -o trading_binary \
    main.cpp orderbook.cpp feed_handler.cpp strategy.cpp risk.cpp router.cpp

# Step 4: GRUB kernel cmdline (one-time, requires reboot).
# Edit /etc/default/grub:
#   GRUB_CMDLINE_LINUX="... isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7"
# Then: sudo update-grub && sudo reboot

# Step 5: production launch with full pinning.
sudo LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so \
     numactl --cpunodebind=0 --membind=0 \
     chrt --rr 50 \
     ./trading_binary --core 6 --feed 233.54.12.1:26477

逐行说明每一条买你什么。-O3 打开激进内联与循环变换。-march=icelake-server 让编译器按 CFFEX 张江 COLO 上部署的实际微架构（Intel Xeon Gold 6342 / 6354 Ice Lake-SP）发射 AVX-512 指令；通用 -march=x86-64 二进制可能把 20% 性能白白浪费掉。-flto 启用链接期优化，让 L1 的 add_order 即便处在不同 .cpp 中也能被内联到 L2 的 book.apply。-DNDEBUG 抹掉 assert 调用。

PGO 双趟是这份 recipe 中单项收益最大的一步。第一次构建插入桩，记录分支方向与调用次数；用具有代表性的负载跑一遍（沪深300 ETF 510300.SH 上午开盘 ITCH 重放是标准选择——集合竞价、连续撮合、稳态全在一段轨迹里）；用 -fprofile-use=/tmp/pgo 再构建一次。优化器现在知道 99% 的委托簿更新落在内部价±50 tick 范围内；它据此布局数组索引阶梯，分支预测器配合。实测数字：相对纯 -O3 构建总运行时快 30–60%。

运行期库替换

LD_PRELOAD=libjemalloc.so 把全局 malloc 换成 jemalloc——更低的碎片化、更好的多线程伸缩、更小的 P99 分配延迟。热路径上你本应根本不调 malloc——L3 池分配器拥有订单 arena——但外围（日志、统计累加器、偶发冷路径的 std::string）仍打到全局分配器，jemalloc 削掉那些尾部的微秒。jemalloc 是 Meta 的默认，tcmalloc 是 Google 的默认；两者都是生产可信选择；活跃的国内私募自营桌在做过 A/B 对比后多数也收敛到了 jemalloc。

内核启动参数

isolcpus=2-7 把核 2 到 7 从 Linux 调度器普通池里摘出来——内核不会把其他进程调到上面。nohz_full=2-7 关掉这些核上周期性的 timer-tick 中断（内核默认每毫秒一次以更新调度时间核算）；不关，你的策略线程每秒被内核打断 1000 次，每次 1–3 µs 的缓存颠簸。rcu_nocbs=2-7 把 RCU 回调执行卸载到隔离集合之外的核，让内核没法在你的策略核上顺手做一次 RCU 清扫。把上述三项写进 /etc/default/grub 的 GRUB_CMDLINE_LINUX 然后重启后，核 6 上的策略线程看到零内核抢占——通常机器上 P99.9 抖动的主要来源。

Formula Explorer

tail_p999 = base_latency + max_kernel_preemption

进程钉核 + 实时优先级

numactl --cpunodebind=0 --membind=0 把进程钉到 NUMA 节点 0——CPU 与内存在同一颗插槽。双插槽 COLO 服务器上跨插槽内存访问比同插槽慢 80–120 ns；对一次 10 ns 的 L1 读，这是 10 倍代价。chrt --rr 50 让进程以 SCHED_RR 实时调度、优先级 50 运行——唯一能抢占它的方式是有另一个实时进程优先级 ≥ 50（在硬化的交易主机上几乎不可能）。配合 isolcpus，这保证策略线程从开盘到收盘不被打断。

pybind11：研究到生产的接缝

研究员用 Python——pandas 整理数据、numpy 向量化数学、scipy 统计。他们希望在 Jupyter notebook 里调用你的 C++ pricer 与委托簿，而不用写一行 C++。pybind11 就是这条接缝：一份 .cpp 文件配 PYBIND11_MODULE 宏，编译出一个 .so，Python 像导入其他模块一样导入。私募研究环境、外资行的客户端定价模型、所有主流卖方量化库都用这种方式把 C++ 内核暴露给 Python。

// quant_lib_bindings.cpp — compiles to quant_lib.so via CMake + pybind11.
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <pybind11/stl.h>

#include "orderbook.hpp"   // L1
#include "strategy.hpp"    // L3 MeanRevMidStrategy
#include "risk.hpp"        // L3 RiskManager

namespace py = pybind11;

PYBIND11_MODULE(quant_lib, m) {
    m.doc() = "Trading-systems-in-cpp module 3.4.5: C++ <-> Python research seam";

    py::class_<OrderBook>(m, "OrderBook")
        .def(py::init<std::int32_t, std::int32_t, std::int32_t>(),
             py::arg("min_price_ticks"), py::arg("max_price_ticks"), py::arg("tick_size"))
        .def("best_bid_ticks", &OrderBook::best_bid_ticks)
        .def("best_ask_ticks", &OrderBook::best_ask_ticks)
        .def("mid_ticks", [](const OrderBook& b) {
            return (b.best_bid_ticks() + b.best_ask_ticks()) / 2;
        });

    py::class_<MeanRevMidStrategy::Decision>(m, "Decision")
        .def_readonly("side",        &MeanRevMidStrategy::Decision::side)
        .def_readonly("price_ticks", &MeanRevMidStrategy::Decision::price_ticks)
        .def_readonly("qty",         &MeanRevMidStrategy::Decision::qty)
        .def_readonly("emit",        &MeanRevMidStrategy::Decision::emit);

    // Pure decision function — same numerics as the production OnTick path, exposed for
    // Python research to verify or backtest without the framework around it.
    m.def("decide",
          [](const OrderBook& book, std::int32_t last_trade_ticks,
             std::int32_t threshold_bps, std::uint32_t lot_size) {
              return MeanRevMidStrategy::decide(book, last_trade_ticks,
                                                threshold_bps, lot_size);
          },
          py::arg("book"), py::arg("last_trade_ticks"),
          py::arg("threshold_bps") = 5, py::arg("lot_size") = 100,
          py::call_guard<py::gil_scoped_release>(),
          "Compute the strategy decision for a (book, last_trade) pair. Releases the GIL.");

    // Zero-copy NumPy interop: a vectorised batch-pricing entry point.
    m.def("price_batch",
          [](py::array_t<double, py::array::c_style | py::array::forcecast> spots,
             double strike, double vol, double ttm) {
              const auto buf = spots.unchecked<1>();
              py::array_t<double> out(buf.shape(0));
              auto out_buf = out.mutable_unchecked<1>();
              for (py::ssize_t i = 0; i < buf.shape(0); ++i) {
                  out_buf(i) = bs_price(buf(i), strike, vol, ttm);
              }
              return out;
          },
          py::arg("spots"), py::arg("strike"), py::arg("vol"), py::arg("ttm"),
          py::call_guard<py::gil_scoped_release>(),
          "Vectorised Black-Scholes pricing over a NumPy array of spots. Zero-copy in.");
}

三处惯用法值得点名。py::array_t<double, py::array::c_style | py::array::forcecast> 接收一个 double NumPy 数组并把它的数据指针直接暴露给 C++——跨语言边界不复制数据。forcecast 标志在输入数组是 float32 或步长不标准时在 Python 侧做一次转换。py::call_guard<py::gil_scoped_release>() 在调用期间释放 Python GIL——向量化数值工作必备，否则一个耗时的 price_batch 会阻塞所有其他 Python 线程。m.def 内的 lambda 是 pybind11 包装的对象；可以原地定义或委托给 C++ 方法。

Python 侧的跨语言数值一致性验证——确保 C++ pricer 与 scipy 参考实现在浮点舍入范围内相符：

import numpy as np
import quant_lib   # compiled from quant_lib_bindings.cpp via CMake + pybind11

# Build an OrderBook on the Python side.
book = quant_lib.OrderBook(min_price_ticks=1, max_price_ticks=1_000_000, tick_size=1)
# (In a real test, populate the book from a recorded snapshot here.)

# Cross-language pricing identity: the same decide() called from Python must produce
# the same Decision the C++ binary produced during the live replay.
decision = quant_lib.decide(book, last_trade_ticks=450_00, threshold_bps=5, lot_size=100)
assert decision.emit is True or decision.emit is False           # exhaustive boolean
if decision.emit:
    assert decision.side in (0, 1)                                # 0 = Buy, 1 = Sell
    assert decision.qty == 100

# Vectorised batch-pricing: zero-copy NumPy interop.
spots = np.linspace(440.0, 460.0, 10_000).astype(np.float64)
prices_cpp = quant_lib.price_batch(spots, strike=450.0, vol=0.20, ttm=0.25)

# Reference Python pricer using scipy.stats.norm — same math the C++ bs_price() implements.
from scipy.stats import norm
def bs_price_py(s, k, vol, ttm):
    d1 = (np.log(s / k) + 0.5 * vol * vol * ttm) / (vol * np.sqrt(ttm))
    d2 = d1 - vol * np.sqrt(ttm)
    return s * norm.cdf(d1) - k * norm.cdf(d2)
prices_py = bs_price_py(spots, 450.0, 0.20, 0.25)

# The C++ and Python prices must agree to floating-point round-off.
assert np.allclose(prices_cpp, prices_py, atol=1e-6), \
    f"C++/Python pricing mismatch: max abs diff = {np.max(np.abs(prices_cpp - prices_py))}"
print(f"OK: cross-language pricing identity verified on {len(spots)} spots")

跨语言恒等断言能抓住最容易出错的低级问题——Black-Scholes 公式里漏了一个负号、一侧把 vol * vol * ttm 写成 vol * vol + ttm、tick 单位换算少了或多了一个量级。np.allclose(..., atol=1e-6) 既允许 IEEE 舍入波动、又对真正的逻辑差异敏感。在沪深300 ETF 这条产线上，研究员把这套接缝作为日常 backtest 的入口——上线前的所有数值一致性测试都跑在这上面。

tick-to-trade：线路到线路的测量

tick-to-trade 延迟是「行情报文 RX 时间戳在 NIC 上」到「订单报文 TX 时间戳在 NIC 上」的线路到线路区间。期间一切——内核网络栈、行情处理器、委托簿应用、策略 OnTick、风控校验、FIX 序列化、内核 TX 路径——都算进来。可信的测量只能靠两端硬件 NIC 时间戳；clock_gettime 的软件时间戳会完全漏掉内核栈那一段。

// tick_to_trade_measurement.cpp — wire-to-wire latency via hardware NIC timestamps.
#include <linux/net_tstamp.h>
#include <sys/socket.h>
#include <vector>
#include <cstdint>

// Enable hardware NIC timestamping on both RX and TX sockets.
bool enable_hw_timestamping(int sock_fd) {
    int flags = SOF_TIMESTAMPING_RX_HARDWARE | SOF_TIMESTAMPING_TX_HARDWARE
              | SOF_TIMESTAMPING_RAW_HARDWARE;
    return ::setsockopt(sock_fd, SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags)) == 0;
}

// Extract the hardware RX timestamp from a recvmsg control-message buffer.
std::int64_t extract_hw_rx_ns(const struct msghdr& msg) {
    for (auto* cm = CMSG_FIRSTHDR(&msg); cm != nullptr; cm = CMSG_NXTHDR(&msg, cm)) {
        if (cm->cmsg_level == SOL_SOCKET && cm->cmsg_type == SCM_TIMESTAMPING) {
            auto* ts = reinterpret_cast<const struct scm_timestamping*>(CMSG_DATA(cm));
            return static_cast<std::int64_t>(ts->ts[2].tv_sec) * 1'000'000'000LL
                 + static_cast<std::int64_t>(ts->ts[2].tv_nsec);
        }
    }
    return -1;   // no HW timestamp available — fail loudly in production
}

struct TickToTradeSample {
    std::uint64_t cl_ord_id;
    std::int64_t  rx_hw_ns;    // hardware RX timestamp of the trigger packet
    std::int64_t  tx_hw_ns;    // hardware TX timestamp of the order packet
    std::int64_t  delta_ns() const noexcept { return tx_hw_ns - rx_hw_ns; }
};

// Offline reporting: sort and emit median / P99 / P99.9 / max in nanoseconds.
struct Percentiles { std::int64_t p50, p99, p999, pmax; };
Percentiles report(std::vector<std::int64_t> deltas_ns) {
    std::sort(deltas_ns.begin(), deltas_ns.end());
    const auto n = deltas_ns.size();
    return {
        deltas_ns[n * 50  / 100],
        deltas_ns[n * 99  / 100],
        deltas_ns[n * 999 / 1000],
        deltas_ns.back(),
    };
}

SCM_TIMESTAMPING 的 scm_timestamping struct 里带三个 timespec：软件 RX（ts[0]）、转换时钟域的硬件 RX（ts[1]）、原始硬件 RX（ts[2]）。你要的是 raw 那一格——未经内核转换的 NIC 时钟纳秒。把触发的 ITCH 包的 RX 时间戳按 cl_ord_id 对上对应订单包的 TX 时间戳，就拿到一条 tick-to-trade 样本。攒满一百万条，排序，报告百分位。

这里需要警惕的两个陷阱：第一，TX 硬件时间戳是异步通过 socket 的 error queue 投递回来的，需要在 send 之后用 recvmsg(MSG_ERRQUEUE) 取，而不是 send 一返回就有；忽视这一点会导致 TX 时间戳缺失 30% 以上。第二，NIC 时钟与系统时钟分别在两个时钟域，跨包匹配必须用同一个时钟域（RX 与 TX 都取 ts[2]），混域会得到几十毫秒的虚假差值。生产环境里通常在私募交易桌上把 NIC 时钟通过 PTP 与一颗 GPS 主时钟同步到 ±200 ns 之内，跨主机的端到端测量才有意义。

端到端预算

按组件分解，每条在调好的 Xeon Gold 6342 上的典型中位值：

组件	来源课程	中位延迟 (ns)	备注
feed-parse	L2 收包并解析	100–300	`recvmsg` + 头解析 + 派发
book-update	L1 数组索引阶梯	50–200	池一次 bump + ~4 次指针写
signal-compute	L3 策略 `OnTick`	策略相关（mean-rev mid 约 50 ns；复杂特征向量化 1–10 µs）	唯一会变的一行
risk-check	L3 六道门	50–100	kill-switch 约 1 ns；单标的查找占大头 30–60 ns
order-serialise	L3 路由器 NEWORDERSINGLE	FIX 文本 100–200；SBE 二进制 / OUCH 约 50	取决于线路格式
NIC-send	内核旁路 TX	200–500	OpenOnload / DPDK

端到端预算：HFT（跨价差套利、延迟敏感做市）1–5 µs；期权做市（Black-Scholes-on-tick 更贵）5–20 µs；跨市场日内 20–100 µs；机构执行 100 µs – 10 ms。汇报规约：中位 / P99 / P99.9 / max 纳秒；绝不用均值。均值把暖缓存与冷缓存事件抹在一起；桌子只关心最差 0.1% 的 tick 有多坏，因为 alpha 就被那 0.1% 吃掉。

闭环

模块 3.4 到此结束。你现在拥有完整的纵向能力：现代 C++17（3.4.1 / 3.4.2）、单线程测量与 SIMD（3.4.3）、正确的多线程 + 内核旁路网络 + FIX / ITCH 解析（3.4.4）、以及把上述组件集成进可部署交易系统的能力（3.4.5）。下一门 3.5 再升一阶：怎么搭建在 CFFEX 张江 / SSE 浦东 / SZSE 福田并行运行四十个此类二进制的运维平台、怎么过交易所认证审计、私募风控层如何跨实例组合。当前阶段，桌子已经有了一份周一可以上线的东西。

练习

Exercise

（a）按完整生产 recipe 构建组装好的 L1 + L2 + L3 交易二进制：先 g++ -std=c++17 -O3 -march=icelake-server -flto -DNDEBUG -fprofile-generate=/tmp/pgo，再用一段 ITCH 上午样本重放 600 秒填充 /tmp/pgo，再以 -fprofile-use=/tmp/pgo 重构。把得到的二进制对同一份 ITCH 文件再跑一次，用 time ./trading_binary --replay sample.itch 计墙钟。和 baseline（仅 -O3、无 -march=icelake-server、无 -flto、无 PGO）对比。一行报告提速百分比。预期：完整 recipe 比 baseline 快 30-60%。

（b）验证运行期库替换：先无 LD_PRELOAD 启动，再带 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so 启动；两次都在 main 早期通过一次 system() 调用执行 cat /proc/self/maps | grep -E "libjemalloc|libc"，确认第二次加载了 jemalloc 而第一次没有。一句话说明为何即便 jemalloc 是全局分配器，3.4.3 L3 的自定义 arena 仍然有用。

（c）验证内核启动参数生效：重启加上 GRUB cmdline isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7 之后，跑 ps -eLo psr,comm | grep -E "^ *[2-7]"，确认没有 kworker / ksoftirqd / migration 内核线程绑在 2-7 核。一句话说明 nohz_full 在 isolcpus 之外多做了什么。

（d）按工作示例编译 pybind11 绑定 .cpp 通过 CMake 生成 quant_lib.so。从 Python REPL 中 import quant_lib、构造 OrderBook、调用 quant_lib.decide(book, last_trade_ticks=450_00, threshold_bps=5, lot_size=100)，验证返回的 Decision 与同输入下 C++ 二进制会发出的 Decision 一致。再调用 quant_lib.price_batch(np.linspace(440, 460, 10000), 450.0, 0.20, 0.25)，断言（np.allclose(..., atol=1e-6)）C++ 结果与 scipy.stats.norm 实现的纯 Python Black-Scholes 一致。一句话说明为何 price_batch 入口需要 py::call_guard<py::gil_scoped_release>() 而 decide 入口在不接触 Python 对象的前提下可以不需要。

（e）跑 L4 capstone：把 L1 + L2 + L3 组装成一个二进制、用完整 LD_PRELOAD + isolcpus + numactl + chrt 配方启动（sudo LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so numactl --cpunodebind=0 --membind=0 chrt --rr 50 ./trading_binary --core 6 --feed 233.54.12.1:26477 --replay sample.itch）、在 RX 和 TX socket 上启用 SO_TIMESTAMPING、回放过程中采集 100 万条 TickToTradeSample、离线对 delta_ns 排序、汇报中位 / P99 / P99.9 / max 纳秒。Intel Xeon Gold 6342 上调好后预期：中位 ~1.5 µs、P99 ~3 µs、P99.9 ~5 µs、max ~10-20 µs。一句话说明为何 P99.9（而非均值）对 tick-to-trade 延迟分布是运营上有意义的汇总统计。

提示

关于（a）：PGO 插桩跑要传 --runtime 600 让样本足够；不足 60 秒的样本稀疏，第二趟构建的提速会很有限。预期增量主要由 -march 与 PGO 贡献。

提示

关于（e）：P99.9 是桌子每 1000 个 tick 中最差的那个；均值由暖缓存 tick 主导。在最差 0.1% 的 tick 上漏掉 alpha 的策略每天都在漏，均值看不出来。