PyO3 与 Python 互操作 — Rust 互操作与生产化

某私募的研究员把一个 Jupyter notebook 推过来:他们在沪深300成份股上扫了 500 万个 (S, K, σ, t) 参数组合,目标是给隐含波动率曲面拟合做敏感度分析。纯 Python + scipy.stats.norm.cdf 跑了 47 分钟,他要的是把这一步压到 5 分钟以内,但策略迭代仍然由他在 notebook 里驱动——研究员不会写 Cargo.toml,不会动 build.rs。"你给我一个 pricer_rs.so,我 import pricer_rs 就能用,numpy 数组进、numpy 数组出,不要让我手动转格式。" 这正是 2026 年量化 Rust 互操作最高杠杆的场景:研究侧仍在 Python / Jupyter,生产侧在 Rust;PyO3 + numpy crate + maturin 是工业标准粘合剂。Polars、Pydantic 核心、Cryptography、Tokenizers 都走这条路——一份 Rust 实现,一个 Python 入口,零拷贝穿透 numpy 缓冲区协议。本课带你把 Rust BS 定价器以 PyO3 扩展模块的形式交给研究员,完整覆盖 #[pymodule] 入口、#[pyfunction] / #[pyclass] 函数与类、py.allow_threads 释放 GIL、numpy 零拷贝、maturin develop / maturin build --release wheel 工作流。

扩展包骨架:`Cargo.toml` 与 crate 类型

PyO3 扩展是一个 Rust 库 crate,产出 .so (Linux) / .pyd (Windows) / .dylib (macOS) ——Python 看到的"模块"实际是动态库。两点关键:[lib] crate-type = ["cdylib"] 让 cargo 产出 C 风格动态库;pyo3 的 extension-module feature 关闭与 Python C API 的直接链接 (在 maturin 流水线里这是必须的)。

[package]
name = "pricer_rs"
version = "0.1.0"
edition = "2021"

[lib]
name = "pricer_rs"
crate-type = ["cdylib"]

[dependencies]
pyo3 = { version = "0.21", features = ["extension-module"] }
numpy = "0.21"
ndarray = "0.15"
thiserror = "1"

pyo3 = "0.21" 是 2024 年底的稳定线,已迁移到 Bound<'py, T> 智能指针——之前的 &PyAny 借用 API 仍可工作但官方推荐 Bound,本课全程使用 Bound。

`#[pymodule]` 入口:Python 看到的"模块"是它

use pyo3::prelude::*;
use pyo3::exceptions::PyValueError;
use numpy::{IntoPyArray, PyArray1, PyReadonlyArray1};

fn bs_call(s: f64, k: f64, r: f64, sigma: f64, t: f64) -> f64 {
    let d1 = ((s / k).ln() + (r + 0.5 * sigma * sigma) * t) / (sigma * t.sqrt());
    let d2 = d1 - sigma * t.sqrt();
    fn phi(x: f64) -> f64 {
        0.5 * (1.0 + libm::erf(x * std::f64::consts::FRAC_1_SQRT_2))
    }
    s * phi(d1) - k * (-r * t).exp() * phi(d2)
}

#[pyfunction]
fn price(s: f64, k: f64, r: f64, sigma: f64, t: f64) -> PyResult<f64> {
    if t <= 0.0 || sigma <= 0.0 {
        return Err(PyValueError::new_err("t and sigma must be > 0"));
    }
    Ok(bs_call(s, k, r, sigma, t))
}

#[pyfunction]
fn price_european_call_batch<'py>(
    py: Python<'py>,
    s_array: PyReadonlyArray1<'py, f64>,
    k: f64, r: f64, sigma: f64, t: f64,
) -> PyResult<Bound<'py, PyArray1<f64>>> {
    let s_slice: &[f64] = s_array.as_slice()?;
    let result: Vec<f64> = py.allow_threads(|| s_slice.iter().map(|&s| bs_call(s, k, r, sigma, t)).collect());
    Ok(result.into_pyarray_bound(py))
}

#[pyclass]
pub struct BlackScholesPricer {
    pub r: f64,
    pub sigma: f64,
}

#[pymethods]
impl BlackScholesPricer {
    #[new]
    fn new(r: f64, sigma: f64) -> Self {
        BlackScholesPricer { r, sigma }
    }
    fn price_call(&self, s: f64, k: f64, t: f64) -> PyResult<f64> {
        if t <= 0.0 || self.sigma <= 0.0 {
            return Err(PyValueError::new_err("t and sigma must be > 0"));
        }
        Ok(bs_call(s, k, self.r, self.sigma, t))
    }
}

#[pymodule]
fn pricer_rs(_py: Python, m: &Bound<'_, PyModule>) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(price, m)?)?;
    m.add_function(wrap_pyfunction!(price_european_call_batch, m)?)?;
    m.add_class::<BlackScholesPricer>()?;
    Ok(())
}

四件事并列在一个文件里:bs_call 是纯 Rust 算子;#[pyfunction] price 是标量入口;#[pyfunction] price_european_call_batch 是 numpy 批量入口;#[pyclass] BlackScholesPricer 是带状态的类入口。最末的 #[pymodule] fn pricer_rs(...) 是 Python 导入模块时调用的注册函数——m.add_function 注册函数、m.add_class 注册类。

错误映射:Rust `Result` ↔ Python `Exception`

PyResult<T> 是 Result<T, PyErr> 的别名。把 Rust 错误映射成 Python 异常,最直白的两条路:

函数内部 Err(PyValueError::new_err("t and sigma must be > 0")),Python 端收到 ValueError。其他常用:PyTypeError::new_err、PyRuntimeError::new_err、PyIndexError::new_err。
自定义错误类型走 impl From<MyRustError> for PyErr,然后函数签名继续返回 PyResult<T>,? 自动转换。

研究员在 notebook 里看到的就是熟悉的 ValueError: t and sigma must be > 0,而不是一个看不懂的 panic 堆栈。

GIL 释放:`py.allow_threads`

Python 的 Global Interpreter Lock 串行化所有 Python 字节码执行——只要你的 Rust 扩展持有 GIL,Python 端其他线程就无法推进。py.allow_threads(|| { ... }) 在闭包入口释放 GIL、闭包退出重新获取。生产规则:Rust 工作超过 ~10 µs 且不触碰 Python 对象,就释放 GIL。两个原因:(a) 让 Python 线程在你跑 Rust 时能推进;(b) GIL 是进程级的、其他扩展也在等。

price_european_call_batch 里的 py.allow_threads(|| s_slice.iter().map(...).collect()) 是教科书式的释放:s_slice 是 numpy 数组的原始字节缓冲(不是 Python 对象),迭代+计算全部不需要 Python 锁。在你的 BS 批量上,一个 5 万长度的数组大约 0.5 ms 计算量——远超 10 µs 阈值,必须释放。

numpy 零拷贝:`PyReadonlyArray1` 与 buffer protocol

numpy crate 通过 Python buffer protocol 零拷贝地把 numpy 数组的原始内存暴露给 Rust。PyReadonlyArray1<'py, f64> 是只读视图,PyReadwriteArray1<'py, f64> 是可写视图,PyArray1<f64> 是可拥有的数组对象。常用变换:

array.as_slice()? → &[f64],直接进 Rust 算法热路径。
Vec::into_pyarray_bound(py) / ndarray::Array1::into_pyarray_bound(py) → Python 端拿到的 numpy 数组。

零拷贝意味着:一个 100 万元素的 np.float64 数组进 Rust,没有 8 MB 的内存拷贝、没有堆分配,只是把指针和长度借给你。把它换成 s_array.to_vec()(显式拷贝)就掉到 ~1 ms 的边界开销——在 10 µs 量级的算法外面套了一层 100 倍 overhead。

Python 侧消费:`maturin develop` + `import`

研究员的代码一字不动:

# Install: maturin develop in the crate dir, then in Python:
import numpy as np
import pricer_rs

# Scalar call
p = pricer_rs.price(s=100.0, k=100.0, r=0.05, sigma=0.20, t=0.25)
assert abs(p - 4.6151) < 0.001, p

# Batch call (zero-copy through the buffer protocol)
s_array = np.array([95.0, 97.5, 100.0, 102.5, 105.0], dtype=np.float64)
prices = pricer_rs.price_european_call_batch(s_array, k=100.0, r=0.05, sigma=0.20, t=0.25)
assert prices.shape == (5,)
print(prices)

# Class usage
pricer = pricer_rs.BlackScholesPricer(r=0.05, sigma=0.20)
for s in [95.0, 100.0, 105.0]:
    print(s, pricer.price_call(s=s, k=100.0, t=0.25))

pricer_rs.price(100.0, 100.0, 0.05, 0.20, 0.25) 对 ATM 平值得到 $\approx 4.6151$ (容差 $0.001$ )——与 L2 同一回归常数,沿用 3.4.1 / 3.5.1 工作示例。

`maturin` 工作流:开发到分发

# 1. Install maturin (one-time, into the active virtualenv)
$ pip install maturin

# 2. Activate or create a venv
$ python -m venv .venv && source .venv/bin/activate

# 3. Develop loop (compile + install + use)
$ maturin develop          # debug build, instant install
$ maturin develop --release # release build, ~10-20s, optimised

# 4. Build wheel for distribution
$ maturin build --release
$ ls target/wheels/
pricer_rs-0.1.0-cp311-cp311-linux_x86_64.whl

# 5. Publish to PyPI (requires PYPI_TOKEN env var)
$ maturin publish

# 6. CI matrix via cibuildwheel (in .github/workflows/release.yml):
#    uses: PyO3/maturin-action@v1
#    with:
#      command: build
#      target: ${{ matrix.target }}
#      args: --release --strip --interpreter python3.10 python3.11 python3.12

maturin develop 是开发循环的核心:它在 venv 里 pip install -e 风格地原地编译并安装,你的 Python 进程下一次 import pricer_rs 就拿到新版本。maturin build --release 产出 wheel,文件名编码了 Python 版本 (cp311 = CPython 3.11)、平台 (linux_x86_64)、ABI 标签。分发到 PyPI 通常配合 cibuildwheel + GitHub Actions 跑 manylinux 矩阵,产出 manylinux2014_x86_64、musllinux_1_2_x86_64、macosx_11_0_arm64、win_amd64 等多个 wheel。

类型存根 `.pyi`:让 mypy / pyright / IDE 看见你的 API

.so 对 Python 类型检查器是不透明的——mypy --strict your_user_code.py 会报 module 'pricer_rs' is untyped。配套发一个 pricer_rs.pyi 类型存根文件,跟 .so 放在同一个目录,工具就能看到 API 签名:

# pricer_rs.pyi
"""Black-Scholes pricing extension implemented in Rust."""
from typing import overload
import numpy as np
import numpy.typing as npt

def price(s: float, k: float, r: float, sigma: float, t: float) -> float:
    """Black-Scholes price of a European call option."""
    ...

def price_european_call_batch(
    s_array: npt.NDArray[np.float64],
    k: float, r: float, sigma: float, t: float,
) -> npt.NDArray[np.float64]:
    """Batch BS pricing over an array of underlying prices. Releases the GIL during computation."""
    ...

class BlackScholesPricer:
    r: float
    sigma: float
    def __init__(self, r: float, sigma: float) -> None: ...
    def price_call(self, s: float, k: float, t: float) -> float: ...

行业实践:Polars (字节火山引擎、TiKV 内部团队大量使用) 是这个模式的标杆——一份 Rust 实现 + Python 入口 + numpy zero-copy + .pyi。Pydantic v2 核心 (pydantic_core 是 Rust + PyO3)、Cryptography、Tokenizers (Hugging Face) 都走同样的路。国内私募的标准模式是研究员在 notebook 里 import 你交付的 PyO3 模块,完全不感知底下是 Rust。

Exercise

(a) 搭起 PyO3 项目:cargo new --lib pricer_rs;写 Cargo.toml,设 [lib] crate-type = ["cdylib"] 与 pyo3 = { version = "0.21", features = ["extension-module"] } + numpy = "0.21" 依赖;写 src/lib.rs 包含 bs_call 辅助 + #[pyfunction] fn price(...) + #[pyfunction] fn price_european_call_batch(...) + #[pyclass] BlackScholesPricer + #[pymodule] 入口,内容按上文。在 python -m venv .venv && source .venv/bin/activate 的全新虚拟环境下 pip install maturin && maturin develop。 (b) Python 验证:import pricer_rs;调 pricer_rs.price(100.0, 100.0, 0.05, 0.20, 0.25);断言结果 $\approx 4.6151$ 容差 $0.001$ 。 (c) 批量路径验证:s_array = np.linspace(80.0, 120.0, 1000) (1000 个执行价);调 prices = pricer_rs.price_european_call_batch(s_array, k=100.0, r=0.05, sigma=0.20, t=0.25);断言 prices.shape == (1000,) 与 prices.dtype == np.float64。用 timeit 与一个纯 NumPy / scipy.stats.norm.cdf 实现 (你自己写或从 3.3.x 借) 比较;报告加速比(预期 5-50 倍,取决于 NumPy 基线是否向量化)。 (d) 验证 GIL 释放至关重要:在 Python 里 spawn 两个 threading.Thread,各自调一次 price_european_call_batch,数组大小 10_000_000;测量两线程并行的墙钟时间 vs 单线程基线。带 py.allow_threads:两线程并行应 $\approx$ 单线程墙钟(不是 2 倍)。去掉 allow_threads:两线程被 GIL 串行化,墙钟 $\approx$ 2 倍单线程。报告两个数字。 (e) 类 API 验证:pricer = pricer_rs.BlackScholesPricer(r=0.05, sigma=0.20); for s in [95.0, 100.0, 105.0]: print(s, pricer.price_call(s=s, k=100.0, t=0.25));验证三个价格随 s 单调递增。 (f) 构建 wheel:maturin build --release;查看 target/wheels/ 里 wheel 名;pip install target/wheels/pricer_rs-0.1.0-cp311-cp311-linux_x86_64.whl(替换为你的实际 wheel 名)装到全新 venv;不需要源 crate 也能 import 成功。 (g) 写 pricer_rs.pyi(按上文)放到 wheel 旁边;验证 mypy --strict your_user_code.py 能看到 API。 (h) 各 3 句话,不实现:(i) 为什么 py.allow_threads 对任何做实际 CPU 工作的 Rust 扩展都不可或缺、生产规则是什么(答:Rust 工作超 ~10 µs 且不触碰 Python 对象时释放 GIL);(ii) 为什么 numpy crate 的 buffer protocol 路径是零拷贝、把它换成 s_array.to_vec() 在 100 万元素数组上会付出多少代价(答:8 MB f64 内存拷贝 + 堆分配,边界上浪费 ~1 ms);(iii) 为什么 .pyi 类型存根对生产化的 PyO3 扩展很关键(答:Python 类型检查器把 .so 看作不透明;有 .pyi,IDE 自动补全和 mypy --strict 才能工作)。

提示

第一次 maturin develop 失败常见原因: 没在 venv 里, 或者 Cargo.toml 缺 [lib] crate-type = ["cdylib"]。先 which python && which pip 确认在 venv 里。

提示

GIL 测试容易翻车的原因: Python 不会让你 import 同一个扩展两次。所以"带/不带 allow_threads"两个版本要编两个 crate 名 (e.g. pricer_rs_gil 与 pricer_rs_no_gil), 或者在源码里加 feature flag 切换两条路径。

本课五个 Fenced 代码块——Cargo.toml 配置、src/lib.rs 含三类入口、Python 侧消费、maturin 工作流、.pyi 类型存根——是任何 PyO3 扩展从零到上线最小完整骨架,改改名字直接复用。

深入主题:研究侧到生产侧的工作流分裂

国内私募与公募的研究 / 生产工作流分裂在 2026 年已经稳定下来:研究侧 = Python + Jupyter + pandas / Polars + numpy + scikit-learn / PyTorch + matplotlib——研究员可以快速迭代、可视化、用 notebook 留存推理轨迹。生产侧 = Rust + tokio + 自研撮合 / 风控——上线后延迟可控、内存安全、依赖审计可做。两侧粘合的标准方式只有两条:(1) 把研究侧的算法用 PyO3 重写为 Rust 扩展,研究员在 notebook 里 import 就能调到 Rust 实现;(2) 把生产侧的市场数据 / 撮合结果通过 Parquet / Arrow IPC 落盘,研究员下班后跑 backtest 反向消费。本课覆盖第 (1) 条;第 (2) 条在 3.6 数据工程主题。

选型决策:何时上 PyO3

PyO3 的 ROI 决定于"Python 端瓶颈在不在 CPU"。三类典型场景:(a) CPU 瓶颈、纯数值、可向量化——本课工作示例 (BS 批量定价) 就是这一类。NumPy 已经把这类常见操作做到 SIMD 级别,你写 PyO3 主要赚的是 (i) 避免 Python 解释器开销,(ii) 释放 GIL 允许多线程并行,(iii) 算法不在 NumPy ufunc 库里的情形 (复杂控制流、嵌套循环) 加速最猛 (常见 10-100 倍)。(b) IO 密集型但需要复杂状态机——比如 FIX 协议解析、订单簿增量更新、市场数据解码。研究员用 Python 写完原型,你用 PyO3 重写状态机,Python 端只剩 for evt in stream: book.apply(evt) 三行。(c) 需要复用 Rust 生态的库——比如 Polars (DataFrame)、Pydantic v2 核心、Cryptography、Tokenizers。Python 用户拿到的是 pip install polars 的体验,底下是 Rust。

第一类与第三类直接走 PyO3 + numpy 路径;第二类常常要配合 tokio + pyo3-async-runtimes 让 Rust 异步运行时与 Python 事件循环互操作,这是另一个独立话题。

GIL 深入:`py.allow_threads` 的边界条件

py.allow_threads(|| { ... }) 的闭包内绝对不能触碰任何 Python 对象——PyAny、PyList、PyDict、Bound<'py, T>,一律不行。原因:GIL 释放期间另一个 Python 线程可能正在修改这些对象,你的并发访问就是数据竞争。安全的输入只有两类:(1) 已经通过 as_slice()? 从 numpy 数组取出的纯 &[T],这是缓冲区的字节地址,不是 Python 对象;(2) 已经 .extract::<f64>()? 等转换出来的 Rust 标量。一旦闭包退出、allow_threads 返回,你才能继续操作 Python 对象。

第二条边界:闭包内的 panic 会让 GIL 重新获取后 unwind 到 Python 侧,变成一个 pyo3_runtime.PanicException。研究员看到这个的概率非生产事故级别但也不好看;生产代码在闭包内用 Result 返回错误,在闭包外把 Err 翻译成 PyValueError::new_err(...) 这种正常异常。

行业实践:同类项目的代码读路径

读源码是学透 PyO3 的最高效方式。强推荐的三个仓库:

Polars (github.com/pola-rs/polars) — DataFrame 引擎,Rust 核心 + Python / Node.js / R 入口。crates/polars-python 子目录是 PyO3 用法的活字典:#[pyclass] 包装 DataFrame、#[pymethods] 暴露成百上千的方法、numpy zero-copy 大量使用、py.allow_threads 在所有 CPU 重的 expression 求值路径上释放。国内字节火山引擎、TiKV 内部团队、多家头部私募都在生产用 Polars。
pydantic-core (github.com/pydantic/pydantic-core) — Pydantic v2 的核心数据验证引擎,Rust + PyO3。展示如何用 #[pyclass] 暴露 schema、用 Bound<'py, PyAny> 接收任意 Python 输入、用 PyErr 派生自定义异常。
tokenizers (github.com/huggingface/tokenizers) — Hugging Face 的 BPE / WordPiece / SentencePiece tokenizer 实现,Rust 核心 + PyO3 + Node.js 入口。展示一个中等复杂度的状态机 + Vec<u32> 输出 + 多语言绑定。

国内私募的真实工作流片段

研究侧由研究员驱动,人均一个 JupyterHub 实例,代码风格五花八门、迭代快、Bug 多;产出的是研究思路,不是生产代码。生产侧由独立的工程团队接手,Rust 重写、测延迟、做容量、跑回测验证一致性后上线。PyO3 扩展是中间层关键角色:让研究侧能直接调到生产质量的算法实现,避免"研究跑 47 分钟、生产跑 47 毫秒、两边结果差 5%"的诡异情形。一致性问题就此消失。

国内 PyO3 落地的典型项目

MindSpore 部分自动微分内核走 Rust + PyO3,与 HuaWei Ascend 后端互操作。
PaddlePaddle 生态里性能敏感的数据预处理算子由 Rust 实现,通过 PyO3 暴露给训练流水线。
JoinQuant、Ricequant、Bigquant 等量化研究平台后端在数据切片、因子计算路径上有 Rust 加速实践。
ByteDance 飞书 Lark 协作引擎与 Volcano Engine ML Serving 后端走类似路径。
DiDi 数据治理 SDK、Meituan SRE 工具链里都见过 PyO3 组件的影子。
Tencent WeChat 后端、Alibaba TaoBao 风控、JD Logistics 路径规划、Pinduoduo 推荐召回都有 Rust 算子落地经验。

CSDN、掘金 Juejin、SegmentFault、InfoQ 中文站、公众号 Rust 语言中文社区每年发表数十篇关于 PyO3 / maturin / Polars 在国内场景下的实践分享。

下一课 (L4 「生产化 Rust:FIX、内核旁路与 PGO」) 是本模块与整个 3.5 主题的生产硬化压轴——拿 3.5.3 L4 的撮合引擎,加 FIX 4.4 会话状态机、tokio-uring 替换内核旁路、cargo pgo 编译反馈、tracing + tokio-console 观测、cargo deny + cargo audit + jemallocator + reproducible builds 部署清单,压到一台可以投放到 NY4 / 张江 COLO 的生产二进制。

扩展包骨架:Cargo.toml 与 crate 类型

#[pymodule] 入口:Python 看到的"模块"是它

错误映射:Rust Result ↔ Python Exception

GIL 释放:py.allow_threads

numpy 零拷贝:PyReadonlyArray1 与 buffer protocol

Python 侧消费:maturin develop + import

maturin 工作流:开发到分发

类型存根 .pyi:让 mypy / pyright / IDE 看见你的 API