您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

給 Python 算法插上性能的翅膀——pybind11 落地實踐

編輯：Python

給 Python 算法插上性能的翅膀——pybind11 落地實踐

轉自：https://zhuanlan.zhihu.com/p/444805518
作者：jesonxiang（向乾彪），騰訊 TEG 後台開發工程師

1. 背景

目前 AI 算法開發特別是訓練基本都以 Python 為主，主流的 AI 計算框架如 TensorFlow、PyTorch 等都提供了豐富的 Python 接口。有句話說得好，人生苦短，我用 Python。但由於 Python 屬於動態語言，解釋執行並缺少成熟的 JIT 方案，計算密集型場景多核並發受限等原因，很難直接滿足較高性能要求的實時 Serving 需求。在一些對性能要求高的場景下，還是需要使用 C/C++來解決。但是如果要求算法同學全部使用 C++來開發線上推理服務，成本又非常高，導致開發效率和資源浪費。因此，如果有輕便的方法能將 Python 和部分 C++編寫的核心代碼結合起來，就能達到既保證開發效率又保證服務性能的效果。本文主要介紹 pybind11 在騰訊廣告多媒體 AI Python 算法的加速實踐，以及過程中的一些經驗總結。

2. 業內方案

2.1 原生方案

Python 官方提供了 Python/C API，可以實現「用 C 語言編寫 Python 庫」，先上一段代碼感受一下：

static PyObject *
spam_system(PyObject *self, PyObject *args)
{

const char *command;
int sts;
if (!PyArg_ParseTuple(args, "s", &command))
return NULL;
sts = system(command);
return PyLong_FromLong(sts);
}

可見改造成本非常高，所有的基本類型都必須手動改為 CPython 解釋器封裝的 binding 類型。由此不難理解，為何 Python 官網也建議大家使用第三方解決方案[1]。

2.2 Cython

Cython 主要打通的是 Python 和 C，方便為 Python 編寫 C 擴展。Cython 的編譯器支持轉化 Python 代碼為 C 代碼，這些 C 代碼可以調用 Python/C 的 API。從本質上來說，Cython 就是包含 C 數據類型的 Python。目前 Python 的 numpy，以及我廠的 tRPC-Python 框架有所應用。

缺點：

需要手動植入 Cython 自帶語法（cdef 等），移植和復用成本高
需要增加其他文件，如 setup.py、*.pyx 來讓你的 Python 代碼最後能夠轉成性能較高的 C 代碼
對於 C++的支持程度存疑

2.3 SIWG

SIWG 主要解決其他高級語言與 C 和 C++語言交互的問題，支持十幾種編程語言，包括常見的 java、C#、javascript、Python 等。使用時需要用*.i 文件定義接口，然後用工具生成跨語言交互代碼。但由於支持的語言眾多，因此在 Python 端性能表現不是太好。

值得一提的是，TensorFlow 早期也是使用 SWIG 來封裝 Python 接口，正式由於 SIWG 存在性能不夠好、構建復雜、綁定代碼晦澀難讀等問題，TensorFlow 已於 2019 年將 SIWG 切換為 pybind11[2]。

2.4 Boost.Python

C++中廣泛應用的 Boost 開源庫，也提供了 Python binding 功能。使用上，通過宏定義和元編程來簡化 Python 的 API 調用。但最大的缺點是需要依賴龐大的 Boost 庫，編譯和依賴關系包袱重，只用於解決 Python binding 的話有一種高射炮打蚊子的既視感。

2.5 pybind11

可以理解為以 Boost.Python 為藍本，僅提供 Python & C++ binding 功能的精簡版，相對於 Boost.Python 在 binary size 以及編譯速度上有不少優勢。對 C++支持非常好，基於 C++11 應用了各種新特性，也許 pybind11 的後綴 11 就是出於這個原因。

Pybind11 通過 C++ 編譯時的自省來推斷類型信息，來最大程度地減少傳統拓展 Python 模塊時繁雜的樣板代碼，且實現了常見數據類型，如 STL 數據結構、智能指針、類、函數重載、實例方法等到 Python 的自動轉換，其中函數可以接收和返回自定義數據類型的值、指針或引用。

特點：

輕量且功能單一，聚焦於提供 C++ & Python binding，交互代碼簡潔
對常見的 C++數據類型如 STL、Python 庫如 numpy 等兼容很好，無人工轉換成本
only header 方式，無需額外代碼生成，編譯期即可完成綁定關系建立，減小 binary 大小
支持 C++新特性，對 C++的重載、繼承，debug 方式便捷易用
完善的官方文檔支持，應用於多個知名開源項目

“Talk is cheap, show me your code.” 三行代碼即可快速實現綁定，你值得擁有：

PYBIND11_MODULE (libcppex, m) {

m.def("add", [](int a, int b) -> int {
 return a + b; });
}

3. Python 調 C++

3.1 從 GIL 鎖說起

GIL（Global Interpreter Lock）全局解釋器鎖：同一時刻在一個進程只允許一個線程使用解釋器，導致多線程無法真正用到多核。由於持有鎖的線程在執行到 I/O 密集函數等一些等待操作時會自動釋放 GIL 鎖，所以對於 I/O 密集型服務來說，多線程是有效果的。但對於 CPU 密集型操作，由於每次只能有一個線程真正執行計算，對性能的影響可想而知。

這裡必須說明的是，GIL 並不是 Python 本身的缺陷，而是目前 Python 默認使用的 CPython 解析器引入的線程安全保護鎖。我們一般說 Python 存在 GIL 鎖，其實只針對於 CPython 解釋器。那麼如果我們能想辦法避開 GIL 鎖，是不是就能有很不錯的加速效果？答案是肯定的，一種方案是改為使用其他解釋器如 pypy 等，但對於成熟的 C 擴展庫兼容不夠好，維護成本高。另一種方案，就是通過 C/C++擴展來封裝計算密集部分代碼，並在執行時移除 GIL 鎖。

3.2 Python 算法性能優化

pybind11 就提供了在 C++端手動釋放 GIL 鎖的接口，因此，我們只需要將密集計算的部分代碼，改造成 C++代碼，並在執行前後分別釋放/獲取 GIL 鎖，Python 算法的多核計算能力就被解鎖了。當然，除了顯示調用接口釋放 GIL 鎖的方法之外，也可以在 C++內部將計算密集型代碼切換到其他 C++線程異步執行，也同樣可以規避 GIL 鎖利用多核。

下面以 100 萬次城市間球面距離計算為例，對比 C++擴展前後性能差異：

C++端：

#include <math.h>
#include <stdio.h>
#include <time.h>
#include <pybind11/embed.h>
namespace py = pybind11;
const double pi = 3.1415926535897932384626433832795;
double rad(double d) {

return d * pi / 180.0;
}
double geo_distance(double lon1, double lat1, double lon2, double lat2, int test_cnt) {

py::gil_scoped_release release; // 釋放GIL鎖
double a, b, s;
double distance = 0;
for (int i = 0; i < test_cnt; i++) {

double radLat1 = rad(lat1);
double radLat2 = rad(lat2);
a = radLat1 - radLat2;
b = rad(lon1) - rad(lon2);
s = pow(sin(a/2),2) + cos(radLat1) * cos(radLat2) * pow(sin(b/2),2);
distance = 2 * asin(sqrt(s)) * 6378 * 1000;
}
py::gil_scoped_acquire acquire; // C++執行結束前恢復GIL鎖
return distance;
}
PYBIND11_MODULE (libcppex, m) {

m.def("geo_distance", &geo_distance, R"pbdoc( Compute geography distance between two places. )pbdoc");
}

Python 調用端：

import sys
import time
import math
import threading
from libcppex import *
def rad(d):
return d * 3.1415926535897932384626433832795 / 180.0
def geo_distance_py(lon1, lat1, lon2, lat2, test_cnt):
distance = 0
for i in range(test_cnt):
radLat1 = rad(lat1)
radLat2 = rad(lat2)
a = radLat1 - radLat2
b = rad(lon1) - rad(lon2)
s = math.sin(a/2)**2 + math.cos(radLat1) * math.cos(radLat2) * math.sin(b/2)**2
distance = 2 * math.asin(math.sqrt(s)) * 6378 * 1000
print(distance)
return distance
def call_cpp_extension(lon1, lat1, lon2, lat2, test_cnt):
res = geo_distance(lon1, lat1, lon2, lat2, test_cnt)
print(res)
return res
if __name__ == "__main__":
threads = []
test_cnt = 1000000
test_type = sys.argv[1]
thread_cnt = int(sys.argv[2])
start_time = time.time()
for i in range(thread_cnt):
if test_type == 'p':
t = threading.Thread(target=geo_distance_py,
args=(113.973129, 22.599578, 114.3311032, 22.6986848, test_cnt,))
elif test_type == 'c':
t = threading.Thread(target=call_cpp_extension,
args=(113.973129, 22.599578, 114.3311032, 22.6986848, test_cnt,))
threads.append(t)
t.start()
for thread in threads:
thread.join()
print('calc time = %d' % int((time.time() - start_time) * 1000))

性能對比：

單線程時耗：Python 964ms，C++ 7ms

$ python test.py p 1
38394.662146601186
calc time = 964
$ python test.py c 1
38394.662146601186
calc time = 7

10 線程時耗：Python 18681ms，C++ 13ms

$ python test.py p 10
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
calc time = 18681
$ python test.py c 10
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
38394.662146601186
calc time = 13

CPU利用率：

Python 多線程無法同時刻多核並行計算，僅相當於單核利用率
C++可以吃滿 DevCloud 機器的 10 個 CPU 核
pythoncpp

結論：

計算密集型代碼，單純改為 C++實現即可獲得不錯的性能提升，在多線程釋放 GIL 鎖的加持下，充分利用多核，性能輕松獲得線性加速比，大幅提升資源利用率。雖然實際場景中也可以用 Python 多進程的方式來利用多核，但是在模型越來越大動辄數十 G 的趨勢下，內存占用過大不說，進程間頻繁切換的 context switching overhead，以及語言本身的性能差異，導致與 C++擴展方式依然有不少差距。

注：以上測試 demo github 地址：https://github.com/jesonxiang/cpp_extension_pybind11，測試環境為 CPU 10 核容器，大家有興趣也可以做性能驗證。

3.3 編譯環境

編譯指令：

g++ -Wall -shared -std=gnu++11 -O2 -fvisibility=hidden -fPIC -I./ perfermance.cc -o libcppex.so `python3-config --cflags --ldflags --libs`

如果 Python 環境未正確配置可能報錯：

這裡對 Python 的依賴是通過 python3-config --cflags --ldflags --libs 來自動指定，可先單獨運行此命令來驗證 Python 依賴是否配置正確。python3-config 正常執行依賴 Python3-dev，可以通過以下命令安裝：

apt install python3-dev

4. C++調 Python

一般 pybind11 都是用於給 C++代碼封裝 Python 端接口，但是反過來 C++調 Python 也是支持的。只需#include <pybind11/embed.h>頭文件即可使用，內部是通過嵌入 CPython 解釋器來實現。使用上也非常簡單易用，同時有不錯的可讀性，與直接調用 Python 接口非常類似。比如對一個 numpy 數組調用一些方法，參考示例如下：

// C++
pyVec = pyVec.attr("transpose")().attr("reshape")(pyVec.size());
# Python
pyVec = pyVec.transpose().reshape(pyVec.size)

以下以我們開發的 C++ GPU 高性能版抽幀 so 為例，除了提供抽幀接口給到 Python 端調用，還需要回調給 Python 從而通知抽幀進度以及幀數據。

Python 端回調接口：

def on_decoding_callback(task_id:str, progress:int):
print("decoding callback, task id: %s, progress: %d" % (task_id, progress))
if __name__ == "__main__":
decoder = DecoderWrapper()
decoder.register_py_callback(os.getcwd() + "/decode_test.py",
"on_decoding_callback")

C++端接口注冊 & 回調 Python：

#include <pybind11/embed.h>
int DecoderWrapper::register_py_callback(const std::string &py_path,
const std::string &func_name) {

int ret = 0;
const std::string &pyPath = py_get_module_path(py_path);
const std::string &pyName = py_get_module_name(py_path);
SoInfo("get py module name: %s, path: %s", pyName.c_str(), pyPath.c_str());
py::gil_scoped_acquire acquire;
py::object sys = py::module::import("sys");
sys.attr("path").attr("append")(py::str(pyPath.c_str())); //Python腳本所在的路徑
py::module pyModule = py::module::import(pyName.c_str());
if (pyModule == NULL) {

LogError("Failed to load pyModule ..");
py::gil_scoped_release release;
return PYTHON_FILE_NOT_FOUND_ERROR;
}
if (py::hasattr(pyModule, func_name.c_str())) {

py_callback = pyModule.attr(func_name.c_str());
} else {

ret = PYTHON_FUNC_NOT_FOUND_ERROR;
}
py::gil_scoped_release release;
return ret;
}
int DecoderListener::on_decoding_progress(std::string &task_id, int progress) {

if (py_callback != NULL) {

try {

py::gil_scoped_acquire acquire;
py_callback(task_id, progress);
py::gil_scoped_release release;
} catch (py::error_already_set const &PythonErr) {

LogError("catched Python exception: %s", PythonErr.what());
} catch (const std::exception &e) {

LogError("catched exception: %s", e.what());
} catch (...) {

LogError("catched unknown exception");
}
}
}

5. 數據類型轉換

5.1 類成員函數

對於類和成員函數的 binding，首先需要構造對象，所以分為兩步：第一步是包裝實例構造方法，另一步是注冊成員函數的訪問方式。同時，也支持通過 def_static、def_readwrite 來綁定靜態方法或成員變量，具體可參考官方文檔[3]。

#include <pybind11/pybind11.h>
class Hello
{

public:
Hello(){
}
void say( const std::string s ){

std::cout << s << std::endl;
}
};
PYBIND11_MODULE(py2cpp, m) {

m.doc() = "pybind11 example";
pybind11::class_<Hello>(m, "Hello")
.def(pybind11::init()) //構造器，對應c++類的構造函數，如果沒有聲明或者參數不對，會導致調用失敗
.def( "say", &Hello::say );
}
/* Python 調用方式： c = py2cpp.Hello() c.say() */

5.2 STL 容器

pybind11 支持 STL 容器自動轉換，當需要處理 STL 容器時，只要額外包括頭文件<pybind11/stl.h>即可。pybind11 提供的自動轉換包括：std::vector<>/std::list<>/std::array<> 轉換成 Python list ；std::set<>/std::unordered_set<> 轉換成 Python set ; std::map<>/std::unordered_map<> 轉換成 dict 等。此外 std::pair<> 和 std::tuple<>的轉換也在 <pybind11/pybind11.h> 頭文件中提供了。

#include <iostream>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
class ContainerTest {

public:
ContainerTest() {
}
void Set(std::vector<int> v) {

mv = v;
}
private:
std::vector<int> mv;
};
PYBIND11_MODULE( py2cpp, m ) {

m.doc() = "pybind11 example";
pybind11::class_<ContainerTest>(m, "CTest")
.def( pybind11::init() )
.def( "set", &ContainerTest::Set );
}
/* Python 調用方式： c = py2cpp.CTest() c.set([1,2,3]) */

5.3 bytes、string 類型傳遞

由於在 Python3 中 string 類型默認為 UTF-8 編碼，如果從 C++端傳輸 string 類型的 protobuf 數據到 Python，則會出現 “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xba in position 0: invalid start byte” 的報錯。

解決方案：pybind11 提供了非文本數據的 binding 類型 py::bytes：

m.def("return_bytes",
[]() {

std::string s("\xba\xd0\xba\xd0"); // Not valid UTF-8
return py::bytes(s); // Return the data without transcoding
}
);

5.4 智能指針

std::unique_ptr pybind11 支持直接轉換：

std::unique_ptr<Example> create_example() {
 return std::unique_ptr<Example>(new Example()); }
m.def("create_example", &create_example);

std::shared_ptr 需要特別注意的是，不能直接使用裸指針。如下的 get_child 函數在 Python 端調用會報內存訪問異常（如 segmentation fault）。

class Child {
 };
class Parent {

public:
Parent() : child(std::make_shared<Child>()) {
 }
Child *get_child() {
 return child.get(); } /* Hint: ** DON'T DO THIS ** */
private:
std::shared_ptr<Child> child;
};
PYBIND11_MODULE(example, m) {

py::class_<Child, std::shared_ptr<Child>>(m, "Child");
py::class_<Parent, std::shared_ptr<Parent>>(m, "Parent")
.def(py::init<>())
.def("get_child", &Parent::get_child);
}

5.5 cv::Mat 到 numpy 轉換

抽幀結果返回給 Python 端時，由於目前 pybind11 暫不支持自動轉換 cv::Mat 數據結構，因此需要手動處理 C++ cv::Mat 和 Python 端 numpy 之間的綁定。轉換代碼如下：

/* Python->C++ Mat */
cv::Mat numpy_uint8_3c_to_cv_mat(py::array_t<uint8_t>& input) {

if (input.ndim() != 3)
throw std::runtime_error("3-channel image must be 3 dims ");
py::buffer_info buf = input.request();
cv::Mat mat(buf.shape[0], buf.shape[1], CV_8UC3, (uint8_t*)buf.ptr);
return mat;
}
/* C++ Mat ->numpy */
py::array_t<uint8_t> cv_mat_uint8_3c_to_numpy(cv::Mat& input) {

py::array_t<uint8_t> dst = py::array_t<uint8_t>({
 input.rows,input.cols,3}, input.data);
return dst;
}

5.6 zero copy

一般來說跨語言調用都產生性能上的 overhead，特別是對於大數據塊的傳遞。因此，pybind11 也支持了數據地址傳遞的方式，避免了大數據塊在內存中的拷貝操作，性能上提升很大。

class Matrix {

public:
Matrix(size_t rows, size_t cols) : m_rows(rows), m_cols(cols) {

m_data = new float[rows*cols];
}
float *data() {
 return m_data; }
size_t rows() const {
 return m_rows; }
size_t cols() const {
 return m_cols; }
private:
size_t m_rows, m_cols;
float *m_data;
};
py::class_<Matrix>(m, "Matrix", py::buffer_protocol())
.def_buffer([](Matrix &m) -> py::buffer_info {

return py::buffer_info(
m.data(), /* Pointer to buffer */
sizeof(float), /* Size of one scalar */
py::format_descriptor<float>::format(), /* Python struct-style format descriptor */
2, /* Number of dimensions */
{
 m.rows(), m.cols() }, /* Buffer dimensions */
{
 sizeof(float) * m.cols(), /* Strides (in bytes) for each index */
sizeof(float) }
);
});

6. 落地 & 行業應用

上述方案，我們已在廣告多媒體 AI 的色彩提取相關服務、GPU 高性能抽幀等算法中落地，取得了非常不錯的提速效果。業內來說，目前市面上大部分 AI 計算框架，如 TensorFlow、Pytorch、阿裡 X-Deep Learning、百度 PaddlePaddle 等，均使用 pybind11 來提供 C++到 Python 端接口封裝，其穩定性以及性能均已得到廣泛驗證。