程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

PySpark - tests for the python package

編輯:Python

Pyspark python package的測試

PySpark - Python Package Management

PySpark提供了一種將pythonA way for the environment to be stripped out of the image,眾所周知,在dockerin the optimization plan,Reducing the size of the image can save resources、提高效率,這是一種“極大程度上”The way the mirroring can be optimized.

在這篇博文中,我主要使用conda打包pythonway to measure itk8s集群上的client/cluster模式提交任務

環境准備

  • k8s 集群
  • NFS - PVC(需要結合k8s配置nfs pvc)
  • A committablesparkMirror of the task

啟動容器

  1. can be submittedsparkThe image of the task starts a container
  2. 進入容器,Do Something(Can be used ahead of time in a containerconda打包所需的python)

1. conda打包環境

  1. 安裝conda
  2. 用conda安裝所需python環境
# python=XXX, XXX代表指定python版本
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack python=XXX
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz

2. 測試使用的代碼:app.py

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql import SparkSession
def main(spark):
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
@pandas_udf("double")
def mean_udf(v: pd.Series) -> float:
return v.mean()
print(df.groupby("id").agg(mean_udf(df['v'])).collect())
if __name__ == "__main__":
main(SparkSession.builder.getOrCreate())

client mode

運行命令

在client模式下,PYSPARK_DRIVER_PYTHONPYSPARK_PYTHON 都要設置

  1. PYSPARK_PYTHON: 通過--archives上傳的pyspark_conda_env.tar.gz在podAfter decompressionpython路徑
  2. PYSPARK_DRIVER_PYTHON: pyspark_conda_env.tar.gzAfter decompression in the containerpython路徑
  3. 由於client模式下,PYSPARK_DRIVER_PYTHON指定的pythonPaths are only useddriver端使用,So just mount into the currently started container,It does not need to exist on other nodes as well.
# 解壓!
# [email protected]:/ppml/trusted-big-data-ml# tar -zxvf pyspark_conda_env.tar.gz -C pyspark_conda_env
export PYSPARK_DRIVER_PYTHON=/ppml/trusted-big-data-ml/pyspark_conda_env/bin/python # Do not set in cluster modes.
export PYSPARK_PYTHON=./pyspark_conda_env/bin/python
# 提交Spark命令
${SPARK_HOME}/bin/spark-submit \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--deploy-mode client \
--conf spark.driver.host=${LOCAL_HOST} \
--master ${RUNTIME_SPARK_MASTER} \
--conf spark.kubernetes.executor.podTemplateFile=/ppml/trusted-big-data-ml/spark-executor-template.yaml \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.kubernetes.executor.deleteOnTermination=false \
--archives ./pyspark_conda_env.tar.gz#pyspark_conda_env \
local:///ppml/trusted-big-data-ml/app.py

執行分析

# hello

cluster mode

運行命令

在cluster模式下,只需要設置PYSPARK_PYTHON

  1. PYSPARK_PYTHON: pyspark_conda_env.tar.gz解壓到pod中的python路徑
  2. 通過conda打包的pyspark_conda_env.tar.gz,通過spark.kubernetes.file.upload.pathThe specified shared file system is passed in toDriver
export PYSPARK_PYTHON=./pyspark_conda_env/bin/python
# 提交Spark命令
${SPARK_HOME}/bin/spark-submit \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--deploy-mode cluster \
--master ${RUNTIME_SPARK_MASTER} \
--conf spark.kubernetes.executor.podTemplateFile=/ppml/trusted-big-data-ml/spark-executor-template.yaml \
--conf spark.kubernetes.driver.podTemplateFile=/ppml/trusted-big-data-ml/spark-driver-template.yaml \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.kubernetes.executor.deleteOnTermination=false \
--conf spark.kubernetes.file.upload.path=/ppml/trusted-big-data-ml/work/data/shaojie \
--archives ./pyspark_conda_env.tar.gz#pyspark_conda_env \
local:///ppml/trusted-big-data-ml/work/data/shaojie/app.py

通過跑cluster模式的任務,Take the opportunity to take a lookspark.kubernetes.file.upload.path的作用.在這裡,--archivesMeans by specifying逗號分割的tar,jar,zipand a series of dependent packages,will be decompressedexecutor

執行分析

22-07-29 02:05:58 INFO SparkContext:57 - Added archive file:/ppml/trusted-big-data-ml/work/data/shaojie/spark-upload-06550aaa-76f6-4f6e-a123-d34c71dbce5c/pyspark_conda_env.tar.gz#pyspark_conda_env at spark://app-py-36a4238247b3d72e-driver-svc.default.svc:7078/files/pyspark_conda_env.tar.gz with timestamp 1659060357319
22-07-29 02:05:58 INFO Utils:57 - Copying /ppml/trusted-big-data-ml/work/data/shaojie/spark-upload-06550aaa-76f6-4f6e-a123-d34c71dbce5c/pyspark_conda_env.tar.gz to /tmp/spark-6c702f67-0531-49f8-9656-48d4e110eea1/pyspark_conda_env.tar.gz
INFO fork chmod is forbidden !!!/tmp/spark-6c702f67-0531-49f8-9656-48d4e110eea1/pyspark_conda_env.tar.gz
22-07-29 02:05:58 INFO SparkContext:57 - Unpacking an archive file:/ppml/trusted-big-data-ml/work/data/shaojie/spark-upload-06550aaa-76f6-4f6e-a123-d34c71dbce5c/pyspark_conda_env.tar.gz#pyspark_conda_env from /tmp/spark-6c702f67-0531-49f8-9656-48d4e110eea1/pyspark_conda_env.tar.gz to /var/data/spark-3024b9ad-8e4d-4b2a-b51a-aee8f54d5a46/spark-8acb95d2-599d-4e2f-8203-c1f3455c4c7f/userFiles-994a18bf-12cd-4d98-b3e9-1035f741fe67/pyspark_conda_en
  1. SparkContextask firstspark.kubernetes.file.upload.pathAll uploaded in the patharchives包,添加到driver中.[spark.kubernetes.file.upload.pathThe specified path must be a file system that can be accessed by the share: HDFS, NFS等]
  2. Copy the package in the shared path toDriver中的路徑
  3. 解壓Driver中拷貝過來的archives包到指定位置

questions

  1. client去提交archives的時候,不需要指定spark.kubernetes.file.upload.path的嘛?
  2. cluster提交的archivesMainly to startdriver使用?
  3. spark.kubernetes.file.upload.pathThe documentation states that the value of this configuration should be a remote store

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved