首頁>技術>

最近簡單的研究了一下Spark on K8s,期間遇到了些許問題,在這裡總結一下分享給大家。

環境介紹

hadoop叢集:部署在實體機上

spark: k8s上

需求

要實現的功能是使用spark讀取遠端hadoop叢集上的lzo檔案

問題

使用Spark官方提供的DockerFile新建容器是沒有問題的,但是由於我的測試環境的資料是lzo壓縮檔案,導致Spark讀取資料時會報本地庫的錯誤:

19/09/20 06:02:46 WARN LzoCompressor: java.lang.UnsatisfiedLinkError: Cannot load liblzo2.so.2 (liblzo2.so.2: cannot open shared object file: No such file or directory)!19/09/20 06:02:46 ERROR LzoCodec: Failed to load/initialize native-lzo library19/09/20 06:02:46 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)java.lang.RuntimeException: native-lzo library not available at com.hadoop.compression.lzo.LzopCodec.createDecompressor(LzopCodec.java:104) at com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:89) at com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:104) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:168) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:133) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

解決

Spark從2.30版本開始支援執行在k8s容器中,如果Spark執行在k8s上,那麼Spark原始碼中給的Dockerfile中使用的基礎源是openjdk:8-alpine,由於使用了alpine,且Dockerfile中沒有安裝讀取lzo等的本地庫,因此作業讀取lzo檔案時失敗。

解決方法:

1.alpine中安裝lzo依賴

\tapk add lzo --no-cache

2.在alpine容器中重新編譯hadoop-lzo本地庫,

下載hadoop-lzo原始碼:https://github.com/twitter/hadoop-lzo,將其複製到alpine容器中執行:

\t\tmvn clean package -Djava.test.skip=true

如果出錯,需要安裝一下環境:

echo "/file/2019/10/06/20191006165702_13485.jpg" > /etc/apk/repositories echo "/file/2019/10/06/20191006165702_13486.jpg" >> /etc/apk/repositories echo "/file/2019/10/06/20191006165703_13487.jpg" >> /etc/apk/repositories apk update --no-cacheapk add gcc --no-cache apk add gcc++ --no-cache apk add lzo --no-cache apk add lzo-dev --no-cache apk add make --no-cache 

之後重新編譯,編譯成功後在target下查詢native!

基於alpine v3.9版本編譯:

hadoop-lzo.jar依賴包

lzo本地庫檔案:

這幾個檔案先複製出來,稍後提交作業時會用到。

3.提交時指定lzo本地庫和hadoop-lzo.jar依賴包。

下面是修改後的Dockerfile,其中需要hadoop-lzo-0.4.21-SNAPSHOT.jar和hadoop-lzo本地庫:

.├── gplnative│ ├── libgplcompression.a│ ├── libgplcompression.la│ ├── libgplcompression.so│ ├── libgplcompression.so.0│ └── libgplcompression.so.0.0.0└── hadoop-lzo-0.4.21-SNAPSHOT.jar

Dockerfile檔案修改成如下:

FROM openjdk:8-alpineARG spark_jars=jarsARG gpl_libs=glplibARG path=kubernetes/dockerfiles# Before building the docker image, first build and make a Spark distribution following# the instructions in http://spark.apache.org/docs/latest/building-spark.html.# If this docker file is being used in the context of building your images from a Spark# distribution, the docker build command should be invoked from the top level directory# of the Spark distribution. E.g.:# docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .RUN set -ex && \\ echo "/file/2019/10/06/20191006165702_13485.jpg" > /etc/apk/repositories && \\ echo "/file/2019/10/06/20191006165702_13486.jpg" >> /etc/apk/repositories && \\ echo "/file/2019/10/06/20191006165703_13487.jpg" >> /etc/apk/repositories && \\ apk upgrade --no-cache && \\ apk add --no-cache bash tini libc6-compat && \\ apk add --no-cache lzo && \\ mkdir -p /opt/spark && \\ mkdir -p /opt/spark/work-dir \\ touch /opt/spark/RELEASE && \\ rm /bin/sh && \\ ln -sv /bin/bash /bin/sh && \\ chgrp root /etc/passwd && chmod ug+rw /etc/passwdCOPY ${spark_jars} /opt/spark/jarsCOPY ${gpl_libs}/hadoop-lzo-0.4.21-SNAPSHOT.jar /opt/spark/jarsCOPY ${gpl_libs}/gplnative /opt/gplnativeCOPY bin /opt/spark/binCOPY sbin /opt/spark/sbinCOPY conf /opt/spark/confCOPY ${img_path}/spark/entrypoint.sh /opt/COPY examples /opt/spark/examplesCOPY data /opt/spark/dataENV SPARK_HOME /opt/sparkWORKDIR /opt/spark/work-dirENTRYPOINT [ "/opt/entrypoint.sh" ]

提交程式碼時加上引數:

--conf spark.executor.extraLibraryPath=/opt/gplnative \\ --conf spark.driver.extraLibraryPath=/opt/gplnative \\ 

至此Spark on k8s就可以讀取lzo檔案了!

最新評論
  • BSA-TRITC(10mg/ml) TRITC-BSA 牛血清白蛋白改性標記羅丹明
  • 深入分析 Java I/O 的工作機制