You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@zeppelin.apache.org by Patrik Iselind <Pa...@axis.com> on 2020/10/07 14:56:03 UTC
Using separate SPARK_HOME in Zeppelin
Hi,
I'm trying to build a docker image for Zeppelin in which I'll be able to use a spark standalone cluster. For this I understand that I need to include a Spark installation and point to it with the environment variable SPARK_HOME. I think I've done this correctly, but it doesn't seem to work. I hope that someone on this list can see what I'm missing.
I have a base image for Zeppelin:
```Dockerfile for zeppelin:apache
FROM alpine:3.8
ARG DIST_MIRROR=http://archive.apache.org/dist/zeppelin
ARG VERSION=0.8.2
ENV ZEPPELIN_HOME=/opt/zeppelin \
JAVA_HOME=/usr/lib/jvm/java-1.8-openjdk \
PATH=$PATH:/usr/lib/jvm/java-1.8-openjdk/jre/bin:/usr/lib/jvm/java-1.8-openjdk/bin
RUN apk add --no-cache bash curl jq openjdk8 py3-pip && \
ln -s /usr/bin/python3 /usr/bin/python && \
mkdir -p ${ZEPPELIN_HOME} && \
curl ${DIST_MIRROR}/zeppelin-${VERSION}/zeppelin-${VERSION}-bin-all.tgz | tar xvz -C ${ZEPPELIN_HOME} && \
mv ${ZEPPELIN_HOME}/zeppelin-${VERSION}-bin-all/* ${ZEPPELIN_HOME} && \
rm -rf ${ZEPPELIN_HOME}/zeppelin-${VERSION}-bin-all && \
rm -rf *.tgz
EXPOSE 8080
VOLUME ${ZEPPELIN_HOME}/logs \
${ZEPPELIN_HOME}/notebook
WORKDIR ${ZEPPELIN_HOME}
CMD ./bin/zeppelin.sh run
```
From this base image I include Spark 3.0.1 from the same bitnami image that my Spark cluster is using.
``` Dockerfile for zeppelin:latest
FROM docker.io/bitnami/spark:3.0.1-debian-10-r32 AS sparkimage
FROM zeppelin:alpine
COPY --from=sparkimage /opt/bitnami/spark /opt/spark
RUN cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh && \
echo "export SPARK_HOME=/opt/spark" >> conf/zeppelin-env.sh && \
echo "export PYTHONPATH=\$SPARK_HOME/python/" >> conf/zeppelin-env.sh && \
echo "export PYTHONPATH=\$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:\$PYTHONPATH" >> conf/zeppelin-env.sh && \
echo "export PYSPARK_PYTHON=python3" >> conf/zeppelin-env.sh && \
echo "export PYSPARK_DRIVER_PYTHON=python3" >> conf/zeppelin-env.sh
RUN cp conf/zeppelin-site.xml.template conf/zeppelin-site.xml
# From 0.8.2, Zeppelin server bind 127.0.0.1 by default instead of 0.0.0.0.
# Configure zeppelin.server.addr property or ZEPPELIN_ADDR env variable to
# change.
ENV ZEPPELIN_ADDR="0.0.0.0"
```
Now I start zeppelin:latest and make no changes to the interpreters at all, it's not needed to produce my issue. I'd later, when starting a pyspark interpreter works, set spark.master to spark://spark-master:7077.
Open a new notebook.
```example
%python
import pyspark
print(pyspark.version.__version__)
```
prints
```output
3.0.1
```
This is exactly what I expect.
Now comes the troublesome part.
```example
%pyspark
print(sc)
```
prints
```output
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at org.apache.zeppelin.spark.BaseSparkScalaInterpreter.getUserJars(BaseSparkScalaInterpreter.scala:382)
at org.apache.zeppelin.spark.SparkScala211Interpreter.open(SparkScala211Interpreter.scala:71)
at org.apache.zeppelin.spark.NewSparkInterpreter.open(NewSparkInterpreter.java:102)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:62)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at org.apache.zeppelin.spark.PySparkInterpreter.getSparkInterpreter(PySparkInterpreter.java:664)
at org.apache.zeppelin.spark.PySparkInterpreter.createGatewayServerAndStartScript(PySparkInterpreter.java:260)
at org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:194)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:616)
at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:140)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
What am I missing to get the %pyspark interpreter to work?
===========
Patrik Iselind, IDD
If anything is unclear, don't hesitate to ask more.