You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@zeppelin.apache.org by Gianluigi Barbarano <gi...@gmail.com> on 2015/08/21 12:35:28 UTC
Use Zeppelin in CDH cluster, using Yarn
Hi all,
I want to use Zeppelin in my CDH cluster, to run spark and pyspark code,
through Yarn.
My environment is:
CDH 5.4.2 (spark and yarn installed)
Zeppelin 0.5.0
I've built Zeppelin in this way:
"mvn clean package -Pspark-1.3 -Ppyspark -Dhadoop.version=2.6.0-cdh5.4.2
-Phadoop-2.6 -Pyarn -DskipTests"
and installed it on a node of my CDH cluster.
I've set the following env. variables:
export ZEPPELIN_HOME=/opt/incubator-zeppelin
export ZEPPELIN_PORT=7979
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HIVE_CONF_DIR="/etc/hive/conf"
export HIVECLASSPATH=$(find /opt/cloudera/parcels/CDH/lib/hive/lib/ -name
'*.jar' -print0 | sed 's/\x0/:/g')
My zeppelin-env.sh is:
export MASTER=yarn
export ZEPPELIN_JAVA_OPTS="-Dspark.executor.memory=512m -Dspark.cores.max=1"
export HADOOP_CONF_DIR="/etc/hadoop/conf.cloudera.yarn"
export
PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
Spark-Jar:
/opt/cloudera/parcels/CDH/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar
When I execute a simple PYSPARK code in my notebook:
sc.textFile("/user/example/test.txt").count()
I see the new application in my Yarn console, Zeppelin tells me that it's
RUNNING but nothing happens: I don't receive any results and I can't
execute any other code. No errors in my Yarn log.
In Zeppelin log:
INFO [2015-08-21 12:15:13,740] ({pool-1-thread-2}
SchedulerFactory.java[jobStarted]:132) - Job
paragraph_1440152108501_1888952717 started by scheduler
remoteinterpreter_695055854
INFO [2015-08-21 12:15:13,745] ({pool-1-thread-2}
Paragraph.java[jobRun]:189) - run paragraph 20150821-121508_181152709 using
null org.apache.zeppelin.interpreter.LazyOpenInterpreter@91f60ce
INFO [2015-08-21 12:15:13,776] ({pool-1-thread-2}
RemoteInterpreterProcess.java[reference]:108) - Run interpreter process
/opt/incubator-zeppelin/bin/interpreter.sh -d
/opt/incubator-zeppelin/interpreter/spark$
INFO [2015-08-21 12:15:15,346] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.PySparkInterpreter
INFO [2015-08-21 12:15:15,416] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.SparkInterpreter
INFO [2015-08-21 12:15:15,423] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.SparkSqlInterpreter
INFO [2015-08-21 12:15:15,425] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.DepInterpreter
INFO [2015-08-21 12:15:15,440] ({pool-1-thread-2}
Paragraph.java[jobRun]:206) - RUN : sc.textFile("/user/admin/1.txt").count()
INFO [2015-08-21 12:15:25,146] ({qtp1379078592-48}
NotebookServer.java[onMessage]:112) - RECEIVE << PING
INFO [2015-08-21 12:15:58,657] ({Thread-33}
NotebookServer.java[broadcast]:264) - SEND >> NOTE
INFO [2015-08-21 12:15:58,800] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:15:59,314] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:15:59,829] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:00,344] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:00,857] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:01,370] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:01,883] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:02,397] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
If I execute the similar code in %spark, I see the application running in
Yarn console but after some seconds Zeppelin gives me this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 3, cdhlva03.gcio.unicredit.eu): ExecutorLostFailure (executor 4
lost) Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Guys, do you have some ideas?
Thanks a lot in advance
Re: Use Zeppelin in CDH cluster, using Yarn
Posted by moon soo Lee <mo...@apache.org>.
Hi,
I'm not sure it'll help, but can you try add (change path to your actual
location)
*export SPARK_YARN_USER_ENV="PYTHONPATH=* $SPARK_HOME/
*python:/usr/spark-1.4.1-bin/*$SPARK_HOME/*lib/py4j-0.8.2.1-src.zip"*
Thanks,
moon
On Mon, Aug 24, 2015 at 7:40 AM Gianluigi Barbarano <
gianluigi.barbarano@gmail.com> wrote:
> Hi,
>
> some of you could help me?
>
> thanks a lot!
>
>
>
> ---------- Forwarded message ----------
> From: Gianluigi Barbarano <gi...@gmail.com>
> Date: 2015-08-21 12:35 GMT+02:00
> Subject: Use Zeppelin in CDH cluster, using Yarn
> To: users@zeppelin.incubator.apache.org
>
>
> Hi all,
>
> I want to use Zeppelin in my CDH cluster, to run spark and pyspark code,
> through Yarn.
> My environment is:
>
> CDH 5.4.2 (spark and yarn installed)
> Zeppelin 0.5.0
>
> I've built Zeppelin in this way:
>
> "mvn clean package -Pspark-1.3 -Ppyspark -Dhadoop.version=2.6.0-cdh5.4.2
> -Phadoop-2.6 -Pyarn -DskipTests"
>
> and installed it on a node of my CDH cluster.
>
>
> I've set the following env. variables:
>
> export ZEPPELIN_HOME=/opt/incubator-zeppelin
> export ZEPPELIN_PORT=7979
> export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
> export SPARK_DIST_CLASSPATH=$(hadoop classpath)
> export HIVE_CONF_DIR="/etc/hive/conf"
> export HIVECLASSPATH=$(find /opt/cloudera/parcels/CDH/lib/hive/lib/ -name
> '*.jar' -print0 | sed 's/\x0/:/g')
>
> My zeppelin-env.sh is:
>
> export MASTER=yarn
> export ZEPPELIN_JAVA_OPTS="-Dspark.executor.memory=512m
> -Dspark.cores.max=1"
> export HADOOP_CONF_DIR="/etc/hadoop/conf.cloudera.yarn"
> export
> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>
> Spark-Jar:
> /opt/cloudera/parcels/CDH/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar
>
>
>
> When I execute a simple PYSPARK code in my notebook:
>
> sc.textFile("/user/example/test.txt").count()
>
> I see the new application in my Yarn console, Zeppelin tells me that it's
> RUNNING but nothing happens: I don't receive any results and I can't
> execute any other code. No errors in my Yarn log.
>
> In Zeppelin log:
>
> INFO [2015-08-21 12:15:13,740] ({pool-1-thread-2}
> SchedulerFactory.java[jobStarted]:132) - Job
> paragraph_1440152108501_1888952717 started by scheduler
> remoteinterpreter_695055854
> INFO [2015-08-21 12:15:13,745] ({pool-1-thread-2}
> Paragraph.java[jobRun]:189) - run paragraph 20150821-121508_181152709 using
> null org.apache.zeppelin.interpreter.LazyOpenInterpreter@91f60ce
> INFO [2015-08-21 12:15:13,776] ({pool-1-thread-2}
> RemoteInterpreterProcess.java[reference]:108) - Run interpreter process
> /opt/incubator-zeppelin/bin/interpreter.sh -d
> /opt/incubator-zeppelin/interpreter/spark$
> INFO [2015-08-21 12:15:15,346] ({pool-1-thread-2}
> RemoteInterpreter.java[init]:144) - Create remote interpreter
> org.apache.zeppelin.spark.PySparkInterpreter
> INFO [2015-08-21 12:15:15,416] ({pool-1-thread-2}
> RemoteInterpreter.java[init]:144) - Create remote interpreter
> org.apache.zeppelin.spark.SparkInterpreter
> INFO [2015-08-21 12:15:15,423] ({pool-1-thread-2}
> RemoteInterpreter.java[init]:144) - Create remote interpreter
> org.apache.zeppelin.spark.SparkSqlInterpreter
> INFO [2015-08-21 12:15:15,425] ({pool-1-thread-2}
> RemoteInterpreter.java[init]:144) - Create remote interpreter
> org.apache.zeppelin.spark.DepInterpreter
> INFO [2015-08-21 12:15:15,440] ({pool-1-thread-2}
> Paragraph.java[jobRun]:206) - RUN : sc.textFile("/user/admin/1.txt").count()
> INFO [2015-08-21 12:15:25,146] ({qtp1379078592-48}
> NotebookServer.java[onMessage]:112) - RECEIVE << PING
> INFO [2015-08-21 12:15:58,657] ({Thread-33}
> NotebookServer.java[broadcast]:264) - SEND >> NOTE
> INFO [2015-08-21 12:15:58,800] ({Thread-34}
> NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
> INFO [2015-08-21 12:15:59,314] ({Thread-34}
> NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
> INFO [2015-08-21 12:15:59,829] ({Thread-34}
> NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
> INFO [2015-08-21 12:16:00,344] ({Thread-34}
> NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
> INFO [2015-08-21 12:16:00,857] ({Thread-34}
> NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
> INFO [2015-08-21 12:16:01,370] ({Thread-34}
> NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
> INFO [2015-08-21 12:16:01,883] ({Thread-34}
> NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
> INFO [2015-08-21 12:16:02,397] ({Thread-34}
> NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
>
>
> If I execute the similar code in %spark, I see the application running in
> Yarn console but after some seconds Zeppelin gives me this error:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 0.0 (TID 3, cdhlva03.gcio.unicredit.eu): ExecutorLostFailure (executor 4
> lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at scala.Option.foreach(Option.scala:236) at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
>
>
> Guys, do you have some ideas?
>
> Thanks a lot in advance
>
>
>
>
>
Fwd: Use Zeppelin in CDH cluster, using Yarn
Posted by Gianluigi Barbarano <gi...@gmail.com>.
Hi,
some of you could help me?
thanks a lot!
---------- Forwarded message ----------
From: Gianluigi Barbarano <gi...@gmail.com>
Date: 2015-08-21 12:35 GMT+02:00
Subject: Use Zeppelin in CDH cluster, using Yarn
To: users@zeppelin.incubator.apache.org
Hi all,
I want to use Zeppelin in my CDH cluster, to run spark and pyspark code,
through Yarn.
My environment is:
CDH 5.4.2 (spark and yarn installed)
Zeppelin 0.5.0
I've built Zeppelin in this way:
"mvn clean package -Pspark-1.3 -Ppyspark -Dhadoop.version=2.6.0-cdh5.4.2
-Phadoop-2.6 -Pyarn -DskipTests"
and installed it on a node of my CDH cluster.
I've set the following env. variables:
export ZEPPELIN_HOME=/opt/incubator-zeppelin
export ZEPPELIN_PORT=7979
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HIVE_CONF_DIR="/etc/hive/conf"
export HIVECLASSPATH=$(find /opt/cloudera/parcels/CDH/lib/hive/lib/ -name
'*.jar' -print0 | sed 's/\x0/:/g')
My zeppelin-env.sh is:
export MASTER=yarn
export ZEPPELIN_JAVA_OPTS="-Dspark.executor.memory=512m -Dspark.cores.max=1"
export HADOOP_CONF_DIR="/etc/hadoop/conf.cloudera.yarn"
export
PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
Spark-Jar:
/opt/cloudera/parcels/CDH/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar
When I execute a simple PYSPARK code in my notebook:
sc.textFile("/user/example/test.txt").count()
I see the new application in my Yarn console, Zeppelin tells me that it's
RUNNING but nothing happens: I don't receive any results and I can't
execute any other code. No errors in my Yarn log.
In Zeppelin log:
INFO [2015-08-21 12:15:13,740] ({pool-1-thread-2}
SchedulerFactory.java[jobStarted]:132) - Job
paragraph_1440152108501_1888952717 started by scheduler
remoteinterpreter_695055854
INFO [2015-08-21 12:15:13,745] ({pool-1-thread-2}
Paragraph.java[jobRun]:189) - run paragraph 20150821-121508_181152709 using
null org.apache.zeppelin.interpreter.LazyOpenInterpreter@91f60ce
INFO [2015-08-21 12:15:13,776] ({pool-1-thread-2}
RemoteInterpreterProcess.java[reference]:108) - Run interpreter process
/opt/incubator-zeppelin/bin/interpreter.sh -d
/opt/incubator-zeppelin/interpreter/spark$
INFO [2015-08-21 12:15:15,346] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.PySparkInterpreter
INFO [2015-08-21 12:15:15,416] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.SparkInterpreter
INFO [2015-08-21 12:15:15,423] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.SparkSqlInterpreter
INFO [2015-08-21 12:15:15,425] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.DepInterpreter
INFO [2015-08-21 12:15:15,440] ({pool-1-thread-2}
Paragraph.java[jobRun]:206) - RUN : sc.textFile("/user/admin/1.txt").count()
INFO [2015-08-21 12:15:25,146] ({qtp1379078592-48}
NotebookServer.java[onMessage]:112) - RECEIVE << PING
INFO [2015-08-21 12:15:58,657] ({Thread-33}
NotebookServer.java[broadcast]:264) - SEND >> NOTE
INFO [2015-08-21 12:15:58,800] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:15:59,314] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:15:59,829] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:00,344] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:00,857] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:01,370] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:01,883] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:02,397] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
If I execute the similar code in %spark, I see the application running in
Yarn console but after some seconds Zeppelin gives me this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 3, cdhlva03.gcio.unicredit.eu): ExecutorLostFailure (executor 4
lost) Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Guys, do you have some ideas?
Thanks a lot in advance