You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Bhushan Pathak <bh...@gmail.com> on 2017/06/13 11:51:49 UTC

Install spark on a hadoop cluster

Hello,

I have a 3-node hadoop cluster - one master & 2 slaves. I want to integrate
spark with this hadoop setup so that spark uses yarn for job scheduling &
execution.

Hadoop version : 2.7.3
Spark version : 2.1.0

I have read various documentation & blog posts & my understanding so far is
that -
1. I need to install spark [download the tar.gz file & extract] on all 3
nodes
2. On master node, update the spark-env.sh as follows -
SPARK_DAEMON_JAVA_OPTS=-Dspark.driver.port=53411
HADOOP_CONF_DIR=/home/hadoop/hadoop-2.7.3/etc/hadoop
SPARK_MASTER_HOST=192.168.10.44

3. On master node, update slaves file to list IP of the 2 slave nodes
4. On master node, update spark-defaults.conf as follows -
spark.master                    spark://192.168.10.44:7077
spark.serializer                org.apache.spark.serializer.KyroSerializer

5. Repeat steps 2 - 5 on the slave nodes as well
6. HDFS & Yarn services are already running
7. Directly use spark-shell to submit jobs to yarn, command to use is -
$ ./spark-submit --master yarn --deploy-mode cluster --conf
spark.eventLog.enabled=true --conf
spark.eventLog.dir=file:///tmp/spark-events --class
org.sparkexample.WordCountTask /home/hadoop/first-example-1.0-SNAPSHOT.jar
a /user/hadoop

Please let me know whether this is correct? Am I missing something?

Thanks
Bhushan Pathak