You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@zeppelin.apache.org by mo...@apache.org on 2016/02/24 17:41:22 UTC

incubator-zeppelin git commit: Fix pyspark to work on yarn mode when spark version is lower than or equal to 1.4.x

Repository: incubator-zeppelin
Updated Branches:
  refs/heads/master 617eb947b -> d16ec20fc


Fix pyspark to work on yarn mode when spark version is lower than or equal to 1.4.x

### What is this PR for?
pyspark.zip, py4j-\*.zip should be distributed to yarn nodes to make pyspark function but this hasn't been working after #463 because [`if (pythonLibs.length == pythonLibUris.size())`](https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java#L329) condition will never be true. This PR fixes this issue by changing this if condition to be  `pythonlibUris.size() == 2`, while integer 2 refers pyspark.zip and py4j-\*.zip.

In addition, yarn-install documentation has been updated.

### What type of PR is it?
Bug Fix

### Is there a relevant Jira issue?
No. But the issue has reported via [user mailing list](http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/Can-t-get-Pyspark-1-4-1-interpreter-to-work-on-Zeppelin-0-6-td2229.html#a2259) by Ian Maloney

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Mina Lee <mi...@nflabs.com>

Closes #736 from minahlee/fix/pyspark_on_yarn and squashes the following commits:

e588f7b [Mina Lee] Merge branch 'master' of https://github.com/apache/incubator-zeppelin into fix/pyspark_on_yarn
2710c46 [Mina Lee] [DOC] Remove invalid information of installation location
c544dec [Mina Lee] [DOC] Remove redundant Zeppelin build information from yarn_install.md [DOC] Guide users to set SPARK_HOME to use spark in yarn mode [DOC] Change spark version to the latest in yarn config example [DOC] Add note that spark for cdh4 doesn't support yarn [DOC] Remove spark properties `spark.home` and `spark.yarn.jar` from doc which doesn't work on zeppelin anymore [DOC] Fix typos [DOC] Add info that embedded spark doesn't work on yarn mode anymore when Spark version is 1.5.0 or higher in README.md
6465ba8 [Mina Lee] Change  condition to make pyspark, py4j libraries be distributed to yarn executors


Project: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/commit/d16ec20f
Tree: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/tree/d16ec20f
Diff: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/diff/d16ec20f

Branch: refs/heads/master
Commit: d16ec20fcf7c69a97bd90b3faac634098dc58214
Parents: 617eb94
Author: Mina Lee <mi...@nflabs.com>
Authored: Tue Feb 23 13:35:19 2016 +0900
Committer: Lee moon soo <mo...@apache.org>
Committed: Wed Feb 24 08:44:50 2016 -0800

----------------------------------------------------------------------
 README.md                                       |   2 +-
 docs/install/install.md                         |  14 +-
 docs/install/yarn_install.md                    | 132 ++++---------------
 .../apache/zeppelin/spark/SparkInterpreter.java |   8 +-
 4 files changed, 40 insertions(+), 116 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index ce5926f..cca45d4 100644
--- a/README.md
+++ b/README.md
@@ -104,7 +104,7 @@ minor version can be adjusted by `-Dhadoop.version=x.x.x`
 ##### -Pyarn (optional)
 
 enable YARN support for local mode
-
+> YARN for local mode is not supported for Spark v1.5.0 or higher. Set SPARK_HOME instead.
 
 ##### -Ppyspark (optional)
 

http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/docs/install/install.md
----------------------------------------------------------------------
diff --git a/docs/install/install.md b/docs/install/install.md
index 38752f5..b86c5bb 100644
--- a/docs/install/install.md
+++ b/docs/install/install.md
@@ -22,9 +22,9 @@ limitations under the License.
 
 
 ## Zeppelin Installation
-Welcome to your first trial to explore Zeppelin ! 
+Welcome to your first trial to explore Zeppelin!
 
-In this documentation, we will explain how you can install Zeppelin from **Binary Package** or build from **Source** by yourself. Plus, you can see all of Zeppelin's configurations in the **Zeppelin Configuration** section below.
+In this documentation, we will explain how you can install Zeppelin from **Binary Package** or build from **Source** by yourself. Plus, you can see all of Zeppelin's configurations in the [Zeppelin Configuration](install.html#zeppelin-configuration) section below.
 
 ### Install with Binary Package
 
@@ -32,9 +32,17 @@ If you want to install Zeppelin with latest binary package, please visit [this p
 
 ### Build from Zeppelin Source
 
-You can also build Zeppelin from the source. Please check instructions in `README.md` in [Zeppelin github](https://github.com/apache/incubator-zeppelin/blob/master/README.md). 
+You can also build Zeppelin from the source.
 
+#### Prerequisites for build
+ * Java 1.7
+ * Git
+ * Maven(3.1.x or higher)
+ * Node.js Package Manager
 
+If you don't have requirements prepared, please check instructions in [README.md](https://github.com/apache/incubator-zeppelin/blob/master/README.md) for the details.
+
+<a name="zeppelin-configuration"> </a>
 ## Zeppelin Configuration
 
 You can configure Zeppelin with both **environment variables** in `conf/zeppelin-env.sh` and **java properties** in `conf/zeppelin-site.xml`. If both are defined, then the **environment variables** will be used priorly.

http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/docs/install/yarn_install.md
----------------------------------------------------------------------
diff --git a/docs/install/yarn_install.md b/docs/install/yarn_install.md
index 723291f..dd86467 100644
--- a/docs/install/yarn_install.md
+++ b/docs/install/yarn_install.md
@@ -20,7 +20,7 @@ limitations under the License.
 {% include JB/setup %}
 
 ## Introduction
-This page describes how to pre-configure a bare metal node, build & configure Zeppelin on it, configure Zeppelin and connect it to existing YARN cluster running Hortonworks flavour of Hadoop. It also describes steps to configure Spark & Hive interpreter of Zeppelin. 
+This page describes how to pre-configure a bare metal node, configure Zeppelin and connect it to existing YARN cluster running Hortonworks flavour of Hadoop. It also describes steps to configure Spark & Hive interpreter of Zeppelin.
 
 ## Prepare Node
 
@@ -44,84 +44,16 @@ Its assumed in the rest of the document that zeppelin user is indeed created and
 
 ### List of Prerequisites
 
- * CentOS 6.x
- * Git
- * Java 1.7 
- * Apache Maven
- * Hadoop client.
- * Spark.
+ * CentOS 6.x, Mac OSX, Ubuntu 14.X
+ * Java 1.7
+ * Hadoop client
+ * Spark
  * Internet connection is required. 
 
-Its assumed that the node has CentOS 6.x installed on it. Although any version of Linux distribution should work fine. The working directory of all prerequisite pacakges is /home/zeppelin/prerequisites, although any location could be used.
-
-#### Git
-Intall latest stable version of Git. This document describes installation of version 2.4.8
-
-```bash
-yum install curl-devel expat-devel gettext-devel openssl-devel zlib-devel
-yum install  gcc perl-ExtUtils-MakeMaker
-yum remove git
-cd /home/zeppelin/prerequisites
-wget https://github.com/git/git/archive/v2.4.8.tar.gz
-tar xzf git-2.0.4.tar.gz
-cd git-2.0.4
-make prefix=/home/zeppelin/prerequisites/git all
-make prefix=/home/zeppelin/prerequisites/git install
-echo "export PATH=$PATH:/home/zeppelin/prerequisites/bin" >> /home/zeppelin/.bashrc
-source /home/zeppelin/.bashrc
-git --version
-```
-
-Assuming all the packages are successfully installed, running the version option with git command should display
-
-```bash
-git version 2.4.8
-```
-
-#### Java
-Zeppelin works well with 1.7.x version of Java runtime. Download JDK version 7 and a stable update and follow below instructions to install it.
-
-```bash
-cd /home/zeppelin/prerequisites/
-#Download JDK 1.7, Assume JDK 7 update 79 is downloaded.
-tar -xf jdk-7u79-linux-x64.tar.gz
-echo "export JAVA_HOME=/home/zeppelin/prerequisites/jdk1.7.0_79" >> /home/zeppelin/.bashrc
-source /home/zeppelin/.bashrc
-echo $JAVA_HOME
-```
-Assuming all the packages are successfully installed, echoing JAVA_HOME environment variable should display
-
-```bash
-/home/zeppelin/prerequisites/jdk1.7.0_79
-```
-
-#### Apache Maven
-Download and install a stable version of Maven.
-
-```bash
-cd /home/zeppelin/prerequisites/
-wget ftp://mirror.reverse.net/pub/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
-tar -xf apache-maven-3.3.3-bin.tar.gz 
-cd apache-maven-3.3.3
-export MAVEN_HOME=/home/zeppelin/prerequisites/apache-maven-3.3.3
-echo "export PATH=$PATH:/home/zeppelin/prerequisites/apache-maven-3.3.3/bin" >> /home/zeppelin/.bashrc
-source /home/zeppelin/.bashrc
-mvn -version
-```
-
-Assuming all the packages are successfully installed, running the version option with mvn command should display
-
-```bash
-Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T04:57:37-07:00)
-Maven home: /home/zeppelin/prerequisites/apache-maven-3.3.3
-Java version: 1.7.0_79, vendor: Oracle Corporation
-Java home: /home/zeppelin/prerequisites/jdk1.7.0_79/jre
-Default locale: en_US, platform encoding: UTF-8
-OS name: "linux", version: "2.6.32-358.el6.x86_64", arch: "amd64", family: "unix"
-```
+It's assumed that the node has CentOS 6.x installed on it. Although any version of Linux distribution should work fine.
 
 #### Hadoop client
-Zeppelin can work with multiple versions & distributions of Hadoop. A complete list [is available here.](https://github.com/apache/incubator-zeppelin#build) This document assumes Hadoop 2.7.x client libraries including configuration files are installed on Zeppelin node. It also assumes /etc/hadoop/conf contains various Hadoop configuration files. The location of Hadoop configuration files may vary, hence use appropriate location.
+Zeppelin can work with multiple versions & distributions of Hadoop. A complete list is available [here](https://github.com/apache/incubator-zeppelin#build). This document assumes Hadoop 2.7.x client libraries including configuration files are installed on Zeppelin node. It also assumes /etc/hadoop/conf contains various Hadoop configuration files. The location of Hadoop configuration files may vary, hence use appropriate location.
 
 ```bash
 hadoop version
@@ -134,32 +66,21 @@ This command was run using /usr/hdp/2.3.1.0-2574/hadoop/lib/hadoop-common-2.7.1.
 ```
 
 #### Spark
-Zeppelin can work with multiple versions Spark. A complete list [is available here.](https://github.com/apache/incubator-zeppelin#build) This document assumes Spark 1.3.1 is installed on Zeppelin node at /home/zeppelin/prerequisites/spark.
-
-## Build
+Spark is supported out of the box and to take advantage of this, you need to Download appropriate version of Spark binary packages from [Spark Download page](http://spark.apache.org/downloads.html) and unzip it.
+Zeppelin can work with multiple versions of Spark. A complete list is available [here](https://github.com/apache/incubator-zeppelin#build).
+This document assumes Spark 1.6.0 is installed at /usr/lib/spark.
+> Note: Spark should be installed on the same node as Zeppelin.
 
-Checkout source code from [git://git.apache.org/incubator-zeppelin.git](git://git.apache.org/incubator-zeppelin.git).
+> Note: Spark's pre-built package for CDH 4 doesn't support yarn.
 
-```bash
-cd /home/zeppelin/
-git clone git://git.apache.org/incubator-zeppelin.git
-```
-Zeppelin package is available at `/home/zeppelin/incubator-zeppelin` after the checkout completes.
-
-### Cluster mode
+#### Zeppelin
 
-As its assumed Hadoop 2.7.x is installed on the YARN cluster & Spark 1.3.1 is installed on Zeppelin node. Hence appropriate options are chosen to build Zeppelin. This is very important as Zeppelin will bundle corresponding Hadoop & Spark libraries and they must match the ones present on YARN cluster & Zeppelin Spark installation. 
-
-Zeppelin is a maven project and hence must be built with Apache Maven.
-
-```bash
-cd /home/zeppelin/incubator-zeppelin
-mvn clean package -Pspark-1.3 -Dspark.version=1.3.1 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests
-```
-Building Zeppelin for first time downloads various dependencies and hence takes few minutes to complete. 
+Checkout source code from [git://git.apache.org/incubator-zeppelin.git](https://github.com/apache/incubator-zeppelin.git) or download binary package from [Download page](https://zeppelin.incubator.apache.org/download.html).
+You can refer [Install](install.html) page for the details.
+This document assumes that Zeppelin is located under `/home/zeppelin/incubator-zeppelin`.
 
 ## Zeppelin Configuration
-Zeppelin configurations needs to be modified to connect to YARN cluster. Create a copy of zeppelin environment XML
+Zeppelin configuration needs to be modified to connect to YARN cluster. Create a copy of zeppelin environment shell script.
 
 ```bash
 cp /home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh.template /home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh 
@@ -168,9 +89,10 @@ cp /home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh.template /home/zeppeli
 Set the following properties
 
 ```bash
-export JAVA_HOME=/home/zeppelin/prerequisites/jdk1.7.0_79
-export HADOOP_CONF_DIR=/etc/hadoop/conf
+export JAVA_HOME="/usr/java/jdk1.7.0_79"
+export HADOOP_CONF_DIR="/etc/hadoop/conf"
 export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.1.0-2574"
+export SPARK_HOME="/usr/lib/spark"
 ```
 
 As /etc/hadoop/conf contains various configurations of YARN cluster, Zeppelin can now submit Spark/Hive jobs on YARN cluster form its web interface. The value of hdp.version is set to 2.3.1.0-2574. This can be obtained by running the following command
@@ -196,7 +118,7 @@ bin/zeppelin-daemon.sh stop
 ```
 
 ## Interpreter
-Zeppelin provides to various distributed processing frameworks to process data that ranges from Spark, Hive, Tajo, Ignite and Lens to name a few. This document describes to configure Hive & Spark interpreters.
+Zeppelin provides various distributed processing frameworks to process data that ranges from Spark, Hive, Tajo, Ignite and Lens to name a few. This document describes to configure Hive & Spark interpreters.
 
 ### Hive
 Zeppelin supports Hive interpreter and hence copy hive-site.xml that should be present at /etc/hive/conf to the configuration folder of Zeppelin. Once Zeppelin is built it will have conf folder under /home/zeppelin/incubator-zeppelin.
@@ -209,7 +131,7 @@ Once Zeppelin server has started successfully, visit http://[zeppelin-server-hos
 Click on Save button. Once these configurations are updated, Zeppelin will prompt you to restart the interpreter. Accept the prompt and the interpreter will reload the configurations.
 
 ### Spark
-Zeppelin was built with Spark 1.3.1 and it was assumed that 1.3.1 version of Spark is installed at /home/zeppelin/prerequisites/spark. Look for Spark configrations and click edit button to add the following properties
+It was assumed that 1.6.0 version of Spark is installed at /usr/lib/spark. Look for Spark configurations and click edit button to add the following properties
 
 <table class="table-configuration">
   <tr>
@@ -223,11 +145,6 @@ Zeppelin was built with Spark 1.3.1 and it was assumed that 1.3.1 version of Spa
     <td>In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.</td>
   </tr>
   <tr>
-    <td>spark.home</td>
-    <td>/home/zeppelin/prerequisites/spark</td>
-    <td></td>
-  </tr>
-  <tr>
     <td>spark.driver.extraJavaOptions</td>
     <td>-Dhdp.version=2.3.1.0-2574</td>
     <td></td>
@@ -237,11 +154,6 @@ Zeppelin was built with Spark 1.3.1 and it was assumed that 1.3.1 version of Spa
     <td>-Dhdp.version=2.3.1.0-2574</td>
     <td></td>
   </tr>
-  <tr>
-    <td>spark.yarn.jar</td>
-    <td>/home/zeppelin/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar</td>
-    <td></td>
-  </tr>
 </table>
 
 Click on Save button. Once these configurations are updated, Zeppelin will prompt you to restart the interpreter. Accept the prompt and the interpreter will reload the configurations.

http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
----------------------------------------------------------------------
diff --git a/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java b/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
index a905fb7..1923186 100644
--- a/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
+++ b/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
@@ -326,7 +326,10 @@ public class SparkInterpreter extends Interpreter {
       }
     }
     pythonLibUris.trimToSize();
-    if (pythonLibs.length == pythonLibUris.size()) {
+
+    // Distribute two libraries(pyspark.zip and py4j-*.zip) to workers
+    // when spark version is less than or equal to 1.4.1
+    if (pythonLibUris.size() == 2) {
       try {
         String confValue = conf.get("spark.yarn.dist.files");
         conf.set("spark.yarn.dist.files", confValue + "," + Joiner.on(",").join(pythonLibUris));
@@ -339,7 +342,8 @@ public class SparkInterpreter extends Interpreter {
       conf.set("spark.submit.pyArchives", Joiner.on(":").join(pythonLibs));
     }
 
-    // Distributes needed libraries to workers.
+    // Distributes needed libraries to workers
+    // when spark version is greater than or equal to 1.5.0
     if (getProperty("master").equals("yarn-client")) {
       conf.set("spark.yarn.isPython", "true");
     }