You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@zeppelin.apache.org by mo...@apache.org on 2016/02/24 17:41:22 UTC
incubator-zeppelin git commit: Fix pyspark to work on yarn mode when
spark version is lower than or equal to 1.4.x
Repository: incubator-zeppelin
Updated Branches:
refs/heads/master 617eb947b -> d16ec20fc
Fix pyspark to work on yarn mode when spark version is lower than or equal to 1.4.x
### What is this PR for?
pyspark.zip, py4j-\*.zip should be distributed to yarn nodes to make pyspark function but this hasn't been working after #463 because [`if (pythonLibs.length == pythonLibUris.size())`](https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java#L329) condition will never be true. This PR fixes this issue by changing this if condition to be `pythonlibUris.size() == 2`, while integer 2 refers pyspark.zip and py4j-\*.zip.
In addition, yarn-install documentation has been updated.
### What type of PR is it?
Bug Fix
### Is there a relevant Jira issue?
No. But the issue has reported via [user mailing list](http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/Can-t-get-Pyspark-1-4-1-interpreter-to-work-on-Zeppelin-0-6-td2229.html#a2259) by Ian Maloney
### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No
Author: Mina Lee <mi...@nflabs.com>
Closes #736 from minahlee/fix/pyspark_on_yarn and squashes the following commits:
e588f7b [Mina Lee] Merge branch 'master' of https://github.com/apache/incubator-zeppelin into fix/pyspark_on_yarn
2710c46 [Mina Lee] [DOC] Remove invalid information of installation location
c544dec [Mina Lee] [DOC] Remove redundant Zeppelin build information from yarn_install.md [DOC] Guide users to set SPARK_HOME to use spark in yarn mode [DOC] Change spark version to the latest in yarn config example [DOC] Add note that spark for cdh4 doesn't support yarn [DOC] Remove spark properties `spark.home` and `spark.yarn.jar` from doc which doesn't work on zeppelin anymore [DOC] Fix typos [DOC] Add info that embedded spark doesn't work on yarn mode anymore when Spark version is 1.5.0 or higher in README.md
6465ba8 [Mina Lee] Change condition to make pyspark, py4j libraries be distributed to yarn executors
Project: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/commit/d16ec20f
Tree: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/tree/d16ec20f
Diff: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/diff/d16ec20f
Branch: refs/heads/master
Commit: d16ec20fcf7c69a97bd90b3faac634098dc58214
Parents: 617eb94
Author: Mina Lee <mi...@nflabs.com>
Authored: Tue Feb 23 13:35:19 2016 +0900
Committer: Lee moon soo <mo...@apache.org>
Committed: Wed Feb 24 08:44:50 2016 -0800
----------------------------------------------------------------------
README.md | 2 +-
docs/install/install.md | 14 +-
docs/install/yarn_install.md | 132 ++++---------------
.../apache/zeppelin/spark/SparkInterpreter.java | 8 +-
4 files changed, 40 insertions(+), 116 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index ce5926f..cca45d4 100644
--- a/README.md
+++ b/README.md
@@ -104,7 +104,7 @@ minor version can be adjusted by `-Dhadoop.version=x.x.x`
##### -Pyarn (optional)
enable YARN support for local mode
-
+> YARN for local mode is not supported for Spark v1.5.0 or higher. Set SPARK_HOME instead.
##### -Ppyspark (optional)
http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/docs/install/install.md
----------------------------------------------------------------------
diff --git a/docs/install/install.md b/docs/install/install.md
index 38752f5..b86c5bb 100644
--- a/docs/install/install.md
+++ b/docs/install/install.md
@@ -22,9 +22,9 @@ limitations under the License.
## Zeppelin Installation
-Welcome to your first trial to explore Zeppelin !
+Welcome to your first trial to explore Zeppelin!
-In this documentation, we will explain how you can install Zeppelin from **Binary Package** or build from **Source** by yourself. Plus, you can see all of Zeppelin's configurations in the **Zeppelin Configuration** section below.
+In this documentation, we will explain how you can install Zeppelin from **Binary Package** or build from **Source** by yourself. Plus, you can see all of Zeppelin's configurations in the [Zeppelin Configuration](install.html#zeppelin-configuration) section below.
### Install with Binary Package
@@ -32,9 +32,17 @@ If you want to install Zeppelin with latest binary package, please visit [this p
### Build from Zeppelin Source
-You can also build Zeppelin from the source. Please check instructions in `README.md` in [Zeppelin github](https://github.com/apache/incubator-zeppelin/blob/master/README.md).
+You can also build Zeppelin from the source.
+#### Prerequisites for build
+ * Java 1.7
+ * Git
+ * Maven(3.1.x or higher)
+ * Node.js Package Manager
+If you don't have requirements prepared, please check instructions in [README.md](https://github.com/apache/incubator-zeppelin/blob/master/README.md) for the details.
+
+<a name="zeppelin-configuration"> </a>
## Zeppelin Configuration
You can configure Zeppelin with both **environment variables** in `conf/zeppelin-env.sh` and **java properties** in `conf/zeppelin-site.xml`. If both are defined, then the **environment variables** will be used priorly.
http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/docs/install/yarn_install.md
----------------------------------------------------------------------
diff --git a/docs/install/yarn_install.md b/docs/install/yarn_install.md
index 723291f..dd86467 100644
--- a/docs/install/yarn_install.md
+++ b/docs/install/yarn_install.md
@@ -20,7 +20,7 @@ limitations under the License.
{% include JB/setup %}
## Introduction
-This page describes how to pre-configure a bare metal node, build & configure Zeppelin on it, configure Zeppelin and connect it to existing YARN cluster running Hortonworks flavour of Hadoop. It also describes steps to configure Spark & Hive interpreter of Zeppelin.
+This page describes how to pre-configure a bare metal node, configure Zeppelin and connect it to existing YARN cluster running Hortonworks flavour of Hadoop. It also describes steps to configure Spark & Hive interpreter of Zeppelin.
## Prepare Node
@@ -44,84 +44,16 @@ Its assumed in the rest of the document that zeppelin user is indeed created and
### List of Prerequisites
- * CentOS 6.x
- * Git
- * Java 1.7
- * Apache Maven
- * Hadoop client.
- * Spark.
+ * CentOS 6.x, Mac OSX, Ubuntu 14.X
+ * Java 1.7
+ * Hadoop client
+ * Spark
* Internet connection is required.
-Its assumed that the node has CentOS 6.x installed on it. Although any version of Linux distribution should work fine. The working directory of all prerequisite pacakges is /home/zeppelin/prerequisites, although any location could be used.
-
-#### Git
-Intall latest stable version of Git. This document describes installation of version 2.4.8
-
-```bash
-yum install curl-devel expat-devel gettext-devel openssl-devel zlib-devel
-yum install gcc perl-ExtUtils-MakeMaker
-yum remove git
-cd /home/zeppelin/prerequisites
-wget https://github.com/git/git/archive/v2.4.8.tar.gz
-tar xzf git-2.0.4.tar.gz
-cd git-2.0.4
-make prefix=/home/zeppelin/prerequisites/git all
-make prefix=/home/zeppelin/prerequisites/git install
-echo "export PATH=$PATH:/home/zeppelin/prerequisites/bin" >> /home/zeppelin/.bashrc
-source /home/zeppelin/.bashrc
-git --version
-```
-
-Assuming all the packages are successfully installed, running the version option with git command should display
-
-```bash
-git version 2.4.8
-```
-
-#### Java
-Zeppelin works well with 1.7.x version of Java runtime. Download JDK version 7 and a stable update and follow below instructions to install it.
-
-```bash
-cd /home/zeppelin/prerequisites/
-#Download JDK 1.7, Assume JDK 7 update 79 is downloaded.
-tar -xf jdk-7u79-linux-x64.tar.gz
-echo "export JAVA_HOME=/home/zeppelin/prerequisites/jdk1.7.0_79" >> /home/zeppelin/.bashrc
-source /home/zeppelin/.bashrc
-echo $JAVA_HOME
-```
-Assuming all the packages are successfully installed, echoing JAVA_HOME environment variable should display
-
-```bash
-/home/zeppelin/prerequisites/jdk1.7.0_79
-```
-
-#### Apache Maven
-Download and install a stable version of Maven.
-
-```bash
-cd /home/zeppelin/prerequisites/
-wget ftp://mirror.reverse.net/pub/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
-tar -xf apache-maven-3.3.3-bin.tar.gz
-cd apache-maven-3.3.3
-export MAVEN_HOME=/home/zeppelin/prerequisites/apache-maven-3.3.3
-echo "export PATH=$PATH:/home/zeppelin/prerequisites/apache-maven-3.3.3/bin" >> /home/zeppelin/.bashrc
-source /home/zeppelin/.bashrc
-mvn -version
-```
-
-Assuming all the packages are successfully installed, running the version option with mvn command should display
-
-```bash
-Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T04:57:37-07:00)
-Maven home: /home/zeppelin/prerequisites/apache-maven-3.3.3
-Java version: 1.7.0_79, vendor: Oracle Corporation
-Java home: /home/zeppelin/prerequisites/jdk1.7.0_79/jre
-Default locale: en_US, platform encoding: UTF-8
-OS name: "linux", version: "2.6.32-358.el6.x86_64", arch: "amd64", family: "unix"
-```
+It's assumed that the node has CentOS 6.x installed on it. Although any version of Linux distribution should work fine.
#### Hadoop client
-Zeppelin can work with multiple versions & distributions of Hadoop. A complete list [is available here.](https://github.com/apache/incubator-zeppelin#build) This document assumes Hadoop 2.7.x client libraries including configuration files are installed on Zeppelin node. It also assumes /etc/hadoop/conf contains various Hadoop configuration files. The location of Hadoop configuration files may vary, hence use appropriate location.
+Zeppelin can work with multiple versions & distributions of Hadoop. A complete list is available [here](https://github.com/apache/incubator-zeppelin#build). This document assumes Hadoop 2.7.x client libraries including configuration files are installed on Zeppelin node. It also assumes /etc/hadoop/conf contains various Hadoop configuration files. The location of Hadoop configuration files may vary, hence use appropriate location.
```bash
hadoop version
@@ -134,32 +66,21 @@ This command was run using /usr/hdp/2.3.1.0-2574/hadoop/lib/hadoop-common-2.7.1.
```
#### Spark
-Zeppelin can work with multiple versions Spark. A complete list [is available here.](https://github.com/apache/incubator-zeppelin#build) This document assumes Spark 1.3.1 is installed on Zeppelin node at /home/zeppelin/prerequisites/spark.
-
-## Build
+Spark is supported out of the box and to take advantage of this, you need to Download appropriate version of Spark binary packages from [Spark Download page](http://spark.apache.org/downloads.html) and unzip it.
+Zeppelin can work with multiple versions of Spark. A complete list is available [here](https://github.com/apache/incubator-zeppelin#build).
+This document assumes Spark 1.6.0 is installed at /usr/lib/spark.
+> Note: Spark should be installed on the same node as Zeppelin.
-Checkout source code from [git://git.apache.org/incubator-zeppelin.git](git://git.apache.org/incubator-zeppelin.git).
+> Note: Spark's pre-built package for CDH 4 doesn't support yarn.
-```bash
-cd /home/zeppelin/
-git clone git://git.apache.org/incubator-zeppelin.git
-```
-Zeppelin package is available at `/home/zeppelin/incubator-zeppelin` after the checkout completes.
-
-### Cluster mode
+#### Zeppelin
-As its assumed Hadoop 2.7.x is installed on the YARN cluster & Spark 1.3.1 is installed on Zeppelin node. Hence appropriate options are chosen to build Zeppelin. This is very important as Zeppelin will bundle corresponding Hadoop & Spark libraries and they must match the ones present on YARN cluster & Zeppelin Spark installation.
-
-Zeppelin is a maven project and hence must be built with Apache Maven.
-
-```bash
-cd /home/zeppelin/incubator-zeppelin
-mvn clean package -Pspark-1.3 -Dspark.version=1.3.1 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests
-```
-Building Zeppelin for first time downloads various dependencies and hence takes few minutes to complete.
+Checkout source code from [git://git.apache.org/incubator-zeppelin.git](https://github.com/apache/incubator-zeppelin.git) or download binary package from [Download page](https://zeppelin.incubator.apache.org/download.html).
+You can refer [Install](install.html) page for the details.
+This document assumes that Zeppelin is located under `/home/zeppelin/incubator-zeppelin`.
## Zeppelin Configuration
-Zeppelin configurations needs to be modified to connect to YARN cluster. Create a copy of zeppelin environment XML
+Zeppelin configuration needs to be modified to connect to YARN cluster. Create a copy of zeppelin environment shell script.
```bash
cp /home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh.template /home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh
@@ -168,9 +89,10 @@ cp /home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh.template /home/zeppeli
Set the following properties
```bash
-export JAVA_HOME=/home/zeppelin/prerequisites/jdk1.7.0_79
-export HADOOP_CONF_DIR=/etc/hadoop/conf
+export JAVA_HOME="/usr/java/jdk1.7.0_79"
+export HADOOP_CONF_DIR="/etc/hadoop/conf"
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.1.0-2574"
+export SPARK_HOME="/usr/lib/spark"
```
As /etc/hadoop/conf contains various configurations of YARN cluster, Zeppelin can now submit Spark/Hive jobs on YARN cluster form its web interface. The value of hdp.version is set to 2.3.1.0-2574. This can be obtained by running the following command
@@ -196,7 +118,7 @@ bin/zeppelin-daemon.sh stop
```
## Interpreter
-Zeppelin provides to various distributed processing frameworks to process data that ranges from Spark, Hive, Tajo, Ignite and Lens to name a few. This document describes to configure Hive & Spark interpreters.
+Zeppelin provides various distributed processing frameworks to process data that ranges from Spark, Hive, Tajo, Ignite and Lens to name a few. This document describes to configure Hive & Spark interpreters.
### Hive
Zeppelin supports Hive interpreter and hence copy hive-site.xml that should be present at /etc/hive/conf to the configuration folder of Zeppelin. Once Zeppelin is built it will have conf folder under /home/zeppelin/incubator-zeppelin.
@@ -209,7 +131,7 @@ Once Zeppelin server has started successfully, visit http://[zeppelin-server-hos
Click on Save button. Once these configurations are updated, Zeppelin will prompt you to restart the interpreter. Accept the prompt and the interpreter will reload the configurations.
### Spark
-Zeppelin was built with Spark 1.3.1 and it was assumed that 1.3.1 version of Spark is installed at /home/zeppelin/prerequisites/spark. Look for Spark configrations and click edit button to add the following properties
+It was assumed that 1.6.0 version of Spark is installed at /usr/lib/spark. Look for Spark configurations and click edit button to add the following properties
<table class="table-configuration">
<tr>
@@ -223,11 +145,6 @@ Zeppelin was built with Spark 1.3.1 and it was assumed that 1.3.1 version of Spa
<td>In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.</td>
</tr>
<tr>
- <td>spark.home</td>
- <td>/home/zeppelin/prerequisites/spark</td>
- <td></td>
- </tr>
- <tr>
<td>spark.driver.extraJavaOptions</td>
<td>-Dhdp.version=2.3.1.0-2574</td>
<td></td>
@@ -237,11 +154,6 @@ Zeppelin was built with Spark 1.3.1 and it was assumed that 1.3.1 version of Spa
<td>-Dhdp.version=2.3.1.0-2574</td>
<td></td>
</tr>
- <tr>
- <td>spark.yarn.jar</td>
- <td>/home/zeppelin/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar</td>
- <td></td>
- </tr>
</table>
Click on Save button. Once these configurations are updated, Zeppelin will prompt you to restart the interpreter. Accept the prompt and the interpreter will reload the configurations.
http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
----------------------------------------------------------------------
diff --git a/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java b/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
index a905fb7..1923186 100644
--- a/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
+++ b/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
@@ -326,7 +326,10 @@ public class SparkInterpreter extends Interpreter {
}
}
pythonLibUris.trimToSize();
- if (pythonLibs.length == pythonLibUris.size()) {
+
+ // Distribute two libraries(pyspark.zip and py4j-*.zip) to workers
+ // when spark version is less than or equal to 1.4.1
+ if (pythonLibUris.size() == 2) {
try {
String confValue = conf.get("spark.yarn.dist.files");
conf.set("spark.yarn.dist.files", confValue + "," + Joiner.on(",").join(pythonLibUris));
@@ -339,7 +342,8 @@ public class SparkInterpreter extends Interpreter {
conf.set("spark.submit.pyArchives", Joiner.on(":").join(pythonLibs));
}
- // Distributes needed libraries to workers.
+ // Distributes needed libraries to workers
+ // when spark version is greater than or equal to 1.5.0
if (getProperty("master").equals("yarn-client")) {
conf.set("spark.yarn.isPython", "true");
}