You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2016/06/01 02:59:13 UTC

[jira] [Commented] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH

    [ https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15309149#comment-15309149 ] 

liyunzhang_intel commented on PIG-4903:
---------------------------------------

[~sriksun]:
After investigating the spark code, let's explain why before excluding the spark-yarn* explicitly:

before code in bin/pig
{code}
################# ADDING SPARK DEPENDENCIES ##################
# Spark typically works with a single assembly file. However this
# assembly isn't available as a artifact to pull in via ivy.
# To work around this short coming, we add all the jars barring
# spark-yarn to DIST through dist-files and then add them to classpath
# of the executors through an independent env variable. The reason
# for excluding spark-yarn is because spark-yarn is already being added
# by the spark-yarn-client via jarOf(Client.Class)
for f in $PIG_HOME/lib/*.jar; do
    if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
        # Exclude spark-assembly.jar from shipped jars, but retain in classpath
        SPARK_JARS=${SPARK_JARS}:$f;
    else
        SPARK_JARS=${SPARK_JARS}:$f;
        SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
        SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
    fi
done
CLASSPATH=${CLASSPATH}:${SPARK_JARS}

export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
export SPARK_JARS=${SPARK_YARN_DIST_FILES}
export SPARK_DIST_CLASSPATH

{code}

In the code, we did following things(here all the dependency jars are under $PIG_HOME/lib/ and $PIG_HOME/lib/spark/)

* Step1:add all dependency jars to the classpath of PIG, 
* Step2:add all dependency jars (exclude spark-yarn*.jar) to the SPARK_YARN_DIST_FILES( we ship all these jars to distcache)
* Step3:all all dependecy jars( exclude spark-yarn*.jar) to the SPARK_DIST_CLASSPATH


Step2 and Step3 is to make the dependency jars to be uploaded to hdfs  files and specify them in the SPARK_DIST_CLASSPATH so that later these jars will be included in the classpath of yarn container.

*Why need exclude spark-yarn*.jar?*
In [org.apache.spark.deploy.yarn.Client#prepareLocalResources|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L440], 
spark will copy some resources like dependency jars and so on to the distributed cache, in [sparkJar(conf:SparkConf)|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1136], it will return the jar which contains yarn.Client.class.

{code}
private def sparkJar(conf: SparkConf): String = {
    if (conf.contains(CONF_SPARK_JAR)) {
      conf.get(CONF_SPARK_JAR)
    } else if (System.getenv(ENV_SPARK_JAR) != null) {
      logWarning(
        s"$ENV_SPARK_JAR detected in the system environment. This variable has been deprecated " +
          s"in favor of the $CONF_SPARK_JAR configuration variable.")
      System.getenv(ENV_SPARK_JAR)
    } else {
      SparkContext.jarOfClass(this.getClass).getOrElse(throw new SparkException("Could not "
        + "find jar containing Spark classes. The jar can be defined using the "
        + "spark.yarn.jar configuration option. If testing Spark, either set that option or "
        + "make sure SPARK_PREPEND_CLASSES is not set."))
    }
  }

{code}
Here in pig on spark, SparkContext.jarOfClass(org.apache.spark.deploy.yarn.Client) is spark-yarn*.jar if we append seperate spark dependency jar in the pig classpath and later we need exclude spark-yarn*.jar in the SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH otherwise spark-yarn*.jar will uploaded twice and duplicated spark-yarn*.jar will cause a problem([SPARK-1921|https://issues.apache.org/jira/browse/SPARK-1921]).

We can improve this code by following points:

* we can require end-users to specify SPARK_HOME and get the path of spark-assembly*.jar then append spark-assembly*.jar to the classpath of pig.
* SparkContext.jarOfClass(this.getClass) will return spark-assembly*.jar if spark-assembly*.jar is in the classpath of pig so that spark-assembly*.jar will be copied to the distributed cache and we need not add separate spark dependency jars like spark-core*.jar and spark-yarn*.jar(jars under $PIG_HOME/lib/spark) to the SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH.
{code}
################# ADDING SPARK DEPENDENCIES ##################
# Please specify SPARK_HOME first so that we can locate $SPARK_HOME/lib/spark-assembly*.jar,
# we will add spark-assembly*.jar to the classpath
if [ -n "$SPARK_HOME" ]; then
    echo "Using Spark Home: " ${SPARK_HOME}
    SPARK_JARS=`ls ${SPARK_HOME}/lib/spark-assembly*`
fi

for f in $PIG_HOME/lib/*.jar; do
        SPARK_JARS=${SPARK_JARS}:$f;
        SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
        SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
done

CLASSPATH=${CLASSPATH}:${SPARK_JARS}

export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
export SPARK_JARS=${SPARK_YARN_DIST_FILES}
export SPARK_DIST_CLASSPATH
################# ADDING SPARK DEPENDENCIES ##################
{code}

I have tested it successfully both inyarn-client and local mode.


> Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-4903
>                 URL: https://issues.apache.org/jira/browse/PIG-4903
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>
> There are some comments about bin/pig on https://reviews.apache.org/r/45667/#comment198955.
> {code}
> ################# ADDING SPARK DEPENDENCIES ##################
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
>     if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
>         # Exclude spark-assembly.jar from shipped jars, but retain in classpath
>         SPARK_JARS=${SPARK_JARS}:$f;
>     else
>         SPARK_JARS=${SPARK_JARS}:$f;
>         SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
>         SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
>     fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need not copy all these depency jar to SPARK_DIST_CLASSPATH because all these dependency jars are included in spark-assembly.jar and spark-assembly.jar is uploaded with the spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)