You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2016/05/27 02:09:13 UTC

[jira] [Comment Edited] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH

    [ https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303337#comment-15303337 ] 

liyunzhang_intel edited comment on PIG-4903 at 5/27/16 2:08 AM:
----------------------------------------------------------------

[~sriksun]: thanks for your reply, here is my understanding of the code your provide:
1. SPARK_JARS includes all the dependency jars which is in $PIG_HOME/lib/ and $PIG_HOME/lib/spark/ and we need add those jars to the classpath of pig.
2. SPARK_YARN_DIST_FILES includes all the dependency jars we need to be shipped.
3.SPARK_DIST_CLASSPATH includes all the dependency jars later the executors needs in spark on yarn mode.


In above code you provide, i don't understand following 1 point:
1. why we need exclude spark-yarn.jar from shipped jar. can you explain detaily about these? now i'm investigating spark code to  understand it.

i found that  we only  need ship the jar under $PIG_HOME/lib/ and add spark-assembly.jar to the classpath of pig to make it run successfully:
  
{code}
if [ -n "$SPARK_HOME" ]; then
    echo "Using Spark Home: " ${SPARK_HOME}
    SPARK_JARS=`ls ${SPARK_HOME}/lib/spark-assembly*`
fi

for f in $PIG_HOME/lib/*.jar; do
        SPARK_JARS=${SPARK_JARS}:$f;
        SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
        SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
done
{code}
  It is very strange spark-assembly.jar will be automatically uploaded in this code while only spark-yarn.jar will be uploaded in PIG-4667. If spark-assembly.jar will be automatically uploaded, we need not ship jars under $PIG_HOME/lib/spark/.


was (Author: kellyzly):
[~sriksun]: thanks for your reply, here is my understanding of the code your provide:
1. SPARK_JARS includes all the dependency jars which is in $PIG_HOME/lib/ and $PIG_HOME/lib/spark/ and we need add those jars to the classpath of pig.
2. SPARK_YARN_DIST_FILES includes all the dependency jars we need to be shipped.
3.SPARK_DIST_CLASSPATH includes all the dependency jars later the executors needs in spark on yarn mode.


In above code you provide, i don't understand following 2 points:
1. why we need exclude spark-yarn.jar from shipped jar. can you explain detaily about these? now i'm investigating spark code to  understand it.
2. i found that  we only  need ship the jar under $PIG_HOME/lib/ and add spark-assembly.jar to the classpath of pig to make it run successfully:
  
{code}
if [ -n "$SPARK_HOME" ]; then
    echo "Using Spark Home: " ${SPARK_HOME}
    SPARK_JARS=`ls ${SPARK_HOME}/lib/spark-assembly*`
fi

for f in $PIG_HOME/lib/*.jar; do
        SPARK_JARS=${SPARK_JARS}:$f;
        SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
        SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
done
{code}
  It is very strange spark-assembly.jar will be automatically uploaded in this code while only spark-yarn.jar will be uploaded in PIG-4667. If spark-assembly.jar will be automatically uploaded, we need not ship jars under $PIG_HOME/lib/spark/.

> Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-4903
>                 URL: https://issues.apache.org/jira/browse/PIG-4903
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>
> There are some comments about bin/pig on https://reviews.apache.org/r/45667/#comment198955.
> {code}
> ################# ADDING SPARK DEPENDENCIES ##################
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
>     if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
>         # Exclude spark-assembly.jar from shipped jars, but retain in classpath
>         SPARK_JARS=${SPARK_JARS}:$f;
>     else
>         SPARK_JARS=${SPARK_JARS}:$f;
>         SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
>         SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
>     fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need not copy all these depency jar to SPARK_DIST_CLASSPATH because all these dependency jars are included in spark-assembly.jar and spark-assembly.jar is uploaded with the spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)