You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2016/06/01 03:19:13 UTC
[jira] [Updated] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH

     [ https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

liyunzhang_intel updated PIG-4903:
----------------------------------
    Attachment: PIG-4903.patch

[~sriksun],[~xuefuz] ,[~mohitsabharwal] , [~pallavi.rao] and [~kexianda]:
Rohini suggested to use spark-assembly not use seperate spark dependency jars(under PIG_HOME/lib/spark/) in [review board|https://reviews.apache.org/r/45667/#comment198955]:
{quote}
    This is not a good idea. If I remember correctly, spark-assembly.jar is 128MB+. If you are copying all the individual jars that it is made up of to distcache for every job, it will suffer bad performance as copy to hdfs and localization by NM will be very costly.

    Like Tez you can have users copy the assembly jar to hdfs and specify the hdfs location. This will ensure there is only one copy in hdfs and localization is done only once per node by node manager.
{quote}

So there are two options:
1. use seperate spark dependency jars(now)
2. require user to specify SPARK_HOME and then we can locate spark-assembly*.jar(PIG-4903.patch)

Please give me your suggestions.


> Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-4903
>                 URL: https://issues.apache.org/jira/browse/PIG-4903
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>         Attachments: PIG-4903.patch
>
>
> There are some comments about bin/pig on https://reviews.apache.org/r/45667/#comment198955.
> {code}
> ################# ADDING SPARK DEPENDENCIES ##################
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
>     if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
>         # Exclude spark-assembly.jar from shipped jars, but retain in classpath
>         SPARK_JARS=${SPARK_JARS}:$f;
>     else
>         SPARK_JARS=${SPARK_JARS}:$f;
>         SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
>         SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
>     fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need not copy all these depency jar to SPARK_DIST_CLASSPATH because all these dependency jars are included in spark-assembly.jar and spark-assembly.jar is uploaded with the spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)