You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2016/05/24 03:45:12 UTC
[jira] [Commented] (PIG-4667) Enable Pig on Spark to run on Yarn Client mode

    [ https://issues.apache.org/jira/browse/PIG-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297618#comment-15297618 ] 

liyunzhang_intel commented on PIG-4667:
---------------------------------------

[~sriksun]:  Now community is reviewing the pig on spark, following is part of the [feedback|https://reviews.apache.org/r/45667/#review134255] from the community
about following code in bin/pig:
{code}
################# ADDING SPARK DEPENDENCIES ##################
# Spark typically works with a single assembly file. However this
# assembly isn't available as a artifact to pull in via ivy.
# To work around this short coming, we add all the jars barring
# spark-yarn to DIST through dist-files and then add them to classpath
# of the executors through an independent env variable. The reason
# for excluding spark-yarn is because spark-yarn is already being added
# by the spark-yarn-client via jarOf(Client.Class)

for f in $PIG_HOME/lib/spark/*.jar; do
    if [[ $f == $PIG_HOME/lib/spark/spark-yarn* ]]; then
        # Exclude spark-yarn.jar from shipped jars, but retain in classpath
        SPARK_JARS=${SPARK_JARS}:$f;
    else
        SPARK_JARS=${SPARK_JARS}:$f;
        SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
        SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
    fi
done

for f in $PIG_HOME/lib/*.jar; do
    SPARK_JARS=${SPARK_JARS}:$f;
    SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
    SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
done
CLASSPATH=${CLASSPATH}:${SPARK_JARS}

#SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},$PIG_HOME/lib/spark-assembly-1.6.0-hadoop2.6.0.jar
#SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:$PIG_HOME/lib/spark-assembly-1.6.0-hadoop2.6.0.jar
export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
export SPARK_JARS=${SPARK_YARN_DIST_FILES}
export SPARK_DIST_CLASSPATH
################# ADDING SPARK DEPENDENCIES ##################
{code}
Rohini left some comment:
{quote}
   This is not a good idea. If I remember correctly, spark-assembly.jar is 128MB+. If you are copying all the individual jars that it is made up of to distcache for every job, it will suffer bad performance as copy to hdfs and localization by NM will be very costly. 
     Like Tez you can have users copy the assembly jar to hdfs and specify the hdfs location. This will ensure there is only one copy in hdfs and localization is done only once per node by node manager.
{quote}

Can we replace all jars in $PIG_HOME/lib/spark/ with spark-assembly.jar if we let end-users to copy spark-assemly.jar to $PIG_HOME/lib/ not download all the dependency jars from ivy?




> Enable Pig on Spark to run on Yarn Client mode
> ----------------------------------------------
>
>                 Key: PIG-4667
>                 URL: https://issues.apache.org/jira/browse/PIG-4667
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Srikanth Sundarrajan
>            Assignee: Srikanth Sundarrajan
>             Fix For: spark-branch
>
>         Attachments: PIG-4667-logs.tgz, PIG-4667-v1.patch, PIG-4667.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)