You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2016/07/06 02:14:11 UTC
[jira] [Commented] (PIG-4797) Optimization for join/group case for spark mode

    [ https://issues.apache.org/jira/browse/PIG-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363626#comment-15363626 ] 

liyunzhang_intel commented on PIG-4797:
---------------------------------------

Let's explain more about why we can not put JoinGroupOptimizerSpark before CombinerOptimizer?
The CombinerOptimizer will collapse POGlobalRearrange and POPackager into POReduceBy like following:
{code:title=origin}
     foreach (using algebraicOp)
       -> packager
          -> globalRearrange
              -> localRearrange
{code}
{code:title=After CombinerOptimization}
     foreach (using algebraicOp.Final)
       -> reduceBy (uses algebraicOp.Intermediate)
          -> localRearrange
             -> foreach (using algebraicOp.Initial)
                 -> CombinerRearrange
{code}
  The JoinGroupOptimizaterSpark will collapse POLocalRearrange, POGlobalRearrange, POPackage into POJoinGroupSpark
{code:title=origin}
 foreach
       -> packager
            ->globalRearrange
                 ->localRearrange
                     ->load
{code}
{code:title=JoinGroupOptimization}
          foreach
               ->joinGroupSpark
                 ->load
{code}
 if JoinGroupOptimizerSpark first, the spark plan will be changed and the spark plan will not be suitable for CombinerOptimizer any more.


> Optimization for join/group case for spark mode
> -----------------------------------------------
>
>                 Key: PIG-4797
>                 URL: https://issues.apache.org/jira/browse/PIG-4797
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Pallavi Rao
>            Assignee: liyunzhang_intel
>              Labels: spork
>         Attachments: Join performance analysis.pdf, PIG-4797.patch, PIG-4797_2.patch, PIG-4797_3.patch, PIG-4797_5.patch
>
>
> There are a big  performance difference in join between spark and mr mode.
> {code}
> daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray,
>             date:chararray, open:float, high:float, low:float,
>             close:float, volume:int, adj_close:float);
> divs  = load './NYSE_dividends' as (exchange:chararray, symbol:chararray,
>             date:chararray, dividends:float);
> jnd   = join daily by (exchange, symbol), divs by (exchange, symbol);
> store jnd into './join.out';
> {code}
> join.sh
> {code}
> mode=$1
> start=$(date +%s)
> ./pig -x $mode  $PIG_HOME/bin/join.pig
> end=$(date +%s)
> execution_time=$(( $end - $start ))
> echo "execution_time:"$excution_time
> {code}
> The execution time:
> || |||mr||spark||
> |join|20 sec|79 sec|
> You can download the test data NYSE_daily and NYSE_dividends in https://github.com/alanfgates/programmingpig/blob/master/data/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)