You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Pallavi Rao (JIRA)" <ji...@apache.org> on 2016/02/18 11:18:18 UTC
[jira] [Commented] (PIG-4797) Analyze JOIN performance and improve
the same.
[ https://issues.apache.org/jira/browse/PIG-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15152085#comment-15152085 ]
Pallavi Rao commented on PIG-4797:
----------------------------------
Errata: There was NO performance difference between CoGroupRDD and cogroup operator.
> Analyze JOIN performance and improve the same.
> ----------------------------------------------
>
> Key: PIG-4797
> URL: https://issues.apache.org/jira/browse/PIG-4797
> Project: Pig
> Issue Type: Improvement
> Components: spark
> Reporter: Pallavi Rao
> Assignee: Pallavi Rao
> Labels: spork
> Attachments: Join performance analysis - Google Docs.pdf
>
>
> There are a big performance difference in join between spark and mr mode.
> {code}
> daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray,
> date:chararray, open:float, high:float, low:float,
> close:float, volume:int, adj_close:float);
> divs = load './NYSE_dividends' as (exchange:chararray, symbol:chararray,
> date:chararray, dividends:float);
> jnd = join daily by (exchange, symbol), divs by (exchange, symbol);
> store jnd into './join.out';
> {code}
> join.sh
> {code}
> mode=$1
> start=$(date +%s)
> ./pig -x $mode $PIG_HOME/bin/join.pig
> end=$(date +%s)
> execution_time=$(( $end - $start ))
> echo "execution_time:"$excution_time
> {code}
> The execution time:
> || |||mr||spark||
> |join|20 sec|79 sec|
> You can download the test data NYSE_daily and NYSE_dividends in https://github.com/alanfgates/programmingpig/blob/master/data/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)