You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2016/02/18 03:16:18 UTC

[jira] [Updated] (PIG-4797) Analyze JOIN performance and improve the same.

     [ https://issues.apache.org/jira/browse/PIG-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

liyunzhang_intel updated PIG-4797:
----------------------------------
    Description: 
There are a big  performance difference in join between spark and mr mode.
{code}
daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray,
            date:chararray, open:float, high:float, low:float,
            close:float, volume:int, adj_close:float);
divs  = load './NYSE_dividends' as (exchange:chararray, symbol:chararray,
            date:chararray, dividends:float);
jnd   = join daily by (exchange, symbol), divs by (exchange, symbol);
store jnd into './join.out';
{code}

join.sh
{code}
mode=$1
start=$(date +%s)
./pig -x $mode  $PIG_HOME/bin/join.pig
end=$(date +%s)
execution_time=$(( $end - $start ))
echo "execution_time:"$excution_time
{code}

run in MR:
sh join.sh mr 
20 seconds

run in Spark:
sh join.sh spark
79 seconds


> Analyze JOIN performance and improve the same.
> ----------------------------------------------
>
>                 Key: PIG-4797
>                 URL: https://issues.apache.org/jira/browse/PIG-4797
>             Project: Pig
>          Issue Type: Improvement
>          Components: spark
>            Reporter: Pallavi Rao
>            Assignee: Pallavi Rao
>              Labels: spork
>
> There are a big  performance difference in join between spark and mr mode.
> {code}
> daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray,
>             date:chararray, open:float, high:float, low:float,
>             close:float, volume:int, adj_close:float);
> divs  = load './NYSE_dividends' as (exchange:chararray, symbol:chararray,
>             date:chararray, dividends:float);
> jnd   = join daily by (exchange, symbol), divs by (exchange, symbol);
> store jnd into './join.out';
> {code}
> join.sh
> {code}
> mode=$1
> start=$(date +%s)
> ./pig -x $mode  $PIG_HOME/bin/join.pig
> end=$(date +%s)
> execution_time=$(( $end - $start ))
> echo "execution_time:"$excution_time
> {code}
> run in MR:
> sh join.sh mr 
> 20 seconds
> run in Spark:
> sh join.sh spark
> 79 seconds



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)