You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Ying He (JIRA)" <ji...@apache.org> on 2010/01/07 01:24:54 UTC
[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

    [ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797412#action_12797412 ] 

Ying He commented on PIG-480:
-----------------------------

I did more performance tests. It shows the performance is related to the 
nature of data. If the data is skewed, performance is very bad for 
combiner case. If data is uniform,  the combiner case gets the most 
performance gain.  The test is done by using a join then a group by 
statement.

For skewed data, if I use skewed join, the result is much better.  I 
think the reason of bad performance for skewed data is that because the 
map plan of second job is moved to the reducer of first job. If data is 
skewed, a single reducer has to execute the extra logic for all its 
tuples. While without this patch, that part of logic would be executed 
inside multiple mappers. So we lost parallelism for this.  The more 
skewed the data is, the worse the performance would be. 

1. skewed data
combiner       job 1                 job 2                     total
patch             7min 53sec      1min 1sec            8min 54sec
trunk             4min 43sec      1min 37sec          6min 20sec

combiner and using skewed join
patch            1min 55sec      1min 1sec             2min 56sec
trunk            1min 44sec      1min 40sec           3min 24sec

no combiner
patch            2min 26sec      2min 28sec             4min 54sec
trunk            1min 25sec      3min 24sec              4min 49sec

no combiner and using skewed join
patch           1min 17sec      3min 5sec               4min 22sec
trunk            59sec           3min 7sec                   4min 6sec

2. uniform data
combiner
patch           6min 48sec      3min 43sec            10min 31sec
trunk            7min 32sec      7min 3sec              14min 35sec

no combiner
patch           1min 25sec      2min 25sec             3min 50sec
trunk           1min 24sec      2min 28sec             3min 52sec

each group of tests may use different data, so don't make cross group comparison.


> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> -------------------------------------------------------
>
>                 Key: PIG-480
>                 URL: https://issues.apache.org/jira/browse/PIG-480
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.0
>            Reporter: Olga Natkovich
>            Assignee: Ying He
>         Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in second and subsequent MR jobs. Identity mapper is about 50% than pig empty map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.