You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Ying He (JIRA)" <ji...@apache.org> on 2010/01/07 01:24:54 UTC
[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a
chain of M-R jobs
[ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797412#action_12797412 ]
Ying He commented on PIG-480:
-----------------------------
I did more performance tests. It shows the performance is related to the
nature of data. If the data is skewed, performance is very bad for
combiner case. If data is uniform, the combiner case gets the most
performance gain. The test is done by using a join then a group by
statement.
For skewed data, if I use skewed join, the result is much better. I
think the reason of bad performance for skewed data is that because the
map plan of second job is moved to the reducer of first job. If data is
skewed, a single reducer has to execute the extra logic for all its
tuples. While without this patch, that part of logic would be executed
inside multiple mappers. So we lost parallelism for this. The more
skewed the data is, the worse the performance would be.
1. skewed data
combiner job 1 job 2 total
patch 7min 53sec 1min 1sec 8min 54sec
trunk 4min 43sec 1min 37sec 6min 20sec
combiner and using skewed join
patch 1min 55sec 1min 1sec 2min 56sec
trunk 1min 44sec 1min 40sec 3min 24sec
no combiner
patch 2min 26sec 2min 28sec 4min 54sec
trunk 1min 25sec 3min 24sec 4min 49sec
no combiner and using skewed join
patch 1min 17sec 3min 5sec 4min 22sec
trunk 59sec 3min 7sec 4min 6sec
2. uniform data
combiner
patch 6min 48sec 3min 43sec 10min 31sec
trunk 7min 32sec 7min 3sec 14min 35sec
no combiner
patch 1min 25sec 2min 25sec 3min 50sec
trunk 1min 24sec 2min 28sec 3min 52sec
each group of tests may use different data, so don't make cross group comparison.
> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> -------------------------------------------------------
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.2.0
> Reporter: Olga Natkovich
> Assignee: Ying He
> Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in second and subsequent MR jobs. Identity mapper is about 50% than pig empty map job because it doesn't parse the data.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.