You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Ying He (JIRA)" <ji...@apache.org> on 2009/12/03 21:19:21 UTC
[jira] Updated: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

     [ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ying He updated PIG-480:
------------------------

    Attachment: PIG_480.patch

patch to use identity map. 

An IdentityMapOptimizer is applied when a MR plan contains at least 2 MRs.  It evaluates each MR job, if its reducer uses POStore to dump a tmp file, and the mapper of next MR only contains a POLocalRearrange and a POLoad to load the tmp file,  then the POLocalRearrange of next mapper is moved up to the reducer of this MR, and the mapper of next MR job is changed to use identity map.

In this case, the reducer of the MR job output (key, tuple) pairs to the tmp file by using a different OutputFormat, PigBinaryValueOutputFormat.  It uses a different record writer to dump data, the format is

delimiter (3 bytes,, 0x01, 0x02, 0x03)
key
length of byte[] for tuple
byte[] for tuple

the next MR job that uses identity map uses a different InputFormat, PigBinaryValueInputFormat, which returns a different RecordReader, to read in data as (key, tuple) pairs. But the tuple is kept in byte[] form.  The identity map does nothing except passing the (key, tuple) through and writing them to disk. When reducer picks them up, the tuple is de-serialized  for processing. 

The reason of doing this is performance. Because the tuple reading in and writing out of identity map are in byte[] form, we saved a de-serialization and serialization of tuples in mapper.

A use case is  following:

a = load 'f' as (id, v);
b = load 's' as (id, v);
c = join a by id, b by id;
d = group c by a::id;
dump d;

this example  contains 2 MR jobs. After optimization, the first job output (key, tuple) pairs, and second job uses identity map.


> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> -------------------------------------------------------
>
>                 Key: PIG-480
>                 URL: https://issues.apache.org/jira/browse/PIG-480
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.0
>            Reporter: Olga Natkovich
>         Attachments: PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in second and subsequent MR jobs. Identity mapper is about 50% than pig empty map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.