You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Allen Wittenauer (JIRA)" <ji...@apache.org> on 2015/03/10 03:54:38 UTC

[jira] [Updated] (MAPREDUCE-1939) split reduce compute phase into two threads one for reading and another for computing

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allen Wittenauer updated MAPREDUCE-1939:
----------------------------------------
    Fix Version/s:     (was: 0.20.2)

> split reduce compute phase into two threads one for reading and another for computing
> -------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1939
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1939
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.2
>            Reporter: wangxiaowei
>
> it is known that  reduce task is made up of three phases: shuffle , sort and reduce. During reduce phase, a reduce function will read a record from disk or memory first and process it to write to hdfs finally. To convert this serial progress to parallel progress , I split the reduce phase into two threads called producer and consumer individually. producer is used to read record from disk and consumer to process the records read by the first one. I use two buffer, if  producer is writing one buffer consumer will read from another buffer.  Theoretically  there will be a overlap between this two phases so we can reduce the whole reduce time.
> I wonder why hadoop does not implement it originally? Is there some potential problems for such ideas ?
> I have already implemmented a prototypy. The producer just reads bytes from the disk and leaves the work of transformation to real key and value objects to consumer. The results is not good only a improvement of 13%  for time. I think it has someting with the buffer size and the time spending on different threads.Maybe the tiem spend by consumer thread is too long and the producer has to wait until the next buffer is available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)