You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2008/05/26 19:58:02 UTC

[jira] Created: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

The reduce task should not flush the in memory file system before starting the reducer
--------------------------------------------------------------------------------------

                 Key: HADOOP-3446
                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
            Reporter: Owen O'Malley
            Assignee: Owen O'Malley
            Priority: Critical


In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3446:
------------------------------------

    Hadoop Flags: [Reviewed]  (was: [Reviewed, Incompatible change])

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch, 3446-6.patch, 3446-7.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Attachment: 3446-2.patch

This changes reduce as follows:

* Instead of specifying {{fs.inmemory.size.mb}}, map outputs will consume {{mapred.copy.inmem.percent}} of the maximum heap size as returned from {{Runtime.maxMemory()}}, defaulting to 0.7. {{mapred.child.java.opts}} defaults to 200mb and {{fs.inmemory.size.mb}} defaults to 75mb, so this might be considered an incompatible change.
* The memory threshold at which the in-memory merge will start during the shuffle is now user-configurable ({{mapred.inmem.merge.usage}}), defaulting to the old value of 0.66. {{mapred.inmem.merge.threshold}} still controls the maximum number of segments
* Instead of performing a final in-memory merge, the segments are left in memory. At the beginning of the sort phase, the ReduceCopier is queried for an Iterator to the reduce. A user-configurable property {{mapred.reduce.inmem.percent}} determines the maximum size of the segments to be merged from memory during the reduce, relative to the ShuffleRamManager threshold. If the retained segments exceed this threshold, then they must be written to disk before the reduce starts. If there sufficient segments already on disk to require intermediate merges, they will be rolled into the first merge, otherwise they will be merged to disk. The merge into the reduce will contain all the segments that fit below the in-memory reduce threshold from RAM and from the on-disk segments. So given:
{noformat}
+----+ <- Max heap memory (e.g. -Xmx512m) (H)
|    |
|----| <- mapred.copy.inmem.percent (C)
|    |
|    |
|----| <- mapred.reduce.inmem.percent (R)
|    |
+----+
{noformat}
The maximum memory used for copying map output wil be {{H*C}} while the minimum memory available to the reduce will be {{H*(1-C*R)}}

This passes all unit tests on my machine. I'll work on measuring its performance and post the results presently.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Attachment: 3446-1.patch

This passes mapred/hdfs tests and patch validation on my machine and doesn't break LocalJobRunner (unlike 3446-0).

{noformat}
     [exec] -1 overall.

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
     [exec]                         Please justify why no tests are needed for this patch.

     [exec]     -1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
{noformat}

The "javadoc warning" is from:
{noformat}
    [javadoc] javadoc: warning - Multiple sources of package comments found for package "org.apache.commons.logging"
    [javadoc] javadoc: warning - Multiple sources of package comments found for package "org.apache.commons.logging.impl"
{noformat}

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627627#action_12627627 ] 

Hadoop QA commented on HADOOP-3446:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389329/3446-4.patch
  against trunk revision 691099.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3158/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3158/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3158/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3158/console

This message is automatically generated.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-3446:
----------------------------------

    Status: Open  (was: Patch Available)

You need to add some tests for this. You should also have some forrest edits to explain the usage of the config variables.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Status: Patch Available  (was: Open)

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627175#action_12627175 ] 

Chris Douglas commented on HADOOP-3446:
---------------------------------------

I'll add unit tests/docs with the next patch.

As a benchmark, I tried RandomWriter on 19 TaskTrackers, 1GB/node, followed by several sort runs. The max heap memory is set to 512MB, mapred.copy.inmem.percent to 0.8, dfs.replication to 1. The times recorded are the min/max/avg time for the reduce from the end of the shuffle to the end of the reduce.

Params are formatted as: {{io.sort.factor/mapred.inmem.merge.threshold/mapred.inmem.merge.usage/mapred.reduce.inmem.percent}}

|| Params || Min || Max || Avg || Notes ||
| 100/0/1.0/1.0 | 8.35 | 57.775 | 23.1603 | Never hits disk |
| 9/15/1.0/0.01 | 11.164 | 67.569 | 38.0216 | Spills several times, merges some in-memory segments during intermediate merge | 
| 100/0/1.0/0.5 | 11.215 | 74.59 | 33.8571 | Spills some segments to disk before starting reduce |
| 100/0/1.0/0.0 | 17.184 | 88.479 | 59.5489 | Spills all segments to disk before starting reduce |



> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Attachment: 3446-3.patch

Added a unit test. I'm not sure where documentation for the new parameters belongs...

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

      Resolution: Fixed
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Incompatible change])
          Status: Resolved  (was: Patch Available)

I just committed this.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch, 3446-6.patch, 3446-7.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Status: Patch Available  (was: Open)

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Status: Patch Available  (was: Open)

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch, 3446-6.patch, 3446-7.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628598#action_12628598 ] 

Hadoop QA commented on HADOOP-3446:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389455/3446-5.patch
  against trunk revision 692335.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3184/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3184/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3184/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3184/console

This message is automatically generated.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Status: Open  (was: Patch Available)

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch, 3446-6.patch, 3446-7.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Attachment: 3446-7.patch

Merged with trunk.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch, 3446-6.patch, 3446-7.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Status: Patch Available  (was: Open)

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch, 3446-6.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629667#action_12629667 ] 

Chris Douglas commented on HADOOP-3446:
---------------------------------------

The test failure is not related.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch, 3446-6.patch, 3446-7.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-3446:
----------------------------------

    Hadoop Flags: [Incompatible change]

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Status: Open  (was: Patch Available)

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Attachment: 3446-6.patch

Changed config var names, semantics of reduce percentage, and updated documentation & tests to reflect this

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch, 3446-6.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629660#action_12629660 ] 

Hadoop QA commented on HADOOP-3446:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389772/3446-7.patch
  against trunk revision 693587.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3222/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3222/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3222/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3222/console

This message is automatically generated.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch, 3446-6.patch, 3446-7.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Status: Patch Available  (was: Open)

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Attachment: 3446-4.patch

Added documentation

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas reassigned HADOOP-3446:
-------------------------------------

    Assignee: Chris Douglas  (was: Owen O'Malley)

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Fix Version/s: 0.19.0
           Status: Patch Available  (was: Open)

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12610240#action_12610240 ] 

Owen O'Malley commented on HADOOP-3446:
---------------------------------------

Unfortunately, the patch that I used is now hopelessly out of date from HADOOP-2095, Someone will try to get a new solution for this into 0.19.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Critical
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-3446:
----------------------------------

    Status: Open  (was: Patch Available)

This looks good, but I think we should define the new parameter mapred.reduce.inmem.percent as a percent of the total heap size rather than a percent of mapred.copy.inmem.percent. 

I'd also change the names to:
mapred.reduce.input.buffer.percent
mapred.shuffle.input.buffer.percent
mapred.shuffle.merge.percent

Other than that, it looks good.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627046#action_12627046 ] 

Hadoop QA commented on HADOOP-3446:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389141/3446-2.patch
  against trunk revision 690142.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3143/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3143/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3143/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3143/console

This message is automatically generated.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Attachment: 3446-0.patch

I tested this on a 100 node cluster (98 tasktrackers) using sort. Given 300MB/node of data and a sufficiently large io.sort.mb and fs.inmemory.size.mb, io.sort.spill.percent=1.0, fs.inmemory.merge.threshold=0, and mapred.inmem.usage=1.0, each reduce took an average of 121 seconds when reading from disk vs 79 seconds merging and reducing from memory. While the sort with the patch finished the job in 8 minutes instead of 9, both had slow tasktrackers that threw off the running time.

This also includes some similar changes to MapTask, letting the record and serialization buffer soft limits be configured separately.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Critical
>         Attachments: 3446-0.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Status: Open  (was: Patch Available)

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Status: Open  (was: Patch Available)

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630061#action_12630061 ] 

Arun C Murthy commented on HADOOP-3446:
---------------------------------------

+1

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch, 3446-6.patch, 3446-7.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Status: Patch Available  (was: Open)

Submitting to hudson

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628835#action_12628835 ] 

Hadoop QA commented on HADOOP-3446:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389602/3446-6.patch
  against trunk revision 692597.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3193/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3193/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3193/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3193/console

This message is automatically generated.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch, 3446-6.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Attachment: 3446-5.patch

Move unrelated changes to MapTask into a separate JIRA (HADOOP-4063) and change some LinkedLists to ArrayLists.

> The reduce task should not flush the in memory file system before starting the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>            Priority: Critical
>             Fix For: 0.19.0
>
>         Attachments: 3446-0.patch, 3446-1.patch, 3446-2.patch, 3446-3.patch, 3446-4.patch, 3446-5.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.