You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Aaron Kimball (JIRA)" <ji...@apache.org> on 2009/10/06 01:47:31 UTC

[jira] Created: (MAPREDUCE-1059) distcp can generate uneven map task assignments

distcp can generate uneven map task assignments
-----------------------------------------------

                 Key: MAPREDUCE-1059
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: distcp
            Reporter: Aaron Kimball
            Assignee: Aaron Kimball
         Attachments: MAPREDUCE-1059.patch

distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763163#action_12763163 ] 

Aaron Kimball commented on MAPREDUCE-1059:
------------------------------------------

This adds a deprecation warning for an additional use of {{org.apache.hadoop.mapred.FileSplit}}. Seeing as how these deprecation warnings are left unsuppressed in the rest of DistCp (as they should be), this is an expected outcome. Core test failure is still unrelated to this patch.

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1059:
-------------------------------------

    Attachment: MAPREDUCE-1059.3.patch

I realized that the problem is much simpler than I was making it out to be. The split points inserted into the existing file are simply done at points which have nothing to do with the user's chosen target map task size. It always uses the hardcoded constant MAX_BYTES_PER_MAP. This patch changes this to query the configuration for distcp.bytes.per.map.

To verify this works: run a distcp job transferring 100 files of 1 KB: {{hadoop distcp -Ddistcp.bytes.per.map=100 srcpath dstpath}}.

On a pseudo-distributed cluster, Hadoop will generate 20 tasks (the task max of 20 tasks/node will limit the target number of tasks). Then ten of these tasks will transfer files, ten will transfer none because their splits contained no sync points. With this patch, all 20 tasks will transfer 5 files each.


> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.3.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1059:
-------------------------------------

    Status: Open  (was: Patch Available)

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1059:
-------------------------------------

    Status: Patch Available  (was: Open)

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Venkatesh S (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766137#action_12766137 ] 

Venkatesh S commented on MAPREDUCE-1059:
----------------------------------------

> In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.
Not sure if this is the case. It sorts the files in descending order so larger files get copied before getting to smaller files. Since all the files are verified for copy at setup, I suspect that there will be empty maps. 

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762492#action_12762492 ] 

Hadoop QA commented on MAPREDUCE-1059:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12421354/MAPREDUCE-1059.patch
  against trunk revision 819740.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 2 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/141/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/141/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/141/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/141/console

This message is automatically generated.

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1059:
-------------------------------------

    Attachment: MAPREDUCE-1059.patch

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762452#action_12762452 ] 

Aaron Kimball commented on MAPREDUCE-1059:
------------------------------------------

The situation is demonstrated most effectively when a large number of small files are to be transferred, interspersed with a few larger files. SequenceFile sync points are not distributed through the transfer list file in the correct locations. Consequently, many mappers do not have any work to do, as their transfer lists are subsumed by adjacent map tasks in the file.

This patch causes the transfer list file to be rewritten as part of the splitting process, with sync() points inserted at the correct boundaries between map tasks.

This patch also includes tests of the split algorithm which apply the splitting process to several different sets of file sizes. The old process can generate lists of splits in which up to 75% of the mappers do no work. The new process guarantees a better distribution of files over mappers.


> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793401#action_12793401 ] 

Aaron Kimball commented on MAPREDUCE-1059:
------------------------------------------

Failing contrib tests are streaming regressions. No new unit tests for this patch because it is a small patch and testing this behavior with unit tests would be challenging. A functional test plan was outlined in my previous comment. 

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.3.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762795#action_12762795 ] 

Aaron Kimball commented on MAPREDUCE-1059:
------------------------------------------

(Note that "fixing" testMapCount involves changing the expectation of the test case; distcp now generates 3 splits of equal size, rather than "4 or 5" as was expected earlier.)

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1059:
-------------------------------------

    Status: Open  (was: Patch Available)

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762830#action_12762830 ] 

Hadoop QA commented on MAPREDUCE-1059:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12421468/MAPREDUCE-1059.2.patch
  against trunk revision 819740.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 5 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 2331 javac compiler warnings (more than the trunk's current 2330 warnings).

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/144/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/144/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/144/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/144/console

This message is automatically generated.

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1059:
-------------------------------------

    Status: Patch Available  (was: Open)

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.3.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1059:
-------------------------------------

    Status: Patch Available  (was: Open)

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796714#action_12796714 ] 

Hudson commented on MAPREDUCE-1059:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #196 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/196/])
    

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.3.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated MAPREDUCE-1059:
-------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

+1

I committed this. Thanks, Aaron!

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.3.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792802#action_12792802 ] 

Hadoop QA commented on MAPREDUCE-1059:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428394/MAPREDUCE-1059.3.patch
  against trunk revision 892411.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/222/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/222/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/222/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/222/console

This message is automatically generated.

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.3.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1059) distcp can generate uneven map task assignments

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1059:
-------------------------------------

    Attachment: MAPREDUCE-1059.2.patch

The testHftpAccessControl error is unrelated. This patch fixes the other two unit test errors and the findbugs warning.

> distcp can generate uneven map task assignments
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.patch
>
>
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.