You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Peter Romianowski (JIRA)" <ji...@apache.org> on 2009/12/16 23:34:18 UTC

[jira] Created: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Massive performance problem with DistCp and -delete
---------------------------------------------------

                 Key: MAPREDUCE-1305
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: distcp
    Affects Versions: 0.20.1
            Reporter: Peter Romianowski
            Assignee: Peter Romianowski


*First problem*

In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.

The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.

Changed that to just serialize Path and not FileStatus.

*Second problem*

To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.

Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Peter Romianowski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Romianowski updated MAPREDUCE-1305:
-----------------------------------------

    Attachment: MAPREDUCE-1305.patch

We even do not need the absolute path serialized. Using NullWritable now.

Patch is against trunk, rev 891812

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Peter Romianowski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Romianowski updated MAPREDUCE-1305:
-----------------------------------------

    Status: Patch Available  (was: Open)

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: MAPRED-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804366#action_12804366 ] 

Chris Douglas commented on MAPREDUCE-1305:
------------------------------------------

The patch looks good, but this:

bq. To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.

Doesn't appear to be fixed. Are you planning to address this in a separate issue?

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1305) Running distcp with -delete incurs avoidable penalties

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833430#action_12833430 ] 

Hudson commented on MAPREDUCE-1305:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #234 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/234/])
    . Improve efficiency of distcp -delete. Contributed by Peter Romianowski


> Running distcp with -delete incurs avoidable penalties
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>             Fix For: 0.22.0
>
>         Attachments: M1305-1.patch, M1305-2.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Running distcp with -delete incurs avoidable penalties

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated MAPREDUCE-1305:
----------------------------------------------

    Hadoop Flags: [Reviewed]

+1 patch looks good.

> Running distcp with -delete incurs avoidable penalties
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: M1305-1.patch, M1305-2.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated MAPREDUCE-1305:
-------------------------------------

    Status: Patch Available  (was: Open)

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: M1305-1.patch, M1305-2.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Peter Romianowski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791654#action_12791654 ] 

Peter Romianowski commented on MAPREDUCE-1305:
----------------------------------------------

Yes I am using distcp against file:/// to make a backup of some crucial files to a NAS.

The better question would be: Why serializing FileStatus if Path is all we need?

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: MAPRED-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794068#action_12794068 ] 

Hadoop QA commented on MAPREDUCE-1305:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428316/MAPREDUCE-1305.patch
  against trunk revision 893469.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/console

This message is automatically generated.

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Peter Romianowski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Romianowski updated MAPREDUCE-1305:
-----------------------------------------

    Attachment: MAPRED-1305.patch

Patch is against trunk.

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: MAPRED-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated MAPREDUCE-1305:
-------------------------------------

    Status: Open  (was: Patch Available)

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: M1305-1.patch, M1305-2.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Running distcp with -delete incurs avoidable penalties

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated MAPREDUCE-1305:
-------------------------------------

    Summary: Running distcp with -delete incurs avoidable penalties  (was: Massive performance problem with DistCp and -delete)

> Running distcp with -delete incurs avoidable penalties
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: M1305-1.patch, M1305-2.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791855#action_12791855 ] 

Hadoop QA commented on MAPREDUCE-1305:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428234/MAPRED-1305.patch
  against trunk revision 891524.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/208/console

This message is automatically generated.

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: MAPRED-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1305) Running distcp with -delete incurs avoidable penalties

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832541#action_12832541 ] 

Hadoop QA commented on MAPREDUCE-1305:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12435423/M1305-2.patch
  against trunk revision 908321.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/console

This message is automatically generated.

> Running distcp with -delete incurs avoidable penalties
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: M1305-1.patch, M1305-2.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831893#action_12831893 ] 

Koji Noguchi commented on MAPREDUCE-1305:
-----------------------------------------

bq. Is supporting Trash useful for DistCp users running with -delete?

To me, yes.
I've seen many of our users deleting their files accidentally.  
Trash has saved us great time.

I'd like to request the Trash part to stay if there's not much performance problem.

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: M1305-1.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated MAPREDUCE-1305:
-------------------------------------

    Attachment: M1305-1.patch

Modified Peter's patch to remove FsShell invocations.

That part isn't actually horrible, performance-wise; it reuses the instance, so while there's certainly avoidable overhead in parsing and whatnot, it's not forking a process or anything too notable. It also supports the Trash, which may be useful/appreciated.

Is supporting Trash useful for DistCp users running with \-delete?

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: M1305-1.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1305) Running distcp with -delete incurs avoidable penalties

Posted by "Peter Romianowski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832575#action_12832575 ] 

Peter Romianowski commented on MAPREDUCE-1305:
----------------------------------------------

Thanks Chris for remove calls to FsShell. I've been very busy lately so I did not manage to compile the patch.

> Running distcp with -delete incurs avoidable penalties
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: M1305-1.patch, M1305-2.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated MAPREDUCE-1305:
-------------------------------------

    Attachment: M1305-2.patch

*nod* This version uses Trash directly, avoiding FsShell but keeping existing behavior

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: M1305-1.patch, M1305-2.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Allen Wittenauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791642#action_12791642 ] 

Allen Wittenauer commented on MAPREDUCE-1305:
---------------------------------------------

Hmm. Why would RawLocalFileSystem be used at all?  Are you using distcp against file://?

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1305) Running distcp with -delete incurs avoidable penalties

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833364#action_12833364 ] 

Hudson commented on MAPREDUCE-1305:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #236 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/236/])
    . Improve efficiency of distcp -delete. Contributed by Peter Romianowski


> Running distcp with -delete incurs avoidable penalties
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>             Fix For: 0.22.0
>
>         Attachments: M1305-1.patch, M1305-2.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Peter Romianowski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Romianowski updated MAPREDUCE-1305:
-----------------------------------------

    Attachment:     (was: MAPRED-1305.patch)

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Running distcp with -delete incurs avoidable penalties

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated MAPREDUCE-1305:
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.22.0
           Status: Resolved  (was: Patch Available)

I committed this. Thanks, Peter!

> Running distcp with -delete incurs avoidable penalties
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>             Fix For: 0.22.0
>
>         Attachments: M1305-1.patch, M1305-2.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated MAPREDUCE-1305:
-------------------------------------

    Status: Patch Available  (was: Open)

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated MAPREDUCE-1305:
-------------------------------------

    Status: Open  (was: Patch Available)

> Massive performance problem with DistCp and -delete
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-1305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Peter Romianowski
>            Assignee: Peter Romianowski
>         Attachments: MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.
> The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.