You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "David Phillips (JIRA)" <ji...@apache.org> on 2008/10/03 23:23:44 UTC
[jira] Created: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
--------------------------------------------------------------------
Key: HADOOP-4339
URL: https://issues.apache.org/jira/browse/HADOOP-4339
Project: Hadoop Core
Issue Type: Bug
Components: fs
Affects Versions: 0.18.1
Reporter: David Phillips
FsShell.du has two inefficiencies:
* calling getContentSummary twice for each top-level item rather than calling it once and saving the result
* calling getContentSummary for files rather than using the size it already has in FileStatus
getContentSummary has one:
* calling itself for files rather than using the length it already has in FileStatus
Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
The simple solution:
* FsShell.du calls once per item and saves the ContentSummary
* FsShell.du uses FileStatus.getLen for files
* getContentSummary only calls itself for directories
Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "David Phillips (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Phillips updated HADOOP-4339:
-----------------------------------
Status: Patch Available (was: Open)
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Bug
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Attachments: simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649897#action_12649897 ]
Chris Douglas commented on HADOOP-4339:
---------------------------------------
In FsShell, it makes more sense to save the length instead of the ContentSummary. The FileSystem change looks good.
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Bug
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Attachments: hadoop-fsshell-du-simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "David Phillips (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Phillips updated HADOOP-4339:
-----------------------------------
Attachment: hadoop-fsshell-du-simple.patch
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Bug
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Attachments: hadoop-fsshell-du-simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Douglas updated HADOOP-4339:
----------------------------------
Hadoop Flags: [Reviewed]
Status: Patch Available (was: Open)
+1 Looks good
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Bug
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Assignee: David Phillips
> Fix For: 0.20.0
>
> Attachments: hadoop-fsshell-du-simple.patch, hadoop-fsshell-du-simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Douglas updated HADOOP-4339:
----------------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
I just committed this. Thanks, David.
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Improvement
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Assignee: David Phillips
> Fix For: 0.20.0
>
> Attachments: hadoop-fsshell-du-simple.patch, hadoop-fsshell-du-simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "David Phillips (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Phillips updated HADOOP-4339:
-----------------------------------
Attachment: simple.patch
Patch for the simple solution. This reduces calls to getFileStatus to one per directory for -du and -dus [1], as opposed to the previous one (-dus) or two (-dus) per file and directory.
[1] -dus still has an extra call for the base directory due to the initial call to globStatus.
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Bug
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Attachments: simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "David Phillips (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Phillips updated HADOOP-4339:
-----------------------------------
Attachment: (was: simple.patch)
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Bug
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Attachments: hadoop-fsshell-du-simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650676#action_12650676 ]
Hudson commented on HADOOP-4339:
--------------------------------
Integrated in Hadoop-trunk #670 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/670/])
. Remove redundant calls from FileSystem/FsShell when
generating/processing ContentSummary. Contributed by David Phillips.
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Improvement
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Assignee: David Phillips
> Fix For: 0.20.0
>
> Attachments: hadoop-fsshell-du-simple.patch, hadoop-fsshell-du-simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645827#action_12645827 ]
Hadoop QA commented on HADOOP-4339:
-----------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12392279/hadoop-fsshell-du-simple.patch
against trunk revision 712102.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs warnings.
+1 Eclipse classpath. The patch retains Eclipse classpath integrity.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3549/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3549/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3549/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3549/console
This message is automatically generated.
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Bug
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Attachments: hadoop-fsshell-du-simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Douglas updated HADOOP-4339:
----------------------------------
Issue Type: Improvement (was: Bug)
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Improvement
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Assignee: David Phillips
> Fix For: 0.20.0
>
> Attachments: hadoop-fsshell-du-simple.patch, hadoop-fsshell-du-simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Douglas updated HADOOP-4339:
----------------------------------
Fix Version/s: 0.20.0
Assignee: David Phillips
Status: Open (was: Patch Available)
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Bug
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Assignee: David Phillips
> Fix For: 0.20.0
>
> Attachments: hadoop-fsshell-du-simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-4339) Improve FsShell -du/-dus and
FileSystem.getContentSummary efficiency
Posted by "David Phillips (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Phillips updated HADOOP-4339:
-----------------------------------
Attachment: hadoop-fsshell-du-simple.patch
Good point, Chris. Patch updated.
> Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
> --------------------------------------------------------------------
>
> Key: HADOOP-4339
> URL: https://issues.apache.org/jira/browse/HADOOP-4339
> Project: Hadoop Core
> Issue Type: Bug
> Components: fs
> Affects Versions: 0.18.1
> Reporter: David Phillips
> Assignee: David Phillips
> Fix For: 0.20.0
>
> Attachments: hadoop-fsshell-du-simple.patch, hadoop-fsshell-du-simple.patch
>
>
> FsShell.du has two inefficiencies:
> * calling getContentSummary twice for each top-level item rather than calling it once and saving the result
> * calling getContentSummary for files rather than using the size it already has in FileStatus
> getContentSummary has one:
> * calling itself for files rather than using the length it already has in FileStatus
> Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).
> The simple solution:
> * FsShell.du calls once per item and saves the ContentSummary
> * FsShell.du uses FileStatus.getLen for files
> * getContentSummary only calls itself for directories
> Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.