You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Konstantin Shvachko (JIRA)" <ji...@apache.org> on 2006/03/14 00:54:03 UTC

[jira] Created: (HADOOP-79) listFiles optimization

listFiles optimization
----------------------

         Key: HADOOP-79
         URL: http://issues.apache.org/jira/browse/HADOOP-79
     Project: Hadoop
        Type: Improvement
  Components: dfs  
    Reporter: Konstantin Shvachko


In FSDirectory.getListing() looking at line
listing[i] = new DFSFileInfo(curName, cur.computeFileLength(), cur.computeContentsLength(), isDir(curName));

1. computeContentsLength() is actually calling computeFileLength(), so this is called twice,
meaning that file length is calculated twice.
2. isDir() is looking for the INode (starting from the rootDir) that has actually been obtained
just two lines above, note that the tree is locked by that time.

I propose a simple optimization for this, see attachment.

3. A related question: Why DFSFileInfo needs 2 separate fields len for file length and
contentsLen for directory contents size? It looks like these fields are mutually exclusive,
and we can use just one, interpreting it one way or another with respect to the value of isDir.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Resolved: (HADOOP-79) listFiles optimization

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-79?page=all ]
     
Doug Cutting resolved HADOOP-79:
--------------------------------

    Fix Version: 0.1
     Resolution: Fixed
      Assign To: Konstantin Shvachko

This looks fine to me.  I simplified FSDirectory.isDir() a bit more & committed this.

Did you find this to be a bottleneck in benchmarks?  BTW, I have had some success profiling Hadoop daemons using Sun's built-in sampling profiler.  I simply set HADOOP_OPTS to  '-agentlib:hprof=cpu=samples,interval=20' before starting a daemon.  Then, when I stop that daemon, it dumps profile data to a text file.

And, finally, yes, DFSFileInfo could re-use the length field for both purposes.  But this class is only used for interchange, right?, so making it small will only serve to make RPC's a bit faster and won't save a lot of memory.

> listFiles optimization
> ----------------------
>
>          Key: HADOOP-79
>          URL: http://issues.apache.org/jira/browse/HADOOP-79
>      Project: Hadoop
>         Type: Improvement
>   Components: dfs
>     Reporter: Konstantin Shvachko
>     Assignee: Konstantin Shvachko
>      Fix For: 0.1
>  Attachments: DFSFileInfo.patch
>
> In FSDirectory.getListing() looking at line
> listing[i] = new DFSFileInfo(curName, cur.computeFileLength(), cur.computeContentsLength(), isDir(curName));
> 1. computeContentsLength() is actually calling computeFileLength(), so this is called twice,
> meaning that file length is calculated twice.
> 2. isDir() is looking for the INode (starting from the rootDir) that has actually been obtained
> just two lines above, note that the tree is locked by that time.
> I propose a simple optimization for this, see attachment.
> 3. A related question: Why DFSFileInfo needs 2 separate fields len for file length and
> contentsLen for directory contents size? It looks like these fields are mutually exclusive,
> and we can use just one, interpreting it one way or another with respect to the value of isDir.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (HADOOP-79) listFiles optimization

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-79?page=all ]

Konstantin Shvachko updated HADOOP-79:
--------------------------------------

    Attachment: DFSFileInfo.patch

> listFiles optimization
> ----------------------
>
>          Key: HADOOP-79
>          URL: http://issues.apache.org/jira/browse/HADOOP-79
>      Project: Hadoop
>         Type: Improvement
>   Components: dfs
>     Reporter: Konstantin Shvachko
>  Attachments: DFSFileInfo.patch
>
> In FSDirectory.getListing() looking at line
> listing[i] = new DFSFileInfo(curName, cur.computeFileLength(), cur.computeContentsLength(), isDir(curName));
> 1. computeContentsLength() is actually calling computeFileLength(), so this is called twice,
> meaning that file length is calculated twice.
> 2. isDir() is looking for the INode (starting from the rootDir) that has actually been obtained
> just two lines above, note that the tree is locked by that time.
> I propose a simple optimization for this, see attachment.
> 3. A related question: Why DFSFileInfo needs 2 separate fields len for file length and
> contentsLen for directory contents size? It looks like these fields are mutually exclusive,
> and we can use just one, interpreting it one way or another with respect to the value of isDir.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-79) listFiles optimization

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-79?page=comments#action_12370447 ] 

Konstantin Shvachko commented on HADOOP-79:
-------------------------------------------

No, it was not really a bottleneck.
Interesting about the profiling. Is it JMX or something else?
Yes, DFSFileInfo is for reporting only, so it does not save space.
It might save some code though, since when you have one field you 
probably don't need two different finctions to extract it.

> listFiles optimization
> ----------------------
>
>          Key: HADOOP-79
>          URL: http://issues.apache.org/jira/browse/HADOOP-79
>      Project: Hadoop
>         Type: Improvement
>   Components: dfs
>     Reporter: Konstantin Shvachko
>     Assignee: Konstantin Shvachko
>      Fix For: 0.1
>  Attachments: DFSFileInfo.patch
>
> In FSDirectory.getListing() looking at line
> listing[i] = new DFSFileInfo(curName, cur.computeFileLength(), cur.computeContentsLength(), isDir(curName));
> 1. computeContentsLength() is actually calling computeFileLength(), so this is called twice,
> meaning that file length is calculated twice.
> 2. isDir() is looking for the INode (starting from the rootDir) that has actually been obtained
> just two lines above, note that the tree is locked by that time.
> I propose a simple optimization for this, see attachment.
> 3. A related question: Why DFSFileInfo needs 2 separate fields len for file length and
> contentsLen for directory contents size? It looks like these fields are mutually exclusive,
> and we can use just one, interpreting it one way or another with respect to the value of isDir.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira