You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Hairong Kuang (JIRA)" <ji...@apache.org> on 2010/04/03 02:14:27 UTC

[jira] Created: (HADOOP-6678) Propose some changes to FileContext

Propose some changes to FileContext
-----------------------------------

                 Key: HADOOP-6678
                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
             Project: Hadoop Common
          Issue Type: Improvement
          Components: fs
            Reporter: Hairong Kuang
             Fix For: 0.21.0, 0.22.0


# Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
# Remove methods isFile(Path), isDirectory(Path), and exists.
All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
{code}
  FileContext fc = ..;
  if (fc.exists(path)) {
    if (fc.isFile(path)) {
     ...
    } else {
    ...
    }
  }
{code}
The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
{code}
  FileContext fc = ...;
  FileStatus fstatus = fc.getFileStatus(path);
  if (fstatus.isFile() {
    ...
  } else {
    ...
  }
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6678) Remove FileContext#isFile, isDirectory and exists

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Collins updated HADOOP-6678:
--------------------------------

    Summary: Remove FileContext#isFile, isDirectory and exists  (was: Propose some changes to FileContext)

> Remove FileContext#isFile, isDirectory and exists
> -------------------------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-6678) Remove FileContext#isFile, isDirectory and exists

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-6678:
----------------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags: [Reviewed]  (was: [Incompatible change, Reviewed])
      Resolution: Fixed

I've just committed this. Thanks, Eli!

> Remove FileContext#isFile, isDirectory and exists
> -------------------------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch, hadoop-6678-2.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854795#action_12854795 ] 

Eli Collins commented on HADOOP-6678:
-------------------------------------

Right, the patch requires HADOOP-6689 and HADOOP-6563, sorry for the noise.

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Remove FileContext#isFile, isDirectory and exists

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861645#action_12861645 ] 

Eli Collins commented on HADOOP-6678:
-------------------------------------

Hey Hairong,  no worries - looking at the implementation of FileContext#exists I think it can be simplified:

{code}
  public boolean exists(final Path f) throws AccessControlException,  IOException {
    try {
      return getFileStatus(f) != null;            
    } catch (FileNotFoundException e) {
      return false;
    } catch (UnsupportedFileSystemException e) {
      return false;
    } catch (UnresolvedLinkException e) {
      return false;
    }
  }
{code}

# AFS#getFileStatus should never return null, eg see Hdfs#getFileStatus. Per the spec if should throw a FileNotFoundException if f does not exist.
# Seems like the client should have to deal with UnsupportedFileSystemException rather than have FC#exists swallow it
# UnresolvedLinkException will never be thrown (see HADOOP-6727)

Therefore I think we can write FC#exists as

{code}
  public boolean exists(final Path f) throws AccessControlException, UnsupportedFileSystemException,  IOException {
    try {
      FileStatus fs =  getFileStatus(f);
      assert fs != null;
      return true;
    } catch (FileNotFoundException e) {
      return false;
    }
  }
{code}

Since this is still not a one-liner putting it in FileContext#util seems reasonable to me.  I'll send out new patches.



> Remove FileContext#isFile, isDirectory and exists
> -------------------------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Remove FileContext#isFile, isDirectory and exists

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861989#action_12861989 ] 

Hadoop QA commented on HADOOP-6678:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12443114/hadoop-6678-2.patch
  against trunk revision 939026.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 24 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/console

This message is automatically generated.

> Remove FileContext#isFile, isDirectory and exists
> -------------------------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch, hadoop-6678-2.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-6678) Propose some changes to FileContext

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang reassigned HADOOP-6678:
-------------------------------------

    Assignee: Eli Collins

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856225#action_12856225 ] 

Eli Collins commented on HADOOP-6678:
-------------------------------------

Thanks for the review Hairong. Could you check out HDFS-1089 as well?  It's simple, just removes uses of FileContext#isFile, isDirectory and exists from HDFS, which needs to be checked in before this patch. 

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853022#action_12853022 ] 

Eli Collins commented on HADOOP-6678:
-------------------------------------

Accidentally submitted the last comment..

+1 

I cold do part 2 in lieu of HADOOP-6585 (Add FileStatus#isDirectory and isFile) and HDFS-995, MR-1535 (Replace usage of FileStatus#isDir()) which I just started working on today.

Thanks,
Eli




> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853558#action_12853558 ] 

Hairong Kuang commented on HADOOP-6678:
---------------------------------------

> I cold do part 2 in lieu of HADOOP-6585 (Add FileStatus#isDirectory and isFile) and HDFS-995, MR-1535.
Eli, thanks for volunteering to do part 2. Your help would be great!

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Remove FileContext#isFile, isDirectory and exists

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861942#action_12861942 ] 

Hairong Kuang commented on HADOOP-6678:
---------------------------------------

+1. The patch looks good to me.

> Remove FileContext#isFile, isDirectory and exists
> -------------------------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch, hadoop-6678-2.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6678) Propose some changes to FileContext

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Collins updated HADOOP-6678:
--------------------------------

    Attachment: hadoop-6678-1.patch

Here's a patch for part 2, removes FileContext#isFile, isDirectory, and exists. 

Notes:
* We no longer call fixRelativePart on the path passed to getFileStatus since that method already calls fixRelativePart on the argument (so this redundant call is now gone)
* deleteOnExit now throws the FileNotFoundException like delete rather than return false 
* Removed UnresolvedLinkException from the throws clause of the public FileContext APIs since they'll never throw this exception (FileContext resolves links). 
* Patch applies to trunk plus HADOOP-6689 and HADOOP-6563

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6678) Propose some changes to FileContext

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Collins updated HADOOP-6678:
--------------------------------

    Status: Patch Available  (was: Open)

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853465#action_12853465 ] 

Hairong Kuang commented on HADOOP-6678:
---------------------------------------

We thought about client side stat cache too. Actually right before I filed the jira, we went through the two options one more time. Although I am excited about the cache solution, we feel that the proposed solution for #2 is less controversy, straightforward, and also exposes more information to for users to write more efficient code. 

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855211#action_12855211 ] 

Hairong Kuang commented on HADOOP-6678:
---------------------------------------

+1 on Eli's patch.

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6678) Remove FileContext#isFile, isDirectory and exists

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-6678:
----------------------------------

          Status: Patch Available  (was: Open)
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Incompatible change])

> Remove FileContext#isFile, isDirectory and exists
> -------------------------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch, hadoop-6678-2.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854789#action_12854789 ] 

Eli Collins commented on HADOOP-6678:
-------------------------------------

Forgot to mention, this patch also adds FileStatus#isFile, currently it just returns !isDir so that it's clear this change is mostly just changing syntax. Further FileStatus changes will be made in HADOOP-6585.

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853468#action_12853468 ] 

Todd Lipcon commented on HADOOP-6678:
-------------------------------------

OK, sounds good. Thanks for the info, Hairong.

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6678) Remove FileContext#isFile, isDirectory and exists

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-6678:
----------------------------------

    Status: Open  (was: Patch Available)

> Remove FileContext#isFile, isDirectory and exists
> -------------------------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch, hadoop-6678-2.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854792#action_12854792 ] 

Hadoop QA commented on HADOOP-6678:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12441111/hadoop-6678-1.patch
  against trunk revision 931226.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 21 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/448/console

This message is automatically generated.

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6678) Remove FileContext#isFile, isDirectory and exists

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Collins updated HADOOP-6678:
--------------------------------

    Attachment: hadoop-6678-2.patch

Thanks Hairong. Patch attached.
* Removes FC#isFile and FC#isDirectory, and uses thereof.
* Refactors FC#exists per previous comment and moves to Util
* Refactors FC#checkDest. Note that it is now passed overwrite instead of false. It (and it's callers) needs unit tests. Filed HADOOP-6730 for that.
* Adds Path#getName tests to TestPath to sanity check my understanding of FC#checkDest.
* Modifies test classes to use the new helper methods

Still needs to be applied with HDFS-1089 (easy 4 line patch). ant test-core passes.

> Remove FileContext#isFile, isDirectory and exists
> -------------------------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch, hadoop-6678-2.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853024#action_12853024 ] 

Todd Lipcon commented on HADOOP-6678:
-------------------------------------

The other option for #2 is a stat cache, no? Of course we're trading off one set of problems for another, but I think for a lot of use cases a 3 second stat cache (a la linux nfs default) is totally reasonable, so long as it's easily disabled.

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Remove FileContext#isFile, isDirectory and exists

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861169#action_12861169 ] 

Hairong Kuang commented on HADOOP-6678:
---------------------------------------

Hi Eli, sorry for the delay for getting back to this issue. I was on vacation & worked on a cluster emergency after I came back.

But I got some time to think more about this proposed FileContext change. I am quite concerned about removing FileContext#exists. This is a useful often-used method that's not easy for a user to write. Do you think if it is a good idea that we still keep this method but put it in FileContext#Util with a comment saying that "be cautious about using this method." and list a few scenarios that could avoid calling this method? 

> Remove FileContext#isFile, isDirectory and exists
> -------------------------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Remove FileContext#isFile, isDirectory and exists

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861863#action_12861863 ] 

Hairong Kuang commented on HADOOP-6678:
---------------------------------------

> I think we can write FC#exists as..
+1. Thank you, Eli!

> Remove FileContext#isFile, isDirectory and exists
> -------------------------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>            Assignee: Eli Collins
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855178#action_12855178 ] 

Hairong Kuang commented on HADOOP-6678:
---------------------------------------

Thanks Eli for working on part 2. I will review it.

Since Eli submitted the patch to part 2 to this jira, I created HADOOP-6692 for the listStatus change.

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hadoop-6678-1.patch
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6678) Propose some changes to FileContext

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853019#action_12853019 ] 

Eli Collins commented on HADOOP-6678:
-------------------------------------

+1

HADOOP-6585 (Add FileStatus#isDirectory and isFile)

(Replace usage of FileStatus#isDir())

> Propose some changes to FileContext
> -----------------------------------
>
>                 Key: HADOOP-6678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6678
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Hairong Kuang
>             Fix For: 0.21.0, 0.22.0
>
>
> # Add a method  Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
> # Remove methods isFile(Path), isDirectory(Path), and exists.
> All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below: 
> {code}
>   FileContext fc = ..;
>   if (fc.exists(path)) {
>     if (fc.isFile(path)) {
>      ...
>     } else {
>     ...
>     }
>   }
> {code}
> The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:
> {code}
>   FileContext fc = ...;
>   FileStatus fstatus = fc.getFileStatus(path);
>   if (fstatus.isFile() {
>     ...
>   } else {
>     ...
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.