You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Todd Lipcon <to...@cloudera.com> on 2013/05/02 18:28:05 UTC

Re: DistributedFileSystem.listStatus() - Why does it do partial listings then assemble?

Hi Brad,

The reasoning is that the NameNode locking is somewhat coarse grained. In
older versions of Hadoop, before it worked this way, we found that listing
large directories (eg with 100k+ files) could end up holding the namenode's
lock for a quite long period of time and starve other clients.

Additionally, I believe there is a second API that does the "on-demand"
fetching of the next set of files from the listing as well, no?

As for the consistency argument, you're correct that you may have a
non-atomic view of the directory contents, but I can't think of any
applications where this would be problematic.

-Todd

On Thu, May 2, 2013 at 9:18 AM, Brad Childs <bd...@redhat.com> wrote:

> Could someone explain why the DistributedFileSystem's listStatus() method
> does a piecemeal assembly of a directory listing within the method?
>
> Is there a locking issue? What if an element is added to the the directory
> during the operation?  What if elements are removed?
>
> It would make sense to me that the FileSystem class listStatus() method
> returned an Iterator allowing only partial fetching/chatter as needed.  But
> I dont understand why you'd want to assemble a giant array of the listing
> chunk by chunk.
>
>
> Here's the source of the listStatus() method, and I've linked the entire
> class below.
>
>
> ---------------------------------
>
>   public FileStatus[] listStatus(Path p) throws IOException {
>     String src = getPathName(p);
>
>     // fetch the first batch of entries in the directory
>     DirectoryListing thisListing = dfs.listPaths(
>         src, HdfsFileStatus.EMPTY_NAME);
>
>     if (thisListing == null) { // the directory does not exist
>       return null;
>     }
>
>     HdfsFileStatus[] partialListing = thisListing.getPartialListing();
>     if (!thisListing.hasMore()) { // got all entries of the directory
>       FileStatus[] stats = new FileStatus[partialListing.length];
>       for (int i = 0; i < partialListing.length; i++) {
>         stats[i] = makeQualified(partialListing[i], p);
>       }
>       statistics.incrementReadOps(1);
>       return stats;
>     }
>
>     // The directory size is too big that it needs to fetch more
>     // estimate the total number of entries in the directory
>     int totalNumEntries =
>       partialListing.length + thisListing.getRemainingEntries();
>     ArrayList<FileStatus> listing =
>       new ArrayList<FileStatus>(totalNumEntries);
>     // add the first batch of entries to the array list
>     for (HdfsFileStatus fileStatus : partialListing) {
>       listing.add(makeQualified(fileStatus, p));
>     }
>     statistics.incrementLargeReadOps(1);
>
>     // now fetch more entries
>     do {
>       thisListing = dfs.listPaths(src, thisListing.getLastName());
>
>       if (thisListing == null) {
>         return null; // the directory is deleted
>       }
>
>       partialListing = thisListing.getPartialListing();
>       for (HdfsFileStatus fileStatus : partialListing) {
>         listing.add(makeQualified(fileStatus, p));
>       }
>       statistics.incrementLargeReadOps(1);
>     } while (thisListing.hasMore());
>
>     return listing.toArray(new FileStatus[listing.size()]);
>   }
>
> --------------------------------------------
>
>
>
>
>
> Ref:
>
> https://svn.apache.org/repos/asf/hadoop/common/tags/release-1.0.4/src/hdfs/org/apache/hadoop/hdfs/DistributedFileSystem.java
> http://docs.oracle.com/javase/6/docs/api/java/util/Iterator.html
>
>
> thanks!
>
> -bc
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: DistributedFileSystem.listStatus() - Why does it do partial listings then assemble?

Posted by Suresh Srinivas <su...@hortonworks.com>.

Additional reason, HDFS does not have limit on number of files in a
directory. Some
clusters had millions of files in a single directory. Listing such a
directory
resulted in very large responses, requiring large contiguous memory
allocation in JVM
(for the array) and unpredictable GC failures.


On Thu, May 2, 2013 at 9:28 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Brad,
>
> The reasoning is that the NameNode locking is somewhat coarse grained. In
> older versions of Hadoop, before it worked this way, we found that listing
> large directories (eg with 100k+ files) could end up holding the namenode's
> lock for a quite long period of time and starve other clients.
>
> Additionally, I believe there is a second API that does the "on-demand"
> fetching of the next set of files from the listing as well, no?
>
> As for the consistency argument, you're correct that you may have a
> non-atomic view of the directory contents, but I can't think of any
> applications where this would be problematic.
>
> -Todd
>
> On Thu, May 2, 2013 at 9:18 AM, Brad Childs <bd...@redhat.com> wrote:
>
> > Could someone explain why the DistributedFileSystem's listStatus() method
> > does a piecemeal assembly of a directory listing within the method?
> >
> > Is there a locking issue? What if an element is added to the the
> directory
> > during the operation?  What if elements are removed?
> >
> > It would make sense to me that the FileSystem class listStatus() method
> > returned an Iterator allowing only partial fetching/chatter as needed.
>  But
> > I dont understand why you'd want to assemble a giant array of the listing
> > chunk by chunk.
> >
> >
> > Here's the source of the listStatus() method, and I've linked the entire
> > class below.
> >
> >
> > ---------------------------------
> >
> >   public FileStatus[] listStatus(Path p) throws IOException {
> >     String src = getPathName(p);
> >
> >     // fetch the first batch of entries in the directory
> >     DirectoryListing thisListing = dfs.listPaths(
> >         src, HdfsFileStatus.EMPTY_NAME);
> >
> >     if (thisListing == null) { // the directory does not exist
> >       return null;
> >     }
> >
> >     HdfsFileStatus[] partialListing = thisListing.getPartialListing();
> >     if (!thisListing.hasMore()) { // got all entries of the directory
> >       FileStatus[] stats = new FileStatus[partialListing.length];
> >       for (int i = 0; i < partialListing.length; i++) {
> >         stats[i] = makeQualified(partialListing[i], p);
> >       }
> >       statistics.incrementReadOps(1);
> >       return stats;
> >     }
> >
> >     // The directory size is too big that it needs to fetch more
> >     // estimate the total number of entries in the directory
> >     int totalNumEntries =
> >       partialListing.length + thisListing.getRemainingEntries();
> >     ArrayList<FileStatus> listing =
> >       new ArrayList<FileStatus>(totalNumEntries);
> >     // add the first batch of entries to the array list
> >     for (HdfsFileStatus fileStatus : partialListing) {
> >       listing.add(makeQualified(fileStatus, p));
> >     }
> >     statistics.incrementLargeReadOps(1);
> >
> >     // now fetch more entries
> >     do {
> >       thisListing = dfs.listPaths(src, thisListing.getLastName());
> >
> >       if (thisListing == null) {
> >         return null; // the directory is deleted
> >       }
> >
> >       partialListing = thisListing.getPartialListing();
> >       for (HdfsFileStatus fileStatus : partialListing) {
> >         listing.add(makeQualified(fileStatus, p));
> >       }
> >       statistics.incrementLargeReadOps(1);
> >     } while (thisListing.hasMore());
> >
> >     return listing.toArray(new FileStatus[listing.size()]);
> >   }
> >
> > --------------------------------------------
> >
> >
> >
> >
> >
> > Ref:
> >
> >
> https://svn.apache.org/repos/asf/hadoop/common/tags/release-1.0.4/src/hdfs/org/apache/hadoop/hdfs/DistributedFileSystem.java
> > http://docs.oracle.com/javase/6/docs/api/java/util/Iterator.html
> >
> >
> > thanks!
> >
> > -bc
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
http://hortonworks.com/download/

Re: DistributedFileSystem.listStatus() - Why does it do partial listings then assemble?

Posted by Steve Loughran <st...@hortonworks.com>.

On 2 May 2013 09:28, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Brad,
>
> The reasoning is that the NameNode locking is somewhat coarse grained. In
> older versions of Hadoop, before it worked this way, we found that listing
> large directories (eg with 100k+ files) could end up holding the namenode's
> lock for a quite long period of time and starve other clients.
>
> Additionally, I believe there is a second API that does the "on-demand"
> fetching of the next set of files from the listing as well, no?
>

HDFS v2; only incompatible change between v1 and v2 FileSystem class.

Chatty over long haul and hangs Amazon S3://  an issue for which there's a
patch to
replicate but not fix the problem
https://issues.apache.org/jira/browse/HADOOP-9410

Good local -but I think it needs test coverage for all the other filesystem
clients that ship w/ Hadoop

FWIW, blobstores do tend to only support paged lists of their blobs, so the
same build-up-as-you-go-along process works there. We should spell out in
the documentation "changes that occur to the filesystem during the
generation of this list MAY not be reflected in the result, and so MAY
result in a partially incomplete or inconsistent view".

-Steve