You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/12/04 10:24:00 UTC
[jira] [Commented] (HDFS-13616) Batch listing of multiple directories

    [ https://issues.apache.org/jira/browse/HDFS-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17642970#comment-17642970 ] 

ASF GitHub Bot commented on HDFS-13616:
---------------------------------------

fanlinqian commented on PR #1725:
URL: https://github.com/apache/hadoop/pull/1725#issuecomment-1336370086

   Hello, I encountered a bug when using the batch method, when I input a directory with more than 1000 files in it and 2 replications of each file's data block, only the first 500 files of this directory are returned and then it stops. I think it should be hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java in getBatchedListing() method to modify, as follows.
    for (; srcsIndex < srcs.length; srcsIndex++) {
                   String src = srcs[srcsIndex];
                   HdfsPartialListing listing;
                   try {
                       DirectoryListing dirListing = getListingInt(dir, pc, src, indexStartAfter, needLocation);
                       if (dirListing == null) {
                           throw new FileNotFoundException("Path " + src + " does not exist");}
                       listing = new HdfsPartialListing(srcsIndex, Lists.newArrayList(dirListing.getPartialListing()));
                       numEntries += listing.getPartialListing().size();
                       lastListing = dirListing;
   
                   } catch (Exception e) {
                       if (e instanceof AccessControlException) {
                           logAuditEvent(false, operationName, src);}
                       listing = new HdfsPartialListing(srcsIndex,
                               new RemoteException(e.getClass().getCanonicalName(), e.getMessage()));
                       lastListing = null;
                       LOG.info("Exception listing src {}", src, e);}
                   listings.put(srcsIndex, listing);
   
                 //My modification
                   (lastListing.getRemainingEntries()!=0)
                   {
                        break;
                   }
   
                   if (indexStartAfter.length != 0)
                   {
                       indexStartAfter = new byte[0];
                   }
                   // Terminate if we've reached the maximum listing size
                   if (numEntries >= dir.getListLimit()) {
                       break;
                   }
               }
   The reason for this bug is mainly that the result returned by the getListingInt(dir, pc, src, indexStartAfter, needLocation) method will limit both the number of files in the directory as well as the number of data blocks and replications of the files at the same time. But the getBatchedListing() method will only exit the loop if the number of returned results is greater than 1000.
   Looking forward to your reply




> Batch listing of multiple directories
> -------------------------------------
>
>                 Key: HDFS-13616
>                 URL: https://issues.apache.org/jira/browse/HDFS-13616
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>    Affects Versions: 3.2.0
>            Reporter: Andrew Wang
>            Assignee: Chao Sun
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: BenchmarkListFiles.java, HDFS-13616.001.patch, HDFS-13616.002.patch
>
>
> One of the dominant workloads for external metadata services is listing of partition directories. This can end up being bottlenecked on RTT time when partition directories contain a small number of files. This is fairly common, since fine-grained partitioning is used for partition pruning by the query engines.
> A batched listing API that takes multiple paths amortizes the RTT cost. Initial benchmarks show a 10-20x improvement in metadata loading performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org