You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "bharath v (JIRA)" <ji...@apache.org> on 2017/08/28 20:59:00 UTC

[jira] [Resolved] (IMPALA-4847) Simplify the code for file/block metadata loading by manually calling listLocatedStatus() for each partition.

     [ https://issues.apache.org/jira/browse/IMPALA-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

bharath v resolved IMPALA-4847.
-------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.11.0

https://gerrit.cloudera.org/#/c/7652/

> Simplify the code for file/block metadata loading by manually calling listLocatedStatus() for each partition.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-4847
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4847
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>    Affects Versions: Impala 2.8.0
>            Reporter: Alexander Behm
>            Assignee: bharath v
>            Priority: Critical
>             Fix For: Impala 2.11.0
>
>
> The fix for IMPALA-4172/IMPALA-3653 uses Hadoop's Filesystem.listFiles() API to recursively list all files under an HDFS table's parent directory. We then map each file to its corresponding partition. However, the use of listFiles() and the associated code for doing the file-to-partition mapping does not really make sense because listFiles() is just a recursive wrapper around listLocatedStatus(). So for a table with 10k partitions there will be 10k RPCs doing listLocatedStatus().
> We should simplify our code to just loop over all partitions and call listLocatedStatus(). This has the following benefits:
> * Simper code. Would have avoided bugs like IMPALA-4789.
> * Faster code. No need to map files to partitions.
> * Easier to parallelize in the future.
> * Easier to decouple table and partition loading in the future.
> Keep in mind that for S3 tables we do want to use the listFiles() API to avoid being throttled by S3.
> Relevant links:
> https://github.com/apache/hadoop/blob/branch-2.6.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1720
> https://github.com/apache/hadoop/blob/branch-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L766



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)