You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2020/05/14 19:12:00 UTC

[jira] [Resolved] (IMPALA-7320) Loading HDFS tables calls getFileStatus on each partition serially

     [ https://issues.apache.org/jira/browse/IMPALA-7320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong resolved IMPALA-7320.
-----------------------------------
    Resolution: Fixed

> Loading HDFS tables calls getFileStatus on each partition serially
> ------------------------------------------------------------------
>
>                 Key: IMPALA-7320
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7320
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>    Affects Versions: Impala 3.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Major
>
> The catalog caches the access level (permissions) of each of the partitions in an HDFS table. This is all loaded when the table is first loaded, and is done so by making serial calls to getFileStatus() on each of the partitions. In most case, all of the partitions are in a single directory and we could get all of the information through a single call to listFileStatus() on the parent. In my testing, a typical getFileStatus call took 1-2 milliseconds, so on a large table with tens of thousands of partitions this can shave many seconds off of the table load time as well as reduce load on the NN.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)