You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Aditya Shah (Jira)" <ji...@apache.org> on 2020/02/13 11:31:00 UTC

[jira] [Commented] (HIVE-21225) ACID: getAcidState() should cache a recursive dir listing locally

    [ https://issues.apache.org/jira/browse/HIVE-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036142#comment-17036142 ] 

Aditya Shah commented on HIVE-21225:
------------------------------------

[~vgumashta]


I had some doubts similar to what [~vgarg] raised before. Introducing caching which stores the whole status object of each directory is quite expensive for S3. Since we only did listStatus prior to this in getAcidState it was very fast. The overhead compared to the benefit where we use the statuses once per delta directory (After HIVE-21177) to determine RawFormat seems very high. 


I evaluated 2 examples of tables. One (non-partitioned) with around 900 files in each delta directory and 3 deltas, and other (100 partitions, 40 deltas and 45 files each). The matrix for time for split computation in each was as follows:

 
||Table||Hive version 3.1.1||With HIVE-21177||With HIVE-21225, HIVE-22537, and HIVE-21177||
|3 deltas, 900 files|798s|1s|367s|
|100 partitions,40 deltas, 45 files|12952s|70s|942s|

 

Am I missing something here?

> ACID: getAcidState() should cache a recursive dir listing locally
> -----------------------------------------------------------------
>
>                 Key: HIVE-21225
>                 URL: https://issues.apache.org/jira/browse/HIVE-21225
>             Project: Hive
>          Issue Type: Improvement
>          Components: Transactions
>            Reporter: Gopal Vijayaraghavan
>            Assignee: Vaibhav Gumashta
>            Priority: Major
>             Fix For: 4.0.0
>
>         Attachments: HIVE-21225.1.patch, HIVE-21225.10.patch, HIVE-21225.11.patch, HIVE-21225.12.patch, HIVE-21225.13.patch, HIVE-21225.14.patch, HIVE-21225.15.patch, HIVE-21225.15.patch, HIVE-21225.16.patch, HIVE-21225.17.patch, HIVE-21225.2.patch, HIVE-21225.3.patch, HIVE-21225.4.patch, HIVE-21225.4.patch, HIVE-21225.5.patch, HIVE-21225.6.patch, HIVE-21225.7.patch, HIVE-21225.7.patch, HIVE-21225.8.patch, HIVE-21225.9.patch, async-pid-44-2.svg
>
>
> Currently getAcidState() makes 3 calls into the FS api which could be answered by making a single recursive listDir call and reusing the same data to check for isRawFormat() and isValidBase().
> All delta operations for a single partition can go against a single listed directory snapshot instead of interacting with the NameNode or ObjectStore within the inner loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)