You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2020/06/09 05:27:00 UTC
[jira] [Commented] (HIVE-23597)
VectorizedOrcAcidRowBatchReader::ColumnizedDeleteEventRegistry reads delete
delta directories multiple times
[ https://issues.apache.org/jira/browse/HIVE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128877#comment-17128877 ]
Rajesh Balamohan commented on HIVE-23597:
-----------------------------------------
Initial version of PR: https://github.com/apache/hive/pull/1081
Patch reduces the number of times delete delta files needs to be scanned. For delta_x, it need not look at delete delta's which are lesser than its write ids. Also, caching the orcTail of delete delta reduces the lookup & scan cost in cloud storage.
In a small cluster, simple select query took *"35-40 seconds"* without the patch. With the patch, it takes *"6-7"* seconds.
> VectorizedOrcAcidRowBatchReader::ColumnizedDeleteEventRegistry reads delete delta directories multiple times
> ------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-23597
> URL: https://issues.apache.org/jira/browse/HIVE-23597
> Project: Hive
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Priority: Major
>
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1562]
> {code:java}
> try {
> final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
> if (deleteDeltaDirs.length > 0) {
> int totalDeleteEventCount = 0;
> for (Path deleteDeltaDir : deleteDeltaDirs) {
> {code}
>
> Consider a directory layout like the following. This was created by having simple set of "insert --> update --> select" queries.
>
> {noformat}
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000001
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000002
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000003_0000003_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000004_0000004_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000005_0000005_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000006_0000006_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000007_0000007_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000008_0000008_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000009_0000009_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000010_0000010_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000011_0000011_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000012_0000012_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000013_0000013_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000003_0000003_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000004_0000004_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000005_0000005_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000006_0000006_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000007_0000007_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000008_0000008_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000009_0000009_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000010_0000010_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000011_0000011_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000012_0000012_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000013_0000013_0000 {noformat}
>
> Orcsplit contains all the delete delta folder information. For the directory layout like this, it would create {{~12 splits}}. For every split, it constructs "ColumnizedDeleteEventRegistry" in VRBAcidReader and ends up reading all these delete delta folders multiple times.
> In this case, it would read it approximately {{121 times!}}.
> This causes huge delay in running simple queries like "{{select * from tab_x}}" in cloud storage.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)