You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/07 06:43:24 UTC

[GitHub] [hudi] hujincalrin commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

hujincalrin commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867313427


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc

Review Comment:
   Yes
   We may implement the main part of them at the first stage



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org