You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/27 11:15:45 UTC

[GitHub] [hudi] huberylee opened a new pull request, #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

huberylee opened a new pull request, #5370:
URL: https://github.com/apache/hudi/pull/5370

   ## What is the purpose of the pull request
   
   RFC for Introduce Secondary Index to Improve HUDI Query Performance
   
   ## Brief change log
   
     - Modify rfc/README.md
     - Add rfc/rfc-52 dir
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] huberylee commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

huberylee commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r1001340777


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> columnToPredicates ..)
+
+// Create index
+boolean createIndex(HoodieTable table, List<Column> columns, List<IndexType> indexTypes)
+
+// Build index for the specified table
+boolean buildIndex(HoodieTable table, InstantTime instant)
+
+// Drop index
+boolean dropIndex(HoodieTable table, List<Column> columns)
+
+...
+```
+
+### <a id='imple-index-layer'>Index Implementation Layer</a>
+The role of the secondary index is to provide a mapping from a column or column combination value to 
+specified rows, so that it is convenient to find the result row that meets the requirements according to 
+this index during query, so as to obtain the final data rows.
+
+#### <a id='impl-index-layer-kv-mapping'>KV Mapping</a>
+In mapping 'column value->row', we can use rowId or primary key(RecordKey) to identify one unique row.
+Considering the memory saving and the efficiency of row set merging, we choose rowId. 
+Cause row id of all columns is aligned in row group, we can get row data by row id directly. 
+
+#### <a id='impl-index-layer-build-index'>Build Index</a>
+**trigger time**
+
+When one column's secondary index enabled, we need to build index for it automatically. Index building may
+consume a lot of CPU and IO resources. So, build index while compaction/clustering executing is a good solution, 
+after table service is serviced, writing and index construction can be better decouple to avoid impact on 
+write performance.
+
+Because we decouple the index definition and the index building process, users may not be able to benefit from it
+immediately when they create the index util the next compaction/clustering is triggered and completed.
+
+Also, we need to support a manual way to trigger and monitor index building, SQL CMD needs to be developed,
+such as 'optimize table t1', 'show indexing t1', etc.
+
+**index file**
+- A: build index only for base file
+- B: build index only for log file
+- C: build index for both base file and log file
+
+We prefer plan A right now, the main purpose of this proposal is to save base file IO cost based on the 
+assumption that base file has lots of records.
+
+One index file will be generated for each base file, containing one or more columns of index data.
+The index structure of each column is the mapping of column values to specific rows.
+
+Considering that there are too many index files, we prefer to store multi-column index data in one file instead of 
+one index file per column

Review Comment:
   > Should we consider storing secondary index in metadata table, to improve the index reading? Then you can also leverage the [Async Indexer](https://hudi.apache.org/docs/metadata_indexing) to build an index asynchronously, without implementing the index building again.
   
   Some types of secondary indexes have their own file formats, and there may be random reads in the process of use, so storing secondary index in metadata table is not suitable.
   
   Building secondary index in compaction/async indexer is under consideration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1112001594

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339",
       "triggerID" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8341",
       "triggerID" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f8863fd21c46ff0cc5e422ac323d36a252125895",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8360",
       "triggerID" : "f8863fd21c46ff0cc5e422ac323d36a252125895",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * bb0c0c4323ffbf605def455ded35130b6ed39500 UNKNOWN
   * 4989c96be4d484c5db17e442a3f2f87d43a7be69 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8341) 
   * f8863fd21c46ff0cc5e422ac323d36a252125895 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8360) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

leesf commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867298053


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc

Review Comment:
   This is also the part of the work of RFC?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hujincalrin commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hujincalrin commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867314464


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,

Review Comment:
   It will be many differences between Record Level Index and Secondary Index.
   Record Level Index is mainly a statistic info to reduce file level IO cost, such as column statistics info in read path.
   Secondary Index is a precise index to get accurate rows for query, it also can be used in `tagLocation` in write path.
   
   It is more appropriate to use different indexes for different scenarios.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1110509369

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339",
       "triggerID" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * 07f2f122f1fe0fe8c0097af16e3c1772fa84dabc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197) 
   * bb0c0c4323ffbf605def455ded35130b6ed39500 UNKNOWN
   * 0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339) 
   * 4989c96be4d484c5db17e442a3f2f87d43a7be69 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1103530681

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * df32ef8792075fff6f08820820f5f68c62f8415d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hujincalrin commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hujincalrin commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867313517


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index

Review Comment:
   We can support Spark SQL first, and then expand to Flink, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] prasannarajaperumal commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

prasannarajaperumal commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r972656384


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to

Review Comment:
   I am not sure why index is dependent on read or write path
   Record level index is a primary index and what you propose here is secondary index. 
   We should use the primary index in the read path when the filter is a pointed query for a uuid. 
   `select c1,c2 from t where _row_key='uuid1'`
   We should use the secondary index in the write path when the filter is on the index.
   `update c1=5 from t where c2 = 20`



##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  

Review Comment:
   We can introduce a separate section in the RFC to talk about the implementation is going to be in phases. But here I would want us to think generically how this would plugin into the Optimization layer of Spark/Flink. 
   
   My thoughts around this. 
   RBO for index can be very misleading for query performance - especially when combined with column level stats IO skipping. We can do a simple hint based approach to always use specific index when available else implement it using CBO. 
   
   The way I am thinking for [cascades](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.98.9460) based optimizer is to generate a memo of equivalent query fragments (with direct scans, using all the indexes possible) with cost and run it by the cost based optimizer to pick the right plan. 
   
   What do you think?



##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index

Review Comment:
   Best to call this out in the scope of the RFC



##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> columnToPredicates ..)

Review Comment:
   All usages of index are scan pruning. We should define a standard API for scan pruning that is generic enough for all forms of pruning (min/max, index) can work. I am thinking more like 
   `HoodieTableScan pruneScan(HoodieTable table, HoodieTableScan scan, List<Predicate> columnPredicates`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] huberylee commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

huberylee commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1378870663

   > What is the state of this issue? Any plans to merge RFC to mark it under active development or there are significant changes expected?
   
   Most of the development work is over, some PRs have been merged and another are under review. 
   
   Subtasks:
   - [x] https://github.com/apache/hudi/pull/5761
   - [x] https://github.com/apache/hudi/pull/5894
   - [x] https://github.com/apache/hudi/pull/5933
   - [ ] https://github.com/apache/hudi/pull/6677
   - [ ] https://github.com/apache/hudi/pull/6712


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1103934925

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51b4f79677fb36f66698edccb5205270faaf2696 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173) 
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * 4bde7abd738d375a17b7d1df92248227127db21b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1103872990

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * df32ef8792075fff6f08820820f5f68c62f8415d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166) 
   * 51b4f79677fb36f66698edccb5205270faaf2696 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [RFC][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1110480236

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * 07f2f122f1fe0fe8c0097af16e3c1772fa84dabc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197) 
   * bb0c0c4323ffbf605def455ded35130b6ed39500 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hujincalrin commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hujincalrin commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867314870


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> columnToPredicates ..)
+
+// Create index
+boolean createIndex(HoodieTable table, List<Column> columns, List<IndexType> indexTypes)
+
+// Build index for the specified table
+boolean buildIndex(HoodieTable table, InstantTime instant)

Review Comment:
   It depends on the low-level index implementation.
   
   Now， Lucene supports building index incrementally.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [RFC][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1110490291

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339",
       "triggerID" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * 07f2f122f1fe0fe8c0097af16e3c1772fa84dabc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197) 
   * bb0c0c4323ffbf605def455ded35130b6ed39500 UNKNOWN
   * 0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] huberylee commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

huberylee commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1286370430

   > Great start. Overall I also think we need to think about the abstraction API more carefully here.
   
   The current implementation provides an abstract framework upon which we can easily extend other types of secondary indexes, and this document is a little out of date, I will update it later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

leesf commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867299186


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> columnToPredicates ..)
+
+// Create index
+boolean createIndex(HoodieTable table, List<Column> columns, List<IndexType> indexTypes)
+
+// Build index for the specified table
+boolean buildIndex(HoodieTable table, InstantTime instant)
+
+// Drop index
+boolean dropIndex(HoodieTable table, List<Column> columns)
+
+...
+```
+
+### <a id='imple-index-layer'>Index Implementation Layer</a>
+The role of the secondary index is to provide a mapping from a column or column combination value to 
+specified rows, so that it is convenient to find the result row that meets the requirements according to 
+this index during query, so as to obtain the final data rows.
+
+#### <a id='impl-index-layer-kv-mapping'>KV Mapping</a>
+In mapping 'column value->row', we can use rowId or primary key(RecordKey) to identify one unique row.
+Considering the memory saving and the efficiency of row set merging, we choose rowId. 
+Cause row id of all columns is aligned in row group, we can get row data by row id directly. 
+
+#### <a id='impl-index-layer-build-index'>Build Index</a>
+**trigger time**
+
+When one column's secondary index enabled, we need to build index for it automatically. Index building may
+consume a lot of CPU and IO resources. So, build index while compaction/clustering executing is a good solution, 

Review Comment:
   if writing to hudi table without updating index, can it guarantee the correctness of query?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] huberylee commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

huberylee commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1139509280

   It was accidentally turned off，new PR: https://github.com/apache/hudi/pull/5704


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1111942199

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339",
       "triggerID" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8341",
       "triggerID" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f8863fd21c46ff0cc5e422ac323d36a252125895",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f8863fd21c46ff0cc5e422ac323d36a252125895",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * bb0c0c4323ffbf605def455ded35130b6ed39500 UNKNOWN
   * 4989c96be4d484c5db17e442a3f2f87d43a7be69 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8341) 
   * f8863fd21c46ff0cc5e422ac323d36a252125895 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1104911902

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * 4bde7abd738d375a17b7d1df92248227127db21b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175) 
   * 07f2f122f1fe0fe8c0097af16e3c1772fa84dabc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

yihua commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r1001154957


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> columnToPredicates ..)
+
+// Create index
+boolean createIndex(HoodieTable table, List<Column> columns, List<IndexType> indexTypes)
+
+// Build index for the specified table
+boolean buildIndex(HoodieTable table, InstantTime instant)
+
+// Drop index
+boolean dropIndex(HoodieTable table, List<Column> columns)
+
+...
+```
+
+### <a id='imple-index-layer'>Index Implementation Layer</a>
+The role of the secondary index is to provide a mapping from a column or column combination value to 
+specified rows, so that it is convenient to find the result row that meets the requirements according to 
+this index during query, so as to obtain the final data rows.
+
+#### <a id='impl-index-layer-kv-mapping'>KV Mapping</a>
+In mapping 'column value->row', we can use rowId or primary key(RecordKey) to identify one unique row.
+Considering the memory saving and the efficiency of row set merging, we choose rowId. 
+Cause row id of all columns is aligned in row group, we can get row data by row id directly. 
+
+#### <a id='impl-index-layer-build-index'>Build Index</a>
+**trigger time**
+
+When one column's secondary index enabled, we need to build index for it automatically. Index building may
+consume a lot of CPU and IO resources. So, build index while compaction/clustering executing is a good solution, 
+after table service is serviced, writing and index construction can be better decouple to avoid impact on 
+write performance.
+
+Because we decouple the index definition and the index building process, users may not be able to benefit from it
+immediately when they create the index util the next compaction/clustering is triggered and completed.
+
+Also, we need to support a manual way to trigger and monitor index building, SQL CMD needs to be developed,
+such as 'optimize table t1', 'show indexing t1', etc.
+
+**index file**
+- A: build index only for base file
+- B: build index only for log file
+- C: build index for both base file and log file
+
+We prefer plan A right now, the main purpose of this proposal is to save base file IO cost based on the 
+assumption that base file has lots of records.
+
+One index file will be generated for each base file, containing one or more columns of index data.
+The index structure of each column is the mapping of column values to specific rows.
+
+Considering that there are too many index files, we prefer to store multi-column index data in one file instead of 
+one index file per column

Review Comment:
   Should we consider storing secondary index in metadata table, to improve the index reading?  Then you can also leverage the [Async Indexer](https://hudi.apache.org/docs/metadata_indexing) to build an index asynchronously, without implementing the index building again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

leesf commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867298794


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> columnToPredicates ..)
+
+// Create index
+boolean createIndex(HoodieTable table, List<Column> columns, List<IndexType> indexTypes)
+
+// Build index for the specified table
+boolean buildIndex(HoodieTable table, InstantTime instant)

Review Comment:
   here can only build index incrementally?



##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> columnToPredicates ..)
+
+// Create index
+boolean createIndex(HoodieTable table, List<Column> columns, List<IndexType> indexTypes)
+
+// Build index for the specified table
+boolean buildIndex(HoodieTable table, InstantTime instant)

Review Comment:
   here can build index incrementally?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hujincalrin commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hujincalrin commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867315947


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> columnToPredicates ..)
+
+// Create index
+boolean createIndex(HoodieTable table, List<Column> columns, List<IndexType> indexTypes)
+
+// Build index for the specified table
+boolean buildIndex(HoodieTable table, InstantTime instant)
+
+// Drop index
+boolean dropIndex(HoodieTable table, List<Column> columns)
+
+...
+```
+
+### <a id='imple-index-layer'>Index Implementation Layer</a>
+The role of the secondary index is to provide a mapping from a column or column combination value to 
+specified rows, so that it is convenient to find the result row that meets the requirements according to 
+this index during query, so as to obtain the final data rows.
+
+#### <a id='impl-index-layer-kv-mapping'>KV Mapping</a>
+In mapping 'column value->row', we can use rowId or primary key(RecordKey) to identify one unique row.
+Considering the memory saving and the efficiency of row set merging, we choose rowId. 
+Cause row id of all columns is aligned in row group, we can get row data by row id directly. 
+
+#### <a id='impl-index-layer-build-index'>Build Index</a>
+**trigger time**
+
+When one column's secondary index enabled, we need to build index for it automatically. Index building may
+consume a lot of CPU and IO resources. So, build index while compaction/clustering executing is a good solution, 

Review Comment:
   The main logic for query remains the same as now. 
   
   The difference is  in the base file, we may read out 10 rows before(filter them later and get 3 rows finally), but now we may read out 3 rows exactly needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

leesf commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867299711


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> columnToPredicates ..)
+
+// Create index
+boolean createIndex(HoodieTable table, List<Column> columns, List<IndexType> indexTypes)
+
+// Build index for the specified table
+boolean buildIndex(HoodieTable table, InstantTime instant)
+
+// Drop index
+boolean dropIndex(HoodieTable table, List<Column> columns)
+
+...
+```
+
+### <a id='imple-index-layer'>Index Implementation Layer</a>
+The role of the secondary index is to provide a mapping from a column or column combination value to 
+specified rows, so that it is convenient to find the result row that meets the requirements according to 
+this index during query, so as to obtain the final data rows.
+
+#### <a id='impl-index-layer-kv-mapping'>KV Mapping</a>
+In mapping 'column value->row', we can use rowId or primary key(RecordKey) to identify one unique row.
+Considering the memory saving and the efficiency of row set merging, we choose rowId. 
+Cause row id of all columns is aligned in row group, we can get row data by row id directly. 
+
+#### <a id='impl-index-layer-build-index'>Build Index</a>
+**trigger time**
+
+When one column's secondary index enabled, we need to build index for it automatically. Index building may
+consume a lot of CPU and IO resources. So, build index while compaction/clustering executing is a good solution, 
+after table service is serviced, writing and index construction can be better decouple to avoid impact on 
+write performance.
+
+Because we decouple the index definition and the index building process, users may not be able to benefit from it
+immediately when they create the index util the next compaction/clustering is triggered and completed.
+
+Also, we need to support a manual way to trigger and monitor index building, SQL CMD needs to be developed,
+such as 'optimize table t1', 'show indexing t1', etc.
+
+**index file**
+- A: build index only for base file
+- B: build index only for log file
+- C: build index for both base file and log file
+
+We prefer plan A right now, the main purpose of this proposal is to save base file IO cost based on the 
+assumption that base file has lots of records.
+
+One index file will be generated for each base file, containing one or more columns of index data.

Review Comment:
   if there are large number of base file, the number of index file would be also very large, the cost of reading/opening index file would be large.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hujincalrin commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hujincalrin commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867313696


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index

Review Comment:
   > Here we would only use Spark SQL to manage secondary index?
   
   We can support Spark SQL first, and then expand to Flink, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hujincalrin commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hujincalrin commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867314652


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 

Review Comment:
   Yes, but here is just an example.
   We will do more experiments to get more accurate values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hujincalrin commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hujincalrin commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867313655


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index

Review Comment:
   They can be merged, we divided them into 2 layers for better introduction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] huberylee commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

huberylee commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1185210561

   > @huberylee are you still actively driving this one
   
   Yes, it's already under development


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] huberylee commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

huberylee commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r1001325640


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to

Review Comment:
   Read or write path here means record level index and secondary index are used in different scenarios. Record level index mainly be used with ``tagLocation`` when writing data into a hudi table, but secondary index for filter data when querying with predicates, including normal queries and update/delete with predicates.
   In hoodie, primary index is a logical constraint to ensure the uniqueness of records, and no pk index data exists, so it is not suitable for pointed query. Besides, when the query condition is not a primary key column, secondary index can also be used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1103532673

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * df32ef8792075fff6f08820820f5f68c62f8415d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

kazdy commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1380683733

   Hi @huberylee there's "Explore other execution engines/runtimes (Ray, native Rust, Python)" on hudi roadmap, it seems like secondary index will be using Lucene (at least one implementation), so when using non-JVM client it will not be possible to maintain this index? I would like to understand how it will look like in this case, I guess the simplest would be not to use this this type of index?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1110540685

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339",
       "triggerID" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8341",
       "triggerID" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * bb0c0c4323ffbf605def455ded35130b6ed39500 UNKNOWN
   * 0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339) 
   * 4989c96be4d484c5db17e442a3f2f87d43a7be69 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8341) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] huberylee commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

huberylee commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r860623314


##########
rfc/README.md:
##########
@@ -86,3 +86,4 @@ The list of all RFCs can be found here.
 | 48 | [LogCompaction for MOR tables](./rfc-48/rfc-48.md) | `UNDER REVIEW` | 
 | 49 | [Support sync with DataHub](./rfc-49/rfc-49.md)    | `ONGOING` |
 | 50 | [Improve Timeline Server](./rfc-50/rfc-50.md) | `UNDER REVIEW` | 
+| 52 | [Support Lucene Based Record Level Index](./rfc-52/rfc-52.md) | `UNDER REVIEW` |

Review Comment:
   PR for claim RFC 52 has been merged, I will rebase master later. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] huberylee commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

huberylee commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r1001336571


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  

Review Comment:
   > We can introduce a separate section in the RFC to talk about the implementation is going to be in phases. But here I would want us to think generically how this would plugin into the Optimization layer of Spark/Flink.
   > 
   > My thoughts around this. RBO for index can be very misleading for query performance - especially when combined with column level stats IO skipping. We can do a simple hint based approach to always use specific index when available else implement it using CBO.
   > 
   > The way I am thinking for [cascades](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.98.9460) based optimizer is to generate a memo of equivalent query fragments (with direct scans, using all the indexes possible) with cost and run it by the cost based optimizer to pick the right plan.
   > 
   > What do you think?
   
   OK, we will introduce a separate section to talk about the implementation.
   
   Column level statistics are currently separate from the implementation of secondary indexes, and both of their implementations intrude on spark, maybe we should unify the entrance of hudi index.
   
   Introduce a hint to control the use of indexes may work if the query pattern is known in advance, it is desirable to be able to automatically optimize the query plan based on the index.
   
   CBO is a good place to explore, and we need to do a lot more before we get there, including providing NDV value, cost model, and so on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] huberylee commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

huberylee commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r1001338212


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> columnToPredicates ..)

Review Comment:
   > All usages of index are scan pruning. We should define a standard API for scan pruning that is generic enough for all forms of pruning (min/max, index) can work. I am thinking more like `HoodieTableScan pruneScan(HoodieTable table, HoodieTableScan scan, List<Predicate> columnPredicates`
   
   Great idea! Maybe we should provide a specific format called ``HoodieFormat`` and responding ``Reader/Writer`` to hide the read, write and filter logic of the underlying format.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

leesf commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867298654


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 

Review Comment:
   here 10 is an experience nuber?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

leesf commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867298456


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,

Review Comment:
   So it is totally different from Record Level Index? such as the data layout and data structure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] colagy commented on pull request #5370: [HUDI-3907][RFC-52] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by "colagy (via GitHub)" <gi...@apache.org>.

colagy commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1715079514

   Can I search from hudi using some keywords just like elasticsearch, beaucse of the lucene secondary index?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yandooo commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

yandooo commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1378664743

   What is the state of this issue? Any plans to merge RFC to mark it under active development or there are significant changes expected?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

yihua commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r860511812


##########
rfc/README.md:
##########
@@ -86,3 +86,4 @@ The list of all RFCs can be found here.
 | 48 | [LogCompaction for MOR tables](./rfc-48/rfc-48.md) | `UNDER REVIEW` | 
 | 49 | [Support sync with DataHub](./rfc-49/rfc-49.md)    | `ONGOING` |
 | 50 | [Improve Timeline Server](./rfc-50/rfc-50.md) | `UNDER REVIEW` | 
+| 52 | [Support Lucene Based Record Level Index](./rfc-52/rfc-52.md) | `UNDER REVIEW` |

Review Comment:
   Could you create an individual PR to pick the RFC number, to follow the [Hudi RFC process](https://hudi.apache.org/contribute/rfc-process)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

leesf commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867298255


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index

Review Comment:
   5 and 4 can be merged?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hujincalrin commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hujincalrin commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867313427


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc

Review Comment:
   Yes
   We may implement the main part of them at the first stage



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1104852749

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * 4bde7abd738d375a17b7d1df92248227127db21b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175) 
   * 07f2f122f1fe0fe8c0097af16e3c1772fa84dabc UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1103869820

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * df32ef8792075fff6f08820820f5f68c62f8415d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166) 
   * 51b4f79677fb36f66698edccb5205270faaf2696 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1103882248

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51b4f79677fb36f66698edccb5205270faaf2696 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173) 
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * 4bde7abd738d375a17b7d1df92248227127db21b UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1104026649

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * 4bde7abd738d375a17b7d1df92248227127db21b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

leesf commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867298187


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index

Review Comment:
   Here we would only use Spark SQL to manage secondary index?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

leesf commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867298090


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers

Review Comment:
   typo 4 -> 5 layers?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hujincalrin commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hujincalrin commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r867315606


##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, most of them may not
+match the query predicate even after using column statistic info in the metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position and task split info
+   to decided which row groups should be handled by current task. If the row group data's middle
+   position is contained by task split, the row group should be handled by this task
+- Step2: Using pushed down predicates and row group level column statistics info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets), ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> columnToPredicates ..)
+
+// Create index
+boolean createIndex(HoodieTable table, List<Column> columns, List<IndexType> indexTypes)
+
+// Build index for the specified table
+boolean buildIndex(HoodieTable table, InstantTime instant)
+
+// Drop index
+boolean dropIndex(HoodieTable table, List<Column> columns)
+
+...
+```
+
+### <a id='imple-index-layer'>Index Implementation Layer</a>
+The role of the secondary index is to provide a mapping from a column or column combination value to 
+specified rows, so that it is convenient to find the result row that meets the requirements according to 
+this index during query, so as to obtain the final data rows.
+
+#### <a id='impl-index-layer-kv-mapping'>KV Mapping</a>
+In mapping 'column value->row', we can use rowId or primary key(RecordKey) to identify one unique row.
+Considering the memory saving and the efficiency of row set merging, we choose rowId. 
+Cause row id of all columns is aligned in row group, we can get row data by row id directly. 
+
+#### <a id='impl-index-layer-build-index'>Build Index</a>
+**trigger time**
+
+When one column's secondary index enabled, we need to build index for it automatically. Index building may
+consume a lot of CPU and IO resources. So, build index while compaction/clustering executing is a good solution, 
+after table service is serviced, writing and index construction can be better decouple to avoid impact on 
+write performance.
+
+Because we decouple the index definition and the index building process, users may not be able to benefit from it
+immediately when they create the index util the next compaction/clustering is triggered and completed.
+
+Also, we need to support a manual way to trigger and monitor index building, SQL CMD needs to be developed,
+such as 'optimize table t1', 'show indexing t1', etc.
+
+**index file**
+- A: build index only for base file
+- B: build index only for log file
+- C: build index for both base file and log file
+
+We prefer plan A right now, the main purpose of this proposal is to save base file IO cost based on the 
+assumption that base file has lots of records.
+
+One index file will be generated for each base file, containing one or more columns of index data.

Review Comment:
   This is a good suggestion，using an index itself has an overhead.
   
   So we need to treat it carefully whether we need to use the index, whether we need to use them all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1103771989

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * df32ef8792075fff6f08820820f5f68c62f8415d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1103879229

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * df32ef8792075fff6f08820820f5f68c62f8415d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166) 
   * 51b4f79677fb36f66698edccb5205270faaf2696 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173) 
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [WIP][RFC][HUDI-3907] RFC for Lucene Based Record Level Index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1104925906

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * 07f2f122f1fe0fe8c0097af16e3c1772fa84dabc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1112117453

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339",
       "triggerID" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8341",
       "triggerID" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f8863fd21c46ff0cc5e422ac323d36a252125895",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8360",
       "triggerID" : "f8863fd21c46ff0cc5e422ac323d36a252125895",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * bb0c0c4323ffbf605def455ded35130b6ed39500 UNKNOWN
   * f8863fd21c46ff0cc5e422ac323d36a252125895 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8360) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1110593036

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339",
       "triggerID" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8341",
       "triggerID" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * bb0c0c4323ffbf605def455ded35130b6ed39500 UNKNOWN
   * 4989c96be4d484c5db17e442a3f2f87d43a7be69 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8341) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [RFC][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1110482063

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * 07f2f122f1fe0fe8c0097af16e3c1772fa84dabc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197) 
   * bb0c0c4323ffbf605def455ded35130b6ed39500 UNKNOWN
   * 0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1110512852

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8166",
       "triggerID" : "df32ef8792075fff6f08820820f5f68c62f8415d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8173",
       "triggerID" : "51b4f79677fb36f66698edccb5205270faaf2696",
       "triggerType" : "PUSH"
     }, {
       "hash" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "95ca56ff2e76f43017333195df1c40b4cfa3aa0a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8175",
       "triggerID" : "4bde7abd738d375a17b7d1df92248227127db21b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8197",
       "triggerID" : "07f2f122f1fe0fe8c0097af16e3c1772fa84dabc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb0c0c4323ffbf605def455ded35130b6ed39500",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339",
       "triggerID" : "0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4989c96be4d484c5db17e442a3f2f87d43a7be69",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 95ca56ff2e76f43017333195df1c40b4cfa3aa0a UNKNOWN
   * bb0c0c4323ffbf605def455ded35130b6ed39500 UNKNOWN
   * 0376db30f4a84a9344a3d1d6b7c0bc9e8824bc2e Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8339) 
   * 4989c96be4d484c5db17e442a3f2f87d43a7be69 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1185198919

   @huberylee are you still actively driving this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org