You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by 孟涛 <me...@qq.com.INVALID> on 2022/04/18 03:41:12 UTC

回复：[DISCUSS] hudi index improve

＋1 , it will be a great feature for hudi
index is very import to boost query, and we are also trying to add index support for trino on hudi; maybe we can work together. Looking forward to the design documents
some minor questions:
1. do we need to consider concurrent operation
2. do we want to use metaTable to store index information?






------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "dev"                                                                                    <forwardxu315@gmail.com&gt;;
发送时间:&nbsp;2022年4月18日(星期一) 中午11:18
收件人:&nbsp;"dev"<dev@hudi.apache.org&gt;;

主题:&nbsp;[DISCUSS] hudi index improve



Hi All,

I want to improve hudi‘s index. There are four main steps to achieve this

1. Implement index syntax
&nbsp;&nbsp;&nbsp; a. Implement index syntax for spark sql [1] , I have submitted the
first pr.
&nbsp;&nbsp;&nbsp; b. Implement index syntax for prestodb sql
&nbsp;&nbsp;&nbsp; c. Implement index syntax for trino sql

2. read/write index decoupling
The read/write index is decoupled from the computing engine side, and the
sql index syntax of the first step can be independently executed and called
through the API.

3. build index service

Promote the implementation of the hudi service framework, including index
service, metastore service[2], compact/cluster service[3], etc.

4. Index Management
There are two kinds of management semantic for Index.

&nbsp;&nbsp; - Automatic Refresh
&nbsp;&nbsp; - Manual Refresh


&nbsp;&nbsp; 1. Automatic Refresh

When a user creates an index on the main table without using WITH DEFERRED
REFRESH syntax, the index will be managed by the system automatically. For
every data load to the main table, the system will immediately trigger a
load to the index automatically. These two data loading (to main table and
index) is executed in a transactional manner, meaning that it will be
either both success or neither success.

The data loading to index is incremental, avoiding an expensive total
refresh.

If a user performs the following command on the main table, the system will
return failure. (reject the operation)


&nbsp;&nbsp; - Data management command: UPDATE/DELETE/DELETE.
&nbsp;&nbsp; - Schema management command: ALTER TABLE DROP COLUMN, ALTER TABLE CHANGE
&nbsp;&nbsp; DATATYPE, ALTER TABLE RENAME. Note that adding a new column is supported,
&nbsp;&nbsp; and for dropping columns and change datatype command, hudi will check
&nbsp;&nbsp; whether it will impact the index table, if not, the operation is allowed,
&nbsp;&nbsp; otherwise operation will be rejected by throwing an exception.
&nbsp;&nbsp; - Partition management command: ALTER TABLE ADD/DROP PARTITION.

If a user does want to perform above operations on the main table, the user
can first drop the index, perform the operation, and re-create the index
again.

If a user drops the main table, the index will be dropped immediately too.

We do recommend you to use this management for indexing.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.&nbsp; Manual Refresh

When a user creates an index on the main table using WITH DEFERRED REFRESH
syntax, the index will be created with status disabled and query will NOT
use this index until the user issues REFRESH INDEX command to build the
index. For every REFRESH INDEX command, the system will trigger a full
refresh of the index. Once the refresh operation is finished, system will
change index status to enabled, so that it can be used in query rewrite.

For every new data loading, data update, delete, the related index will be
made disabled, which means that the following queries will not benefit from
the index before it becomes enabled again.

If the main table is dropped by the user, the related index will be dropped
immediately.



Any feedback is welcome!

Thank you.

Regards,
Forward Xu

Related Links:
[1] Implement index syntax for spark sql
<https://issues.apache.org/jira/browse/HUDI-3881&gt;
[2] Metastore service <https://github.com/apache/hudi/pull/5064&gt;

[3] <https://github.com/apache/hudi/pull/4872&gt;compaction/clustering job in
Service <https://github.com/apache/hudi/pull/4872&gt;

Re: [DISCUSS] hudi index improve

Posted by Shiyan Xu <xu...@gmail.com>.

+1 great initiative.

Please also support Trino. Todd Gao is working on Trino/Presto native
connectors. We should align the plan going from there. Looking forward to
the RFC.

On Mon, Apr 18, 2022 at 11:41 AM 孟涛 <me...@qq.com.invalid> wrote:

> ＋1 , it will be a great feature for hudi
> index is very import to boost query, and we are also trying to add index
> support for trino on hudi; maybe we can work together. Looking forward to
> the design documents
> some minor questions:
> 1. do we need to consider concurrent operation
> 2. do we want to use metaTable to store index information?
>
>
>
>
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:
>                                                   "dev"
>                                                                 <
> forwardxu315@gmail.com&gt;;
> 发送时间:&nbsp;2022年4月18日(星期一) 中午11:18
> 收件人:&nbsp;"dev"<dev@hudi.apache.org&gt;;
>
> 主题:&nbsp;[DISCUSS] hudi index improve
>
>
>
> Hi All,
>
> I want to improve hudi‘s index. There are four main steps to achieve this
>
> 1. Implement index syntax
> &nbsp;&nbsp;&nbsp; a. Implement index syntax for spark sql [1] , I have
> submitted the
> first pr.
> &nbsp;&nbsp;&nbsp; b. Implement index syntax for prestodb sql
> &nbsp;&nbsp;&nbsp; c. Implement index syntax for trino sql
>
> 2. read/write index decoupling
> The read/write index is decoupled from the computing engine side, and the
> sql index syntax of the first step can be independently executed and called
> through the API.
>
> 3. build index service
>
> Promote the implementation of the hudi service framework, including index
> service, metastore service[2], compact/cluster service[3], etc.
>
> 4. Index Management
> There are two kinds of management semantic for Index.
>
> &nbsp;&nbsp; - Automatic Refresh
> &nbsp;&nbsp; - Manual Refresh
>
>
> &nbsp;&nbsp; 1. Automatic Refresh
>
> When a user creates an index on the main table without using WITH DEFERRED
> REFRESH syntax, the index will be managed by the system automatically. For
> every data load to the main table, the system will immediately trigger a
> load to the index automatically. These two data loading (to main table and
> index) is executed in a transactional manner, meaning that it will be
> either both success or neither success.
>
> The data loading to index is incremental, avoiding an expensive total
> refresh.
>
> If a user performs the following command on the main table, the system will
> return failure. (reject the operation)
>
>
> &nbsp;&nbsp; - Data management command: UPDATE/DELETE/DELETE.
> &nbsp;&nbsp; - Schema management command: ALTER TABLE DROP COLUMN, ALTER
> TABLE CHANGE
> &nbsp;&nbsp; DATATYPE, ALTER TABLE RENAME. Note that adding a new column
> is supported,
> &nbsp;&nbsp; and for dropping columns and change datatype command, hudi
> will check
> &nbsp;&nbsp; whether it will impact the index table, if not, the operation
> is allowed,
> &nbsp;&nbsp; otherwise operation will be rejected by throwing an exception.
> &nbsp;&nbsp; - Partition management command: ALTER TABLE ADD/DROP
> PARTITION.
>
> If a user does want to perform above operations on the main table, the user
> can first drop the index, perform the operation, and re-create the index
> again.
>
> If a user drops the main table, the index will be dropped immediately too.
>
> We do recommend you to use this management for indexing.
>
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.&nbsp; Manual Refresh
>
> When a user creates an index on the main table using WITH DEFERRED REFRESH
> syntax, the index will be created with status disabled and query will NOT
> use this index until the user issues REFRESH INDEX command to build the
> index. For every REFRESH INDEX command, the system will trigger a full
> refresh of the index. Once the refresh operation is finished, system will
> change index status to enabled, so that it can be used in query rewrite.
>
> For every new data loading, data update, delete, the related index will be
> made disabled, which means that the following queries will not benefit from
> the index before it becomes enabled again.
>
> If the main table is dropped by the user, the related index will be dropped
> immediately.
>
>
>
> Any feedback is welcome!
>
> Thank you.
>
> Regards,
> Forward Xu
>
> Related Links:
> [1] Implement index syntax for spark sql
> <https://issues.apache.org/jira/browse/HUDI-3881&gt;
> [2] Metastore service <https://github.com/apache/hudi/pull/5064&gt;
>
> [3] <https://github.com/apache/hudi/pull/4872&gt;compaction/clustering
> job in
> Service <https://github.com/apache/hudi/pull/4872&gt;

-- 
Best,
Shiyan