You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Pratyaksh Sharma <pr...@gmail.com> on 2022/08/10 16:25:52 UTC

[DISCUSS]: Integrate column stats index with all query engines

Hello community,

With the introduction of multi modal index in Hudi, there is a lot of scope
for improvement on the querying side. There are 2 major ways of reducing
the data scan at the time of querying - partition pruning and file pruning.
While with the latest developments in the community, partition pruning is
supported for commonly used query engines like spark, presto and hive, File
pruning using column stats index is only supported for spark and flink.

We intend to support data skipping for the rest of the engines as well
which include hive, presto and trino. I have written a draft RFC here -
https://github.com/apache/hudi/pull/6345.

Please take a look and let me know what you think. Once we have some
feedback from the community, we can decide on the next steps.

Re: [DISCUSS]: Integrate column stats index with all query engines

Posted by Pratyaksh Sharma <pr...@gmail.com>.

Surely we can work together once we get some feedback on the RFC Meng!

On Thu, Aug 11, 2022 at 9:32 AM 1037817390 <me...@qq.com.invalid>
wrote:

> +1 for this
> it will be better to provide some filter converters to faciliate the
> integration of the engine:
> eg: converter presto domain to hudi domain
>
>
>
> and i have already finish the first version of dataskipping/partition
> prune/filter pushdown for presto,
>
> https://github.com/xiarixiaoyao/presto/commit/800646608d4b88799de0addcddd97d03592954ce
>
> maybe we can work together&nbsp;
>
>
>
>
>
>
>
> 孟涛
> mengtao0326@qq.com
>
>
>
> &nbsp;
>
>
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:
>                                                   "dev"
>                                                                 <
> vinoth@apache.org&gt;;
> 发送时间:&nbsp;2022年8月11日(星期四) 中午12:11
> 收件人:&nbsp;"dev"<dev@hudi.apache.org&gt;;
>
> 主题:&nbsp;Re: [DISCUSS]: Integrate column stats index with all query engines
>
>
>
> +1 for this.
>
> Suggested new reviewers on the RFC.
> https://github.com/apache/hudi/pull/6345/files#r943073339
>
> On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma <pratyaksh13@gmail.com
> &gt;
> wrote:
>
> &gt; Hello community,
> &gt;
> &gt; With the introduction of multi modal index in Hudi, there is a lot of
> scope
> &gt; for improvement on the querying side. There are 2 major ways of
> reducing
> &gt; the data scan at the time of querying - partition pruning and file
> pruning.
> &gt; While with the latest developments in the community, partition
> pruning is
> &gt; supported for commonly used query engines like spark, presto and
> hive, File
> &gt; pruning using column stats index is only supported for spark and
> flink.
> &gt;
> &gt; We intend to support data skipping for the rest of the engines as well
> &gt; which include hive, presto and trino. I have written a draft RFC here
> -
> &gt; https://github.com/apache/hudi/pull/6345.
> &gt;
> &gt; Please take a look and let me know what you think. Once we have some
> &gt; feedback from the community, we can decide on the next steps.
> &gt;

回复： [DISCUSS]: Integrate column stats index with all query engines

Posted by 1037817390 <me...@qq.com.INVALID>.

+1 for this
it will be better to provide some filter converters to faciliate the integration of the engine:
eg: converter presto domain to hudi domain



and i have already finish the first version of dataskipping/partition prune/filter pushdown for presto,
https://github.com/xiarixiaoyao/presto/commit/800646608d4b88799de0addcddd97d03592954ce

maybe we can work together&nbsp;







孟涛
mengtao0326@qq.com



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "dev"                                                                                    <vinoth@apache.org&gt;;
发送时间:&nbsp;2022年8月11日(星期四) 中午12:11
收件人:&nbsp;"dev"<dev@hudi.apache.org&gt;;

主题:&nbsp;Re: [DISCUSS]: Integrate column stats index with all query engines



+1 for this.

Suggested new reviewers on the RFC.
https://github.com/apache/hudi/pull/6345/files#r943073339

On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma <pratyaksh13@gmail.com&gt;
wrote:

&gt; Hello community,
&gt;
&gt; With the introduction of multi modal index in Hudi, there is a lot of scope
&gt; for improvement on the querying side. There are 2 major ways of reducing
&gt; the data scan at the time of querying - partition pruning and file pruning.
&gt; While with the latest developments in the community, partition pruning is
&gt; supported for commonly used query engines like spark, presto and hive, File
&gt; pruning using column stats index is only supported for spark and flink.
&gt;
&gt; We intend to support data skipping for the rest of the engines as well
&gt; which include hive, presto and trino. I have written a draft RFC here -
&gt; https://github.com/apache/hudi/pull/6345.
&gt;
&gt; Please take a look and let me know what you think. Once we have some
&gt; feedback from the community, we can decide on the next steps.
&gt;

Re: [DISCUSS]: Integrate column stats index with all query engines

Posted by Vinoth Chandar <vi...@apache.org>.

+1 for this.

Suggested new reviewers on the RFC.
https://github.com/apache/hudi/pull/6345/files#r943073339

On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> Hello community,
>
> With the introduction of multi modal index in Hudi, there is a lot of scope
> for improvement on the querying side. There are 2 major ways of reducing
> the data scan at the time of querying - partition pruning and file pruning.
> While with the latest developments in the community, partition pruning is
> supported for commonly used query engines like spark, presto and hive, File
> pruning using column stats index is only supported for spark and flink.
>
> We intend to support data skipping for the rest of the engines as well
> which include hive, presto and trino. I have written a draft RFC here -
> https://github.com/apache/hudi/pull/6345.
>
> Please take a look and let me know what you think. Once we have some
> feedback from the community, we can decide on the next steps.
>