You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/06/14 00:57:00 UTC

[jira] [Updated] (HUDI-4250) Optimize Data Skipping to enable in-memory Column Stats Index

     [ https://issues.apache.org/jira/browse/HUDI-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated HUDI-4250:
---------------------------------
    Labels: pull-request-available  (was: )

> Optimize Data Skipping to enable in-memory Column Stats Index 
> --------------------------------------------------------------
>
>                 Key: HUDI-4250
>                 URL: https://issues.apache.org/jira/browse/HUDI-4250
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.12.0
>
>
> Executing on Spark has non-trivial amount of overhead, and therefore has to have a potential of considerable speed-up due to parallelization of the execution.
> In case of Data Skipping seq reading Column Stats Index it only could be justified for *very large* table (100s of 1000s of files, 100s of columns). 
> As such, we have to provide an alternative way of fetching Column Stats Index w/in the reading process to avoid the penalty of scheduling more heavy-weight execution t/h a Spark engine.
> This, along w/ HUDI-4202, will allow to considerably speed up Data Skipping Currently having overhead of *at least* 1-2s even for tables with a handful of files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)