You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/06/14 00:57:00 UTC
[jira] [Updated] (HUDI-4250) Optimize Data Skipping to enable in-memory Column Stats Index
[ https://issues.apache.org/jira/browse/HUDI-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HUDI-4250:
---------------------------------
Labels: pull-request-available (was: )
> Optimize Data Skipping to enable in-memory Column Stats Index
> --------------------------------------------------------------
>
> Key: HUDI-4250
> URL: https://issues.apache.org/jira/browse/HUDI-4250
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Alexey Kudinkin
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Executing on Spark has non-trivial amount of overhead, and therefore has to have a potential of considerable speed-up due to parallelization of the execution.
> In case of Data Skipping seq reading Column Stats Index it only could be justified for *very large* table (100s of 1000s of files, 100s of columns).
> As such, we have to provide an alternative way of fetching Column Stats Index w/in the reading process to avoid the penalty of scheduling more heavy-weight execution t/h a Spark engine.
> This, along w/ HUDI-4202, will allow to considerably speed up Data Skipping Currently having overhead of *at least* 1-2s even for tables with a handful of files.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)