You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Raymond Xu (Jira)" <ji...@apache.org> on 2022/09/19 16:02:00 UTC
[jira] [Updated] (HUDI-1157) Optimization whether to query Bootstrapped table using HoodieBootstrapRelation vs Sparks Parquet datasource
[ https://issues.apache.org/jira/browse/HUDI-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Xu updated HUDI-1157:
-----------------------------
Sprint: (was: 2022/09/19)
> Optimization whether to query Bootstrapped table using HoodieBootstrapRelation vs Sparks Parquet datasource
> -----------------------------------------------------------------------------------------------------------
>
> Key: HUDI-1157
> URL: https://issues.apache.org/jira/browse/HUDI-1157
> Project: Apache Hudi
> Issue Type: Task
> Components: bootstrap
> Reporter: Udit Mehrotra
> Assignee: Ethan Guo
> Priority: Blocker
> Fix For: 0.13.0
>
>
> This has been discussed in [https://github.com/apache/hudi/pull/1702#discussion_r466317612]
> As of now, while querying using *DataSource* we are checking if the table has been bootstrapped by the present of *bootstrap base path* in *hoodie.properties* file, and based on that query the table using *HoodieBootstrapRelation* vs *Spark Parquet Data Source*. However, there could be a scenario where all the files in the originally bootstrapped table have wither been *upserted/deleted* and thus have been fully bootstrapped and their data has been moved over to the target hoodie table. For such tables, we can start querying them using *Spark Parquet Data Source* which will be faster with all of spark's optimizations.
> So, basically we a need a way to check if all of the files have been fully bootstrapped and moved over to the target location.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)