You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Ethan Guo (Jira)" <ji...@apache.org> on 2022/09/15 15:42:00 UTC

[jira] [Updated] (HUDI-915) Partition Columns missing in files upserted after Metadata Bootstrap

     [ https://issues.apache.org/jira/browse/HUDI-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ethan Guo updated HUDI-915:
---------------------------
    Status: In Progress  (was: Open)

> Partition Columns missing in files upserted after Metadata Bootstrap
> --------------------------------------------------------------------
>
>                 Key: HUDI-915
>                 URL: https://issues.apache.org/jira/browse/HUDI-915
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: Common Core
>    Affects Versions: 0.9.0
>            Reporter: Udit Mehrotra
>            Assignee: Ethan Guo
>            Priority: Blocker
>             Fix For: 0.13.0
>
>
> This issue happens in when the source data is partitioned using _*hive-style partitioning*_ which is also the default behavior of spark when it writes the data. With this partitioning, the partition column/schema is never stored in the files but instead retrieved on the fly from the file paths which have partition folder in the form *_partition_key=partition_value_*.
> Now, during metadata bootstrap we store only the metadata columns in the hudi table folder. Also the *bootstrap schema* we are computing directly reads schema from the source data file which does not have the *partition column schema* in it. Thus it is not complete.
> All this manifests into issues when we ultimately do *upserts* on these bootstrapped files and they are fully bootstrapped. During upsert time the schema evolves because the upsert dataframe needs to have partition column in it for performing upserts. Thus ultimately the *upserted rows* have the correct partition column value stored, while the other records which are simply copied over from the metadata bootstrap file have missing partition column in them. Thus, we observe a different behavior here with *bootstrapped* vs *non-bootstrapped* tables.
> While this is not at the moment creating issues with *Hive* because it is able to determine the partition columns becuase of all the metadata it stores, however it creates a problem with other engines like *Spark* where the partition columns will show up as *null* when the upserted files are read.
> Thus, the proposal is to fix the following issues:
>  * When performing bootstrap, figure out the partition schema and store it in the *bootstrap schema* in the commit metadata file. This would provide the following benefits:
>  ** From a completeness perspective this is good so that there is no behavioral changes between bootstrapped vs non-bootstrapped tables.
>  ** In spark bootstrap relation and incremental query relation where we need to figure out the latest schema, once can simply get the accurate schema from the commit metadata file instead of having to determine whether or not partition column is present in the schema obtained from the metadata file and if not figure out the partition schema everytime and merge (which can be expensive).
>  * When doing upsert on files that are metadata bootstrapped, the partition column values should be correctly determined and copied to the upserted file to avoid missing and null values.
>  ** Again this is consistent behavior with non-bootstrapped tables and even though Hive seems to somehow handle this, we should consider other engines like *Spark* where it cannot be automatically handled.
>  ** Without this it will be significantly more complicated to be able to provide the partition value on read side in spark, to be able to determine everytime whether partition value is null and somehow filling it in.
>  ** Once the table is fully bootstrapped at some point in future, and the bootstrap commit is say cleaned up and spark querying happens through *parquet* datasource instead of *new bootstrapped datasource*, the *parquet datasource* will return null values wherever it find the missing partition values. In that case, we have no control over the *parquet* datasource as it is simply reading from the file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)