You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2021/01/21 05:59:01 UTC

[jira] [Closed] (HUDI-1357) Add a check to ensure there is no data loss when writing to HUDI dataset

     [ https://issues.apache.org/jira/browse/HUDI-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinoth Chandar closed HUDI-1357.
--------------------------------
    Resolution: Fixed

> Add a check to ensure there is no data loss when writing to HUDI dataset
> ------------------------------------------------------------------------
>
>                 Key: HUDI-1357
>                 URL: https://issues.apache.org/jira/browse/HUDI-1357
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.7.0
>
>
> When updating a HUDI dataset with updates + deletes, records from existing base files are read and merged with updates+deletes and finally written to newer base files.
> It should hold that:
> count(records_in_older_base file) + num_deletes = count(records_in_new_base file)
> In our internal production deployment, we had an issue wherein due to parquet bug in handling the schema, reading existing records returned null data. This lead to many records not being written out from older parquet into newer parquet file.
> This check will ensure that such issues do not lead to data loss by triggering an exception when the expected record counts do not match. This check is off by default and controlled through a HoodieWriteConfig parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)