You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "satish (Jira)" <ji...@apache.org> on 2020/03/09 23:21:00 UTC
[jira] [Updated] (HUDI-687) incremental reads on MOR tables can lead to data loss

     [ https://issues.apache.org/jira/browse/HUDI-687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

satish updated HUDI-687:
------------------------
      Component/s:     (was: CLI)
    Fix Version/s:     (was: 0.5.2)
      Description: 
example timeline:

t0 -> create bucket1.parquet
t1 -> create and append updates bucket1.log
t2 -> request compaction 
t3 -> create bucket2.parquet

if compaction at t2 takes a long time, incremental reads using HoodieParquetInputFormat can skip data ingested at t1 leading to 'data loss' (Data will still be on disk, but incremental readers wont see it because its in log file and readers move to t3)

To workaround this problem, we want to stop returning data belonging to commits > t1. After compaction is complete, incremental reader would see updates in t2, t3, so on.


  was:
Hudi CLI has 'show archived commits' command which is not very helpful

 
{code:java}
->show archived commits
===============> Showing only 10 archived commits <===============
    ____________________________
    | CommitTime    | CommitType|
    |===========================|
    | 20190322223304| commit    |
    | 20190323220154| commit    |
    | 20190323220154| commit    |
    | 20190323224004| commit    |
    | 20190323224013| commit    |
    | 20190323224229| commit    |
    | 20190323224229| commit    |
    | 20190323232849| commit    |
    | 20190323233109| commit    |
    | 20190323233109| commit    |
 {code}
Modify or introduce new command to make it easy to debug

 

           Labels:   (was: pull-request-available)
         Priority: Critical  (was: Minor)

> incremental reads on MOR tables can lead to data loss
> -----------------------------------------------------
>
>                 Key: HUDI-687
>                 URL: https://issues.apache.org/jira/browse/HUDI-687
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>            Reporter: satish
>            Assignee: satish
>            Priority: Critical
>
> example timeline:
> t0 -> create bucket1.parquet
> t1 -> create and append updates bucket1.log
> t2 -> request compaction 
> t3 -> create bucket2.parquet
> if compaction at t2 takes a long time, incremental reads using HoodieParquetInputFormat can skip data ingested at t1 leading to 'data loss' (Data will still be on disk, but incremental readers wont see it because its in log file and readers move to t3)
> To workaround this problem, we want to stop returning data belonging to commits > t1. After compaction is complete, incremental reader would see updates in t2, t3, so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)