You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/04 00:42:02 UTC

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

vinothchandar commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r403397635
 
 

 ##########
 File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
 ##########
 @@ -118,6 +119,34 @@
     return returns.toArray(new FileStatus[returns.size()]);
   }
 
+  /**
+   * Filter any specific instants that we do not want to process.
+   * example timeline:
+   *
+   * t0 -> create bucket1.parquet
+   * t1 -> create and append updates bucket1.log
+   * t2 -> request compaction
+   * t3 -> create bucket2.parquet
+   *
+   * if compaction at t2 takes a long time, incremental readers on RO tables can move to t3 and would skip updates in t1
+   *
+   * To workaround this problem, we want to stop returning data belonging to commits > t2.
+   * After compaction is complete, incremental reader would see updates in t2, t3, so on.
+   */
+  protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline timeline) {
+    Option<HoodieInstant> pendingCompactionInstant = timeline.filterPendingCompactionTimeline().firstInstant();
+    if (pendingCompactionInstant.isPresent()) {
 
 Review comment:
   This seems like the crux of the change? and most of the other code is improving tests etc. If so, this seems like a  reasonable interim solution to me... Although we should encourage users to do incremental pull out of the RTInputFormat really ... 
   
   The core problem of "data loss" being brought in this issue, feels like a mis-expectation really :) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services