You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/05 18:40:15 UTC

[GitHub] [hudi] bhasudha commented on a change in pull request #1817: [HUDI-651] Fix incremental queries in MOR tables

bhasudha commented on a change in pull request #1817:
URL: https://github.com/apache/hudi/pull/1817#discussion_r465927669



##########
File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
##########
@@ -165,11 +261,15 @@ private static void cleanProjectionColumnIds(Configuration conf) {
     LOG.info("Creating record reader with readCols :" + jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR)
         + ", Ids :" + jobConf.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR));
     // sanity check
-    ValidationUtils.checkArgument(split instanceof HoodieRealtimeFileSplit,
+    ValidationUtils.checkArgument(split instanceof HoodieRealtimeFileSplit || split instanceof HoodieMORIncrementalFileSplit,

Review comment:
       @satishkotha  There are few requirements we need to satisfy in order to support this in HoodieRealtimeFileSplit:
   
   - The start and end time should be honored by the incremental query. If end time is not specified then it can be assumed to be minCommit from (maxNumberrOfCommits, mostRecentCommit). Currently this is not happening as intended.  
   - The base file and log files can be optional. This can be the case when the boundaries of incremental query filter is such that the start commit time matches a log file and/or an end commit time matches only the base file across file slices. Or the incremental query is touching a FileSlice that is not compacted yet.
   
   When I initially started, I was not sure how big the refactor and testing it would be to achieve both of the above requirements in the same HoodieRealtimeFileSplit. This would also require regression testing of snapshot queries in all query engines and new incremental query path in all query engines. So instead of impacting the snapshot queries code path that is running fine, conservatively, I branched out to make these changes only applicable to incremental query path and intended to consolidate them in long term after stabilizing and gaining more confidence.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org