You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "voonhous (via GitHub)" <gi...@apache.org> on 2023/02/09 08:23:54 UTC

[GitHub] [hudi] voonhous commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

voonhous commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101115877


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##########
@@ -380,7 +380,8 @@ private List<MergeOnReadInputSplit> buildFileIndex() {
             .path(FilePathUtils.toFlinkPath(path))
             .rowType(this.tableRowType)
             .maxCompactionMemoryInBytes(maxCompactionMemoryInBytes)
-            .requiredPartitions(getRequiredPartitionPaths()).build();
+            .requiredPartitions(getRequiredPartitionPaths())
+            .skipClustering(false).build();

Review Comment:
   Question, is it safer to hardcode a false into this under `BatchMode`? 
   
   IIUC, when reading in `batchMode`, `skip_clustering` should always be `false` and should not be configurable. 
   
   This is so as the reader will have an empty "state", i.e. no prior reads performed. As such, there will be no "previous" records that have been read. So, reading of reading of `replacecommits` must always be performed to ensure correctness.
   
   I suggest adding a comment to remind anyone modifying this part of the code in the future of **WHY** this `skip_clustering`  must be **false** here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org