You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/08 07:07:37 UTC

[GitHub] [hudi] scxwhite commented on a diff in pull request #5030: [HUDI-3617] MOR compact improve

scxwhite commented on code in PR #5030:
URL: https://github.com/apache/hudi/pull/5030#discussion_r965582169


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##########
@@ -123,25 +133,24 @@ public long getNumMergedRecordsInLog() {
     return numMergedRecordsInLog;
   }
 
-  /**
-   * Returns the builder for {@code HoodieMergedLogRecordScanner}.
-   */
-  public static HoodieMergedLogRecordScanner.Builder newBuilder() {
-    return new Builder();
-  }
-
   @Override
   protected void processNextRecord(HoodieRecord<? extends HoodieRecordPayload> hoodieRecord) throws IOException {
     String key = hoodieRecord.getRecordKey();
     if (records.containsKey(key)) {
       // Merge and store the merged record. The HoodieRecordPayload implementation is free to decide what should be
       // done when a delete (empty payload) is encountered before or after an insert/update.
-
-      HoodieRecord<? extends HoodieRecordPayload> oldRecord = records.get(key);
-      HoodieRecordPayload oldValue = oldRecord.getData();
-      HoodieRecordPayload combinedValue = hoodieRecord.getData().preCombine(oldValue);
-      // If combinedValue is oldValue, no need rePut oldRecord
-      if (combinedValue != oldValue) {
+      HoodieRecord<? extends HoodieRecordPayload> storeRecord = records.get(key);
+      HoodieRecordPayload storeValue = storeRecord.getData();
+      HoodieRecordPayload combinedValue;
+      // If revertLogFile = false, storeRecord is the old record.
+      // If revertLogFile = true, incoming data (hoodieRecord) is the old record.
+      if (!revertLogFile) {

Review Comment:
   > I see we are reversing the log files order and not the log blocks order. So, its not strictly reverse. bcoz, incase of hdfs, same log file could be updated by diff delta commits.
   > 
   > logFile1: log block1, LB2, LB3 logFile2: LB4, LB5 logFile3: LB6
   > 
   > as per master, the order of blocks being processed will be LB1, LB2, LB3, LB4, LB5, LB6.
   > 
   > So, as per this patch (correct me if I am mistaken), the order in which we will process blocks are LB6, LB4, LB5, LB1, LB2, LB3
   > 
   > thinking if this will result in any complications/corner case issues.
   Yes, it's not strictly in reverse order. Therefore, it is necessary to ensure the uniqueness of the record key in each log file, such as setting "hoodie.combine.before.upsert=true".
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org