You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "voonhous (via GitHub)" <gi...@apache.org> on 2023/03/07 05:44:20 UTC

[GitHub] [hudi] voonhous commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

voonhous commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1457575576

   Apologies for disturbing everyone here. I've tried reaching out to the author on Slack, but failed to get an explanation, so i will post my questions here.
   
   # 1. Rational behind the fix?
   From what @SteNicholas shared, this bug fixes a data-duplication issue after clustering? Since this is not documented anywhere in the PR or the JR ticket, I think it is best to mention it in the PR itself for future references.
   
   # 2. Issues with integration test?
   I apologies again as I don't quite understand the reason behind this fix. Hence, while trying to reproduce this issue, I copied the integration test (`testHoodieFlinkClusteringScheduleAfterArchive`) onto a Hudi version's `ITTestHoodieFlinkClustering` that doesn't have this fix yet expecting the IT to fail. But it still succeeds without the said fix. Hence, the test might not be written properly, or the fix might not be valid to begin with as it passes under ANY conditions. 
   
   I am copying the fix onto 6b178ce978ce324361cf708ce711cd0c46452dd6, this is the commit prior to the fix being merged into master.
   
   I repeated the test 100 times and it's passing for all 100 times on a branch that's without the fix.
   
   ![image](https://user-images.githubusercontent.com/6312314/223329883-05e6a499-72f7-4950-b723-db6468a5e68e.png)
   
   
   # 3. Questions regarding the fix
   
   
   Can someone please help to address the concerns above, mainly (2) and (3).
   > i.e, timeline: deltacommit1, deltacommit2, replacecommit1.inflight, deltacommit3, deltacommit4', at this time, if deltacommit1, deltacommit2 are archived, the containsOrBeforeTimelineStarts returns true for replacecommit1.inflight because isBeforeTimelineStarts returns true. Then, the file slices of the pending clustering instant would be included in another clustering plan.
   
   IIUC, the (input) file slices of the pending clustering instant will be excluded in the second clustering plan as clustering plan generator excludes all file slices in pending:
   
   1. logCompaction
   2. compaction
   3. clustering
   
   Code handling this in master (org.apache.hudi.table.action.cluster.strategy.ClusteringPlanStrategy#getFileSlicesEligibleForClustering) :
   
   ```java
   /**
   * Return file slices eligible for clustering. FileIds in pending clustering/compaction are not eligible for clustering.
   */
   protected Stream<FileSlice> getFileSlicesEligibleForClustering(String partition) {
   SyncableFileSystemView fileSystemView = (SyncableFileSystemView) getHoodieTable().getSliceView();
   Set<HoodieFileGroupId> fgIdsInPendingCompactionLogCompactionAndClustering =
       Stream.concat(fileSystemView.getPendingCompactionOperations(), fileSystemView.getPendingLogCompactionOperations())
           .map(instantTimeOpPair -> instantTimeOpPair.getValue().getFileGroupId())
           .collect(Collectors.toSet());
   fgIdsInPendingCompactionLogCompactionAndClustering.addAll(fileSystemView.getFileGroupsInPendingClustering().map(Pair::getKey).collect(Collectors.toSet()));
   
   return hoodieTable.getSliceView().getLatestFileSlices(partition)
       // file ids already in clustering are not eligible
       .filter(slice -> !fgIdsInPendingCompactionLogCompactionAndClustering.contains(slice.getFileGroupId()));
   }
   ```
   
   > When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline to check whether the file slice generated in pending clustering after archive isn't committed via HoodieFileGroup#isFileSliceCommitted(slice).
   
   IIUC, for a `fileSlice` to be a candidate for clustering inputFileGroup, it needs to be committed no? So, even though the the `replaceCommit` is the first instant on the timeline, the `inputFileGroups` should be tagged as HoodieFileGroup#isFileSliceCommitted(slice) = true.  
   
   # 4. My understanding of the fix
   
   From what i understand, (please correct me if i am wrong) this fix is trying to address an issue where the same fileslice is included in 2 separate clusteringPlan/replaceCommit's `inputFileGroup`. After clustering, the same row will exist in 2 different fileGroups as clustering will create new fileGroups for clustered data.
   
   Reference: A replaceCommit does the following:
   ```
   replaceCommit1 = input{FG_0, FG_1, FG_2} -> output{FG_3}
   replaceCommit2 = input{FG_0, FG_1, FG_2} -> output{FG_4}
   ```
   
   If this is occurring, this seems like an issue with how Hudi is fetching `pendingClusteringPlans` no?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org