You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/15 22:46:24 UTC

[GitHub] [hudi] nsivabalan commented on a change in pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

nsivabalan commented on a change in pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#discussion_r807373205



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
##########
@@ -73,7 +73,8 @@ HoodieCleanerPlan requestClean(HoodieEngineContext context) {
       CleanPlanner<T, I, K, O> planner = new CleanPlanner<>(context, table, config);
       Option<HoodieInstant> earliestInstant = planner.getEarliestCommitToRetain();
       context.setJobStatus(this.getClass().getSimpleName(), "Obtaining list of partitions to be cleaned");
-      List<String> partitionsToClean = planner.getPartitionPathsToClean(earliestInstant);
+      boolean isDropPartition = planner.isDropPartition(earliestInstant);

Review comment:
       guess we have to rethink this a bit. We have diff clean policies and I am afraid this may not work for other policies. for eg, default is KEEP_LATEST_COMMITS and it might work, but for KEEP_LATEST_FILE_VERSIONS, earliestInstant is empty. 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -400,6 +416,16 @@ private String getLatestVersionBeforeCommit(List<FileSlice> fileSliceList, Hoodi
     return deletePaths;
   }
 
+  public List<CleanFileInfo> getDeletePartitionPaths(String partitionPath) {

Review comment:
       so, in case of DeletePartition, we just include the partition path (directory) to CleanFileInfo and do not add explicit file groups/files within partition is it? 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
##########
@@ -85,10 +86,18 @@ HoodieCleanerPlan requestClean(HoodieEngineContext context) {
 
       context.setJobStatus(this.getClass().getSimpleName(), "Generating list of file slices to be cleaned");
 
-      Map<String, List<HoodieCleanFileInfo>> cleanOps = context
-          .map(partitionsToClean, partitionPathToClean -> Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean)), cleanerParallelism)
-          .stream()
-          .collect(Collectors.toMap(Pair::getKey, y -> CleanerUtils.convertToHoodieCleanFileInfoList(y.getValue())));
+      Map<String, List<HoodieCleanFileInfo>> cleanOps;
+      if (isDropPartition) {
+        cleanOps = context
+            .map(partitionsToClean, partitionPathToClean -> Pair.of(partitionPathToClean, planner.getDeletePartitionPaths(partitionPathToClean)), cleanerParallelism)

Review comment:
       shouldn't we atleast fix CleanFileInfo to have another variable called isPartitionPath or something. 
   

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -113,21 +117,29 @@ public CleanPlanner(HoodieEngineContext context, HoodieTable<T, I, K, O> hoodieT
     return metadata.getPartitionMetadata().values().stream().flatMap(s -> s.getSavepointDataFile().stream());
   }
 
+  public List<String> getPartitionPathsToClean(Option<HoodieInstant> earliestRetainedInstant) throws IOException {

Review comment:
       is this used anywhere? 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
##########
@@ -85,10 +86,18 @@ HoodieCleanerPlan requestClean(HoodieEngineContext context) {
 
       context.setJobStatus(this.getClass().getSimpleName(), "Generating list of file slices to be cleaned");
 
-      Map<String, List<HoodieCleanFileInfo>> cleanOps = context
-          .map(partitionsToClean, partitionPathToClean -> Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean)), cleanerParallelism)
-          .stream()
-          .collect(Collectors.toMap(Pair::getKey, y -> CleanerUtils.convertToHoodieCleanFileInfoList(y.getValue())));
+      Map<String, List<HoodieCleanFileInfo>> cleanOps;
+      if (isDropPartition) {
+        cleanOps = context
+            .map(partitionsToClean, partitionPathToClean -> Pair.of(partitionPathToClean, planner.getDeletePartitionPaths(partitionPathToClean)), cleanerParallelism)

Review comment:
       from what I infer, we are just returning the partition path directory as part of List<HoodieCleanFileInfo> here and don't explicitly add every file group within the partition. since we store entries in metadata table keyed on partition path, we will just directly delete the entry and so add every file group info is not required? can you confirm my understanding. 
   

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -432,4 +458,47 @@ private boolean isFileSliceNeededForPendingCompaction(FileSlice fileSlice) {
   private boolean isFileGroupInPendingCompaction(HoodieFileGroup fg) {
     return fgIdToPendingCompactionOperations.containsKey(fg.getFileGroupId());
   }
+
+  public boolean isDropPartition(Option<HoodieInstant> instantToRetain) {

Review comment:
       isDeletePartitionOperation

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -432,4 +458,47 @@ private boolean isFileSliceNeededForPendingCompaction(FileSlice fileSlice) {
   private boolean isFileGroupInPendingCompaction(HoodieFileGroup fg) {
     return fgIdToPendingCompactionOperations.containsKey(fg.getFileGroupId());
   }
+
+  public boolean isDropPartition(Option<HoodieInstant> instantToRetain) {
+    try {
+      if (instantToRetain.isPresent() && HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instantToRetain.get().getAction())) {
+        HoodieReplaceCommitMetadata replaceCommitMetadata = HoodieReplaceCommitMetadata.fromBytes(
+            hoodieTable.getActiveTimeline().getInstantDetails(instantToRetain.get()).get(), HoodieReplaceCommitMetadata.class);
+
+        if (replaceCommitMetadata != null
+            && WriteOperationType.DELETE_PARTITION.equals(replaceCommitMetadata.getOperationType())) {
+          return true;
+        }
+      }
+    } catch (Exception e) {
+      throw new HoodieException("Failed to get commit metadata", e);
+    }
+    return false;
+  }
+
+  public List<String> getDropPartitions(Option<HoodieInstant> instantToRetain) {
+    try {
+      if (!instantToRetain.isPresent()) {
+        LOG.info("No earliest commit to retain. No need to scan partitions !!");
+        return Collections.emptyList();
+      }
+
+      if (HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instantToRetain.get().getAction())) {

Review comment:
       I am not sure if we can map earliest instant to retain to clean up partitions to delete. 
   
   for eg:
   C1, C2(delete partition), C3, C4. 
   
   lets say after C4, earliest commit to retain is C2: this essentially means that cleaner has to clean all data files for all commits < C2. and not touch anything pertaining to C2. but here wrt getDropPartition, we are breaking that and triggering deletion of partitions from C2. 
   
   or am I missing something? 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org