You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/24 05:23:24 UTC

[GitHub] [hudi] zhedoubushishi opened a new pull request #1870: [HUDI-808] Support cleaning bootstrap source data

zhedoubushishi opened a new pull request #1870:
URL: https://github.com/apache/hudi/pull/1870


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   JIRA https://issues.apache.org/jira/browse/HUDI-808
   
   This is an important requirement from GDPR perspective. When performing deletion on a metadata only bootstrapped partition, users should have the ability to tell to clean up the original data from the source location because as per this new bootstrapping mechanism the original data serves as the data in original commit for Hudi.
   
   Create an option named ```hoodie.cleaner.bootstrap.source.file```, when set it to true, users have the ability to clean the original source data for the bootstrap table. By default, it is false.
   
   ## Brief change log
   
   Create an option named ```hoodie.cleaner.bootstrap.source.file```, when set it to true, users have the ability to clean the original source data for the bootstrap table. By default, it is false.
   
   Also added corresponding unit tests for this option in TestCleaner.java.
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
     - *Added two test case in ```TestCleaner.java``` to verify the change.*
   
   ## Committer checklist
   
    - [x] Has a corresponding JIRA in PR title & commit
    
    - [x] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#issuecomment-670362764


   @zhedoubushishi this can happen if the job is redeployed between creating a cleaner plan i.e `.clean.requested` and actually completing the cleaning. so the new version would expect full paths to be written inside `.clean.requested` and fail. Makes sense ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r468383431



##########
File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
##########
@@ -52,6 +52,8 @@
   public static final String MAX_COMMITS_TO_KEEP_PROP = "hoodie.keep.max.commits";
   public static final String MIN_COMMITS_TO_KEEP_PROP = "hoodie.keep.min.commits";
   public static final String COMMITS_ARCHIVAL_BATCH_SIZE_PROP = "hoodie.commits.archival.batch";
+  // Set true to clean bootstrap source files when necessary
+  public static final String CLEANER_BOOTSTRAP_BASE_FILE_ENABLED = "hoodie.cleaner.bootstrap.base.file";

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zhedoubushishi commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
zhedoubushishi commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r468069699



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/HoodieCleanStat.java
##########
@@ -39,17 +40,34 @@
   private final List<String> successDeleteFiles;
   // Files that could not be deleted
   private final List<String> failedDeleteFiles;
+  // Boostrap Base Path patterns that were generated for the delete operation

Review comment:
       [typo] boostrap -> bootstrap

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/HoodieCleanStat.java
##########
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.common;
 
+import java.util.ArrayList;

Review comment:
       [nit] wrong line

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/versioning/clean/CleanPlanV2MigrationHandler.java
##########
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline.versioning.clean;
+
+import java.util.HashMap;

Review comment:
       [nit] wrong line

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java
##########
@@ -26,39 +26,49 @@
 import org.apache.hudi.common.table.timeline.HoodieInstant;
 import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
 import org.apache.hudi.common.table.timeline.versioning.clean.CleanMetadataMigrator;
-import org.apache.hudi.common.table.timeline.versioning.clean.CleanV1MigrationHandler;
-import org.apache.hudi.common.table.timeline.versioning.clean.CleanV2MigrationHandler;
+import org.apache.hudi.common.table.timeline.versioning.clean.CleanMetadataV1MigrationHandler;
+import org.apache.hudi.common.table.timeline.versioning.clean.CleanMetadataV2MigrationHandler;
 
 import java.io.IOException;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
+import org.apache.hudi.common.table.timeline.versioning.clean.CleanPlanMigrator;

Review comment:
       [nit] wrong line




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r467526154



##########
File path: hudi-common/src/main/avro/HoodieCleanMetadata.avsc
##########
@@ -24,23 +24,22 @@
      {"name": "totalFilesDeleted", "type": "int"},
      {"name": "earliestCommitToRetain", "type": "string"},
      {"name": "partitionMetadata", "type": {
-     "type" : "map", "values" : {

Review comment:
       this is not deletion. I simply moved this to a separate file for it to be referenced in another place. The idea is to keep these information for bootstrap base files in a separate Avro field for it to be skipped by incremental timeline syncing.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar merged pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
bvaradar merged pull request #1870:
URL: https://github.com/apache/hudi/pull/1870


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r463291857



##########
File path: hudi-client/src/test/java/org/apache/hudi/table/TestCleaner.java
##########
@@ -885,6 +888,109 @@ private void testKeepLatestCommits(boolean simulateFailureRetry, boolean enableI
         file2P0C1));
   }
 
+  @Test
+  public void testBootstrapSourceFileCleanWithKeepLatestFileVersions() throws IOException {
+    testBootstrapSourceFileClean(HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS);
+  }
+
+  @Test
+  public void testBootstrapSourceFileCleanWithKeepLatestCommits() throws IOException {
+    testBootstrapSourceFileClean(HoodieCleaningPolicy.KEEP_LATEST_COMMITS);
+  }
+
+  /**
+   * Test HoodieTable.clean() with Bootstrap source file clean enable.
+   */
+  @Test
+  private void testBootstrapSourceFileClean(HoodieCleaningPolicy cleaningPolicy) throws IOException {
+    HoodieWriteConfig config = HoodieWriteConfig.newBuilder().withPath(basePath).withAssumeDatePartitioning(true)
+        .withCompactionConfig(HoodieCompactionConfig.newBuilder()
+            .withCleanBootstrapSourceFileEnabled(true)
+            .withCleanerPolicy(cleaningPolicy).retainCommits(1).retainFileVersions(2).build())
+        .build();

Review comment:
       I see some private functions in this class to test cleaning based on commits and file versions, that are re-used by other tests. Is it possible to re-use some of these already created ones. May be introduce another boolean flag for bootstrap which can do additional creation of source files and later check if they are cleaned up.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zhedoubushishi commented on pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
zhedoubushishi commented on pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#issuecomment-670256477


   > @zhedoubushishi there is one issue here. we are changing what goes into the cleaner plan i.e its writing full path as opposed to just the file names.
   > 
   > This means that we need to bump up the cleaner metadata and add some inter-op code here. since there might be pending cleaner plans on the timeline, that have just the file names. . To avoid all this can we try and implement this without changing the cleaner plan?
   
   Can I know in which condition would there be pending cleaner plans? afaik, Hudi will run clean right after create a clean request. So does the pending cleaner only exist when there's a clean request but the clean commit failed?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] yanghua commented on pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
yanghua commented on pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#issuecomment-664711400


   @zhedoubushishi There is a conflict file. Can you please fix it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r467526271



##########
File path: hudi-common/src/main/avro/HoodieCleanerPlan.avsc
##########
@@ -47,6 +47,7 @@
       "type": "string"
     },
     {
+      /** This is deprecated and replaced by the field filePathsToBeDeletedPerPartition **/

Review comment:
       I am adding a new field filePathsToBeDeletedPerPartition which deprecates filesToBeDeletedPerPartition field. filePathsToBeDeletedPerPartition retains information to distinguish cleaning of regular hudi and bootstrap base files.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zhedoubushishi commented on pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
zhedoubushishi commented on pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#issuecomment-671730790


   LGTM to me. Thanks for the implementation of versioning part @bvaradar !
   Only left some minor comments. I noticed that there's a conflict with another commit but it seems you just reverted that commit.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#issuecomment-669803111


   cc @bvaradar as well


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r467401872



##########
File path: hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
##########
@@ -116,6 +119,19 @@ HoodieCleanerPlan requestClean(JavaSparkContext jsc) {
         PartitionCleanStat partitionCleanStat = partitionCleanStatMap.get(partitionPath);
         partitionCleanStat.addDeleteFilePatterns(deletePath.getName());
         partitionCleanStat.addDeletedFileResult(deletePath.getName(), deletedFileResult);
+
+        // If CleanBootstrapSourceFileEnabled and it is a metadata bootstrap commit, also delete the corresponding source file
+        if (cleanBootstrapSourceFileEnabled && !FSUtils.isLogFile(deletePath)
+            && FSUtils.getCommitTime(delFileName).equals(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS)) {
+          Option<HoodieBaseFile> baseFile = fileSystemView.getBaseFileOn(partitionPath,

Review comment:
       @zhedoubushishi : I think this is an unnecessary call we will be making per file deletion. With embedded timeline service, this request will be routed to driver. Instead, I think we can handle it using versioning in clean plan.  Thoughts ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r467524509



##########
File path: hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
##########
@@ -116,6 +119,19 @@ HoodieCleanerPlan requestClean(JavaSparkContext jsc) {
         PartitionCleanStat partitionCleanStat = partitionCleanStatMap.get(partitionPath);
         partitionCleanStat.addDeleteFilePatterns(deletePath.getName());
         partitionCleanStat.addDeletedFileResult(deletePath.getName(), deletedFileResult);
+
+        // If CleanBootstrapSourceFileEnabled and it is a metadata bootstrap commit, also delete the corresponding source file
+        if (cleanBootstrapSourceFileEnabled && !FSUtils.isLogFile(deletePath)
+            && FSUtils.getCommitTime(delFileName).equals(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS)) {
+          Option<HoodieBaseFile> baseFile = fileSystemView.getBaseFileOn(partitionPath,

Review comment:
       @zhedoubushishi : Tried reaching you on slack :) to discuss an approach. I went ahead and implemented it in the interest of time. 
   
   The basic idea is to ensure that Cleaner plan stores necessary information related to files to be deleted including bootstrap base files. The cleaner executor will simply read the cleaner plan and be able to distinguish normal vs bootstrap base files. It goes ahead and deletes those files. For bootstrap base files, it records the complete path of the file it deleted in a separate (new) avro field. This is very important in order to ensure incremental timeline syncing (which reads these metadata) to work properly. 
   
   Please take a look at this code changes if possible. 
   ( @vinothchandar  @umehrot2  : FYI )




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zhedoubushishi commented on pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
zhedoubushishi commented on pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#issuecomment-665192084


   > @zhedoubushishi There is a conflict file. Can you please fix it?
   
   Sure. Done.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r468375136



##########
File path: hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
##########
@@ -82,40 +83,45 @@ HoodieCleanerPlan requestClean(JavaSparkContext jsc) {
       LOG.info("Using cleanerParallelism: " + cleanerParallelism);
 
       jsc.setJobGroup(this.getClass().getSimpleName(), "Generates list of file slices to be cleaned");
-      Map<String, List<String>> cleanOps = jsc
+      Map<String, List<HoodieCleanFileInfo>> cleanOps = jsc
           .parallelize(partitionsToClean, cleanerParallelism)
           .map(partitionPathToClean -> Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean)))
           .collect().stream()
-          .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
+          .collect(Collectors.toMap(Pair::getKey,
+            (y) -> y.getValue().stream().map(CleanFileInfo::toHoodieFileCleanInfo).collect(Collectors.toList())));
 
       return new HoodieCleanerPlan(earliestInstant
           .map(x -> new HoodieActionInstant(x.getTimestamp(), x.getAction(), x.getState().name())).orElse(null),
-          config.getCleanerPolicy().name(), cleanOps, 1);
+          config.getCleanerPolicy().name(), null, CleanPlanner.LATEST_CLEAN_PLAN_VERSION, cleanOps);
     } catch (IOException e) {
       throw new HoodieIOException("Failed to schedule clean operation", e);
     }
   }
 
-  private static PairFlatMapFunction<Iterator<Tuple2<String, String>>, String, PartitionCleanStat> deleteFilesFunc(
-      HoodieTable table) {
-    return (PairFlatMapFunction<Iterator<Tuple2<String, String>>, String, PartitionCleanStat>) iter -> {
+  private static PairFlatMapFunction<Iterator<Tuple2<String, CleanFileInfo>>, String, PartitionCleanStat>
+        deleteFilesFunc(HoodieTable table) {
+    return (PairFlatMapFunction<Iterator<Tuple2<String, CleanFileInfo>>, String, PartitionCleanStat>) iter -> {
       Map<String, PartitionCleanStat> partitionCleanStatMap = new HashMap<>();
-
       FileSystem fs = table.getMetaClient().getFs();
-      Path basePath = new Path(table.getMetaClient().getBasePath());
       while (iter.hasNext()) {
-        Tuple2<String, String> partitionDelFileTuple = iter.next();
+        Tuple2<String, CleanFileInfo> partitionDelFileTuple = iter.next();
         String partitionPath = partitionDelFileTuple._1();
-        String delFileName = partitionDelFileTuple._2();
-        Path deletePath = FSUtils.getPartitionPath(FSUtils.getPartitionPath(basePath, partitionPath), delFileName);
+        Path deletePath = new Path(partitionDelFileTuple._2().getFilePath());
         String deletePathStr = deletePath.toString();
         Boolean deletedFileResult = deleteFileAndGetResult(fs, deletePathStr);
         if (!partitionCleanStatMap.containsKey(partitionPath)) {
           partitionCleanStatMap.put(partitionPath, new PartitionCleanStat(partitionPath));
         }
+        boolean isBootstrapBasePathFile = partitionDelFileTuple._2().isBootstrapBaseFile();
         PartitionCleanStat partitionCleanStat = partitionCleanStatMap.get(partitionPath);
-        partitionCleanStat.addDeleteFilePatterns(deletePath.getName());
-        partitionCleanStat.addDeletedFileResult(deletePath.getName(), deletedFileResult);
+        if (isBootstrapBasePathFile) {

Review comment:
       deletePath.toString() returns a full path whereas deletePath.getName() returns only the file name.  That was the reason why it was in if-else block. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r466243056



##########
File path: hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestUtils.java
##########
@@ -513,4 +522,41 @@ public static void writeRecordsToLogFiles(FileSystem fs, String basePath, Schema
     }
     return writeStatList;
   }
+
+  public static Map<String, List<BootstrapSourceFileMapping>> generateBootstrapIndex(HoodieTableMetaClient metaClient,

Review comment:
       can we move these to the test class itself? these are very specific to bootstrap. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r467944196



##########
File path: hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
##########
@@ -82,40 +83,45 @@ HoodieCleanerPlan requestClean(JavaSparkContext jsc) {
       LOG.info("Using cleanerParallelism: " + cleanerParallelism);
 
       jsc.setJobGroup(this.getClass().getSimpleName(), "Generates list of file slices to be cleaned");
-      Map<String, List<String>> cleanOps = jsc
+      Map<String, List<HoodieCleanFileInfo>> cleanOps = jsc
           .parallelize(partitionsToClean, cleanerParallelism)
           .map(partitionPathToClean -> Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean)))
           .collect().stream()
-          .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
+          .collect(Collectors.toMap(Pair::getKey,

Review comment:
       stylistic: in general, a stream within stream is a bit hard to read. `flatMap()` first? but guess this is a map. probably using a named lambda function may help 

##########
File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
##########
@@ -52,6 +52,8 @@
   public static final String MAX_COMMITS_TO_KEEP_PROP = "hoodie.keep.max.commits";
   public static final String MIN_COMMITS_TO_KEEP_PROP = "hoodie.keep.min.commits";
   public static final String COMMITS_ARCHIVAL_BATCH_SIZE_PROP = "hoodie.commits.archival.batch";
+  // Set true to clean bootstrap source files when necessary
+  public static final String CLEANER_BOOTSTRAP_BASE_FILE_ENABLED = "hoodie.cleaner.bootstrap.base.file";

Review comment:
       rename : `hoodie.cleaner.delete.bootstrap.base.file` ?

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/HoodieCleanStat.java
##########
@@ -39,17 +40,34 @@
   private final List<String> successDeleteFiles;
   // Files that could not be deleted
   private final List<String> failedDeleteFiles;
+  // Boostrap Base Path patterns that were generated for the delete operation
+  private final List<String> deleteBootstrapBasePathPatterns;
+  private final List<String> successDeleteBootstrapBaseFiles;
+  // Files that could not be deleted
+  private final List<String> failedDeleteBootstrapBaseFiles;
   // Earliest commit that was retained in this clean
   private final String earliestCommitToRetain;
 
   public HoodieCleanStat(HoodieCleaningPolicy policy, String partitionPath, List<String> deletePathPatterns,
       List<String> successDeleteFiles, List<String> failedDeleteFiles, String earliestCommitToRetain) {
+    this(policy, partitionPath, deletePathPatterns, successDeleteFiles, failedDeleteFiles, earliestCommitToRetain,
+        new ArrayList<>(), new ArrayList<>(), new ArrayList<>());

Review comment:
       CollectionUtils.emptyList or something? 

##########
File path: hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
##########
@@ -82,40 +83,45 @@ HoodieCleanerPlan requestClean(JavaSparkContext jsc) {
       LOG.info("Using cleanerParallelism: " + cleanerParallelism);
 
       jsc.setJobGroup(this.getClass().getSimpleName(), "Generates list of file slices to be cleaned");
-      Map<String, List<String>> cleanOps = jsc
+      Map<String, List<HoodieCleanFileInfo>> cleanOps = jsc
           .parallelize(partitionsToClean, cleanerParallelism)
           .map(partitionPathToClean -> Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean)))
           .collect().stream()
-          .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
+          .collect(Collectors.toMap(Pair::getKey,
+            (y) -> y.getValue().stream().map(CleanFileInfo::toHoodieFileCleanInfo).collect(Collectors.toList())));
 
       return new HoodieCleanerPlan(earliestInstant
           .map(x -> new HoodieActionInstant(x.getTimestamp(), x.getAction(), x.getState().name())).orElse(null),
-          config.getCleanerPolicy().name(), cleanOps, 1);
+          config.getCleanerPolicy().name(), null, CleanPlanner.LATEST_CLEAN_PLAN_VERSION, cleanOps);
     } catch (IOException e) {
       throw new HoodieIOException("Failed to schedule clean operation", e);
     }
   }
 
-  private static PairFlatMapFunction<Iterator<Tuple2<String, String>>, String, PartitionCleanStat> deleteFilesFunc(
-      HoodieTable table) {
-    return (PairFlatMapFunction<Iterator<Tuple2<String, String>>, String, PartitionCleanStat>) iter -> {
+  private static PairFlatMapFunction<Iterator<Tuple2<String, CleanFileInfo>>, String, PartitionCleanStat>
+        deleteFilesFunc(HoodieTable table) {
+    return (PairFlatMapFunction<Iterator<Tuple2<String, CleanFileInfo>>, String, PartitionCleanStat>) iter -> {
       Map<String, PartitionCleanStat> partitionCleanStatMap = new HashMap<>();
-
       FileSystem fs = table.getMetaClient().getFs();
-      Path basePath = new Path(table.getMetaClient().getBasePath());
       while (iter.hasNext()) {
-        Tuple2<String, String> partitionDelFileTuple = iter.next();
+        Tuple2<String, CleanFileInfo> partitionDelFileTuple = iter.next();
         String partitionPath = partitionDelFileTuple._1();
-        String delFileName = partitionDelFileTuple._2();
-        Path deletePath = FSUtils.getPartitionPath(FSUtils.getPartitionPath(basePath, partitionPath), delFileName);
+        Path deletePath = new Path(partitionDelFileTuple._2().getFilePath());
         String deletePathStr = deletePath.toString();
         Boolean deletedFileResult = deleteFileAndGetResult(fs, deletePathStr);
         if (!partitionCleanStatMap.containsKey(partitionPath)) {
           partitionCleanStatMap.put(partitionPath, new PartitionCleanStat(partitionPath));
         }
+        boolean isBootstrapBasePathFile = partitionDelFileTuple._2().isBootstrapBaseFile();
         PartitionCleanStat partitionCleanStat = partitionCleanStatMap.get(partitionPath);
-        partitionCleanStat.addDeleteFilePatterns(deletePath.getName());
-        partitionCleanStat.addDeletedFileResult(deletePath.getName(), deletedFileResult);
+        if (isBootstrapBasePathFile) {

Review comment:
       this is same as 
   
   ```
   partitionCleanStat.addDeleteFilePatterns(deletePath.toString(), isBootstrapBasePathFile);
   partitionCleanStat.addDeletedFileResult(deletePath.toString(), deletedFileResult, isBootstrapBasePathFile);
   ```
   right

##########
File path: hudi-common/src/main/avro/HoodieCleanMetadata.avsc
##########
@@ -24,23 +24,22 @@
      {"name": "totalFilesDeleted", "type": "int"},
      {"name": "earliestCommitToRetain", "type": "string"},
      {"name": "partitionMetadata", "type": {
-     "type" : "map", "values" : {

Review comment:
       do we know that would be backwards compatible?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zhedoubushishi commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
zhedoubushishi commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r466742409



##########
File path: hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestUtils.java
##########
@@ -513,4 +522,41 @@ public static void writeRecordsToLogFiles(FileSystem fs, String basePath, Schema
     }
     return writeStatList;
   }
+
+  public static Map<String, List<BootstrapSourceFileMapping>> generateBootstrapIndex(HoodieTableMetaClient metaClient,

Review comment:
       Sure. Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zhedoubushishi commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
zhedoubushishi commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r467496945



##########
File path: hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
##########
@@ -116,6 +119,19 @@ HoodieCleanerPlan requestClean(JavaSparkContext jsc) {
         PartitionCleanStat partitionCleanStat = partitionCleanStatMap.get(partitionPath);
         partitionCleanStat.addDeleteFilePatterns(deletePath.getName());
         partitionCleanStat.addDeletedFileResult(deletePath.getName(), deletedFileResult);
+
+        // If CleanBootstrapSourceFileEnabled and it is a metadata bootstrap commit, also delete the corresponding source file
+        if (cleanBootstrapSourceFileEnabled && !FSUtils.isLogFile(deletePath)
+            && FSUtils.getCommitTime(delFileName).equals(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS)) {
+          Option<HoodieBaseFile> baseFile = fileSystemView.getBaseFileOn(partitionPath,

Review comment:
       I see what you mean. Yea we should avoid making unnecessary calls. So regarding to versioning, do you mean create a class like [TimelineLayoutVersion](https://github.com/apache/hudi/blob/5ee676e34fa2429d02ecd1a3cf7f586518ed2081/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/versioning/TimelineLayoutVersion.java#L29) or just simply handle this logic in the ```CleanActionExecutor```. For example, if the parameter is a full file path then delete it. Else if the parameter is a file name, generate the file path and then delete it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#issuecomment-671401411


   @zhedoubushishi @umehrot2 please review this PR carefully ! We plan to land today 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zhedoubushishi commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
zhedoubushishi commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r467373477



##########
File path: hudi-client/src/test/java/org/apache/hudi/table/TestCleaner.java
##########
@@ -885,6 +888,109 @@ private void testKeepLatestCommits(boolean simulateFailureRetry, boolean enableI
         file2P0C1));
   }
 
+  @Test
+  public void testBootstrapSourceFileCleanWithKeepLatestFileVersions() throws IOException {
+    testBootstrapSourceFileClean(HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS);
+  }
+
+  @Test
+  public void testBootstrapSourceFileCleanWithKeepLatestCommits() throws IOException {
+    testBootstrapSourceFileClean(HoodieCleaningPolicy.KEEP_LATEST_COMMITS);
+  }
+
+  /**
+   * Test HoodieTable.clean() with Bootstrap source file clean enable.
+   */
+  @Test
+  private void testBootstrapSourceFileClean(HoodieCleaningPolicy cleaningPolicy) throws IOException {
+    HoodieWriteConfig config = HoodieWriteConfig.newBuilder().withPath(basePath).withAssumeDatePartitioning(true)
+        .withCompactionConfig(HoodieCompactionConfig.newBuilder()
+            .withCleanBootstrapSourceFileEnabled(true)
+            .withCleanerPolicy(cleaningPolicy).retainCommits(1).retainFileVersions(2).build())
+        .build();

Review comment:
       Rewrote the test part to make use of the original test code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zhedoubushishi commented on pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
zhedoubushishi commented on pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#issuecomment-670836876


   > @zhedoubushishi this can happen if the job is redeployed between creating a cleaner plan i.e `.clean.requested` and actually completing the cleaning. so the new version would expect full paths to be written inside `.clean.requested` and fail. Makes sense ?
   
   Thanks for the explanation! Make sense to me. I handled the delete logic in the actual delete part instead.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r468382842



##########
File path: hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
##########
@@ -82,40 +83,45 @@ HoodieCleanerPlan requestClean(JavaSparkContext jsc) {
       LOG.info("Using cleanerParallelism: " + cleanerParallelism);
 
       jsc.setJobGroup(this.getClass().getSimpleName(), "Generates list of file slices to be cleaned");
-      Map<String, List<String>> cleanOps = jsc
+      Map<String, List<HoodieCleanFileInfo>> cleanOps = jsc
           .parallelize(partitionsToClean, cleanerParallelism)
           .map(partitionPathToClean -> Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean)))
           .collect().stream()
-          .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
+          .collect(Collectors.toMap(Pair::getKey,

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on a change in pull request #1870: [HUDI-808] Support cleaning bootstrap source data

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r467524645



##########
File path: hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
##########
@@ -116,6 +119,19 @@ HoodieCleanerPlan requestClean(JavaSparkContext jsc) {
         PartitionCleanStat partitionCleanStat = partitionCleanStatMap.get(partitionPath);
         partitionCleanStat.addDeleteFilePatterns(deletePath.getName());
         partitionCleanStat.addDeletedFileResult(deletePath.getName(), deletedFileResult);
+
+        // If CleanBootstrapSourceFileEnabled and it is a metadata bootstrap commit, also delete the corresponding source file
+        if (cleanBootstrapSourceFileEnabled && !FSUtils.isLogFile(deletePath)
+            && FSUtils.getCommitTime(delFileName).equals(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS)) {
+          Option<HoodieBaseFile> baseFile = fileSystemView.getBaseFileOn(partitionPath,

Review comment:
       @zhedoubushishi : Please also note that for ensuring backwards compatibility, I add a migrator class (CleanPlanMigrator.java) to handle pre and post 0.6  .clean.requested files.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org