You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/09/22 00:53:10 UTC

[GitHub] [hudi] nsivabalan commented on a change in pull request #3646: [HUDI-349]: Added new cleaning policy based on number of hours

nsivabalan commented on a change in pull request #3646:
URL: https://github.com/apache/hudi/pull/3646#discussion_r713508760



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
##########
@@ -512,6 +517,11 @@ public Builder retainCommits(int commitsRetained) {
       return this;
     }
 
+    public Builder retainNumberOfHours(int numberOfHours) {

Review comment:
       can we name the arg same as its usage. cleanerHoursRetained

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCleaningPolicy.java
##########
@@ -22,5 +22,5 @@
  * Hoodie cleaning policies.
  */
 public enum HoodieCleaningPolicy {
-  KEEP_LATEST_FILE_VERSIONS, KEEP_LATEST_COMMITS;
+  KEEP_LATEST_FILE_VERSIONS, KEEP_LATEST_COMMITS, KEEP_LAST_X_HOURS;

Review comment:
       KEEP_LATEST_BY_HOURS

##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
##########
@@ -1240,6 +1244,154 @@ public void testKeepLatestCommits(boolean simulateFailureRetry, boolean enableIn
     assertTrue(testTable.baseFileExists(p0, "00000000000005", file3P0C2));
   }
 
+  /**
+   * Test cleaning policy based on number of hours retained policy. This test case covers the case when files will not be cleaned.
+   */
+  @ParameterizedTest
+  @MethodSource("argumentsForTestKeepLatestCommits")
+  public void testKeepXHoursNoCleaning(boolean simulateFailureRetry, boolean enableIncrementalClean, boolean enableBootstrapSourceClean) throws Exception {

Review comment:
       again, is it possible to reuse existing tests for keepLatestcommits rather than rewriting entire tests.

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -402,9 +478,16 @@ private String getLatestVersionBeforeCommit(List<FileSlice> fileSliceList, Hoodi
   public Option<HoodieInstant> getEarliestCommitToRetain() {
     Option<HoodieInstant> earliestCommitToRetain = Option.empty();
     int commitsRetained = config.getCleanerCommitsRetained();
+    int hoursRetained = config.getCleanerHoursRetained();
     if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LATEST_COMMITS
         && commitTimeline.countInstants() > commitsRetained) {
-      earliestCommitToRetain = commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained);
+      earliestCommitToRetain = commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained); //15 instants total, 10 commits to retain, this gives 6th instant in the list
+    } else if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LAST_X_HOURS) {
+      Instant instant = Instant.now();
+      ZonedDateTime commitDateTime = ZonedDateTime.ofInstant(instant, ZoneId.systemDefault());

Review comment:
       Do we have any precedence in hudi code base for doing time based calculations. Can you explore and let me know. Wanna maintain some uniformity if we have any. 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
##########
@@ -69,6 +69,11 @@
       .withDocumentation("Number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits "
           + "(scheduled). This also directly translates into how much data retention the table supports for incremental queries.");
 
+  public static final ConfigProperty<String> CLEANER_HOURS_RETAINED = ConfigProperty.key("hoodie.cleaner.hours.retained")
+          .defaultValue("5")

Review comment:
       lets have 24 may be. 5 hours is very aggressive. 1 day seems reasonable. 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -330,6 +336,74 @@ public CleanPlanner(HoodieEngineContext context, HoodieTable<T, I, K, O> hoodieT
     }
     return deletePaths;
   }
+
+  /**
+   * This method finds the files to be cleaned based on the number of hours. If {@code config.getCleanerHoursRetained()} is set to 5,
+   * all the files with commit time earlier than 5 hours will be removed. Also the latest file for any file group is retained.
+   * This policy gives much more flexibility to users for retaining data for running incremental queries as compared to
+   * KEEP_LATEST_COMMITS cleaning policy. The default number of hours is 5.
+   * @param partitionPath partition path to check
+   * @return list of files to clean
+   */
+  private List<CleanFileInfo> getFilesToCleanKeepingLatestHours(String partitionPath) {

Review comment:
       can't we re-use getFilesToCleanKeepingLatestCommits(). all we need to do is to move most of these to a private method and reuse across both. 
   for getFilesToCleanKeepingLatestCommits(), you can pass in the config value for N commits to retain. where as for getFilesToCleanKeepingLatestHours, we can determine how many commits can be retained and then pass in the N value.
   lets try to re-use code as much as possible. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org