You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/04 01:15:20 UTC

[GitHub] [hudi] nsivabalan opened a new pull request, #6581: [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

nsivabalan opened a new pull request, #6581:
URL: https://github.com/apache/hudi/pull/6581

   ### Change Logs
   
   Hudi has partition aware clustering strategy and recent partitions based strategy as well for clustering. This plays out well if partitioning is based on dates. but what incase partitioning is based on some other random field. 
   
   So, this patch introduces a clustering filter mode to filter based on recently altered files. 
   
   For eg, if a user configures clustering to run every 5 commits, every time clustering runs, it will consider only the file groups touched in the last 5 commits. This will avoid triggering repeated clustering for already clustered file groups as well and clustering will be very fast since only delta file groups are considered. 
   
   Added a new config named, `hoodie.clustering.plan.filter.mode` whose possible values are NONE, RECENTLY_UPDATED_FILES and RECENTLY_INSERTED_FILES. 
   
   RECENTLY_INSERTED_FILES would also benefit those users who are just trying to sort the records based on some column leveraging clustering. It may not make sense to re-cluster(or re sort) a file group which is already clustered/sorted. So, with this filtering logic, one can filter for those file groups which had inserts in the last N commits whenever clustering gets triggered. 
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance impact._
   
   **Risk level: low/medium**
   
   This is a feature or enhancement to clustering which could benefit some users based on their need. 
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HEPBO3AH commented on pull request #6581: [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

Posted by GitBox <gi...@apache.org>.
HEPBO3AH commented on PR #6581:
URL: https://github.com/apache/hudi/pull/6581#issuecomment-1328350889

   A general question:
   **Will this cluster other small files that might be in the partition that was touched that are not part of the current group?**
   
   Think of the following example:
   Timestamped event data arrives from devices once every hour, up to 24 times per day.
   Data is ingested in a single batch every time the processing is ran. 
   1% of the devices is offline for up to 100 days. 
   The storage has daily partitions.
   
   Over the course of 100 days, the 1% of devices create up to `100 * 24 = 2400` file groups in the partition that is 100 days before `today`.
   
   With this PR merged, will all of those files be clustered?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6581: [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6581:
URL: https://github.com/apache/hudi/pull/6581#issuecomment-1287683415

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11138",
       "triggerID" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d3ba36e084b0270252e19932816f6a6acb50fd4e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d3ba36e084b0270252e19932816f6a6acb50fd4e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 31e79a66025ee3abea51e39b45411e14f15885c5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11138) 
   * d3ba36e084b0270252e19932816f6a6acb50fd4e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6581: [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6581:
URL: https://github.com/apache/hudi/pull/6581#issuecomment-1236232457

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 31e79a66025ee3abea51e39b45411e14f15885c5 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6581: [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6581:
URL: https://github.com/apache/hudi/pull/6581#issuecomment-1287685115

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11138",
       "triggerID" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d3ba36e084b0270252e19932816f6a6acb50fd4e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12463",
       "triggerID" : "d3ba36e084b0270252e19932816f6a6acb50fd4e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 31e79a66025ee3abea51e39b45411e14f15885c5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11138) 
   * d3ba36e084b0270252e19932816f6a6acb50fd4e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12463) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6581: [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6581:
URL: https://github.com/apache/hudi/pull/6581#issuecomment-1287798385

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11138",
       "triggerID" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d3ba36e084b0270252e19932816f6a6acb50fd4e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12463",
       "triggerID" : "d3ba36e084b0270252e19932816f6a6acb50fd4e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d3ba36e084b0270252e19932816f6a6acb50fd4e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12463) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #6581: [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #6581:
URL: https://github.com/apache/hudi/pull/6581#discussion_r1002393516


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -148,13 +151,26 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .key(PLAN_PARTITION_FILTER_MODE)
       .defaultValue(ClusteringPlanPartitionFilterMode.NONE)
       .sinceVersion("0.11.0")
-      .withDocumentation("Partition filter mode used in the creation of clustering plan. Available values are - "
-          + "NONE: do not filter table partition and thus the clustering plan will include all partitions that have clustering candidate."
-          + "RECENT_DAYS: keep a continuous range of partitions, worked together with configs '" + DAYBASED_LOOKBACK_PARTITIONS.key() + "' and '"
+      .withDocumentation("Partition Filter mode used in the creation of clustering plan. Available values are - "
+          + "NONE: do not filter anything and thus the clustering plan will include all file slices that have clustering candidate."
+          + "RECENT: keep a continuous range of partitions, worked together with configs '" + DAYBASED_LOOKBACK_PARTITIONS.key() + "' and '"

Review Comment:
   this is still `RECENT_DAYS` right?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/PartitionAwareClusteringPlanStrategy.java:
##########
@@ -83,12 +100,37 @@ public Option<HoodieClusteringPlan> generateClusteringPlan() {
       return Option.empty();
     }
 
+    Option<HoodieDefaultTimeline> toFilterTimeline = Option.empty();
+    Option<HoodieInstant> latestCompletedInstant = Option.empty();
+    // if filtering based on file slices are enabled, we need to pass the toFilter timeline below to assist in filtering.
+    if (config.getClusteringPlanFilterMode() != ClusteringPlanFilterMode.NONE) {
+      Option<HoodieInstant> latestScheduledReplaceCommit = metaClient.getActiveTimeline()
+          .filter(instant -> instant.getAction().equals(HoodieTimeline.REPLACE_COMMIT_ACTION))
+          .filter(instant -> instant.isRequested()).lastInstant();
+      if (latestScheduledReplaceCommit.isPresent()) {
+        HoodieClusteringPlan clusteringPlan = ClusteringUtils.getClusteringPlan(
+                metaClient, HoodieTimeline.getReplaceCommitRequestedInstant(latestScheduledReplaceCommit.get().getTimestamp()))
+            .map(Pair::getRight).orElseThrow(() -> new HoodieClusteringException(
+                "Unable to read clustering plan for instant: " + latestScheduledReplaceCommit.get().getTimestamp()));
+        if (!StringUtils.isNullOrEmpty(clusteringPlan.getLatestCompletedInstant())) {
+          toFilterTimeline = Option.of((HoodieDefaultTimeline)
+              metaClient.getActiveTimeline().filter(instant -> HoodieTimeline.compareTimestamps(instant.getTimestamp(), HoodieTimeline.GREATER_THAN, clusteringPlan.getLatestCompletedInstant())));
+        }
+      } else {
+        // if last clustering was archived
+        toFilterTimeline = Option.of(metaClient.getActiveTimeline());
+      }
+      latestCompletedInstant = metaClient.getActiveTimeline().filterCompletedInstants().lastInstant();
+    }
+
+    Option<HoodieDefaultTimeline> finalToFilterTimeline = toFilterTimeline;
     List<HoodieClusteringGroup> clusteringGroups = getEngineContext()
         .flatMap(
             partitionPaths,
             partitionPath -> {
               List<FileSlice> fileSlicesEligible = getFileSlicesEligibleForClustering(partitionPath).collect(Collectors.toList());
-              return buildClusteringGroupsForPartition(partitionPath, fileSlicesEligible).limit(getWriteConfig().getClusteringMaxNumGroups());
+              List<FileSlice> filteredFileSlices = filterFileSlices(fileSlicesEligible, finalToFilterTimeline, config.getClusteringPlanFilterMode());

Review Comment:
   this follows partition filtering.. should we disable one when the other is enabled otherwise it creates complicated combinations, e.g. let's say when users want certain partitions to be clustered using SELECTED_PARTITIONS starategy no matter whether they were part of recent commits or not?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -61,6 +62,8 @@ public class HoodieClusteringConfig extends HoodieConfig {
       "org.apache.hudi.client.clustering.run.strategy.JavaSortAndSizeExecutionStrategy";
   public static final String PLAN_PARTITION_FILTER_MODE =
       "hoodie.clustering.plan.partition.filter.mode";
+  public static final String PLAN_FILTER_MODE =

Review Comment:
   Is additional config really necessary? Can we fold this into the existing config? Though I like the name of the new config, both configs serve a common purpose of filtering (in different ways), and lesser the configs the better. If we have to define a new config, then let's rename it to  `hoodie.clustering.plan.filegroup.filter.mode` just to be more explicit and differentiate from the existing partition filter config.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6581: [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6581:
URL: https://github.com/apache/hudi/pull/6581#issuecomment-1236232961

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11138",
       "triggerID" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 31e79a66025ee3abea51e39b45411e14f15885c5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11138) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6581: [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6581:
URL: https://github.com/apache/hudi/pull/6581#issuecomment-1236256782

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11138",
       "triggerID" : "31e79a66025ee3abea51e39b45411e14f15885c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 31e79a66025ee3abea51e39b45411e14f15885c5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11138) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files [hudi]

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #6581:
URL: https://github.com/apache/hudi/pull/6581#discussion_r1435369701


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -153,13 +156,26 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .key(PLAN_PARTITION_FILTER_MODE)
       .defaultValue(ClusteringPlanPartitionFilterMode.NONE)
       .sinceVersion("0.11.0")
-      .withDocumentation("Partition filter mode used in the creation of clustering plan. Available values are - "
-          + "NONE: do not filter table partition and thus the clustering plan will include all partitions that have clustering candidate."
-          + "RECENT_DAYS: keep a continuous range of partitions, worked together with configs '" + DAYBASED_LOOKBACK_PARTITIONS.key() + "' and '"
+      .withDocumentation("Partition Filter mode used in the creation of clustering plan. Available values are - "
+          + "NONE: do not filter anything and thus the clustering plan will include all file slices that have clustering candidate."
+          + "RECENT: keep a continuous range of partitions, worked together with configs '" + DAYBASED_LOOKBACK_PARTITIONS.key() + "' and '"
           + PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST.key() + "."
           + "SELECTED_PARTITIONS: keep partitions that are in the specified range ['" + PARTITION_FILTER_BEGIN_PARTITION.key() + "', '"
           + PARTITION_FILTER_END_PARTITION.key() + "'].");
 
+  public static final ConfigProperty<ClusteringPlanFilterMode> PLAN_FILTER_MODE_NAME = ConfigProperty
+      .key(PLAN_FILTER_MODE)
+      .defaultValue(ClusteringPlanFilterMode.NONE)
+      .sinceVersion("0.13.0")
+      .withDocumentation("Filter mode used in the creation of clustering plan. Available values are - "
+      + "NONE: do not filter table partition and thus the clustering plan will include all partitions that have clustering candidate."

Review Comment:
   Note to self: might need fixing. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org