You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/11/29 14:48:10 UTC

[GitHub] [hudi] minihippo opened a new pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

minihippo opened a new pull request #4152:
URL: https://github.com/apache/hudi/pull/4152


   …ite amplification by using LogFileSizeThresholdBasedCompactionStrategy.
   
   ## What is the purpose of the pull request
   For huge table, it has many file groups and each file group size can be more than GB. Therefore, we have to increase the compaction target io size to compact more file groups. However , this parameter is difficult to adjust. If the target io size is large, file groups with small log file size will be compacted, which results in write amplification. If the target io size is small, more file groups with large log file will wait for compaction and reduce the read performance.
   Base on the `LogFileSizeBasedCompactionStrategy`, the new compaction strategy filters the file groups with log file size greater than the threshold(500M) when planning the compaction.
   
   
   ## Brief change log
   
   - add `LogFileSizeThresholdBasedCompactionStrategy`
   
   ## Verify this pull request
   
   - add test case to `TestHoodieCompactionStrategy`
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-981708840


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-981706570


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-981756626


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-983779435


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921",
       "triggerID" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3922",
       "triggerID" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 360acc1684b502ca6209efa3f6daa55a6ab6ace4 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921) 
   * e380bba0e6397ba02dafbbbbdbfcd920a79348ba Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3922) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-983735256


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921",
       "triggerID" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 360acc1684b502ca6209efa3f6daa55a6ab6ace4 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921) 
   * e380bba0e6397ba02dafbbbbdbfcd920a79348ba UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf commented on a change in pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
leesf commented on a change in pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#discussion_r759330065



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
##########
@@ -166,6 +166,11 @@
       .withDocumentation("Amount of MBs to spend during compaction run for the LogFileSizeBasedCompactionStrategy. "
           + "This value helps bound ingestion latency while compaction is run inline mode.");
 
+  public static final ConfigProperty<Long> COMPACTION_LOG_FILE_SIZE_THRESHOLD = ConfigProperty
+      .key("hoodie.compaction.logfile.size.threshold")
+      .defaultValue(1024 * 1024 * 1024L)
+      .withDocumentation("Only if the log file size is greater than the threshold, the file group will be compacted.");

Review comment:
       nit:than the threshold -> than the threshold in bytes?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-981706570


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] minihippo commented on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
minihippo commented on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-981704859


   @leesf cloud u help me to review this pr?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-983711087


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886) 
   * 360acc1684b502ca6209efa3f6daa55a6ab6ace4 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] minihippo commented on a change in pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
minihippo commented on a change in pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#discussion_r760272304



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/LogFileSizeBasedCompactionStrategy.java
##########
@@ -39,8 +40,12 @@
   @Override
   public List<HoodieCompactionOperation> orderAndFilter(HoodieWriteConfig writeConfig,
       List<HoodieCompactionOperation> operations, List<HoodieCompactionPlan> pendingCompactionPlans) {
+    // Filter the file group which log files size is greater than the threshold in bytes.
     // Order the operations based on the reverse size of the logs and limit them by the IO
-    return super.orderAndFilter(writeConfig, operations.stream().sorted(this).collect(Collectors.toList()),
+    long threshold = writeConfig.getCompactionLogFileSizeThreshold();

Review comment:
       already fix




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] minihippo commented on a change in pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
minihippo commented on a change in pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#discussion_r760248613



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/LogFileSizeThresholdBasedCompactionStrategy.java
##########
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.compact.strategy;
+
+import org.apache.hudi.avro.model.HoodieCompactionOperation;
+import org.apache.hudi.avro.model.HoodieCompactionPlan;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * LogFileSizeThresholdBasedCompactionStrategy orders the compactions based on the total log files size,
+ * filters the file group which log files size is less than the threshold and limits the
+ * compactions within a configured IO bound.
+ *
+ * @see LogFileSizeBasedCompactionStrategy
+ * @see BoundedIOCompactionStrategy
+ * @see CompactionStrategy
+ */
+public class LogFileSizeThresholdBasedCompactionStrategy extends LogFileSizeBasedCompactionStrategy {
+  @Override
+  public List<HoodieCompactionOperation> orderAndFilter(HoodieWriteConfig config,
+                                                        List<HoodieCompactionOperation> operations,
+                                                        List<HoodieCompactionPlan> pendingCompactionWorkloads) {
+    long threshold = config.getCompactionLogFileSizeThreshold();
+    return super.orderAndFilter(config, operations.stream()
+        .filter(e -> e.getMetrics().getOrDefault(TOTAL_LOG_FILE_SIZE, 0d) >= threshold)

Review comment:
       fix




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] minihippo commented on a change in pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
minihippo commented on a change in pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#discussion_r760248760



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
##########
@@ -166,6 +166,11 @@
       .withDocumentation("Amount of MBs to spend during compaction run for the LogFileSizeBasedCompactionStrategy. "
           + "This value helps bound ingestion latency while compaction is run inline mode.");
 
+  public static final ConfigProperty<Long> COMPACTION_LOG_FILE_SIZE_THRESHOLD = ConfigProperty
+      .key("hoodie.compaction.logfile.size.threshold")
+      .defaultValue(1024 * 1024 * 1024L)
+      .withDocumentation("Only if the log file size is greater than the threshold, the file group will be compacted.");

Review comment:
       already fix




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf merged pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
leesf merged pull request #4152:
URL: https://github.com/apache/hudi/pull/4152


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf commented on a change in pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
leesf commented on a change in pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#discussion_r760260198



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/LogFileSizeBasedCompactionStrategy.java
##########
@@ -39,8 +40,12 @@
   @Override
   public List<HoodieCompactionOperation> orderAndFilter(HoodieWriteConfig writeConfig,
       List<HoodieCompactionOperation> operations, List<HoodieCompactionPlan> pendingCompactionPlans) {
+    // Filter the file group which log files size is greater than the threshold in bytes.
     // Order the operations based on the reverse size of the logs and limit them by the IO
-    return super.orderAndFilter(writeConfig, operations.stream().sorted(this).collect(Collectors.toList()),
+    long threshold = writeConfig.getCompactionLogFileSizeThreshold();

Review comment:
       here i think the default value of `"hoodie.compaction.logfile.size.threshold"` should be 0 to keep compatibility and once set it to other value and then take effect.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-983713579


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921",
       "triggerID" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886) 
   * 360acc1684b502ca6209efa3f6daa55a6ab6ace4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-983779435


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921",
       "triggerID" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3922",
       "triggerID" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 360acc1684b502ca6209efa3f6daa55a6ab6ace4 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921) 
   * e380bba0e6397ba02dafbbbbdbfcd920a79348ba Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3922) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-983711087


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886) 
   * 360acc1684b502ca6209efa3f6daa55a6ab6ace4 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-983732406


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921",
       "triggerID" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886) 
   * 360acc1684b502ca6209efa3f6daa55a6ab6ace4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921) 
   * e380bba0e6397ba02dafbbbbdbfcd920a79348ba UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-983831487


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921",
       "triggerID" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3922",
       "triggerID" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e380bba0e6397ba02dafbbbbdbfcd920a79348ba Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3922) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf commented on a change in pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
leesf commented on a change in pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#discussion_r759787128



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/LogFileSizeThresholdBasedCompactionStrategy.java
##########
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.compact.strategy;
+
+import org.apache.hudi.avro.model.HoodieCompactionOperation;
+import org.apache.hudi.avro.model.HoodieCompactionPlan;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * LogFileSizeThresholdBasedCompactionStrategy orders the compactions based on the total log files size,
+ * filters the file group which log files size is less than the threshold and limits the
+ * compactions within a configured IO bound.
+ *
+ * @see LogFileSizeBasedCompactionStrategy
+ * @see BoundedIOCompactionStrategy
+ * @see CompactionStrategy
+ */
+public class LogFileSizeThresholdBasedCompactionStrategy extends LogFileSizeBasedCompactionStrategy {
+  @Override
+  public List<HoodieCompactionOperation> orderAndFilter(HoodieWriteConfig config,
+                                                        List<HoodieCompactionOperation> operations,
+                                                        List<HoodieCompactionPlan> pendingCompactionWorkloads) {
+    long threshold = config.getCompactionLogFileSizeThreshold();
+    return super.orderAndFilter(config, operations.stream()
+        .filter(e -> e.getMetrics().getOrDefault(TOTAL_LOG_FILE_SIZE, 0d) >= threshold)

Review comment:
       can we move it to LogFileSizeBasedCompactionStrategy to avoid introducing the new strategy?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-981756626


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-983735256


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921",
       "triggerID" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 360acc1684b502ca6209efa3f6daa55a6ab6ace4 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921) 
   * e380bba0e6397ba02dafbbbbdbfcd920a79348ba UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-981708840


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-983732406


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921",
       "triggerID" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e380bba0e6397ba02dafbbbbdbfcd920a79348ba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886) 
   * 360acc1684b502ca6209efa3f6daa55a6ab6ace4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921) 
   * e380bba0e6397ba02dafbbbbdbfcd920a79348ba UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4152: [HUDI-2881] Compact the file group with larger log files to reduce wr…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4152:
URL: https://github.com/apache/hudi/pull/4152#issuecomment-983713579


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886",
       "triggerID" : "51e4a9e874a5b1c578ba6508fbf2652714c9626c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921",
       "triggerID" : "360acc1684b502ca6209efa3f6daa55a6ab6ace4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51e4a9e874a5b1c578ba6508fbf2652714c9626c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3886) 
   * 360acc1684b502ca6209efa3f6daa55a6ab6ace4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3921) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org