You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/05 09:58:48 UTC

[GitHub] [hudi] boneanxs opened a new pull request, #6046: [HUDI-4363] Support Clustering row writer to improve performance

boneanxs opened a new pull request, #6046:
URL: https://github.com/apache/hudi/pull/6046

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   Enable row writer for clustering to improve performance
   
   ## Brief change log
   1. Integrate clustering with datasource read and write api, in this way,
      - enable clustering use Dataset api
      - Unify the read and write operations together, if read/write logic has improvement, clustering can also benefit
   2. Use hoodie.datasource.read.paths to pass paths for each clusteringOperation
   3. Introduce HoodieInternalWriteStatusCoordinator to persist the InternalWriteStatus of a clustering action. As we can not get it if using Spark datasource.
   4. Add new configures to control this behavior.
   
   ## Verify this pull request
   Manual test:
   A test table has 21 columns, 710716 rows, raw data size 929g(in spark memory), after compressed: 38.3g
   executor memory: 50g, 20 instances, and enable global_sort
   
   Without clustering as row: 32mins, 12sec
   Using clustering as row: 9mins, 51sec
   Also change existing tests(`TestHoodieSparkMergeOnReadTableClustering` and `testLayoutOptimizationFunctional`) to cover this feature 
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r972820986


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##########
@@ -160,9 +160,6 @@ class BaseFileOnlyRelation(sqlContext: SQLContext,
         fileFormat = fileFormat,
         optParams)(sparkSession)
     } else {
-      val readPathsStr = optParams.get(DataSourceReadOptions.READ_PATHS.key)

Review Comment:
   Using `READ_PATHS` may miss glob paths if user use `spark.read.format("hudi").load("glob.path")`, so change here to directly use `globPaths`(which cover both)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r972929585


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##########
@@ -160,9 +160,6 @@ class BaseFileOnlyRelation(sqlContext: SQLContext,
         fileFormat = fileFormat,
         optParams)(sparkSession)
     } else {
-      val readPathsStr = optParams.get(DataSourceReadOptions.READ_PATHS.key)

Review Comment:
   Here maybe also need to consider globPath? for example `spark.read.format("hudi").load("basePath/*/*/file")`.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1250218478

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 20f64af242ac3e6df5d1555edf0766e7dcdd698a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r972551309


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -98,10 +106,18 @@ public HoodieWriteMetadata<HoodieData<WriteStatus>> performClustering(final Hood
     // execute clustering for each group async and collect WriteStatus
     Stream<HoodieData<WriteStatus>> writeStatusesStream = FutureUtils.allOf(
         clusteringPlan.getInputGroups().stream()
-        .map(inputGroup -> runClusteringForGroupAsync(inputGroup,
-            clusteringPlan.getStrategy().getStrategyParams(),
-            Option.ofNullable(clusteringPlan.getPreserveHoodieMetadata()).orElse(false),
-            instantTime))
+            .map(inputGroup -> {
+              if (getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable", false)) {

Review Comment:
   `HoodieWriteConfig` holds common hudi configure. while `hoodie.datasource.write.row.writer.enable` is specially for spark, moving this config from `DataSourceOptions` to `HoodieWriteConfig` maybe not good?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -1084,7 +1084,7 @@ public boolean isEmbeddedTimelineServerEnabled() {
   }
 
   public boolean isEmbeddedTimelineServerReuseEnabled() {
-    return Boolean.parseBoolean(getStringOrDefault(EMBEDDED_TIMELINE_SERVER_REUSE_ENABLED));
+    return getBoolean(EMBEDDED_TIMELINE_SERVER_REUSE_ENABLED);

Review Comment:
   `EMBEDDED_TIMELINE_SERVER_REUSE_ENABLED` is a ConfigProperty, getBoolean will handle this.
   
   ```java
   public <T> Boolean getBoolean(ConfigProperty<T> configProperty) {
       if (configProperty.hasDefaultValue()) {
         return getBooleanOrDefault(configProperty);
       }
       Option<Object> rawValue = getRawValue(configProperty);
       return rawValue.map(v -> Boolean.parseBoolean(v.toString())).orElse(null);
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1201062678

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * dfd50cd0007c4ff48b3e0e27c368d573e47560a2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r971287163


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkSortAndSizeExecutionStrategy.java:
##########
@@ -52,6 +55,27 @@ public SparkSortAndSizeExecutionStrategy(HoodieTable table,
     super(table, engineContext, writeConfig);
   }
 
+  @Override
+  public HoodieData<WriteStatus> performClusteringWithRecordsRow(Dataset<Row> inputRecords, int numOutputGroups,
+                                                                 String instantTime, Map<String, String> strategyParams, Schema schema,
+                                                                 List<HoodieFileGroupId> fileGroupIdList, boolean preserveHoodieMetadata) {
+    LOG.info("Starting clustering for a group, parallelism:" + numOutputGroups + " commit:" + instantTime);
+    HoodieWriteConfig newConfig = HoodieWriteConfig.newBuilder()
+        .withBulkInsertParallelism(numOutputGroups)
+        .withProps(getWriteConfig().getProps()).build();
+
+    boolean shouldPreserveHoodieMetadata = preserveHoodieMetadata;

Review Comment:
   That should be decided wherever we use it. But we shouldn't be overriding one w/ another here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1201056933

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * dfd50cd0007c4ff48b3e0e27c368d573e47560a2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1207764328

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a6ac9622379715e890f1ec1cd7be9422febeb5c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597) 
   * e216664929bd2e01bc1eafa564ec4ebc745b1c34 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1180149751

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 58cf2096e648ccc8c7e7c563003753ce89a90261 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730) 
   * dd79426d234315d90b2deffcd54dc0e9ab43e38e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r955765271


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -131,6 +161,53 @@ public abstract HoodieData<WriteStatus> performClusteringWithRecordsRDD(final Ho
                                                                        final Map<String, String> strategyParams, final Schema schema,
                                                                        final List<HoodieFileGroupId> fileGroupIdList, final boolean preserveHoodieMetadata);
 
+  protected HoodieData<WriteStatus> performRowWrite(Dataset<Row> inputRecords, Map<String, String> parameters) {
+    String uuid = UUID.randomUUID().toString();
+    parameters.put(HoodieWriteConfig.BULKINSERT_ROW_IDENTIFY_ID.key(), uuid);
+    try {
+      inputRecords.write()
+          .format("hudi")
+          .options(JavaConverters.mapAsScalaMapConverter(parameters).asScala())
+          .mode(SaveMode.Append)
+          .save(getWriteConfig().getBasePath());

Review Comment:
   I see, yeah, this is a good improvement, will change it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249996254

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 1587f472f18d7b524971637abe64d171c9799818 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r945555810


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -273,6 +398,62 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                   HoodieClusteringGroup clusteringGroup,
+                                                   String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    String[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> pairs = new ArrayList<>();
+          if (op.getBootstrapFilePath() != null) {
+            pairs.add(op.getBootstrapFilePath());
+          }
+          if (op.getDataFilePath() != null) {
+            pairs.add(op.getDataFilePath());
+          }
+          return pairs;
+        })
+        .flatMap(Collection::stream)
+        .filter(path -> !path.isEmpty())
+        .toArray(String[]::new);
+    String[] deltaPaths = clusteringOps
+        .stream()
+        .filter(op -> !op.getDeltaFilePaths().isEmpty())
+        .flatMap(op -> op.getDeltaFilePaths().stream())
+        .toArray(String[]::new);
+
+    Dataset<Row> inputRecords;
+    if (hasLogFiles) {
+      String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))
+          .orElse("0.75");
+      String[] paths = new String[baseFilePaths.length + deltaPaths.length];
+      System.arraycopy(baseFilePaths, 0, paths, 0, baseFilePaths.length);
+      System.arraycopy(deltaPaths, 0, paths, baseFilePaths.length, deltaPaths.length);
+      inputRecords = sqlContext.read()
+          .format("org.apache.hudi")
+          .option("hoodie.datasource.query.type", "snapshot")
+          .option("compaction.memory.fraction", compactionFractor)
+          .option("as.of.instant", instantTime)
+          .option("hoodie.datasource.read.paths", String.join(",", paths))
+          .load();
+    } else {
+      inputRecords = sqlContext.read()
+          .format("org.apache.hudi")
+          .option("as.of.instant", instantTime)

Review Comment:
   we already collect all parquet files which need to be clustering,  do wil still need to set "as.of.instant" ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1253140705

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519",
       "triggerID" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11522",
       "triggerID" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "af01ca4e55cc478dc252fb443c525c15780eefa9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11540",
       "triggerID" : "af01ca4e55cc478dc252fb443c525c15780eefa9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 46004b031d07d220812c5cdb19e9cd66552ceacc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11522) 
   * af01ca4e55cc478dc252fb443c525c15780eefa9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11540) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1259873129

   @boneanxs can you please create a Jira corresponding to your investigation and link it in here? So that it's easier to discover it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249230524

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153) 
   * 7300c9eb17c30e11aaeb9cd768b15585536ab5f9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423) 
   * 988e4874af3065d6879f9adc40c7483a84467f72 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r974817505


##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -105,6 +108,52 @@ object HoodieDatasetBulkInsertHelper extends Logging {
     partitioner.repartitionRecords(trimmedDF, config.getBulkInsertShuffleParallelism)
   }
 
+  /**
+   * Perform bulk insert for [[Dataset<Row>]], will not change timeline/index, return
+   * information about write files.
+   */
+  def bulkInsert(dataset: Dataset[Row],
+                 instantTime: String,
+                 table: HoodieTable[_ <: HoodieRecordPayload[_ <: HoodieRecordPayload[_ <: AnyRef]], _, _, _],
+                 writeConfig: HoodieWriteConfig,
+                 partitioner: BulkInsertPartitioner[Dataset[Row]],
+                 parallelism: Int,
+                 shouldPreserveHoodieMetadata: Boolean): HoodieData[WriteStatus] = {
+    val repartitionedDataset = partitioner.repartitionRecords(dataset, parallelism)
+    val arePartitionRecordsSorted = partitioner.arePartitionRecordsSorted
+    val schema = dataset.schema
+    val writeStatuses = repartitionedDataset.queryExecution.toRdd.mapPartitions(iter => {
+      val taskContextSupplier: TaskContextSupplier = table.getTaskContextSupplier
+      val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get
+      val taskId = taskContextSupplier.getStageIdSupplier.get.toLong
+      val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get
+      val writer = new BulkInsertDataInternalWriterHelper(
+        table,
+        writeConfig,
+        instantTime,
+        taskPartitionId,
+        taskId,
+        taskEpochId,
+        schema,
+        writeConfig.populateMetaFields,
+        arePartitionRecordsSorted,
+        shouldPreserveHoodieMetadata)
+
+      try {
+        iter.foreach(writer.write)
+      } catch {
+        case t: Throwable =>
+          writer.abort()
+          throw t
+      } finally {
+        writer.close()
+      }
+
+      writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator
+    }).collect()
+    table.getContext.parallelize(writeStatuses.toList.asJava)

Review Comment:
   `writeStatuses` is an `Array`, but parallelize func needs `list`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251834066

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 20f64af242ac3e6df5d1555edf0766e7dcdd698a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470) 
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512) 
   * e75f6d0031490025107040c1b0093c3c5720a67d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r972822536


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##########
@@ -784,22 +785,6 @@ object DataSourceOptionsHelper {
     ) ++ translateConfigurations(paramsWithGlobalProps)
   }
 
-  def inferKeyGenClazz(props: TypedProperties): String = {

Review Comment:
   Move this logic to `HoodieSparkKeyGeneratorFactory` as `HoodieDatasetBulkInsertHelper` need to decide `KeyGenClass`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1250198558

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 1587f472f18d7b524971637abe64d171c9799818 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432) 
   * 20f64af242ac3e6df5d1555edf0766e7dcdd698a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r946538664


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -211,6 +211,23 @@ public class HoodieWriteConfig extends HoodieConfig {
           + " optimally for common query patterns. For now we support a build-in user defined bulkinsert partitioner org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner"
           + " which can does sorting based on specified column values set by " + BULKINSERT_USER_DEFINED_PARTITIONER_SORT_COLUMNS.key());
 
+  public static final ConfigProperty<String> BULKINSERT_ROW_IDENTIFY_ID = ConfigProperty
+      .key("hoodie.bulkinsert.row.writestatus.id")
+      .noDefaultValue()
+      .withDocumentation("The unique id for each write operation, HoodieInternalWriteStatusCoordinator will use "

Review Comment:
   This is an internal configure used by `HoodieDataSourceInternalBatchWrite` and `HoodieDataSourceInternalWriter` to pass `writeStatuses` to the clustering job.
   At first I thought setting this in `DataSourceInternalWriterHelper` like `INSTANT_TIME_OPT_KEY`, but it's difficult for `ClusteringExecutionStrategy` to access this as package `hudi-client` cannot access `hudi-spark-datasource`, So I keep this config in `HoodieWriteConfig`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1236608382

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c913457d2dd531fd1ecae6b0d60e600f59e261b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760) 
   * b8e848d0f8b32ff3c75762951e3af4c911419927 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r972697125


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkSortAndSizeExecutionStrategy.java:
##########
@@ -52,6 +55,27 @@ public SparkSortAndSizeExecutionStrategy(HoodieTable table,
     super(table, engineContext, writeConfig);
   }
 
+  @Override
+  public HoodieData<WriteStatus> performClusteringWithRecordsRow(Dataset<Row> inputRecords, int numOutputGroups,

Review Comment:
   Created: https://issues.apache.org/jira/browse/HUDI-4857
   Will implement it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1254275527

   Did you try to re-run your benchmark after the changes we've made? If so, can you please paste the results in here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r972820986


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##########
@@ -160,9 +160,6 @@ class BaseFileOnlyRelation(sqlContext: SQLContext,
         fileFormat = fileFormat,
         optParams)(sparkSession)
     } else {
-      val readPathsStr = optParams.get(DataSourceReadOptions.READ_PATHS.key)

Review Comment:
   Using `READ_PATHS` may miss glob paths if user use `spark.read.format("hudi").load("glob.path")`, so change here to directly use `globPaths`(which cover both)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251788382

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 20f64af242ac3e6df5d1555edf0766e7dcdd698a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470) 
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1258619711

   @boneanxs Thanks for following up on this! Is number of files in your tables before or after? 
   
   I think the second table's numbers look worrisome and warrants an investigation. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1257404461

   ### Test1
   4 flat columns
   ```bash
   --num-executors 64 \
       --driver-memory 20g \
       --driver-cores 1 \
       --executor-memory 20g \ # rowEnable: 10g
       --executor-cores 1 \
       --class org.apache.hudi.utilities.HoodieClusteringJob \
       $PWD/../hudi-utilities-slim-bundle_2.12-0.13.0-SNAPSHOT.jar \
       --mode scheduleAndExecute \
       --base-path $TABLEPATH \
       --table-name $TABLENAME \
       --spark-memory 20g \ # rowEnable: 10g
       --parallelism 64 \
       --hoodie-conf hoodie.clustering.async.enabled=true \
       --hoodie-conf hoodie.clustering.async.max.commits=0 \
       --hoodie-conf hoodie.clustering.plan.strategy.max.bytes.per.group=5368709120 \
       --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=6442450944 \
       --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=1073741824 \
       --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=10000000 \
   ```
   
   Row enabled | Partition hour | total size | file num | runtime
   -- | -- | -- | -- | --
   true | dt=2022-09-22/hh=14 | 233.9 G | 1.3 K | 753s
   false | dt=2022-09-22/hh=22 | 209.6 G | 1.3 K| 1008s
   
   ### Test2
   23 columns, 9 nested columns, using z-order
   ```bash
   --conf 'spark.sql.parquet.columnarReaderBatchSize=2048' \
       --conf 'spark.yarn.maxAppAttempts=1' \
       --num-executors 32 \
       --driver-memory 20g \
       --driver-cores 1 \
       --executor-memory 30g \
       --executor-cores 1 \
       --class org.apache.hudi.utilities.HoodieClusteringJob \
       $PWD/../hudi-utilities-slim-bundle_2.12-0.13.0-SNAPSHOT.jar \
       --mode scheduleAndExecute \
       --base-path $TABLEPATH \
       --table-name $TABLENAME \
       --spark-memory 30g \
       --parallelism 32 \
       --hoodie-conf hoodie.clustering.async.enabled=true \
       --hoodie-conf hoodie.clustering.async.max.commits=0 \
       --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=209715200 \
       --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=1073741824 \
       --hoodie-conf hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy \
       --hoodie-conf hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy \
       --hoodie-conf hoodie.layout.optimize.enable=true \
       --hoodie-conf hoodie.layout.optimize.strategy=z-order \
       --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=applicationId,sparkUser
   ```
   
   Row enabled | Partition hour | total size | file num | runtime
   -- | -- | -- | -- | --
   true | 2022-09-19 | 70.9 G | 7.5 K | 11h 7min
   false | 2022-09-20 | 69.7 G | 7.3 K| 11h 33min
   
   The computing performance improved 20% to 30%, the bottleneck of this job is writing data, both jobs take approximate 10 hours at writing stage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1200986824

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dfd50cd0007c4ff48b3e0e27c368d573e47560a2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1181275610

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * dd79426d234315d90b2deffcd54dc0e9ab43e38e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1214556523

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e216664929bd2e01bc1eafa564ec4ebc745b1c34 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665) 
   * 1c913457d2dd531fd1ecae6b0d60e600f59e261b UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1214558392

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e216664929bd2e01bc1eafa564ec4ebc745b1c34 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665) 
   * 1c913457d2dd531fd1ecae6b0d60e600f59e261b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1181254713

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1185103626

   > @boneanxs my wechat 1037817390, let's disscuss this pr in wechat first. i think we can lanch this pr in 0.12
   
   Sure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1174904310

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 58cf2096e648ccc8c7e7c563003753ce89a90261 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1236834653

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b8e848d0f8b32ff3c75762951e3af4c911419927 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151) 
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249235392

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153) 
   * 7300c9eb17c30e11aaeb9cd768b15585536ab5f9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423) 
   * 988e4874af3065d6879f9adc40c7483a84467f72 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249292500

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153) 
   * 7300c9eb17c30e11aaeb9cd768b15585536ab5f9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423) 
   * 988e4874af3065d6879f9adc40c7483a84467f72 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425) 
   * f2bb9e61707199197f30eef79e80db3e1241b3a0 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1237091461

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r946553109


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -273,6 +398,62 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                   HoodieClusteringGroup clusteringGroup,
+                                                   String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    String[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> pairs = new ArrayList<>();

Review Comment:
   will change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1253137877

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519",
       "triggerID" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11522",
       "triggerID" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "af01ca4e55cc478dc252fb443c525c15780eefa9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "af01ca4e55cc478dc252fb443c525c15780eefa9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 46004b031d07d220812c5cdb19e9cd66552ceacc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11522) 
   * af01ca4e55cc478dc252fb443c525c15780eefa9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1214586604

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c913457d2dd531fd1ecae6b0d60e600f59e261b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r945544123


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -211,6 +211,23 @@ public class HoodieWriteConfig extends HoodieConfig {
           + " optimally for common query patterns. For now we support a build-in user defined bulkinsert partitioner org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner"
           + " which can does sorting based on specified column values set by " + BULKINSERT_USER_DEFINED_PARTITIONER_SORT_COLUMNS.key());
 
+  public static final ConfigProperty<String> BULKINSERT_ROW_IDENTIFY_ID = ConfigProperty
+      .key("hoodie.bulkinsert.row.writestatus.id")
+      .noDefaultValue()
+      .withDocumentation("The unique id for each write operation, HoodieInternalWriteStatusCoordinator will use "

Review Comment:
   why we need set this config? how about use uuid directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r962540143


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowSpatialCurveSortPartitioner.java:
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.config.HoodieClusteringConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.sort.SpaceCurveSortingHelper;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+import java.util.Arrays;
+import java.util.List;
+
+public class RowSpatialCurveSortPartitioner extends RowCustomColumnsSortPartitioner {
+
+  private final String[] orderByColumns;
+  private final HoodieClusteringConfig.LayoutOptimizationStrategy layoutOptStrategy;
+  private final HoodieClusteringConfig.SpatialCurveCompositionStrategyType curveCompositionStrategyType;
+
+  public RowSpatialCurveSortPartitioner(HoodieWriteConfig config) {
+    super(config);
+    this.layoutOptStrategy = config.getLayoutOptimizationStrategy();
+    if (config.getClusteringSortColumns() != null) {
+      this.orderByColumns = Arrays.stream(config.getClusteringSortColumns().split(","))
+          .map(String::trim).toArray(String[]::new);
+    } else {
+      this.orderByColumns = getSortColumnNames();
+    }
+    this.curveCompositionStrategyType = config.getLayoutOptimizationCurveBuildMethod();
+  }
+
+  @Override
+  public Dataset<Row> repartitionRecords(Dataset<Row> records, int outputPartitions) {
+    return reorder(records, outputPartitions);

Review Comment:
   Looks When building clustering plan, we already consider this, only same partition files will be combined to one `clusteringGroup`, so maybe we don't need to handle it here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r969043938


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -113,6 +129,15 @@ public HoodieWriteMetadata<HoodieData<WriteStatus>> performClustering(final Hood
     return writeMetadata;
   }
 
+  /**
+   * Execute clustering to write inputRecords into new files based on strategyParams.
+   * Different from {@link performClusteringWithRecordsRDD}, this method take {@link Dataset<Row>}
+   * as inputs.
+   */
+  public abstract HoodieData<WriteStatus> performClusteringWithRecordsRow(final Dataset<Row> inputRecords, final int numOutputGroups, final String instantTime,

Review Comment:
   This style of wrapping (while acceptable under recognized code style-guide) makes it quite hard to read on laptop screen: 
   
   <img width="822" alt="Screen Shot 2022-09-12 at 5 48 53 PM" src="https://user-images.githubusercontent.com/428277/189783507-51e4138e-e9fa-48ba-8b54-a798a31c200c.png">
   



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowSpatialCurveSortPartitioner.java:
##########
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.config.HoodieClusteringConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.sort.SpaceCurveSortingHelper;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+import java.util.Arrays;
+import java.util.List;
+
+public class RowSpatialCurveSortPartitioner implements BulkInsertPartitioner<Dataset<Row>> {
+
+  private final String[] orderByColumns;

Review Comment:
   Let's extract common base class `SpatialCurveSortPartitionerBase` so that we can reuse as much code as possible and avoid duplication



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -98,10 +106,18 @@ public HoodieWriteMetadata<HoodieData<WriteStatus>> performClustering(final Hood
     // execute clustering for each group async and collect WriteStatus
     Stream<HoodieData<WriteStatus>> writeStatusesStream = FutureUtils.allOf(
         clusteringPlan.getInputGroups().stream()
-        .map(inputGroup -> runClusteringForGroupAsync(inputGroup,
-            clusteringPlan.getStrategy().getStrategyParams(),
-            Option.ofNullable(clusteringPlan.getPreserveHoodieMetadata()).orElse(false),
-            instantTime))
+            .map(inputGroup -> {
+              if (getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable", false)) {
+                return runClusteringForGroupAsyncWithRow(inputGroup,
+                    clusteringPlan.getStrategy().getStrategyParams(),
+                    Option.ofNullable(clusteringPlan.getPreserveHoodieMetadata()).orElse(false),

Review Comment:
   Please extract common expression (to `shouldPreserveMetadata`)



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -1084,7 +1084,7 @@ public boolean isEmbeddedTimelineServerEnabled() {
   }
 
   public boolean isEmbeddedTimelineServerReuseEnabled() {
-    return Boolean.parseBoolean(getStringOrDefault(EMBEDDED_TIMELINE_SERVER_REUSE_ENABLED));
+    return getBoolean(EMBEDDED_TIMELINE_SERVER_REUSE_ENABLED);

Review Comment:
   We need to do `getBooleanOrDefault`, otherwise it might NPE (due to unboxing)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkBulkInsertRowWriter.java:
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.data.HoodieData;
+import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Iterator;
+import java.util.List;
+
+public class SparkBulkInsertRowWriter {
+
+  /**
+   * Perform bulk insert for {@link Dataset<Row>}, will not change timeline/index, return
+   * information about write files.
+   */
+  public static HoodieData<WriteStatus> bulkInsert(Dataset<Row> dataset,
+                                            String instantTime,
+                                            HoodieTable table,
+                                            HoodieWriteConfig writeConfig,
+                                            BulkInsertPartitioner<Dataset<Row>> partitioner,
+                                            int parallelism,
+                                            boolean preserveHoodieMetadata) {
+    Dataset<Row> repartitionedDataset = partitioner.repartitionRecords(dataset, parallelism);
+
+    boolean arePartitionRecordsSorted = partitioner.arePartitionRecordsSorted();
+    StructType schema = dataset.schema();
+    List<WriteStatus> writeStatuses = repartitionedDataset.queryExecution().toRdd().toJavaRDD().mapPartitions(
+        (FlatMapFunction<Iterator<InternalRow>, WriteStatus>) rowIterator -> {
+          TaskContextSupplier taskContextSupplier = table.getTaskContextSupplier();
+          int taskPartitionId = taskContextSupplier.getPartitionIdSupplier().get();
+          long taskId = taskContextSupplier.getStageIdSupplier().get();
+          long taskEpochId = taskContextSupplier.getAttemptIdSupplier().get();
+
+          final BulkInsertDataInternalWriterHelper writer =
+              new BulkInsertDataInternalWriterHelper(table, writeConfig, instantTime, taskPartitionId, taskId, taskEpochId,
+                  schema, writeConfig.populateMetaFields(), arePartitionRecordsSorted, preserveHoodieMetadata);
+          while (rowIterator.hasNext()) {
+            writer.write(rowIterator.next());
+          }
+          return writer.getWriteStatuses()
+              .stream()
+              .map(internalWriteStatus -> {
+                WriteStatus status = new WriteStatus(
+                    internalWriteStatus.isTrackSuccessRecords(), internalWriteStatus.getFailureFraction());
+                status.setFileId(internalWriteStatus.getFileId());
+                status.setTotalRecords(internalWriteStatus.getTotalRecords());
+                status.setPartitionPath(internalWriteStatus.getPartitionPath());
+                status.setStat(internalWriteStatus.getStat());
+                return status;
+              }).iterator();
+        }).collect();

Review Comment:
   @nsivabalan FYI: this is what we've talked about last week -- we dereference RDD/Dataframe at a well-defined point in the workflow and the convert to a list of `WriteStatus`s



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkSortAndSizeExecutionStrategy.java:
##########
@@ -52,6 +55,27 @@ public SparkSortAndSizeExecutionStrategy(HoodieTable table,
     super(table, engineContext, writeConfig);
   }
 
+  @Override
+  public HoodieData<WriteStatus> performClusteringWithRecordsRow(Dataset<Row> inputRecords, int numOutputGroups,
+                                                                 String instantTime, Map<String, String> strategyParams, Schema schema,
+                                                                 List<HoodieFileGroupId> fileGroupIdList, boolean preserveHoodieMetadata) {
+    LOG.info("Starting clustering for a group, parallelism:" + numOutputGroups + " commit:" + instantTime);
+    HoodieWriteConfig newConfig = HoodieWriteConfig.newBuilder()
+        .withBulkInsertParallelism(numOutputGroups)
+        .withProps(getWriteConfig().getProps()).build();
+
+    boolean shouldPreserveHoodieMetadata = preserveHoodieMetadata;

Review Comment:
   Why do we need to override this here?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkSingleFileSortExecutionStrategy.java:
##########
@@ -54,6 +57,35 @@ public SparkSingleFileSortExecutionStrategy(HoodieTable table,
     super(table, engineContext, writeConfig);
   }
 
+  @Override
+  public HoodieData<WriteStatus> performClusteringWithRecordsRow(Dataset<Row> inputRecords,
+                                                                 int numOutputGroups,
+                                                                 String instantTime,
+                                                                 Map<String, String> strategyParams,
+                                                                 Schema schema,
+                                                                 List<HoodieFileGroupId> fileGroupIdList,
+                                                                 boolean preserveHoodieMetadata) {
+    if (numOutputGroups != 1 || fileGroupIdList.size() != 1) {
+      throw new HoodieClusteringException("Expect only one file group for strategy: " + getClass().getName());
+    }
+    LOG.info("Starting clustering for a group, parallelism:" + numOutputGroups + " commit:" + instantTime);
+
+    HoodieWriteConfig newConfig = HoodieWriteConfig.newBuilder()
+        .withBulkInsertParallelism(numOutputGroups)
+        .withProps(getWriteConfig().getProps()).build();
+
+    boolean shouldPreserveHoodieMetadata = preserveHoodieMetadata;
+    if (!newConfig.populateMetaFields() && preserveHoodieMetadata) {
+      LOG.warn("Will setting preserveHoodieMetadata to false as populateMetaFields is false");
+      shouldPreserveHoodieMetadata = false;
+    }
+
+    newConfig.setValue(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE, String.valueOf(Long.MAX_VALUE));

Review Comment:
   Let's duplicate the comment from the original method as well



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -98,10 +106,18 @@ public HoodieWriteMetadata<HoodieData<WriteStatus>> performClustering(final Hood
     // execute clustering for each group async and collect WriteStatus
     Stream<HoodieData<WriteStatus>> writeStatusesStream = FutureUtils.allOf(
         clusteringPlan.getInputGroups().stream()
-        .map(inputGroup -> runClusteringForGroupAsync(inputGroup,
-            clusteringPlan.getStrategy().getStrategyParams(),
-            Option.ofNullable(clusteringPlan.getPreserveHoodieMetadata()).orElse(false),
-            instantTime))
+            .map(inputGroup -> {
+              if (getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable", false)) {

Review Comment:
   Let's abstract this as a method in WriteConfig



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkBulkInsertRowWriter.java:
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.data.HoodieData;
+import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Iterator;
+import java.util.List;
+
+public class SparkBulkInsertRowWriter {

Review Comment:
   Let's consolidate this one w/ `HoodieDatasetBulkInsertHelper`



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BulkInsertDataInternalWriterHelper.java:
##########
@@ -118,7 +126,8 @@ public BulkInsertDataInternalWriterHelper(HoodieTable hoodieTable, HoodieWriteCo
   private Option<BuiltinKeyGenerator> getKeyGenerator(Properties properties) {
     TypedProperties typedProperties = new TypedProperties();
     typedProperties.putAll(properties);
-    if (properties.get(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME().key()).equals(NonpartitionedKeyGenerator.class.getName())) {
+    if (Option.ofNullable(properties.get(HoodieWriteConfig.KEYGENERATOR_CLASS_NAME.key()))

Review Comment:
   Why are we not instantiating Non-partitioned KG?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java:
##########
@@ -70,6 +70,8 @@ public class HoodieRowCreateHandle implements Serializable {
   private final UTF8String commitTime;
   private final Function<Long, String> seqIdGenerator;
 
+  private final boolean preserveHoodieMetadata;

Review Comment:
   nit: `shouldPreserveHoodieMetadata`



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -148,29 +184,34 @@ protected BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> getPartitioner(Map<Str
       switch (layoutOptStrategy) {
         case ZORDER:
         case HILBERT:
-          return new RDDSpatialCurveSortPartitioner(
+          return isRowPartitioner
+              ? new RowSpatialCurveSortPartitioner(getWriteConfig())
+              : new RDDSpatialCurveSortPartitioner(
               (HoodieSparkEngineContext) getEngineContext(),
               orderByColumns,
               layoutOptStrategy,
               getWriteConfig().getLayoutOptimizationCurveBuildMethod(),
               HoodieAvroUtils.addMetadataFields(schema));
         case LINEAR:
-          return new RDDCustomColumnsSortPartitioner(orderByColumns, HoodieAvroUtils.addMetadataFields(schema),
+          return isRowPartitioner
+              ? new RowCustomColumnsSortPartitioner(orderByColumns)
+              : new RDDCustomColumnsSortPartitioner(orderByColumns, HoodieAvroUtils.addMetadataFields(schema),
               getWriteConfig().isConsistentLogicalTimestampEnabled());
         default:
           throw new UnsupportedOperationException(String.format("Layout optimization strategy '%s' is not supported", layoutOptStrategy));
       }
-    }).orElse(BulkInsertInternalPartitionerFactory.get(getWriteConfig().getBulkInsertSortMode()));
+    }).orElse(isRowPartitioner ? BulkInsertInternalPartitionerWithRowsFactory.get(getWriteConfig().getBulkInsertSortMode()) :
+        BulkInsertInternalPartitionerFactory.get(getWriteConfig().getBulkInsertSortMode()));
   }
 
   /**
-   * Submit job to execute clustering for the group.
+   * Submit job to execute clustering for the group with RDD APIs.
    */
-  private CompletableFuture<HoodieData<WriteStatus>> runClusteringForGroupAsync(HoodieClusteringGroup clusteringGroup, Map<String, String> strategyParams,
-                                                                             boolean preserveHoodieMetadata, String instantTime) {
+  private CompletableFuture<HoodieData<WriteStatus>> runClusteringForGroupAsyncWithRDD(HoodieClusteringGroup clusteringGroup, Map<String, String> strategyParams,

Review Comment:
   Let's stay consistent with the rest of the codebase how we identify row or avro workflows:
    - Avro/HoodieRecord keep their existing names
    - Row counterparts get `AsRow` suffix
   
   In this case suggest going w/ `runClusteringForGroupAsync` for existing one and `runClusteringForGroupAsyncAsRow` for the new one.
   
   WDYT?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowSpatialCurveSortPartitioner.java:
##########
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.config.HoodieClusteringConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.sort.SpaceCurveSortingHelper;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+import java.util.Arrays;
+import java.util.List;
+
+public class RowSpatialCurveSortPartitioner implements BulkInsertPartitioner<Dataset<Row>> {
+
+  private final String[] orderByColumns;
+  private final HoodieClusteringConfig.LayoutOptimizationStrategy layoutOptStrategy;
+  private final HoodieClusteringConfig.SpatialCurveCompositionStrategyType curveCompositionStrategyType;
+
+  public RowSpatialCurveSortPartitioner(HoodieWriteConfig config) {
+    this.layoutOptStrategy = config.getLayoutOptimizationStrategy();
+    if (config.getClusteringSortColumns() != null) {
+      this.orderByColumns = Arrays.stream(config.getClusteringSortColumns().split(","))
+          .map(String::trim).toArray(String[]::new);
+    } else {
+      throw new IllegalArgumentException("The config "
+          + HoodieClusteringConfig.PLAN_STRATEGY_SORT_COLUMNS.key() + " must be provided");
+    }
+    this.curveCompositionStrategyType = config.getLayoutOptimizationCurveBuildMethod();
+  }
+
+  @Override
+  public Dataset<Row> repartitionRecords(Dataset<Row> records, int outputPartitions) {
+    return reorder(records, outputPartitions);
+  }
+
+  private Dataset<Row> reorder(Dataset<Row> dataset, int numOutputGroups) {

Review Comment:
   This method for ex, is the same for both Avro/Row impls. We should certainly reuse it (instead of duplicating)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -148,29 +184,34 @@ protected BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> getPartitioner(Map<Str
       switch (layoutOptStrategy) {
         case ZORDER:
         case HILBERT:
-          return new RDDSpatialCurveSortPartitioner(
+          return isRowPartitioner
+              ? new RowSpatialCurveSortPartitioner(getWriteConfig())
+              : new RDDSpatialCurveSortPartitioner(
               (HoodieSparkEngineContext) getEngineContext(),
               orderByColumns,
               layoutOptStrategy,
               getWriteConfig().getLayoutOptimizationCurveBuildMethod(),
               HoodieAvroUtils.addMetadataFields(schema));
         case LINEAR:
-          return new RDDCustomColumnsSortPartitioner(orderByColumns, HoodieAvroUtils.addMetadataFields(schema),
+          return isRowPartitioner
+              ? new RowCustomColumnsSortPartitioner(orderByColumns)
+              : new RDDCustomColumnsSortPartitioner(orderByColumns, HoodieAvroUtils.addMetadataFields(schema),
               getWriteConfig().isConsistentLogicalTimestampEnabled());

Review Comment:
   Same as above



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -148,29 +184,34 @@ protected BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> getPartitioner(Map<Str
       switch (layoutOptStrategy) {
         case ZORDER:
         case HILBERT:
-          return new RDDSpatialCurveSortPartitioner(
+          return isRowPartitioner
+              ? new RowSpatialCurveSortPartitioner(getWriteConfig())
+              : new RDDSpatialCurveSortPartitioner(
               (HoodieSparkEngineContext) getEngineContext(),

Review Comment:
   Please fix alignment



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkSortAndSizeExecutionStrategy.java:
##########
@@ -52,6 +55,27 @@ public SparkSortAndSizeExecutionStrategy(HoodieTable table,
     super(table, engineContext, writeConfig);
   }
 
+  @Override
+  public HoodieData<WriteStatus> performClusteringWithRecordsRow(Dataset<Row> inputRecords, int numOutputGroups,

Review Comment:
   Let's create a Jira to implement `HoodieData` wrapping around `DataFrame` (you can assign one to myself)
   
   In that case we won't need to duplicate these methods and will simply be able to parameterize them.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -273,6 +330,60 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                   HoodieClusteringGroup clusteringGroup,
+                                                   String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    String[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> readPaths = new ArrayList<>();
+          if (op.getBootstrapFilePath() != null) {
+            readPaths.add(op.getBootstrapFilePath());
+          }
+          if (op.getDataFilePath() != null) {
+            readPaths.add(op.getDataFilePath());
+          }
+          return readPaths;
+        })
+        .flatMap(Collection::stream)
+        .filter(path -> !path.isEmpty())
+        .toArray(String[]::new);
+    String[] deltaPaths = clusteringOps
+        .stream()
+        .filter(op -> !op.getDeltaFilePaths().isEmpty())
+        .flatMap(op -> op.getDeltaFilePaths().stream())
+        .toArray(String[]::new);
+
+    Dataset<Row> inputRecords;
+    if (hasLogFiles) {
+      String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))
+          .orElse("0.75");
+      String[] paths = CollectionUtils.combine(baseFilePaths, deltaPaths);
+      inputRecords = sqlContext.read()

Review Comment:
   Same feedback as with the write-path: we can't use Spark DataSource in here for mostly the same reasons -- it violates layering and could lead to subtle bugs. 
   
   Instead let's extract following portion of the `createRelation` method and reuse it directly here:
   https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala#L125
   
   Like following:
   ```
   val relation = DefaultSource.createRelation(...)
   val df = sparkSession.baseRelationToDataFrame(relation)
   ```



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -273,6 +330,60 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                   HoodieClusteringGroup clusteringGroup,
+                                                   String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    String[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> readPaths = new ArrayList<>();
+          if (op.getBootstrapFilePath() != null) {
+            readPaths.add(op.getBootstrapFilePath());
+          }
+          if (op.getDataFilePath() != null) {
+            readPaths.add(op.getDataFilePath());
+          }
+          return readPaths;
+        })
+        .flatMap(Collection::stream)
+        .filter(path -> !path.isEmpty())
+        .toArray(String[]::new);
+    String[] deltaPaths = clusteringOps
+        .stream()
+        .filter(op -> !op.getDeltaFilePaths().isEmpty())
+        .flatMap(op -> op.getDeltaFilePaths().stream())
+        .toArray(String[]::new);
+
+    Dataset<Row> inputRecords;
+    if (hasLogFiles) {
+      String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))

Review Comment:
   The only part that differs b/w these branches are the options composition. Let's extract the common part and only keep options composition under conditional



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowSpatialCurveSortPartitioner.java:
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.config.HoodieClusteringConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.sort.SpaceCurveSortingHelper;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+import java.util.Arrays;
+import java.util.List;
+
+public class RowSpatialCurveSortPartitioner extends RowCustomColumnsSortPartitioner {
+
+  private final String[] orderByColumns;
+  private final HoodieClusteringConfig.LayoutOptimizationStrategy layoutOptStrategy;
+  private final HoodieClusteringConfig.SpatialCurveCompositionStrategyType curveCompositionStrategyType;
+
+  public RowSpatialCurveSortPartitioner(HoodieWriteConfig config) {
+    super(config);
+    this.layoutOptStrategy = config.getLayoutOptimizationStrategy();
+    if (config.getClusteringSortColumns() != null) {
+      this.orderByColumns = Arrays.stream(config.getClusteringSortColumns().split(","))
+          .map(String::trim).toArray(String[]::new);
+    } else {
+      this.orderByColumns = getSortColumnNames();
+    }
+    this.curveCompositionStrategyType = config.getLayoutOptimizationCurveBuildMethod();
+  }
+
+  @Override
+  public Dataset<Row> repartitionRecords(Dataset<Row> records, int outputPartitions) {
+    return reorder(records, outputPartitions);

Review Comment:
   SG



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -113,6 +129,15 @@ public HoodieWriteMetadata<HoodieData<WriteStatus>> performClustering(final Hood
     return writeMetadata;
   }
 
+  /**
+   * Execute clustering to write inputRecords into new files based on strategyParams.
+   * Different from {@link performClusteringWithRecordsRDD}, this method take {@link Dataset<Row>}
+   * as inputs.
+   */
+  public abstract HoodieData<WriteStatus> performClusteringWithRecordsRow(final Dataset<Row> inputRecords, final int numOutputGroups, final String instantTime,

Review Comment:
   Stacked one for comparison:
   
   <img width="710" alt="Screen Shot 2022-09-12 at 5 50 34 PM" src="https://user-images.githubusercontent.com/428277/189783700-35472b54-224c-4bd1-8f39-42952d1b5bf4.png">
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1258921709

   > Is number of files in your tables before or after?
   
   It's before, added after informations
   
   ### Test1
   
   Row enabled | Partition hour | total size | file num | total size(after) | file num(after) | runtime
   -- | -- | -- | -- | -- | -- | --
   true | dt=2022-09-22/hh=14 | 233.9 G | 1.3 K | 138.1 G | 47 | 753s
   false | dt=2022-09-22/hh=22 | 209.6 G | 1.3 K | 123.6 G | 43  | 1008s
   
   ### Test2
   
   Row enabled | Partition hour | total size | file num | total size(after) | file num(after)  | runtime
   -- | -- | -- | -- | -- | -- | --
   true | 2022-09-19 | 70.9 G | 7.5 K | 55.8G | 409 | 11h 7min
   false | 2022-09-20 | 69.7 G | 7.3 K | 54.5G | 397 |  11h 33min
   
   Yea, the second test has many small files, but it still confuse me why it so slow to write files whose average size is 250M. Still investigate why it happens.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1180145219

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 58cf2096e648ccc8c7e7c563003753ce89a90261 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730) 
   * dd79426d234315d90b2deffcd54dc0e9ab43e38e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r970248200


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkSortAndSizeExecutionStrategy.java:
##########
@@ -52,6 +55,27 @@ public SparkSortAndSizeExecutionStrategy(HoodieTable table,
     super(table, engineContext, writeConfig);
   }
 
+  @Override
+  public HoodieData<WriteStatus> performClusteringWithRecordsRow(Dataset<Row> inputRecords, int numOutputGroups,
+                                                                 String instantTime, Map<String, String> strategyParams, Schema schema,
+                                                                 List<HoodieFileGroupId> fileGroupIdList, boolean preserveHoodieMetadata) {
+    LOG.info("Starting clustering for a group, parallelism:" + numOutputGroups + " commit:" + instantTime);
+    HoodieWriteConfig newConfig = HoodieWriteConfig.newBuilder()
+        .withBulkInsertParallelism(numOutputGroups)
+        .withProps(getWriteConfig().getProps()).build();
+
+    boolean shouldPreserveHoodieMetadata = preserveHoodieMetadata;

Review Comment:
   If `populateMetaFields` is false, will `preserveHoodieMetadata` still make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251927655

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519",
       "triggerID" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512) 
   * e75f6d0031490025107040c1b0093c3c5720a67d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519) 
   * 46004b031d07d220812c5cdb19e9cd66552ceacc UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r974679847


##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -105,6 +108,52 @@ object HoodieDatasetBulkInsertHelper extends Logging {
     partitioner.repartitionRecords(trimmedDF, config.getBulkInsertShuffleParallelism)
   }
 
+  /**
+   * Perform bulk insert for [[Dataset<Row>]], will not change timeline/index, return
+   * information about write files.
+   */
+  def bulkInsert(dataset: Dataset[Row],
+                 instantTime: String,
+                 table: HoodieTable[_ <: HoodieRecordPayload[_ <: HoodieRecordPayload[_ <: AnyRef]], _, _, _],
+                 writeConfig: HoodieWriteConfig,
+                 partitioner: BulkInsertPartitioner[Dataset[Row]],
+                 parallelism: Int,
+                 shouldPreserveHoodieMetadata: Boolean): HoodieData[WriteStatus] = {
+    val repartitionedDataset = partitioner.repartitionRecords(dataset, parallelism)
+    val arePartitionRecordsSorted = partitioner.arePartitionRecordsSorted
+    val schema = dataset.schema
+    val writeStatuses = repartitionedDataset.queryExecution.toRdd.mapPartitions(iter => {
+      val taskContextSupplier: TaskContextSupplier = table.getTaskContextSupplier
+      val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get
+      val taskId = taskContextSupplier.getStageIdSupplier.get.toLong
+      val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get
+      val writer = new BulkInsertDataInternalWriterHelper(
+        table,
+        writeConfig,
+        instantTime,
+        taskPartitionId,
+        taskId,
+        taskEpochId,
+        schema,
+        writeConfig.populateMetaFields,
+        arePartitionRecordsSorted,
+        shouldPreserveHoodieMetadata)
+
+      try {
+        iter.foreach(writer.write)
+      } catch {
+        case t: Throwable =>
+          writer.abort()
+          throw t
+      } finally {
+        writer.close()
+      }
+
+      writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator
+    }).collect()
+    table.getContext.parallelize(writeStatuses.toList.asJava)

Review Comment:
   nit: no need for `toList`



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkAdapterSupport.scala:
##########
@@ -26,17 +26,6 @@ import org.apache.spark.sql.hudi.SparkAdapter
  */
 trait SparkAdapterSupport {
 
-  lazy val sparkAdapter: SparkAdapter = {

Review Comment:
   Instead of moving this to Java let's dot he following:
   
   - Create companion object `ScalaAdapterSupport` 
   - Move this conditional there
   - Keep this var (for compatibility) referencing static one from the object



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java:
##########
@@ -183,8 +183,17 @@ public boolean accept(Path path) {
             metaClientCache.put(baseDir.toString(), metaClient);
           }
 
-          fsView = FileSystemViewManager.createInMemoryFileSystemView(engineContext,
-              metaClient, HoodieInputFormatUtils.buildMetadataConfig(getConf()));
+          if (getConf().get("as.of.instant") != null) {

Review Comment:
   Good catch!



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSpatialCurveSortPartitioner.java:
##########
@@ -91,27 +81,4 @@ public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> reco
           return hoodieRecord;
         });
   }
-
-  private Dataset<Row> reorder(Dataset<Row> dataset, int numOutputGroups) {

Review Comment:
   Thanks for cleaning that up!



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieSparkMergeOnReadTableClustering.java:
##########
@@ -60,20 +61,30 @@ class TestHoodieSparkMergeOnReadTableClustering extends SparkClientFunctionalTes
 
   private static Stream<Arguments> testClustering() {
     return Stream.of(
-        Arguments.of(true, true, true),
-        Arguments.of(true, true, false),
-        Arguments.of(true, false, true),
-        Arguments.of(true, false, false),
-        Arguments.of(false, true, true),
-        Arguments.of(false, true, false),
-        Arguments.of(false, false, true),
-        Arguments.of(false, false, false)
-    );
+        Arrays.asList(true, true, true),
+        Arrays.asList(true, true, false),
+        Arrays.asList(true, false, true),
+        Arrays.asList(true, false, false),
+        Arrays.asList(false, true, true),
+        Arrays.asList(false, true, false),
+        Arrays.asList(false, false, true),
+        Arrays.asList(false, false, false))
+        .flatMap(arguments -> {
+          ArrayList<Boolean> enableRowClusteringArgs = new ArrayList<>();
+          enableRowClusteringArgs.add(true);
+          enableRowClusteringArgs.addAll(arguments);
+          ArrayList<Boolean> disableRowClusteringArgs = new ArrayList<>();
+          disableRowClusteringArgs.add(false);

Review Comment:
   Appreciate your intent, but if we'd be chaining every parameter like that this could would become unreadable. Let's just add one more column for parameter (and add a comment at the top enumerating all the params in this matrix)



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -110,44 +110,9 @@ class DefaultSource extends RelationProvider
     val isBootstrappedTable = metaClient.getTableConfig.getBootstrapBasePath.isPresent

Review Comment:
   Let's move these statements (up to line #109) into `DefaultSource.createRelation` to make a cleaner cut (all these vars are used in that method rather than here)



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala:
##########
@@ -147,6 +150,15 @@ trait SparkAdapter extends Serializable {
    */
   def createInterpretedPredicate(e: Expression): InterpretedPredicate
 
+  /**
+   * Create Hoodie relation based on globPaths, otherwise use tablePath if it's empty
+   */
+  def createRelation(metaClient: HoodieTableMetaClient,
+                     sqlContext: SQLContext,

Review Comment:
   nit: Rule of thumb is to put anything context-like as first param (`SQLContext`)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/JavaSparkAdaptorSupport.java:
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi;
+
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.sql.hudi.SparkAdapter;
+
+/**
+ * Java implementation to provide SparkAdapter when we need to adapt
+ * the difference between spark2 and spark3.
+ */
+public class JavaSparkAdaptorSupport {
+
+  private JavaSparkAdaptorSupport() {}
+
+  private static class AdapterSupport {
+
+    private static final SparkAdapter ADAPTER = new AdapterSupport().sparkAdapter();
+
+    private SparkAdapter sparkAdapter() {
+      String adapterClass;
+      if (HoodieSparkUtils.isSpark3_3()) {

Review Comment:
   We should keep this one in Scala though -- see my comment below
   
   



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##########
@@ -160,9 +160,6 @@ class BaseFileOnlyRelation(sqlContext: SQLContext,
         fileFormat = fileFormat,
         optParams)(sparkSession)
     } else {
-      val readPathsStr = optParams.get(DataSourceReadOptions.READ_PATHS.key)

Review Comment:
   This will be globbed w/in DataSource:
   https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L569



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251790938

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 20f64af242ac3e6df5d1555edf0766e7dcdd698a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470) 
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251870554

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519",
       "triggerID" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512) 
   * e75f6d0031490025107040c1b0093c3c5720a67d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249889804

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 7300c9eb17c30e11aaeb9cd768b15585536ab5f9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423) 
   * 988e4874af3065d6879f9adc40c7483a84467f72 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425) 
   * f2bb9e61707199197f30eef79e80db3e1241b3a0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429) 
   * 1587f472f18d7b524971637abe64d171c9799818 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1200843311

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dd79426d234315d90b2deffcd54dc0e9ab43e38e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843) 
   * dfd50cd0007c4ff48b3e0e27c368d573e47560a2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249928256

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 988e4874af3065d6879f9adc40c7483a84467f72 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425) 
   * f2bb9e61707199197f30eef79e80db3e1241b3a0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429) 
   * 1587f472f18d7b524971637abe64d171c9799818 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1236612392

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c913457d2dd531fd1ecae6b0d60e600f59e261b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760) 
   * b8e848d0f8b32ff3c75762951e3af4c911419927 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1250199140

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 1587f472f18d7b524971637abe64d171c9799818 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432) 
   * 20f64af242ac3e6df5d1555edf0766e7dcdd698a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1252245568

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519",
       "triggerID" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11522",
       "triggerID" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 46004b031d07d220812c5cdb19e9cd66552ceacc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11522) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1252177928

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519",
       "triggerID" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11522",
       "triggerID" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * e75f6d0031490025107040c1b0093c3c5720a67d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519) 
   * 46004b031d07d220812c5cdb19e9cd66552ceacc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11522) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r945546375


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -131,6 +161,53 @@ public abstract HoodieData<WriteStatus> performClusteringWithRecordsRDD(final Ho
                                                                        final Map<String, String> strategyParams, final Schema schema,
                                                                        final List<HoodieFileGroupId> fileGroupIdList, final boolean preserveHoodieMetadata);
 
+  protected HoodieData<WriteStatus> performRowWrite(Dataset<Row> inputRecords, Map<String, String> parameters) {
+    String uuid = UUID.randomUUID().toString();
+    parameters.put(HoodieWriteConfig.BULKINSERT_ROW_IDENTIFY_ID.key(), uuid);
+    try {
+      inputRecords.write()
+          .format("hudi")
+          .options(JavaConverters.mapAsScalaMapConverter(parameters).asScala())
+          .mode(SaveMode.Append)
+          .save(getWriteConfig().getBasePath());
+      List<WriteStatus> writeStatusList = HoodieInternalWriteStatusCoordinator.get().getWriteStatuses(uuid)
+          .stream()
+          .map(internalWriteStatus -> {
+            WriteStatus status = new WriteStatus(
+                internalWriteStatus.isTrackSuccessRecords(), internalWriteStatus.getFailureFraction());
+            status.setFileId(internalWriteStatus.getFileId());
+            status.setTotalRecords(internalWriteStatus.getTotalRecords());
+            status.setPartitionPath(internalWriteStatus.getPartitionPath());
+            status.setStat(internalWriteStatus.getStat());
+            return status;
+          }).collect(Collectors.toList());
+      return getEngineContext().parallelize(writeStatusList);
+    } finally {
+      HoodieInternalWriteStatusCoordinator.get().removeStatuses(uuid);
+    }
+  }
+
+  protected Map<String, String> buildHoodieRowParameters(int numOutputGroups, String instantTime, Map<String, String> strategyParams, boolean preserveHoodieMetadata) {
+    HashMap<String, String> params = new HashMap<>();
+    HoodieWriteConfig writeConfig = getWriteConfig();
+    params.put(HoodieWriteConfig.BULKINSERT_PARALLELISM_VALUE.key(), String.valueOf(numOutputGroups));
+    params.put(HoodieWriteConfig.BULKINSERT_ROW_AUTO_COMMIT.key(), String.valueOf(false));

Review Comment:
   why false,  the default value of this config is true



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1214811767

   @alexeykudinkin  could you pls help review this pr, thanks very much


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r963194835


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatusCoordinator.java:
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client;
+
+import java.util.List;
+import java.util.concurrent.ConcurrentHashMap;
+
+public class HoodieInternalWriteStatusCoordinator {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1238883968

   Hey, @alexeykudinkin, addressed all comments, could you plz review again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1174908752

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 58cf2096e648ccc8c7e7c563003753ce89a90261 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1179289240

   nice work!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1184600898

   @boneanxs  my wechat 1037817390, let's disscuss this pr in wechat first.  i think we can lanch this pr in 0.12


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1206031953

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dfd50cd0007c4ff48b3e0e27c368d573e47560a2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486) 
   * 5a6ac9622379715e890f1ec1cd7be9422febeb5c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1236671429

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c913457d2dd531fd1ecae6b0d60e600f59e261b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760) 
   * b8e848d0f8b32ff3c75762951e3af4c911419927 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151) 
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1236665766

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c913457d2dd531fd1ecae6b0d60e600f59e261b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760) 
   * b8e848d0f8b32ff3c75762951e3af4c911419927 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151) 
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1207888573

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e216664929bd2e01bc1eafa564ec4ebc745b1c34 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r946549917


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -131,6 +161,53 @@ public abstract HoodieData<WriteStatus> performClusteringWithRecordsRDD(final Ho
                                                                        final Map<String, String> strategyParams, final Schema schema,
                                                                        final List<HoodieFileGroupId> fileGroupIdList, final boolean preserveHoodieMetadata);
 
+  protected HoodieData<WriteStatus> performRowWrite(Dataset<Row> inputRecords, Map<String, String> parameters) {
+    String uuid = UUID.randomUUID().toString();
+    parameters.put(HoodieWriteConfig.BULKINSERT_ROW_IDENTIFY_ID.key(), uuid);
+    try {
+      inputRecords.write()
+          .format("hudi")
+          .options(JavaConverters.mapAsScalaMapConverter(parameters).asScala())
+          .mode(SaveMode.Append)
+          .save(getWriteConfig().getBasePath());
+      List<WriteStatus> writeStatusList = HoodieInternalWriteStatusCoordinator.get().getWriteStatuses(uuid)
+          .stream()
+          .map(internalWriteStatus -> {
+            WriteStatus status = new WriteStatus(
+                internalWriteStatus.isTrackSuccessRecords(), internalWriteStatus.getFailureFraction());
+            status.setFileId(internalWriteStatus.getFileId());
+            status.setTotalRecords(internalWriteStatus.getTotalRecords());
+            status.setPartitionPath(internalWriteStatus.getPartitionPath());
+            status.setStat(internalWriteStatus.getStat());
+            return status;
+          }).collect(Collectors.toList());
+      return getEngineContext().parallelize(writeStatusList);
+    } finally {
+      HoodieInternalWriteStatusCoordinator.get().removeStatuses(uuid);
+    }
+  }
+
+  protected Map<String, String> buildHoodieRowParameters(int numOutputGroups, String instantTime, Map<String, String> strategyParams, boolean preserveHoodieMetadata) {
+    HashMap<String, String> params = new HashMap<>();
+    HoodieWriteConfig writeConfig = getWriteConfig();
+    params.put(HoodieWriteConfig.BULKINSERT_PARALLELISM_VALUE.key(), String.valueOf(numOutputGroups));
+    params.put(HoodieWriteConfig.BULKINSERT_ROW_AUTO_COMMIT.key(), String.valueOf(false));

Review Comment:
   This is a new `AUTO_COMMIT` config specifically for `BULKINSERT`, I want to use `hoodie.auto.commit` at first, but I found it is set to false when building HoodieConfig(not sure why), https://github.com/apache/hudi/pull/1834#discussion_r907976925 so if using `hoodie.auto.commit` here, could cause lots tests failed as using bulk_insert will not auto commit by default. So I introduce a new configure to bypass this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249533624

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153) 
   * 7300c9eb17c30e11aaeb9cd768b15585536ab5f9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423) 
   * 988e4874af3065d6879f9adc40c7483a84467f72 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425) 
   * f2bb9e61707199197f30eef79e80db3e1241b3a0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429) 
   * 1587f472f18d7b524971637abe64d171c9799818 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r972814389


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -273,6 +330,60 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                   HoodieClusteringGroup clusteringGroup,
+                                                   String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    String[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> readPaths = new ArrayList<>();
+          if (op.getBootstrapFilePath() != null) {
+            readPaths.add(op.getBootstrapFilePath());
+          }
+          if (op.getDataFilePath() != null) {
+            readPaths.add(op.getDataFilePath());
+          }
+          return readPaths;
+        })
+        .flatMap(Collection::stream)
+        .filter(path -> !path.isEmpty())
+        .toArray(String[]::new);
+    String[] deltaPaths = clusteringOps
+        .stream()
+        .filter(op -> !op.getDeltaFilePaths().isEmpty())
+        .flatMap(op -> op.getDeltaFilePaths().stream())
+        .toArray(String[]::new);
+
+    Dataset<Row> inputRecords;
+    if (hasLogFiles) {
+      String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r974818835


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##########
@@ -160,9 +160,6 @@ class BaseFileOnlyRelation(sqlContext: SQLContext,
         fileFormat = fileFormat,
         optParams)(sparkSession)
     } else {
-      val readPathsStr = optParams.get(DataSourceReadOptions.READ_PATHS.key)

Review Comment:
   I see, thanks for clear me!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251867212

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512) 
   * e75f6d0031490025107040c1b0093c3c5720a67d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1253213672

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519",
       "triggerID" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11522",
       "triggerID" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "af01ca4e55cc478dc252fb443c525c15780eefa9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11540",
       "triggerID" : "af01ca4e55cc478dc252fb443c525c15780eefa9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * af01ca4e55cc478dc252fb443c525c15780eefa9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11540) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r971286665


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -148,29 +184,34 @@ protected BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> getPartitioner(Map<Str
       switch (layoutOptStrategy) {
         case ZORDER:
         case HILBERT:
-          return new RDDSpatialCurveSortPartitioner(
+          return isRowPartitioner
+              ? new RowSpatialCurveSortPartitioner(getWriteConfig())
+              : new RDDSpatialCurveSortPartitioner(
               (HoodieSparkEngineContext) getEngineContext(),
               orderByColumns,
               layoutOptStrategy,
               getWriteConfig().getLayoutOptimizationCurveBuildMethod(),
               HoodieAvroUtils.addMetadataFields(schema));
         case LINEAR:
-          return new RDDCustomColumnsSortPartitioner(orderByColumns, HoodieAvroUtils.addMetadataFields(schema),
+          return isRowPartitioner
+              ? new RowCustomColumnsSortPartitioner(orderByColumns)
+              : new RDDCustomColumnsSortPartitioner(orderByColumns, HoodieAvroUtils.addMetadataFields(schema),
               getWriteConfig().isConsistentLogicalTimestampEnabled());
         default:
           throw new UnsupportedOperationException(String.format("Layout optimization strategy '%s' is not supported", layoutOptStrategy));
       }
-    }).orElse(BulkInsertInternalPartitionerFactory.get(getWriteConfig().getBulkInsertSortMode()));
+    }).orElse(isRowPartitioner ? BulkInsertInternalPartitionerWithRowsFactory.get(getWriteConfig().getBulkInsertSortMode()) :
+        BulkInsertInternalPartitionerFactory.get(getWriteConfig().getBulkInsertSortMode()));
   }
 
   /**
-   * Submit job to execute clustering for the group.
+   * Submit job to execute clustering for the group with RDD APIs.
    */
-  private CompletableFuture<HoodieData<WriteStatus>> runClusteringForGroupAsync(HoodieClusteringGroup clusteringGroup, Map<String, String> strategyParams,
-                                                                             boolean preserveHoodieMetadata, String instantTime) {
+  private CompletableFuture<HoodieData<WriteStatus>> runClusteringForGroupAsyncWithRDD(HoodieClusteringGroup clusteringGroup, Map<String, String> strategyParams,

Review Comment:
   RDD suffix is misleading though (Row based ones are also relying on RDD internally)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1181277681

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * dd79426d234315d90b2deffcd54dc0e9ab43e38e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1200848098

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dd79426d234315d90b2deffcd54dc0e9ab43e38e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843) 
   * dfd50cd0007c4ff48b3e0e27c368d573e47560a2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1206029376

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dfd50cd0007c4ff48b3e0e27c368d573e47560a2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486) 
   * 5a6ac9622379715e890f1ec1cd7be9422febeb5c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r945558006


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -273,6 +398,62 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                   HoodieClusteringGroup clusteringGroup,
+                                                   String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    String[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> pairs = new ArrayList<>();

Review Comment:
   pair  is ambiguous



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1214672107

   The CI failure seems not relate to the PR.
   
   Thanks to @voonhous, he tested 2 cases, cluster individual parquet files of  ~500MB up to 10GB groups.
   
   After enable `hoodie.clustering.as.row`, it could give us nearly 30% performance improvement
   
   ### Test 1
   | clustering as row enabled |Partition hour| total size | runtime(min) |
   | :----:| :----:|:----:|:----:|
   |true|dt=2022-07-28/hh=23|2.0T|76|
   |false|dt=2022-07-28/hh=00|2.0T|123|
   
   ### Test 2
   | clustering as row enabled |Partition hour| total size | File Count | runtime(min) |
   | :----:| :----:|:----:|:----:|:----:|
   |true|dt=2022-07-28/hh=14|2.5T|7792|92|
   |false|dt=2022-07-28/hh=15|2.5T|7771|128|
   
   The spark configure used
   
   ```bash
       --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
       --conf 'spark.rpc.askTimeout=600s' \
       --conf 'spark.driver.extraJavaOptions=-Djava.util.concurrent.ForkJoinPool.common.parallelism=250' \
       --conf 'spark.sql.parquet.columnarReaderBatchSize=1024' \
       --conf 'spark.yarn.maxAppAttempts=1' \
       --num-executors 64 \
       --driver-memory 20g \
       --driver-cores 1 \
       --executor-memory 15g \
       --executor-cores 2 \
       --class org.apache.hudi.utilities.HoodieClusteringJob \
       hudi-utilities-bundle_2.12-0.12.0-SNAPSHOT.jar \
       --props hdfs://test/2022-07-24_clustering/clusteringjob_optimized.properties \
       --mode scheduleAndExecute \
       --base-path hdfs://test/test/hudi/voon_kafka_test__test_hudi_011_04/ \
       --table-name rank_server_log_hudi_test_1h \
       --spark-memory 15g \
       --parallelism 32
   ```
   
   clusteringjob.properties
   
   ```bash
   hoodie.clustering.async.enabled=true
   hoodie.clustering.async.max.commits=2
   hoodie.clustering.plan.strategy.max.bytes.per.group=10737418240
   hoodie.clustering.plan.strategy.target.file.max.bytes=11811160064
   hoodie.clustering.plan.strategy.small.file.limit=6442450944
   hoodie.clustering.plan.strategy.max.num.groups=10000
   hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
   hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS
   hoodie.clustering.plan.strategy.cluster.begin.partition=dt=2022-07-28/hh=15
   hoodie.clustering.plan.strategy.cluster.end.partition=dt=2022-07-28/hh=15
   hoodie.clustering.plan.strategy.sort.columns=partition,offset
   ```
   
   Gentle ping @xiarixiaoyao @XuQianJin-Stars @codope, can you guys help to review this if you catch time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r972822536


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##########
@@ -784,22 +785,6 @@ object DataSourceOptionsHelper {
     ) ++ translateConfigurations(paramsWithGlobalProps)
   }
 
-  def inferKeyGenClazz(props: TypedProperties): String = {

Review Comment:
   Move this logic to `HoodieSparkKeyGeneratorFactory` as `HoodieDatasetBulkInsertHelper` need to decide `KeyGenClass`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249163148

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153) 
   * 7300c9eb17c30e11aaeb9cd768b15585536ab5f9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249538850

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153) 
   * 7300c9eb17c30e11aaeb9cd768b15585536ab5f9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423) 
   * 988e4874af3065d6879f9adc40c7483a84467f72 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425) 
   * f2bb9e61707199197f30eef79e80db3e1241b3a0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429) 
   * 1587f472f18d7b524971637abe64d171c9799818 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1240240348

   @boneanxs will do


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r970249926


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BulkInsertDataInternalWriterHelper.java:
##########
@@ -118,7 +126,8 @@ public BulkInsertDataInternalWriterHelper(HoodieTable hoodieTable, HoodieWriteCo
   private Option<BuiltinKeyGenerator> getKeyGenerator(Properties properties) {
     TypedProperties typedProperties = new TypedProperties();
     typedProperties.putAll(properties);
-    if (properties.get(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME().key()).equals(NonpartitionedKeyGenerator.class.getName())) {
+    if (Option.ofNullable(properties.get(HoodieWriteConfig.KEYGENERATOR_CLASS_NAME.key()))

Review Comment:
   Here follow the old logic, but fix if `properties` doesn't have the `KEYGENERATOR_CLASS_NAME`, it could throw NPE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249958059

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * f2bb9e61707199197f30eef79e80db3e1241b3a0 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429) 
   * 1587f472f18d7b524971637abe64d171c9799818 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r971405570


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -273,6 +330,60 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                   HoodieClusteringGroup clusteringGroup,
+                                                   String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    String[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> readPaths = new ArrayList<>();
+          if (op.getBootstrapFilePath() != null) {
+            readPaths.add(op.getBootstrapFilePath());
+          }
+          if (op.getDataFilePath() != null) {
+            readPaths.add(op.getDataFilePath());
+          }
+          return readPaths;
+        })
+        .flatMap(Collection::stream)
+        .filter(path -> !path.isEmpty())
+        .toArray(String[]::new);
+    String[] deltaPaths = clusteringOps
+        .stream()
+        .filter(op -> !op.getDeltaFilePaths().isEmpty())
+        .flatMap(op -> op.getDeltaFilePaths().stream())
+        .toArray(String[]::new);
+
+    Dataset<Row> inputRecords;
+    if (hasLogFiles) {
+      String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))
+          .orElse("0.75");
+      String[] paths = CollectionUtils.combine(baseFilePaths, deltaPaths);
+      inputRecords = sqlContext.read()

Review Comment:
   Good idea!I'll take a try



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan merged pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
nsivabalan merged PR #6046:
URL: https://github.com/apache/hudi/pull/6046


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r977019700


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -275,6 +345,66 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                HoodieClusteringGroup clusteringGroup,
+                                                String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    Path[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> readPaths = new ArrayList<>();
+          if (op.getBootstrapFilePath() != null) {
+            readPaths.add(op.getBootstrapFilePath());
+          }
+          if (op.getDataFilePath() != null) {
+            readPaths.add(op.getDataFilePath());
+          }
+          return readPaths;
+        })
+        .flatMap(Collection::stream)
+        .filter(path -> !path.isEmpty())
+        .map(Path::new)
+        .toArray(Path[]::new);
+
+    HashMap<String, String> params = new HashMap<>();
+    params.put("hoodie.datasource.query.type", "snapshot");
+    params.put("as.of.instant", instantTime);
+
+    Path[] paths;
+    if (hasLogFiles) {
+      String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))
+          .orElse("0.75");
+      params.put("compaction.memory.fraction", compactionFractor);
+
+      Path[] deltaPaths = clusteringOps
+          .stream()
+          .filter(op -> !op.getDeltaFilePaths().isEmpty())
+          .flatMap(op -> op.getDeltaFilePaths().stream())
+          .map(Path::new)
+          .toArray(Path[]::new);
+      paths = CollectionUtils.combine(baseFilePaths, deltaPaths);
+    } else {
+      paths = baseFilePaths;
+    }
+
+    String readPathString = String.join(",", Arrays.stream(paths).map(Path::toString).toArray(String[]::new));
+    params.put("hoodie.datasource.read.paths", readPathString);
+    // Building HoodieFileIndex needs this param to decide query path
+    params.put("glob.paths", readPathString);
+
+    // Let Hudi relations to fetch the schema from the table itself
+    BaseRelation relation = SparkAdapterSupport$.MODULE$.sparkAdapter()

Review Comment:
   :+1:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1260281685

   > can you please create a Jira corresponding to your investigation and link it in here? So that it's easier to discover it
   
   Yea, sure thing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1207759021

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a6ac9622379715e890f1ec1cd7be9422febeb5c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597) 
   * e216664929bd2e01bc1eafa564ec4ebc745b1c34 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r946552880


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -273,6 +398,62 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                   HoodieClusteringGroup clusteringGroup,
+                                                   String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    String[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> pairs = new ArrayList<>();
+          if (op.getBootstrapFilePath() != null) {
+            pairs.add(op.getBootstrapFilePath());
+          }
+          if (op.getDataFilePath() != null) {
+            pairs.add(op.getDataFilePath());
+          }
+          return pairs;
+        })
+        .flatMap(Collection::stream)
+        .filter(path -> !path.isEmpty())
+        .toArray(String[]::new);
+    String[] deltaPaths = clusteringOps
+        .stream()
+        .filter(op -> !op.getDeltaFilePaths().isEmpty())
+        .flatMap(op -> op.getDeltaFilePaths().stream())
+        .toArray(String[]::new);
+
+    Dataset<Row> inputRecords;
+    if (hasLogFiles) {
+      String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))
+          .orElse("0.75");
+      String[] paths = new String[baseFilePaths.length + deltaPaths.length];
+      System.arraycopy(baseFilePaths, 0, paths, 0, baseFilePaths.length);
+      System.arraycopy(deltaPaths, 0, paths, baseFilePaths.length, deltaPaths.length);
+      inputRecords = sqlContext.read()
+          .format("org.apache.hudi")
+          .option("hoodie.datasource.query.type", "snapshot")
+          .option("compaction.memory.fraction", compactionFractor)
+          .option("as.of.instant", instantTime)
+          .option("hoodie.datasource.read.paths", String.join(",", paths))
+          .load();
+    } else {
+      inputRecords = sqlContext.read()
+          .format("org.apache.hudi")
+          .option("as.of.instant", instantTime)

Review Comment:
   Just setting this to keep commits align with the parquet files, should no harm if we remove it, but we better align them same, what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r951930671


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -98,10 +110,18 @@ public HoodieWriteMetadata<HoodieData<WriteStatus>> performClustering(final Hood
     // execute clustering for each group async and collect WriteStatus
     Stream<HoodieData<WriteStatus>> writeStatusesStream = FutureUtils.allOf(
         clusteringPlan.getInputGroups().stream()
-        .map(inputGroup -> runClusteringForGroupAsync(inputGroup,
-            clusteringPlan.getStrategy().getStrategyParams(),
-            Option.ofNullable(clusteringPlan.getPreserveHoodieMetadata()).orElse(false),
-            instantTime))
+            .map(inputGroup -> {
+              if (Boolean.parseBoolean(getWriteConfig().getString(HoodieClusteringConfig.CLUSTERING_AS_ROW))) {

Review Comment:
   Let's abstract this as a method in `WriteConfig`



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowSpatialCurveSortPartitioner.java:
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.config.HoodieClusteringConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.sort.SpaceCurveSortingHelper;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+import java.util.Arrays;
+import java.util.List;
+
+public class RowSpatialCurveSortPartitioner extends RowCustomColumnsSortPartitioner {

Review Comment:
   Why do we inherit from `RowCustomColumnsSortPartitioner`



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -273,6 +398,62 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                   HoodieClusteringGroup clusteringGroup,
+                                                   String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    String[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> pairs = new ArrayList<>();
+          if (op.getBootstrapFilePath() != null) {
+            pairs.add(op.getBootstrapFilePath());
+          }
+          if (op.getDataFilePath() != null) {
+            pairs.add(op.getDataFilePath());
+          }
+          return pairs;
+        })
+        .flatMap(Collection::stream)
+        .filter(path -> !path.isEmpty())
+        .toArray(String[]::new);
+    String[] deltaPaths = clusteringOps
+        .stream()
+        .filter(op -> !op.getDeltaFilePaths().isEmpty())
+        .flatMap(op -> op.getDeltaFilePaths().stream())
+        .toArray(String[]::new);
+
+    Dataset<Row> inputRecords;
+    if (hasLogFiles) {
+      String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))
+          .orElse("0.75");
+      String[] paths = new String[baseFilePaths.length + deltaPaths.length];
+      System.arraycopy(baseFilePaths, 0, paths, 0, baseFilePaths.length);
+      System.arraycopy(deltaPaths, 0, paths, baseFilePaths.length, deltaPaths.length);

Review Comment:
   You can use `CollectionUtils.combine`



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatusCoordinator.java:
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client;
+
+import java.util.List;
+import java.util.concurrent.ConcurrentHashMap;
+
+public class HoodieInternalWriteStatusCoordinator {

Review Comment:
   I appreciate the intent, but this component doesn't really make sense (it's essentially a global buffer allowing us to facilitate data flow we can't organize otherwise)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowSpatialCurveSortPartitioner.java:
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.config.HoodieClusteringConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.sort.SpaceCurveSortingHelper;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+import java.util.Arrays;
+import java.util.List;
+
+public class RowSpatialCurveSortPartitioner extends RowCustomColumnsSortPartitioner {
+
+  private final String[] orderByColumns;
+  private final HoodieClusteringConfig.LayoutOptimizationStrategy layoutOptStrategy;
+  private final HoodieClusteringConfig.SpatialCurveCompositionStrategyType curveCompositionStrategyType;
+
+  public RowSpatialCurveSortPartitioner(HoodieWriteConfig config) {
+    super(config);
+    this.layoutOptStrategy = config.getLayoutOptimizationStrategy();
+    if (config.getClusteringSortColumns() != null) {
+      this.orderByColumns = Arrays.stream(config.getClusteringSortColumns().split(","))
+          .map(String::trim).toArray(String[]::new);
+    } else {
+      this.orderByColumns = getSortColumnNames();
+    }
+    this.curveCompositionStrategyType = config.getLayoutOptimizationCurveBuildMethod();
+  }
+
+  @Override
+  public Dataset<Row> repartitionRecords(Dataset<Row> records, int outputPartitions) {
+    return reorder(records, outputPartitions);

Review Comment:
   We need to separate out handling of partitioned table (for partitioned tables there's no point of ordering the records _across_ partitions, we should be ordering only w/in respective partitions; take a look at `RowCustomColumnsSortPartitioner` for an example)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java:
##########
@@ -153,13 +157,22 @@ private void writeRow(InternalRow row) {
       //          over again)
       UTF8String recordKey = row.getUTF8String(HoodieRecord.RECORD_KEY_META_FIELD_ORD);
       UTF8String partitionPath = row.getUTF8String(HoodieRecord.PARTITION_PATH_META_FIELD_ORD);
-      // This is the only meta-field that is generated dynamically, hence conversion b/w
-      // [[String]] and [[UTF8String]] is unavoidable
-      UTF8String seqId = UTF8String.fromString(seqIdGenerator.apply(GLOBAL_SEQ_NO.getAndIncrement()));
-
-      InternalRow updatedRow = new HoodieInternalRow(commitTime, seqId, recordKey,
-          partitionPath, fileName, row, true);
 
+      InternalRow updatedRow;
+      if (preserveMetadata) {
+        updatedRow = new HoodieInternalRow(row.getUTF8String(HoodieRecord.COMMIT_TIME_METADATA_FIELD_ORD),

Review Comment:
   You can reduce conditional to only the portion that differs (seqNo, commitTime)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -131,6 +161,53 @@ public abstract HoodieData<WriteStatus> performClusteringWithRecordsRDD(final Ho
                                                                        final Map<String, String> strategyParams, final Schema schema,
                                                                        final List<HoodieFileGroupId> fileGroupIdList, final boolean preserveHoodieMetadata);
 
+  protected HoodieData<WriteStatus> performRowWrite(Dataset<Row> inputRecords, Map<String, String> parameters) {
+    String uuid = UUID.randomUUID().toString();
+    parameters.put(HoodieWriteConfig.BULKINSERT_ROW_IDENTIFY_ID.key(), uuid);
+    try {
+      inputRecords.write()
+          .format("hudi")
+          .options(JavaConverters.mapAsScalaMapConverter(parameters).asScala())
+          .mode(SaveMode.Append)
+          .save(getWriteConfig().getBasePath());

Review Comment:
   We shouldn't be using DataSource for writing in this case:
   
   First of all, we're violating the layering of the integration -- Strategy is an internal component of Spark DataSource integration, and as such should not be referencing component that encompasses it (DS). Rule of thumb is usually that component can interface with other components w/in the same or adjacent layers.
   
   On top of that, it's actually not strictly necessary -- since we're we're trying to bulk-insert back (while reshaping its layout) only the data that was already persisted and NOT a new data, we can bypass all of the handling that occurs in Spark DS and write the data directly using `HoodieBulkInsertDataInternalWriter`. If you would take a look at the task that is actually doing the writing on the Spark side ([WriteToDataSourceV2Exec](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L418)) it's actually very simple and most of the complications stem from the need to commit the results of the operations, which aren't relevant to us in this case.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -138,6 +138,13 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .sinceVersion("0.9.0")
       .withDocumentation("Config to control frequency of async clustering");
 
+  public static final ConfigProperty<Boolean> CLUSTERING_AS_ROW = ConfigProperty

Review Comment:
   I think we should reuse existing config `hoodie.datasource.write.row.writer.enable` to control whether we go down row-writing path or not to avoid users confusion (since original config is generic)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1236616369

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c913457d2dd531fd1ecae6b0d60e600f59e261b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760) 
   * b8e848d0f8b32ff3c75762951e3af4c911419927 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151) 
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1254932365

   > Did you try to re-run your benchmark after the changes we've made? If so, can you please paste the results in here
   
   Sure, will rerun the benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1254275167

   @boneanxs thank you very much for iterating on this one! Truly monumental effort!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251932916

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1587f472f18d7b524971637abe64d171c9799818",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11432",
       "triggerID" : "1587f472f18d7b524971637abe64d171c9799818",
       "triggerType" : "PUSH"
     }, {
       "hash" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470",
       "triggerID" : "20f64af242ac3e6df5d1555edf0766e7dcdd698a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512",
       "triggerID" : "2ff0b70e69fcff7cd061a2512dc983ac92a3c87c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519",
       "triggerID" : "e75f6d0031490025107040c1b0093c3c5720a67d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11522",
       "triggerID" : "46004b031d07d220812c5cdb19e9cd66552ceacc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512) 
   * e75f6d0031490025107040c1b0093c3c5720a67d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519) 
   * 46004b031d07d220812c5cdb19e9cd66552ceacc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11522) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1175153416

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 58cf2096e648ccc8c7e7c563003753ce89a90261 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1180574406

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dd79426d234315d90b2deffcd54dc0e9ab43e38e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r970247306


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -273,6 +330,60 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                   HoodieClusteringGroup clusteringGroup,
+                                                   String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    String[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> readPaths = new ArrayList<>();
+          if (op.getBootstrapFilePath() != null) {
+            readPaths.add(op.getBootstrapFilePath());
+          }
+          if (op.getDataFilePath() != null) {
+            readPaths.add(op.getDataFilePath());
+          }
+          return readPaths;
+        })
+        .flatMap(Collection::stream)
+        .filter(path -> !path.isEmpty())
+        .toArray(String[]::new);
+    String[] deltaPaths = clusteringOps
+        .stream()
+        .filter(op -> !op.getDeltaFilePaths().isEmpty())
+        .flatMap(op -> op.getDeltaFilePaths().stream())
+        .toArray(String[]::new);
+
+    Dataset<Row> inputRecords;
+    if (hasLogFiles) {
+      String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))
+          .orElse("0.75");
+      String[] paths = CollectionUtils.combine(baseFilePaths, deltaPaths);
+      inputRecords = sqlContext.read()

Review Comment:
   Yea, I was also thinking maybe we can change here as well when fixing the write-path. But I found it quite difficult, if we want to access DefaultSource here, we may need to move many dependent codes(it might be whole `hudi-spark-common` package) from package `hudi-spark-common` to `hudi-spark-client`(like `BaseFileOnlyRelation`, `HoodieFileIndex`, etc), and I'm not sure whether it works even we move them, it might meet other problems.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r971302485


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -273,6 +330,60 @@ private HoodieData<HoodieRecord<T>> readRecordsForGroupBaseFiles(JavaSparkContex
         .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any).
+   */
+  private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,
+                                                   HoodieClusteringGroup clusteringGroup,
+                                                   String instantTime) {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream()
+        .map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+    SQLContext sqlContext = new SQLContext(jsc.sc());
+
+    String[] baseFilePaths = clusteringOps
+        .stream()
+        .map(op -> {
+          ArrayList<String> readPaths = new ArrayList<>();
+          if (op.getBootstrapFilePath() != null) {
+            readPaths.add(op.getBootstrapFilePath());
+          }
+          if (op.getDataFilePath() != null) {
+            readPaths.add(op.getDataFilePath());
+          }
+          return readPaths;
+        })
+        .flatMap(Collection::stream)
+        .filter(path -> !path.isEmpty())
+        .toArray(String[]::new);
+    String[] deltaPaths = clusteringOps
+        .stream()
+        .filter(op -> !op.getDeltaFilePaths().isEmpty())
+        .flatMap(op -> op.getDeltaFilePaths().stream())
+        .toArray(String[]::new);
+
+    Dataset<Row> inputRecords;
+    if (hasLogFiles) {
+      String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))
+          .orElse("0.75");
+      String[] paths = CollectionUtils.combine(baseFilePaths, deltaPaths);
+      inputRecords = sqlContext.read()

Review Comment:
   Good call. We shouldn't be moving any of these classes, we can use `SparkAdapter` to provide us w/ an interface to access it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r970257203


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -148,29 +184,34 @@ protected BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> getPartitioner(Map<Str
       switch (layoutOptStrategy) {
         case ZORDER:
         case HILBERT:
-          return new RDDSpatialCurveSortPartitioner(
+          return isRowPartitioner
+              ? new RowSpatialCurveSortPartitioner(getWriteConfig())
+              : new RDDSpatialCurveSortPartitioner(
               (HoodieSparkEngineContext) getEngineContext(),
               orderByColumns,
               layoutOptStrategy,
               getWriteConfig().getLayoutOptimizationCurveBuildMethod(),
               HoodieAvroUtils.addMetadataFields(schema));
         case LINEAR:
-          return new RDDCustomColumnsSortPartitioner(orderByColumns, HoodieAvroUtils.addMetadataFields(schema),
+          return isRowPartitioner
+              ? new RowCustomColumnsSortPartitioner(orderByColumns)
+              : new RDDCustomColumnsSortPartitioner(orderByColumns, HoodieAvroUtils.addMetadataFields(schema),
               getWriteConfig().isConsistentLogicalTimestampEnabled());
         default:
           throw new UnsupportedOperationException(String.format("Layout optimization strategy '%s' is not supported", layoutOptStrategy));
       }
-    }).orElse(BulkInsertInternalPartitionerFactory.get(getWriteConfig().getBulkInsertSortMode()));
+    }).orElse(isRowPartitioner ? BulkInsertInternalPartitionerWithRowsFactory.get(getWriteConfig().getBulkInsertSortMode()) :
+        BulkInsertInternalPartitionerFactory.get(getWriteConfig().getBulkInsertSortMode()));
   }
 
   /**
-   * Submit job to execute clustering for the group.
+   * Submit job to execute clustering for the group with RDD APIs.
    */
-  private CompletableFuture<HoodieData<WriteStatus>> runClusteringForGroupAsync(HoodieClusteringGroup clusteringGroup, Map<String, String> strategyParams,
-                                                                             boolean preserveHoodieMetadata, String instantTime) {
+  private CompletableFuture<HoodieData<WriteStatus>> runClusteringForGroupAsyncWithRDD(HoodieClusteringGroup clusteringGroup, Map<String, String> strategyParams,

Review Comment:
   Yea, will change it to stay consistent with other codes(though I think adding `RDD` suffix look more clear, causing it takes the same params, as well as returning the HoodieData with `AsRow` method)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249167991

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153) 
   * 7300c9eb17c30e11aaeb9cd768b15585536ab5f9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1249297603

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10665",
       "triggerID" : "e216664929bd2e01bc1eafa564ec4ebc745b1c34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10760",
       "triggerID" : "1c913457d2dd531fd1ecae6b0d60e600f59e261b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11151",
       "triggerID" : "b8e848d0f8b32ff3c75762951e3af4c911419927",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a16d35ec42bf86e5759ebb155cad40e83aba9f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153",
       "triggerID" : "60ef51484364c4a7e8a0aa64817e96b4f5a277cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423",
       "triggerID" : "7300c9eb17c30e11aaeb9cd768b15585536ab5f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425",
       "triggerID" : "988e4874af3065d6879f9adc40c7483a84467f72",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429",
       "triggerID" : "f2bb9e61707199197f30eef79e80db3e1241b3a0",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 60ef51484364c4a7e8a0aa64817e96b4f5a277cf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11153) 
   * 7300c9eb17c30e11aaeb9cd768b15585536ab5f9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11423) 
   * 988e4874af3065d6879f9adc40c7483a84467f72 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11425) 
   * f2bb9e61707199197f30eef79e80db3e1241b3a0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11429) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1201038722

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1206102844

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9730",
       "triggerID" : "58cf2096e648ccc8c7e7c563003753ce89a90261",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dd79426d234315d90b2deffcd54dc0e9ab43e38e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9843",
       "triggerID" : "1181254713",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dfd50cd0007c4ff48b3e0e27c368d573e47560a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486",
       "triggerID" : "1201038722",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597",
       "triggerID" : "5a6ac9622379715e890f1ec1cd7be9422febeb5c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a6ac9622379715e890f1ec1cd7be9422febeb5c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1183311635

   thanks for your work.  will take a look tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org