You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/05/06 23:54:44 UTC

[GitHub] [hudi] satishkotha opened a new pull request #2918: [HUDI-1877] Add support in clustering to not change record location

satishkotha opened a new pull request #2918:
URL: https://github.com/apache/hudi/pull/2918


   ## What is the purpose of the pull request
   
   Add support for reusing fileId in clustering execution strategy. This is strategy specific. Default is still to create new files
   
   ## Brief change log
   Some datasets rely on external index. We cannot change record location for clustering (because external index doesn't support update). We can still take advantage of clustering by doing 'local' sorting within each file. Add support for such strategies.
   
   Also, made small changes on how metadata is generated after clustering is complete. (metadata is getting generated redundantly twice before. Removed 1 to make it simple).
   
   ## Verify this pull request
   This change added tests 
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#discussion_r638404203



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/CreateFixedFileHandleFactory.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.concurrent.atomic.AtomicBoolean;
+
+/**
+ * A HoodieCreateHandleFactory is used to write all data in the spark partition into a single data file.
+ *
+ * Please use this with caution. This can end up creating very large files if not used correctly.
+ */
+public class CreateFixedFileHandleFactory<T extends HoodieRecordPayload, I, K, O> extends WriteHandleFactory<T, I, K, O> {

Review comment:
       can we subclass this from CreateHandleFactory? or call this `SingleFileCreateHandleFactory`?

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java
##########
@@ -148,9 +151,12 @@ private void validateWriteResult(HoodieWriteMetadata<JavaRDD<WriteStatus>> write
       JavaSparkContext jsc = HoodieSparkEngineContext.getSparkContext(context);
       JavaRDD<HoodieRecord<? extends HoodieRecordPayload>> inputRecords = readRecordsForGroup(jsc, clusteringGroup);
       Schema readerSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(config.getSchema()));
+      List<HoodieFileGroupId> inputFileIds = clusteringGroup.getSlices().stream()

Review comment:
       so the input file ids are already in the serialized plan? This PR just passes this around additionally?

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java
##########
@@ -163,8 +169,10 @@ protected String getCommitActionType() {
 
   @Override
   protected Map<String, List<String>> getPartitionToReplacedFileIds(JavaRDD<WriteStatus> writeStatuses) {
-    return ClusteringUtils.getFileGroupsFromClusteringPlan(clusteringPlan).collect(
-        Collectors.groupingBy(fg -> fg.getPartitionPath(), Collectors.mapping(fg -> fg.getFileId(), Collectors.toList())));
+    Set<HoodieFileGroupId> newFilesWritten = new HashSet(writeStatuses.map(s -> s.getFileId()).collect());

Review comment:
       rename: `newFileIds`

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/ClusteringExecutionStrategy.java
##########
@@ -51,7 +53,7 @@ public ClusteringExecutionStrategy(HoodieTable table, HoodieEngineContext engine
    * Note that commit is not done as part of strategy. commit is callers responsibility.
    */
   public abstract O performClustering(final I inputRecords, final int numOutputGroups, final String instantTime,
-                                      final Map<String, String> strategyParams, final Schema schema);
+                                      final Map<String, String> strategyParams, final Schema schema, final List<HoodieFileGroupId> inputFileIds);

Review comment:
       can you please add javadocs for this method explaining what each param is. 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateFixedHandle.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.avro.Schema;
+import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.util.Map;
+
+/**
+ * A HoodieCreateHandle which writes all data into a single file.

Review comment:
       This is bit of a misnomer. Even HoodieCreateHandle only writes to a single file. 
   
   Rename: HoodieUnboundedCreateHandle or something that captures that intent , that this does not respect the sizing aspects.

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateFixedHandle.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.avro.Schema;
+import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.util.Map;
+
+/**
+ * A HoodieCreateHandle which writes all data into a single file.
+ *
+ * Please use this with caution. This can end up creating very large files if not used correctly.
+ */
+public class HoodieCreateFixedHandle<T extends HoodieRecordPayload, I, K, O> extends HoodieCreateHandle<T, I, K, O> {
+
+  private static final Logger LOG = LogManager.getLogger(HoodieCreateFixedHandle.class);
+
+  public HoodieCreateFixedHandle(HoodieWriteConfig config, String instantTime, HoodieTable<T, I, K, O> hoodieTable,
+                                 String partitionPath, String fileId, TaskContextSupplier taskContextSupplier) {
+    super(config, instantTime, hoodieTable, partitionPath, fileId, getWriterSchemaIncludingAndExcludingMetadataPair(config),
+        taskContextSupplier);
+  }
+
+  public HoodieCreateFixedHandle(HoodieWriteConfig config, String instantTime, HoodieTable<T, I, K, O> hoodieTable,
+                                 String partitionPath, String fileId, Pair<Schema, Schema> writerSchemaIncludingAndExcludingMetadataPair,
+                                 TaskContextSupplier taskContextSupplier) {
+    super(config, instantTime, hoodieTable, partitionPath, fileId, writerSchemaIncludingAndExcludingMetadataPair,
+        taskContextSupplier);
+  }
+
+  /**
+   * Called by the compactor code path.
+   */
+  public HoodieCreateFixedHandle(HoodieWriteConfig config, String instantTime, HoodieTable<T, I, K, O> hoodieTable,
+                                 String partitionPath, String fileId, Map<String, HoodieRecord<T>> recordMap,
+                                 TaskContextSupplier taskContextSupplier) {
+    this(config, instantTime, hoodieTable, partitionPath, fileId, taskContextSupplier);
+  }
+
+  @Override
+  public boolean canWrite(HoodieRecord record) {

Review comment:
       Let's just reuse CreateHandle with a large target file size? if we are doing all this for just a specific clustering strategy?

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java
##########
@@ -163,8 +169,10 @@ protected String getCommitActionType() {
 
   @Override
   protected Map<String, List<String>> getPartitionToReplacedFileIds(JavaRDD<WriteStatus> writeStatuses) {
-    return ClusteringUtils.getFileGroupsFromClusteringPlan(clusteringPlan).collect(
-        Collectors.groupingBy(fg -> fg.getPartitionPath(), Collectors.mapping(fg -> fg.getFileId(), Collectors.toList())));
+    Set<HoodieFileGroupId> newFilesWritten = new HashSet(writeStatuses.map(s -> s.getFileId()).collect());
+    return ClusteringUtils.getFileGroupsFromClusteringPlan(clusteringPlan)
+        .filter(fg -> !newFilesWritten.contains(fg))

Review comment:
       sorry. not following. why do we need this filter?

##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestHoodieClientOnCopyOnWriteStorage.java
##########
@@ -1167,7 +1177,7 @@ public void testPendingClusteringRollback() throws Exception {
     fileIdIntersection.retainAll(fileIds2);
     assertEquals(0, fileIdIntersection.size());
 
-    config = getConfigBuilder(HoodieFailedWritesCleaningPolicy.LAZY).withAutoCommit(completeClustering)
+    config = getConfigBuilder(HoodieFailedWritesCleaningPolicy.LAZY).withAutoCommit(false)

Review comment:
       so we don't honor `completeClustering` anymore? Not following why this change was needed

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/CreateFixedFileHandleFactory.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.concurrent.atomic.AtomicBoolean;
+
+/**
+ * A HoodieCreateHandleFactory is used to write all data in the spark partition into a single data file.
+ *
+ * Please use this with caution. This can end up creating very large files if not used correctly.
+ */
+public class CreateFixedFileHandleFactory<T extends HoodieRecordPayload, I, K, O> extends WriteHandleFactory<T, I, K, O> {
+
+  private AtomicBoolean isHandleCreated = new AtomicBoolean(false);
+  private String fileId;
+  
+  public CreateFixedFileHandleFactory(String fileId) {
+    super();
+    this.fileId = fileId;
+  }
+
+  @Override
+  public HoodieWriteHandle<T, I, K, O> create(final HoodieWriteConfig hoodieConfig, final String commitTime,

Review comment:
       wondering why we need this actually. Would n't just passing `Long.MAX_VALUE` as the target file size, get the create handle to do this? 

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java
##########
@@ -257,12 +265,4 @@ protected String getCommitActionType() {
     return hoodieRecord;
   }
 
-  private HoodieWriteMetadata<JavaRDD<WriteStatus>> buildWriteMetadata(JavaRDD<WriteStatus> writeStatusJavaRDD) {

Review comment:
       this was removed, because the constructor does the same job?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha edited a comment on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
satishkotha edited a comment on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-840833995


   >Can we add aother config
   
   Yes that will actually be provided as separate strategy. 
   
   > If the sort will support in HoodieCreateFixedHandle?
   
   This part will be provided in execution strategy. Right now i only add test strategy which doesn't support sorting. I'm going to work on adding real strategy to sort.
   
   Both above strategies will be sent as another PR. Let me know if that works.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-commenter commented on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-833971983


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2918](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a5736cc) into [master](https://codecov.io/gh/apache/hudi/commit/0284cdecce3136c28cd8599f77f9a0b174145265?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0284cde) will **increase** coverage by `7.50%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2918/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #2918      +/-   ##
   ============================================
   + Coverage     54.23%   61.73%   +7.50%     
   + Complexity     3810      336    -3474     
   ============================================
     Files           488       54     -434     
     Lines         23574     2002   -21572     
     Branches       2510      237    -2273     
   ============================================
   - Hits          12786     1236   -11550     
   + Misses         9636      645    -8991     
   + Partials       1152      121    -1031     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `61.73% <ø> (-7.85%)` | `336.00 <ø> (-39.00)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...ies/exception/HoodieSnapshotExporterException.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVTbmFwc2hvdEV4cG9ydGVyRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `5.17% <0.00%> (-83.63%)` | `0.00% <0.00%> (-28.00%)` | |
   | [...hudi/utilities/schema/JdbcbasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9KZGJjYmFzZWRTY2hlbWFQcm92aWRlci5qYXZh) | `0.00% <0.00%> (-72.23%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...he/hudi/utilities/transform/AWSDmsTransformer.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9BV1NEbXNUcmFuc2Zvcm1lci5qYXZh) | `0.00% <0.00%> (-66.67%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | `40.69% <0.00%> (-23.84%)` | `27.00% <0.00%> (-6.00%)` | |
   | [.../common/table/log/block/HoodieLogBlockVersion.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVMb2dCbG9ja1ZlcnNpb24uamF2YQ==) | | | |
   | [.../apache/hudi/hive/MultiPartKeysValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvTXVsdGlQYXJ0S2V5c1ZhbHVlRXh0cmFjdG9yLmphdmE=) | | | |
   | [...org/apache/hudi/HoodieDatasetBulkInsertHelper.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllRGF0YXNldEJ1bGtJbnNlcnRIZWxwZXIuamF2YQ==) | | | |
   | [...a/org/apache/hudi/cli/commands/RepairsCommand.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1JlcGFpcnNDb21tYW5kLmphdmE=) | | | |
   | [...va/org/apache/hudi/table/format/FilePathUtils.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9mb3JtYXQvRmlsZVBhdGhVdGlscy5qYXZh) | | | |
   | ... and [429 more](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
satishkotha commented on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-840833995


   >Can we add aother config
   Yes that will actually be provided as separate strategy. 
   
   > If the sort will support in HoodieCreateFixedHandle?
   This part will be provided in execution strategy. Right now i only add test strategy which doesn't support sorting. I'm going to work on adding real strategy to sort.
   
   Both above strategies will be sent as another PR. Let me know if that works.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 edited a comment on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
lw309637554 edited a comment on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-840609544


   > > @satishkotha hello , have some doubt
   > > 
   > > 1. Just see add a test strategy . Will a formal strategy be added later?
   > > 2. This PR is to support which Index?
   > > 3. If every file group just transfrom to a same name file group. If the small files  can not merge ?
   > 
   > @lw309637554
   > 
   > 1. Yes, the actual strategy can be added easily if we agree on high level change
   > 2. This is to support HBaseIndex, which does not support update for record location
   > 3. yes, you are right. merging strategy cannot be applied to tables that use HBaseIndex. We can still local 'file-level' sorting i.e., sorting records in each data file by specified column so only one block (row group) needs to be read for queries.
   > 
   > Let me know if you any other questions/comments.
   @satishkotha 
   high level change is OK . Just have a other two comments
   1.  ".withClusteringMaxBytesInGroup(10) // set small number so each file is considered as separate clustering group" , Can we add aother config
   2. If the sort will support in HoodieCreateFixedHandle?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on a change in pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#discussion_r631861752



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestHoodieClientOnCopyOnWriteStorage.java
##########
@@ -1114,6 +1114,16 @@ public void testClusteringWithSortColumns() throws Exception {
         .withClusteringTargetPartitions(0).withInlineClusteringNumCommits(1).build();
     testClustering(clusteringConfig);
   }
+  
+  @Test
+  public void testClusteringWithOneFilePerGroup() throws Exception {
+    HoodieClusteringConfig clusteringConfig = HoodieClusteringConfig.newBuilder().withClusteringMaxNumGroups(10)
+        .withClusteringMaxBytesInGroup(10) // set small number so each file is considered as separate clustering group
+        .withClusteringExecutionStrategyClass("org.apache.hudi.ClusteringIdentityTestExecutionStrategy")

Review comment:
       Can add a config




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-840609544


   > > @satishkotha hello , have some doubt
   > > 
   > > 1. Just see add a test strategy . Will a formal strategy be added later?
   > > 2. This PR is to support which Index?
   > > 3. If every file group just transfrom to a same name file group. If the small files  can not merge ?
   > 
   > @lw309637554
   > 
   > 1. Yes, the actual strategy can be added easily if we agree on high level change
   > 2. This is to support HBaseIndex, which does not support update for record location
   > 3. yes, you are right. merging strategy cannot be applied to tables that use HBaseIndex. We can still local 'file-level' sorting i.e., sorting records in each data file by specified column so only one block (row group) needs to be read for queries.
   > 
   > Let me know if you any other questions/comments.
   
   high level change is OK . Just have a other two comments
   1.  ".withClusteringMaxBytesInGroup(10) // set small number so each file is considered as separate clustering group" , Can we add aother config
   2. If the sort will support in HoodieCreateFixedHandle?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on a change in pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#discussion_r631864273



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateFixedHandle.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.avro.Schema;
+import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.util.Map;
+
+/**
+ * A HoodieCreateHandle which writes all data into a single file.

Review comment:
       HoodieCreateFixedHandle




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on a change in pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#discussion_r629658906



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/ClusteringIdentityTestExecutionStrategy.java
##########
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi;
+
+import org.apache.avro.Schema;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.common.model.HoodieFileGroupId;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieClusteringException;
+import org.apache.hudi.execution.SparkLazyInsertIterable;
+import org.apache.hudi.io.CreateFixedFileHandleFactory;
+import org.apache.hudi.table.HoodieSparkCopyOnWriteTable;
+import org.apache.hudi.table.HoodieSparkMergeOnReadTable;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.cluster.strategy.ClusteringExecutionStrategy;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+
+/**
+ * Sample clustering strategy for testing. This actually doesnt transform data, but simply rewrites the same data 
+ * in a new file.
+ */
+public class ClusteringIdentityTestExecutionStrategy<T extends HoodieRecordPayload<T>>
+    extends ClusteringExecutionStrategy<T, JavaRDD<HoodieRecord<T>>, JavaRDD<HoodieKey>, JavaRDD<WriteStatus>> {
+
+  private static final Logger LOG = LogManager.getLogger(ClusteringIdentityTestExecutionStrategy.class);
+
+  public ClusteringIdentityTestExecutionStrategy(HoodieSparkCopyOnWriteTable<T> table,
+                                                 HoodieSparkEngineContext engineContext,
+                                                 HoodieWriteConfig writeConfig) {
+    super(table, engineContext, writeConfig);
+  }
+
+  public ClusteringIdentityTestExecutionStrategy(HoodieSparkMergeOnReadTable<T> table,
+                                                 HoodieSparkEngineContext engineContext,
+                                                 HoodieWriteConfig writeConfig) {
+    super(table, engineContext, writeConfig);
+  }
+
+  @Override
+  public JavaRDD<WriteStatus> performClustering(
+      final JavaRDD<HoodieRecord<T>> inputRecords,
+      final int numOutputGroups,
+      final String instantTime,
+      final Map<String, String> strategyParams,
+      final Schema schema,
+      final List<HoodieFileGroupId> inputFileIds) {
+    if (inputRecords.getNumPartitions() != 1 || inputFileIds.size() != 1) {

Review comment:
       yes, this is enforced by setting group size limit to a small number. See unit test added  `.withClusteringMaxBytesInGroup(10) // set small number so each file is considered as separate clustering group`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
satishkotha commented on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-837283028


   > @satishkotha hello , have some doubt
   > 
   > 1. Just see add a test strategy . Will a formal strategy be added later?
   > 2. This PR is to support which Index?
   > 3. If every file group just transfrom to a same name file group. If the small files  can not merge ?
   
   @lw309637554 
   1. Yes, the actual strategy can be added easily if we agree on high level change
   2. This is to support HBaseIndex, which does not support update for record location
   3. yes, you are right. merging strategy cannot be applied to tables that use HBaseIndex. We can still local 'file-level' sorting i.e., sorting records in each data file by specified column so only one block (row group) needs to be read for queries.
   
   Let me know if you any other questions/comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha closed pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
satishkotha closed pull request #2918:
URL: https://github.com/apache/hudi/pull/2918


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-commenter edited a comment on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-833971983


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2918](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (32dbe35) into [master](https://codecov.io/gh/apache/hudi/commit/0284cdecce3136c28cd8599f77f9a0b174145265?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0284cde) will **decrease** coverage by `44.89%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2918/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2918       +/-   ##
   ============================================
   - Coverage     54.23%   9.34%   -44.90%     
   + Complexity     3810      48     -3762     
   ============================================
     Files           488      54      -434     
     Lines         23574    2002    -21572     
     Branches       2510     237     -2273     
   ============================================
   - Hits          12786     187    -12599     
   + Misses         9636    1802     -7834     
   + Partials       1152      13     -1139     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.34% <ø> (-60.24%)` | `48.00 <ø> (-327.00)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | ... and [465 more](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on a change in pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#discussion_r628763575



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/ClusteringIdentityTestExecutionStrategy.java
##########
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi;
+
+import org.apache.avro.Schema;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.common.model.HoodieFileGroupId;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieClusteringException;
+import org.apache.hudi.execution.SparkLazyInsertIterable;
+import org.apache.hudi.io.CreateFixedFileHandleFactory;
+import org.apache.hudi.table.HoodieSparkCopyOnWriteTable;
+import org.apache.hudi.table.HoodieSparkMergeOnReadTable;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.cluster.strategy.ClusteringExecutionStrategy;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+
+/**
+ * Sample clustering strategy for testing. This actually doesnt transform data, but simply rewrites the same data 
+ * in a new file.
+ */
+public class ClusteringIdentityTestExecutionStrategy<T extends HoodieRecordPayload<T>>
+    extends ClusteringExecutionStrategy<T, JavaRDD<HoodieRecord<T>>, JavaRDD<HoodieKey>, JavaRDD<WriteStatus>> {
+
+  private static final Logger LOG = LogManager.getLogger(ClusteringIdentityTestExecutionStrategy.class);
+
+  public ClusteringIdentityTestExecutionStrategy(HoodieSparkCopyOnWriteTable<T> table,
+                                                 HoodieSparkEngineContext engineContext,
+                                                 HoodieWriteConfig writeConfig) {
+    super(table, engineContext, writeConfig);
+  }
+
+  public ClusteringIdentityTestExecutionStrategy(HoodieSparkMergeOnReadTable<T> table,
+                                                 HoodieSparkEngineContext engineContext,
+                                                 HoodieWriteConfig writeConfig) {
+    super(table, engineContext, writeConfig);
+  }
+
+  @Override
+  public JavaRDD<WriteStatus> performClustering(
+      final JavaRDD<HoodieRecord<T>> inputRecords,
+      final int numOutputGroups,
+      final String instantTime,
+      final Map<String, String> strategyParams,
+      final Schema schema,
+      final List<HoodieFileGroupId> inputFileIds) {
+    if (inputRecords.getNumPartitions() != 1 || inputFileIds.size() != 1) {

Review comment:
       if must one fileid, each clustering group should just have one file group? but not see the limit in clustering scheduling




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-835403599


   @satishkotha hello , have some doubt
   1. Just see add a test strategy . Will a formal strategy be added later?
   2. This PR is to support which Index?
   3. If every file group just transfrom to a same name file group. If the small files  can not merge ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-commenter edited a comment on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-833971983


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2918](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (32dbe35) into [master](https://codecov.io/gh/apache/hudi/commit/0284cdecce3136c28cd8599f77f9a0b174145265?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0284cde) will **increase** coverage by `15.29%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2918/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2918       +/-   ##
   =============================================
   + Coverage     54.23%   69.53%   +15.29%     
   + Complexity     3810      374     -3436     
   =============================================
     Files           488       54      -434     
     Lines         23574     2002    -21572     
     Branches       2510      237     -2273     
   =============================================
   - Hits          12786     1392    -11394     
   + Misses         9636      478     -9158     
   + Partials       1152      132     -1020     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.53% <ø> (-0.05%)` | `374.00 <ø> (-1.00)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2918?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `71.08% <0.00%> (-0.35%)` | `55.00% <0.00%> (-1.00%)` | |
   | [...n/java/org/apache/hudi/common/HoodieCleanStat.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL0hvb2RpZUNsZWFuU3RhdC5qYXZh) | | | |
   | [...n/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZU1lcmdlT25SZWFkUkRELnNjYWxh) | | | |
   | [...hadoop/realtime/RealtimeCompactedRecordReader.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL1JlYWx0aW1lQ29tcGFjdGVkUmVjb3JkUmVhZGVyLmphdmE=) | | | |
   | [...che/hudi/exception/InvalidHoodiePathException.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0ludmFsaWRIb29kaWVQYXRoRXhjZXB0aW9uLmphdmE=) | | | |
   | [.../hudi/async/SparkStreamingAsyncCompactService.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXN5bmMvU3BhcmtTdHJlYW1pbmdBc3luY0NvbXBhY3RTZXJ2aWNlLmphdmE=) | | | |
   | [...org/apache/hudi/common/model/TableServiceType.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1RhYmxlU2VydmljZVR5cGUuamF2YQ==) | | | |
   | [.../common/util/queue/FunctionBasedQueueProducer.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvcXVldWUvRnVuY3Rpb25CYXNlZFF1ZXVlUHJvZHVjZXIuamF2YQ==) | | | |
   | [...ava/org/apache/hudi/cli/commands/UtilsCommand.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1V0aWxzQ29tbWFuZC5qYXZh) | | | |
   | [.../hudi/table/format/cow/ParquetSplitReaderUtil.java](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9mb3JtYXQvY293L1BhcnF1ZXRTcGxpdFJlYWRlclV0aWwuamF2YQ==) | | | |
   | ... and [425 more](https://codecov.io/gh/apache/hudi/pull/2918/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on a change in pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#discussion_r632102497



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateFixedHandle.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.avro.Schema;
+import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.util.Map;
+
+/**
+ * A HoodieCreateHandle which writes all data into a single file.

Review comment:
       Fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
satishkotha commented on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-949276670


   I think all changes in this have already been merged as part of #3419. Closing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-847474915


   > We cannot change record location for clustering (because external index doesn't support update). We can still take advantage of clustering by doing 'local' sorting within each file. 
   
   This can be achieved by sorting during original write time, correct?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#issuecomment-841668156


   @satishkotha LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on a change in pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#discussion_r631861069



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/ClusteringIdentityTestExecutionStrategy.java
##########
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi;
+
+import org.apache.avro.Schema;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.common.model.HoodieFileGroupId;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieClusteringException;
+import org.apache.hudi.execution.SparkLazyInsertIterable;
+import org.apache.hudi.io.CreateFixedFileHandleFactory;
+import org.apache.hudi.table.HoodieSparkCopyOnWriteTable;
+import org.apache.hudi.table.HoodieSparkMergeOnReadTable;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.cluster.strategy.ClusteringExecutionStrategy;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+
+/**
+ * Sample clustering strategy for testing. This actually doesnt transform data, but simply rewrites the same data 
+ * in a new file.
+ */
+public class ClusteringIdentityTestExecutionStrategy<T extends HoodieRecordPayload<T>>
+    extends ClusteringExecutionStrategy<T, JavaRDD<HoodieRecord<T>>, JavaRDD<HoodieKey>, JavaRDD<WriteStatus>> {
+
+  private static final Logger LOG = LogManager.getLogger(ClusteringIdentityTestExecutionStrategy.class);
+
+  public ClusteringIdentityTestExecutionStrategy(HoodieSparkCopyOnWriteTable<T> table,
+                                                 HoodieSparkEngineContext engineContext,
+                                                 HoodieWriteConfig writeConfig) {
+    super(table, engineContext, writeConfig);
+  }
+
+  public ClusteringIdentityTestExecutionStrategy(HoodieSparkMergeOnReadTable<T> table,
+                                                 HoodieSparkEngineContext engineContext,
+                                                 HoodieWriteConfig writeConfig) {
+    super(table, engineContext, writeConfig);
+  }
+
+  @Override
+  public JavaRDD<WriteStatus> performClustering(
+      final JavaRDD<HoodieRecord<T>> inputRecords,
+      final int numOutputGroups,
+      final String instantTime,
+      final Map<String, String> strategyParams,
+      final Schema schema,
+      final List<HoodieFileGroupId> inputFileIds) {
+    if (inputRecords.getNumPartitions() != 1 || inputFileIds.size() != 1) {

Review comment:
        Can we support a other config  such as filegroupLocalSort? Because reuse withClusteringMaxBytesInGroup to set it so small , users may be confuse.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on a change in pull request #2918: [HUDI-1877] Add support in clustering to not change record location

Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2918:
URL: https://github.com/apache/hudi/pull/2918#discussion_r632102292



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestHoodieClientOnCopyOnWriteStorage.java
##########
@@ -1114,6 +1114,16 @@ public void testClusteringWithSortColumns() throws Exception {
         .withClusteringTargetPartitions(0).withInlineClusteringNumCommits(1).build();
     testClustering(clusteringConfig);
   }
+  
+  @Test
+  public void testClusteringWithOneFilePerGroup() throws Exception {
+    HoodieClusteringConfig clusteringConfig = HoodieClusteringConfig.newBuilder().withClusteringMaxNumGroups(10)
+        .withClusteringMaxBytesInGroup(10) // set small number so each file is considered as separate clustering group
+        .withClusteringExecutionStrategyClass("org.apache.hudi.ClusteringIdentityTestExecutionStrategy")

Review comment:
       This is just a unit test. Will provide another schedule clustering strategy as part of another PR to limit number of files per group.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org