You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "wuwenchi (via GitHub)" <gi...@apache.org> on 2023/02/03 01:47:56 UTC

[GitHub] [hudi] wuwenchi opened a new pull request, #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

wuwenchi opened a new pull request, #7834:
URL: https://github.com/apache/hudi/pull/7834

   ### Change Logs
   
   Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1096865034


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/HoodieSimpleBucketIndex.java:
##########
@@ -44,7 +44,7 @@ public HoodieSimpleBucketIndex(HoodieWriteConfig config) {
     super(config);
   }
 
-  private Map<Integer, HoodieRecordLocation> loadPartitionBucketIdFileIdMapping(
+  public Map<Integer, HoodieRecordLocation> loadPartitionBucketIdFileIdMapping(

Review Comment:
   HoodieSimpleBucketIndex and RDDSimpleBucketPartitioner belong to different packages and can only use public.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1149970489


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java:
##########
@@ -38,9 +38,12 @@ public static BulkInsertPartitioner get(HoodieTable table,
   public static BulkInsertPartitioner get(HoodieTable table,
                                           HoodieWriteConfig config,
                                           boolean enforceNumOutputPartitions) {
-    if (config.getIndexType().equals(HoodieIndex.IndexType.BUCKET)
-        && config.getBucketIndexEngineType().equals(HoodieIndex.BucketIndexEngineType.CONSISTENT_HASHING)) {
-      return new RDDConsistentBucketPartitioner(table);
+    if (config.getIndexType().equals(HoodieIndex.IndexType.BUCKET)) {
+      if (config.getBucketIndexEngineType().equals(HoodieIndex.BucketIndexEngineType.CONSISTENT_HASHING)) {
+        return new RDDConsistentBucketPartitioner(table);
+      } else if (config.getBucketIndexEngineType().equals(HoodieIndex.BucketIndexEngineType.SIMPLE)) {
+        return new RDDSimpleBucketPartitioner(table);
+      }

Review Comment:
   In the else case, throw exception ?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/HoodieSimpleBucketIndex.java:
##########
@@ -44,7 +44,7 @@ public HoodieSimpleBucketIndex(HoodieWriteConfig config) {
     super(config);
   }
 
-  private Map<Integer, HoodieRecordLocation> loadPartitionBucketIdFileIdMapping(
+  public Map<Integer, HoodieRecordLocation> loadPartitionBucketIdFileIdMapping(

Review Comment:
   Minor: Rename method to loadBucketIdToFileIdMappingForPartition



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BucketIndexPartitioner.java:
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.execution.bulkinsert.BulkInsertSortMode;
+import org.apache.hudi.io.AppendHandleFactory;
+import org.apache.hudi.io.SingleFileHandleCreateFactory;
+import org.apache.hudi.io.WriteHandleFactory;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+/**
+ * Abstract of bucket index bulk_insert partitioner
+ */
+public abstract class BucketIndexPartitioner<T> implements BulkInsertPartitioner<T> {

Review Comment:
   can you rename to BucketIndexBulkInsertPartitioner. Also, similar name change to derived classes.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.index.bucket.HoodieSimpleBucketIndex;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RDDSimpleBucketPartitioner<T extends HoodieRecordPayload> extends RDDBucketIndexPartitioner<T> {
+
+  public RDDSimpleBucketPartitioner(HoodieTable table) {
+    super(table, null, false);
+    ValidationUtils.checkArgument(table.getIndex() instanceof HoodieSimpleBucketIndex);
+  }
+
+  @Override
+  public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records, int outputPartitions) {
+    HoodieSimpleBucketIndex index = (HoodieSimpleBucketIndex) table.getIndex();
+    HashMap<String, Integer> fileIdToIdx = new HashMap<>();

Review Comment:
   minor: rename to fileIdToBucketIndex



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.index.bucket.HoodieSimpleBucketIndex;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RDDSimpleBucketPartitioner<T extends HoodieRecordPayload> extends RDDBucketIndexPartitioner<T> {

Review Comment:
   similar class name change to include RDDSimpleBucketBulkInsertPartitioner. 



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/execution/bulkinsert/TestRDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.data.HoodieJavaRDD;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.hudi.table.HoodieSparkTable;
+import org.apache.hudi.testutils.HoodieClientTestHarness;
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+public class TestRDDSimpleBucketPartitioner extends HoodieClientTestHarness {
+
+  @BeforeEach
+  public void setUp() throws Exception {
+    initPath();
+    initSparkContexts("TestRDDSimpleBucketPartitioner");
+    initFileSystem();
+    initTimelineService();
+  }
+
+  @AfterEach
+  public void tearDown() throws IOException {
+    cleanupResources();
+  }
+
+  @ParameterizedTest
+  @MethodSource("configParams")
+  public void testSimpleBucketPartitioner(HoodieTableType type, boolean partitionSort) throws IOException {
+    HoodieTestUtils.init(HoodieTestUtils.getDefaultHadoopConf(), basePath, type);
+    int bucketNum = 2;
+    HoodieWriteConfig config = HoodieWriteConfig
+        .newBuilder()
+        .withPath(basePath)
+        .withSchema(TRIP_EXAMPLE_SCHEMA)
+        .build();
+    config.setValue(HoodieIndexConfig.INDEX_TYPE, HoodieIndex.IndexType.BUCKET.name());
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE, HoodieIndex.BucketIndexEngineType.SIMPLE.name());
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD, "_row_key");
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS, "" + bucketNum);
+    if (partitionSort) {
+      config.setValue(HoodieWriteConfig.BULK_INSERT_SORT_MODE, BulkInsertSortMode.PARTITION_SORT.name());
+    }
+
+    HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
+    List<HoodieRecord> records = dataGenerator.generateInserts("0", 100);
+    HoodieJavaRDD<HoodieRecord> javaRDD = HoodieJavaRDD.of(records, context, 1);
+    javaRDD.map(HoodieRecord::getPartitionPath).count();

Review Comment:
   this line can be removed.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.index.bucket.HoodieSimpleBucketIndex;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RDDSimpleBucketPartitioner<T extends HoodieRecordPayload> extends RDDBucketIndexPartitioner<T> {
+
+  public RDDSimpleBucketPartitioner(HoodieTable table) {
+    super(table, null, false);
+    ValidationUtils.checkArgument(table.getIndex() instanceof HoodieSimpleBucketIndex);
+  }
+
+  @Override
+  public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records, int outputPartitions) {
+    HoodieSimpleBucketIndex index = (HoodieSimpleBucketIndex) table.getIndex();
+    HashMap<String, Integer> fileIdToIdx = new HashMap<>();
+
+    // Map <partition, <bucketNo, fileID>>
+    Map<String, HashMap<Integer, String>> partitionMapper = getPartitionMapper(records, fileIdToIdx);
+
+    return doPartition(records, new Partitioner() {
+      @Override
+      public int numPartitions() {
+        return index.getNumBuckets() * partitionMapper.size();
+      }
+
+      @Override
+      public int getPartition(Object key) {
+        HoodieKey hoodieKey = (HoodieKey) key;
+        String partitionPath = hoodieKey.getPartitionPath();
+        int bucketID = index.getBucketID(hoodieKey);
+        String fileID = partitionMapper.get(partitionPath).get(bucketID);
+        return fileIdToIdx.get(fileID);
+      }
+    });
+  }
+
+  Map<String, HashMap<Integer, String>> getPartitionMapper(JavaRDD<HoodieRecord<T>> records,
+                                                           HashMap<String, Integer> fileIdToIdx) {

Review Comment:
   Kindly use the interface Map<String, Integer> instead of concrete definitions whenever possible (type declarations in variables and parameters). Rest of the codebase usually follow that. 



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/execution/bulkinsert/TestRDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.data.HoodieJavaRDD;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.hudi.table.HoodieSparkTable;
+import org.apache.hudi.testutils.HoodieClientTestHarness;
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+public class TestRDDSimpleBucketPartitioner extends HoodieClientTestHarness {
+
+  @BeforeEach
+  public void setUp() throws Exception {
+    initPath();
+    initSparkContexts("TestRDDSimpleBucketPartitioner");
+    initFileSystem();
+    initTimelineService();
+  }
+
+  @AfterEach
+  public void tearDown() throws IOException {
+    cleanupResources();
+  }
+
+  @ParameterizedTest
+  @MethodSource("configParams")
+  public void testSimpleBucketPartitioner(HoodieTableType type, boolean partitionSort) throws IOException {
+    HoodieTestUtils.init(HoodieTestUtils.getDefaultHadoopConf(), basePath, type);
+    int bucketNum = 2;
+    HoodieWriteConfig config = HoodieWriteConfig
+        .newBuilder()
+        .withPath(basePath)
+        .withSchema(TRIP_EXAMPLE_SCHEMA)
+        .build();
+    config.setValue(HoodieIndexConfig.INDEX_TYPE, HoodieIndex.IndexType.BUCKET.name());
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE, HoodieIndex.BucketIndexEngineType.SIMPLE.name());
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD, "_row_key");
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS, "" + bucketNum);
+    if (partitionSort) {
+      config.setValue(HoodieWriteConfig.BULK_INSERT_SORT_MODE, BulkInsertSortMode.PARTITION_SORT.name());
+    }
+
+    HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
+    List<HoodieRecord> records = dataGenerator.generateInserts("0", 100);
+    HoodieJavaRDD<HoodieRecord> javaRDD = HoodieJavaRDD.of(records, context, 1);
+    javaRDD.map(HoodieRecord::getPartitionPath).count();
+
+    final HoodieSparkTable table = HoodieSparkTable.create(config, context);
+    BulkInsertPartitioner partitioner = BulkInsertInternalPartitionerFactory.get(table, config);
+    JavaRDD<HoodieRecord> repartitionRecords =
+        (JavaRDD<HoodieRecord>) partitioner.repartitionRecords(HoodieJavaRDD.getJavaRDD(javaRDD), 1);
+
+    assertEquals(bucketNum * javaRDD.map(HoodieRecord::getPartitionPath).distinct().count(),

Review Comment:
   Can you extend this test such that we perform repartitionRecords after the first write where the partition and file-ids exist and in the second write for the same records (as first batch), we check if the records map to the same bucketIds as expected. 



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.index.bucket.HoodieSimpleBucketIndex;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RDDSimpleBucketPartitioner<T extends HoodieRecordPayload> extends RDDBucketIndexPartitioner<T> {
+
+  public RDDSimpleBucketPartitioner(HoodieTable table) {
+    super(table, null, false);
+    ValidationUtils.checkArgument(table.getIndex() instanceof HoodieSimpleBucketIndex);
+  }
+
+  @Override
+  public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records, int outputPartitions) {
+    HoodieSimpleBucketIndex index = (HoodieSimpleBucketIndex) table.getIndex();
+    HashMap<String, Integer> fileIdToIdx = new HashMap<>();
+
+    // Map <partition, <bucketNo, fileID>>
+    Map<String, HashMap<Integer, String>> partitionMapper = getPartitionMapper(records, fileIdToIdx);
+
+    return doPartition(records, new Partitioner() {
+      @Override
+      public int numPartitions() {
+        return index.getNumBuckets() * partitionMapper.size();
+      }
+
+      @Override
+      public int getPartition(Object key) {
+        HoodieKey hoodieKey = (HoodieKey) key;
+        String partitionPath = hoodieKey.getPartitionPath();
+        int bucketID = index.getBucketID(hoodieKey);
+        String fileID = partitionMapper.get(partitionPath).get(bucketID);
+        return fileIdToIdx.get(fileID);
+      }
+    });
+  }
+
+  Map<String, HashMap<Integer, String>> getPartitionMapper(JavaRDD<HoodieRecord<T>> records,
+                                                           HashMap<String, Integer> fileIdToIdx) {

Review Comment:
   fileIdToIdx -> fileIdToBucketIndex



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.index.bucket.HoodieSimpleBucketIndex;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RDDSimpleBucketPartitioner<T extends HoodieRecordPayload> extends RDDBucketIndexPartitioner<T> {
+
+  public RDDSimpleBucketPartitioner(HoodieTable table) {
+    super(table, null, false);
+    ValidationUtils.checkArgument(table.getIndex() instanceof HoodieSimpleBucketIndex);
+  }
+
+  @Override
+  public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records, int outputPartitions) {
+    HoodieSimpleBucketIndex index = (HoodieSimpleBucketIndex) table.getIndex();
+    HashMap<String, Integer> fileIdToIdx = new HashMap<>();
+
+    // Map <partition, <bucketNo, fileID>>
+    Map<String, HashMap<Integer, String>> partitionMapper = getPartitionMapper(records, fileIdToIdx);
+
+    return doPartition(records, new Partitioner() {
+      @Override
+      public int numPartitions() {
+        return index.getNumBuckets() * partitionMapper.size();
+      }
+
+      @Override
+      public int getPartition(Object key) {
+        HoodieKey hoodieKey = (HoodieKey) key;
+        String partitionPath = hoodieKey.getPartitionPath();
+        int bucketID = index.getBucketID(hoodieKey);
+        String fileID = partitionMapper.get(partitionPath).get(bucketID);
+        return fileIdToIdx.get(fileID);
+      }
+    });
+  }
+
+  Map<String, HashMap<Integer, String>> getPartitionMapper(JavaRDD<HoodieRecord<T>> records,
+                                                           HashMap<String, Integer> fileIdToIdx) {
+
+    HoodieSimpleBucketIndex index = (HoodieSimpleBucketIndex) table.getIndex();
+    int numBuckets = index.getNumBuckets();
+    return records
+        .map(HoodieRecord::getPartitionPath)
+        .distinct().collect().stream()
+        .collect(Collectors.toMap(p -> p, p -> {
+          Map<Integer, HoodieRecordLocation> locationMap = index.loadPartitionBucketIdFileIdMapping(table, p);
+          HashMap<Integer, String> fileIdMap = new HashMap<>();

Review Comment:
   rename to bucketIdToFileIdMap



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1157482940


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BucketIndexPartitioner.java:
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.execution.bulkinsert.BulkInsertSortMode;
+import org.apache.hudi.io.AppendHandleFactory;
+import org.apache.hudi.io.SingleFileHandleCreateFactory;
+import org.apache.hudi.io.WriteHandleFactory;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+/**
+ * Abstract of bucket index bulk_insert partitioner
+ */
+public abstract class BucketIndexPartitioner<T> implements BulkInsertPartitioner<T> {

Review Comment:
   done



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.index.bucket.HoodieSimpleBucketIndex;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RDDSimpleBucketPartitioner<T extends HoodieRecordPayload> extends RDDBucketIndexPartitioner<T> {

Review Comment:
   done



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.index.bucket.HoodieSimpleBucketIndex;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RDDSimpleBucketPartitioner<T extends HoodieRecordPayload> extends RDDBucketIndexPartitioner<T> {
+
+  public RDDSimpleBucketPartitioner(HoodieTable table) {
+    super(table, null, false);
+    ValidationUtils.checkArgument(table.getIndex() instanceof HoodieSimpleBucketIndex);
+  }
+
+  @Override
+  public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records, int outputPartitions) {
+    HoodieSimpleBucketIndex index = (HoodieSimpleBucketIndex) table.getIndex();
+    HashMap<String, Integer> fileIdToIdx = new HashMap<>();

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1465875986

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * 62bb4d2a2a67dabf6452d9af1df17d0fd964d070 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918) 
   * f928ae1c51cdf405ab1b039b2654709d5de40b38 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1414810171

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * c598b3d64f3dbb5e0640a5adad05afbf13400afd Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886) 
   * 7857a3839c59105bbb2d6eb8a210b7658dae02bd UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YuweiXiao commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "YuweiXiao (via GitHub)" <gi...@apache.org>.
YuweiXiao commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1096918547


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/HoodieSimpleBucketIndex.java:
##########
@@ -44,7 +44,7 @@ public HoodieSimpleBucketIndex(HoodieWriteConfig config) {
     super(config);
   }
 
-  private Map<Integer, HoodieRecordLocation> loadPartitionBucketIdFileIdMapping(
+  public Map<Integer, HoodieRecordLocation> loadPartitionBucketIdFileIdMapping(

Review Comment:
   Could it be protected?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1496536930

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9a40108980f493961409d878dcba79b5642ba981",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937",
       "triggerID" : "9a40108980f493961409d878dcba79b5642ba981",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16123",
       "triggerID" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16123) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar merged pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar merged PR #7834:
URL: https://github.com/apache/hudi/pull/7834


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1096859917


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDConsistentBucketPartitioner.java:
##########
@@ -18,82 +18,45 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
-import org.apache.hudi.avro.HoodieAvroUtils;
-import org.apache.hudi.common.config.SerializableSchema;
-import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.model.ConsistentHashingNode;
 import org.apache.hudi.common.model.HoodieConsistentHashingMetadata;
 import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
-import org.apache.hudi.common.model.HoodieTableType;
-import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.ValidationUtils;
-import org.apache.hudi.common.util.collection.FlatLists;
-import org.apache.hudi.common.util.collection.FlatLists.ComparableList;
 import org.apache.hudi.index.bucket.ConsistentBucketIdentifier;
 import org.apache.hudi.index.bucket.HoodieSparkConsistentBucketIndex;
-import org.apache.hudi.io.AppendHandleFactory;
-import org.apache.hudi.io.SingleFileHandleCreateFactory;
-import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.HoodieTable;
-
-import org.apache.avro.Schema;
-import org.apache.log4j.LogManager;
-import org.apache.log4j.Logger;
 import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
 
-import java.io.Serializable;
-import java.util.ArrayList;
-import java.util.Arrays;
 import java.util.Collections;
-import java.util.Comparator;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
 import java.util.stream.Collectors;
 
-import scala.Tuple2;
-
 import static org.apache.hudi.config.HoodieClusteringConfig.PLAN_STRATEGY_SORT_COLUMNS;
 
 /**
  * A partitioner for (consistent hashing) bucket index used in bulk_insert
  */
 public class RDDConsistentBucketPartitioner<T> extends RDDBucketIndexPartitioner<T> {
 
-  private static final Logger LOG = LogManager.getLogger(RDDConsistentBucketPartitioner.class);
-
-  private final HoodieTable table;
-  private final List<String> indexKeyFields;
   private final Map<String, List<ConsistentHashingNode>> hashingChildrenNodes;
-  private final String[] sortColumnNames;
-  private final boolean preserveHoodieMetadata;
-  private final boolean consistentLogicalTimestampEnabled;
 
-  private List<Boolean> doAppend;
-  private List<String> fileIdPfxList;
-
-  public RDDConsistentBucketPartitioner(HoodieTable table, Map<String, String> strategyParams, boolean preserveHoodieMetadata) {
-    this.table = table;
-    this.indexKeyFields = Arrays.asList(table.getConfig().getBucketIndexHashField().split(","));
+  public RDDConsistentBucketPartitioner(HoodieTable table,
+                                        Map<String, String> strategyParams,
+                                        boolean preserveHoodieMetadata) {
+    super(table,
+        strategyParams.getOrDefault(PLAN_STRATEGY_SORT_COLUMNS.key(), null),
+        preserveHoodieMetadata);
     this.hashingChildrenNodes = new HashMap<>();
-    this.consistentLogicalTimestampEnabled = table.getConfig().isConsistentLogicalTimestampEnabled();
-    this.preserveHoodieMetadata = preserveHoodieMetadata;
-
-    if (strategyParams.containsKey(PLAN_STRATEGY_SORT_COLUMNS.key())) {
-      sortColumnNames = strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()).split(",");
-    } else {
-      sortColumnNames = null;
-    }
   }
 
   public RDDConsistentBucketPartitioner(HoodieTable table) {
     this(table, Collections.emptyMap(), false);
     ValidationUtils.checkArgument(table.getIndex() instanceof HoodieSparkConsistentBucketIndex,
         "RDDConsistentBucketPartitioner can only be used together with consistent hashing bucket index");
-    ValidationUtils.checkArgument(table.getMetaClient().getTableType().equals(HoodieTableType.MERGE_ON_READ),

Review Comment:
   This validation was moved to RDDBucketIndexPartitioner.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YuweiXiao commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "YuweiXiao (via GitHub)" <gi...@apache.org>.
YuweiXiao commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1096918955


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDConsistentBucketPartitioner.java:
##########
@@ -18,82 +18,45 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
-import org.apache.hudi.avro.HoodieAvroUtils;
-import org.apache.hudi.common.config.SerializableSchema;
-import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.model.ConsistentHashingNode;
 import org.apache.hudi.common.model.HoodieConsistentHashingMetadata;
 import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
-import org.apache.hudi.common.model.HoodieTableType;
-import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.ValidationUtils;
-import org.apache.hudi.common.util.collection.FlatLists;
-import org.apache.hudi.common.util.collection.FlatLists.ComparableList;
 import org.apache.hudi.index.bucket.ConsistentBucketIdentifier;
 import org.apache.hudi.index.bucket.HoodieSparkConsistentBucketIndex;
-import org.apache.hudi.io.AppendHandleFactory;
-import org.apache.hudi.io.SingleFileHandleCreateFactory;
-import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.HoodieTable;
-
-import org.apache.avro.Schema;
-import org.apache.log4j.LogManager;
-import org.apache.log4j.Logger;
 import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
 
-import java.io.Serializable;
-import java.util.ArrayList;
-import java.util.Arrays;
 import java.util.Collections;
-import java.util.Comparator;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
 import java.util.stream.Collectors;
 
-import scala.Tuple2;
-
 import static org.apache.hudi.config.HoodieClusteringConfig.PLAN_STRATEGY_SORT_COLUMNS;
 
 /**
  * A partitioner for (consistent hashing) bucket index used in bulk_insert
  */
 public class RDDConsistentBucketPartitioner<T> extends RDDBucketIndexPartitioner<T> {
 
-  private static final Logger LOG = LogManager.getLogger(RDDConsistentBucketPartitioner.class);
-
-  private final HoodieTable table;
-  private final List<String> indexKeyFields;
   private final Map<String, List<ConsistentHashingNode>> hashingChildrenNodes;
-  private final String[] sortColumnNames;
-  private final boolean preserveHoodieMetadata;
-  private final boolean consistentLogicalTimestampEnabled;
 
-  private List<Boolean> doAppend;
-  private List<String> fileIdPfxList;
-
-  public RDDConsistentBucketPartitioner(HoodieTable table, Map<String, String> strategyParams, boolean preserveHoodieMetadata) {
-    this.table = table;
-    this.indexKeyFields = Arrays.asList(table.getConfig().getBucketIndexHashField().split(","));
+  public RDDConsistentBucketPartitioner(HoodieTable table,
+                                        Map<String, String> strategyParams,
+                                        boolean preserveHoodieMetadata) {
+    super(table,
+        strategyParams.getOrDefault(PLAN_STRATEGY_SORT_COLUMNS.key(), null),
+        preserveHoodieMetadata);
     this.hashingChildrenNodes = new HashMap<>();
-    this.consistentLogicalTimestampEnabled = table.getConfig().isConsistentLogicalTimestampEnabled();
-    this.preserveHoodieMetadata = preserveHoodieMetadata;
-
-    if (strategyParams.containsKey(PLAN_STRATEGY_SORT_COLUMNS.key())) {
-      sortColumnNames = strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()).split(",");
-    } else {
-      sortColumnNames = null;
-    }
   }
 
   public RDDConsistentBucketPartitioner(HoodieTable table) {
     this(table, Collections.emptyMap(), false);
     ValidationUtils.checkArgument(table.getIndex() instanceof HoodieSparkConsistentBucketIndex,
         "RDDConsistentBucketPartitioner can only be used together with consistent hashing bucket index");
-    ValidationUtils.checkArgument(table.getMetaClient().getTableType().equals(HoodieTableType.MERGE_ON_READ),

Review Comment:
   Not sure if SIMPLE bucket also has this constraint. 
   
   The reason why we constraint CONSISTENT HASHING to MOR table only, is that some mechanism (i.e., concurrent control) rely on the log feature of MOR table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1096474138


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDBucketIndexPartitioner.java:
##########
@@ -18,15 +18,155 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.config.SerializableSchema;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.FlatLists;
+import org.apache.hudi.io.AppendHandleFactory;
+import org.apache.hudi.io.SingleFileHandleCreateFactory;
+import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.hudi.table.HoodieTable;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+
 
 /**
  * Abstract of bucket index bulk_insert partitioner
  * TODO implement partitioner for SIMPLE BUCKET INDEX

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1414660920

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1157482664


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java:
##########
@@ -38,9 +38,12 @@ public static BulkInsertPartitioner get(HoodieTable table,
   public static BulkInsertPartitioner get(HoodieTable table,
                                           HoodieWriteConfig config,
                                           boolean enforceNumOutputPartitions) {
-    if (config.getIndexType().equals(HoodieIndex.IndexType.BUCKET)
-        && config.getBucketIndexEngineType().equals(HoodieIndex.BucketIndexEngineType.CONSISTENT_HASHING)) {
-      return new RDDConsistentBucketPartitioner(table);
+    if (config.getIndexType().equals(HoodieIndex.IndexType.BUCKET)) {
+      if (config.getBucketIndexEngineType().equals(HoodieIndex.BucketIndexEngineType.CONSISTENT_HASHING)) {
+        return new RDDConsistentBucketPartitioner(table);
+      } else if (config.getBucketIndexEngineType().equals(HoodieIndex.BucketIndexEngineType.SIMPLE)) {
+        return new RDDSimpleBucketPartitioner(table);
+      }

Review Comment:
   `getBucketIndexEngineType` is to obtain the value from `enum BucketIndexEngineType`. If the user sets an unsupported type, this function will directly throw an exception, so there is maybe no need to throw an exception again here.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/HoodieSimpleBucketIndex.java:
##########
@@ -44,7 +44,7 @@ public HoodieSimpleBucketIndex(HoodieWriteConfig config) {
     super(config);
   }
 
-  private Map<Integer, HoodieRecordLocation> loadPartitionBucketIdFileIdMapping(
+  public Map<Integer, HoodieRecordLocation> loadPartitionBucketIdFileIdMapping(

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1507984765

   Code changes look good. Will wait for tests to pass before merging
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1415127017

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * 7857a3839c59105bbb2d6eb8a210b7658dae02bd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jonvex commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "jonvex (via GitHub)" <gi...@apache.org>.
jonvex commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1416128488

   @danny0405 I did some work with bulkinsertpartitioner so I can take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1141157998


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDBucketIndexPartitioner.java:
##########
@@ -18,15 +18,155 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.config.SerializableSchema;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.FlatLists;
+import org.apache.hudi.io.AppendHandleFactory;
+import org.apache.hudi.io.SingleFileHandleCreateFactory;
+import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.hudi.table.HoodieTable;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+
 
 /**
  * Abstract of bucket index bulk_insert partitioner
- * TODO implement partitioner for SIMPLE BUCKET INDEX
  */
 public abstract class RDDBucketIndexPartitioner<T>
     implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+

Review Comment:
   @wuwenchi : I agree it was written like that before but wanted to take this opportunity to correct it as we are bringing in more engines to this specific use-case. Can you elaborate more why it is convenient to use rdd directly ? One of the main reason why we introduced these abstractions is to have proper reuse of core components across engines while being flexible to extend in engine specific way. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1504414285

   @wuwenchi : Please ping me in this PR once you have addressed all comments and is ready for review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1484637787

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9a40108980f493961409d878dcba79b5642ba981",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "9a40108980f493961409d878dcba79b5642ba981",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * f928ae1c51cdf405ab1b039b2654709d5de40b38 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688) 
   * 9a40108980f493961409d878dcba79b5642ba981 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1484812849

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9a40108980f493961409d878dcba79b5642ba981",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937",
       "triggerID" : "9a40108980f493961409d878dcba79b5642ba981",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * 9a40108980f493961409d878dcba79b5642ba981 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jonvex commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "jonvex (via GitHub)" <gi...@apache.org>.
jonvex commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1096055351


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDBucketIndexPartitioner.java:
##########
@@ -18,15 +18,155 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.config.SerializableSchema;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.FlatLists;
+import org.apache.hudi.io.AppendHandleFactory;
+import org.apache.hudi.io.SingleFileHandleCreateFactory;
+import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.hudi.table.HoodieTable;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+
 
 /**
  * Abstract of bucket index bulk_insert partitioner
  * TODO implement partitioner for SIMPLE BUCKET INDEX
  */
 public abstract class RDDBucketIndexPartitioner<T>
     implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+
+  public static final Logger LOG = LogManager.getLogger(RDDBucketIndexPartitioner.class);
+
+  public final HoodieTable table;
+  public final String[] sortColumnNames;
+  final List<String> indexKeyFields;
+  final boolean consistentLogicalTimestampEnabled;
+  public final List<Boolean> doAppend = new ArrayList<>();
+  public final List<String> fileIdPfxList = new ArrayList<>();

Review Comment:
   I think these can be private or protected



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YuweiXiao commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "YuweiXiao (via GitHub)" <gi...@apache.org>.
YuweiXiao commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1096475783


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/HoodieSimpleBucketIndex.java:
##########
@@ -44,7 +44,7 @@ public HoodieSimpleBucketIndex(HoodieWriteConfig config) {
     super(config);
   }
 
-  private Map<Integer, HoodieRecordLocation> loadPartitionBucketIdFileIdMapping(
+  public Map<Integer, HoodieRecordLocation> loadPartitionBucketIdFileIdMapping(

Review Comment:
   Maybe keep it private.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDConsistentBucketPartitioner.java:
##########
@@ -18,82 +18,45 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
-import org.apache.hudi.avro.HoodieAvroUtils;
-import org.apache.hudi.common.config.SerializableSchema;
-import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.model.ConsistentHashingNode;
 import org.apache.hudi.common.model.HoodieConsistentHashingMetadata;
 import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
-import org.apache.hudi.common.model.HoodieTableType;
-import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.ValidationUtils;
-import org.apache.hudi.common.util.collection.FlatLists;
-import org.apache.hudi.common.util.collection.FlatLists.ComparableList;
 import org.apache.hudi.index.bucket.ConsistentBucketIdentifier;
 import org.apache.hudi.index.bucket.HoodieSparkConsistentBucketIndex;
-import org.apache.hudi.io.AppendHandleFactory;
-import org.apache.hudi.io.SingleFileHandleCreateFactory;
-import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.HoodieTable;
-
-import org.apache.avro.Schema;
-import org.apache.log4j.LogManager;
-import org.apache.log4j.Logger;
 import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
 
-import java.io.Serializable;
-import java.util.ArrayList;
-import java.util.Arrays;
 import java.util.Collections;
-import java.util.Comparator;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
 import java.util.stream.Collectors;
 
-import scala.Tuple2;
-
 import static org.apache.hudi.config.HoodieClusteringConfig.PLAN_STRATEGY_SORT_COLUMNS;
 
 /**
  * A partitioner for (consistent hashing) bucket index used in bulk_insert
  */
 public class RDDConsistentBucketPartitioner<T> extends RDDBucketIndexPartitioner<T> {
 
-  private static final Logger LOG = LogManager.getLogger(RDDConsistentBucketPartitioner.class);
-
-  private final HoodieTable table;
-  private final List<String> indexKeyFields;
   private final Map<String, List<ConsistentHashingNode>> hashingChildrenNodes;
-  private final String[] sortColumnNames;
-  private final boolean preserveHoodieMetadata;
-  private final boolean consistentLogicalTimestampEnabled;
 
-  private List<Boolean> doAppend;
-  private List<String> fileIdPfxList;
-
-  public RDDConsistentBucketPartitioner(HoodieTable table, Map<String, String> strategyParams, boolean preserveHoodieMetadata) {
-    this.table = table;
-    this.indexKeyFields = Arrays.asList(table.getConfig().getBucketIndexHashField().split(","));
+  public RDDConsistentBucketPartitioner(HoodieTable table,
+                                        Map<String, String> strategyParams,
+                                        boolean preserveHoodieMetadata) {
+    super(table,
+        strategyParams.getOrDefault(PLAN_STRATEGY_SORT_COLUMNS.key(), null),
+        preserveHoodieMetadata);
     this.hashingChildrenNodes = new HashMap<>();
-    this.consistentLogicalTimestampEnabled = table.getConfig().isConsistentLogicalTimestampEnabled();
-    this.preserveHoodieMetadata = preserveHoodieMetadata;
-
-    if (strategyParams.containsKey(PLAN_STRATEGY_SORT_COLUMNS.key())) {
-      sortColumnNames = strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()).split(",");
-    } else {
-      sortColumnNames = null;
-    }
   }
 
   public RDDConsistentBucketPartitioner(HoodieTable table) {
     this(table, Collections.emptyMap(), false);
     ValidationUtils.checkArgument(table.getIndex() instanceof HoodieSparkConsistentBucketIndex,
         "RDDConsistentBucketPartitioner can only be used together with consistent hashing bucket index");
-    ValidationUtils.checkArgument(table.getMetaClient().getTableType().equals(HoodieTableType.MERGE_ON_READ),

Review Comment:
   We could keep this validation, as Consistent Hashing only supports MOR table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1416693235

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * 62bb4d2a2a67dabf6452d9af1df17d0fd964d070 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1157483783


##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/execution/bulkinsert/TestRDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.data.HoodieJavaRDD;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.hudi.table.HoodieSparkTable;
+import org.apache.hudi.testutils.HoodieClientTestHarness;
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+public class TestRDDSimpleBucketPartitioner extends HoodieClientTestHarness {
+
+  @BeforeEach
+  public void setUp() throws Exception {
+    initPath();
+    initSparkContexts("TestRDDSimpleBucketPartitioner");
+    initFileSystem();
+    initTimelineService();
+  }
+
+  @AfterEach
+  public void tearDown() throws IOException {
+    cleanupResources();
+  }
+
+  @ParameterizedTest
+  @MethodSource("configParams")
+  public void testSimpleBucketPartitioner(HoodieTableType type, boolean partitionSort) throws IOException {
+    HoodieTestUtils.init(HoodieTestUtils.getDefaultHadoopConf(), basePath, type);
+    int bucketNum = 2;
+    HoodieWriteConfig config = HoodieWriteConfig
+        .newBuilder()
+        .withPath(basePath)
+        .withSchema(TRIP_EXAMPLE_SCHEMA)
+        .build();
+    config.setValue(HoodieIndexConfig.INDEX_TYPE, HoodieIndex.IndexType.BUCKET.name());
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE, HoodieIndex.BucketIndexEngineType.SIMPLE.name());
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD, "_row_key");
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS, "" + bucketNum);
+    if (partitionSort) {
+      config.setValue(HoodieWriteConfig.BULK_INSERT_SORT_MODE, BulkInsertSortMode.PARTITION_SORT.name());
+    }
+
+    HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
+    List<HoodieRecord> records = dataGenerator.generateInserts("0", 100);
+    HoodieJavaRDD<HoodieRecord> javaRDD = HoodieJavaRDD.of(records, context, 1);
+    javaRDD.map(HoodieRecord::getPartitionPath).count();
+
+    final HoodieSparkTable table = HoodieSparkTable.create(config, context);
+    BulkInsertPartitioner partitioner = BulkInsertInternalPartitionerFactory.get(table, config);
+    JavaRDD<HoodieRecord> repartitionRecords =
+        (JavaRDD<HoodieRecord>) partitioner.repartitionRecords(HoodieJavaRDD.getJavaRDD(javaRDD), 1);
+
+    assertEquals(bucketNum * javaRDD.map(HoodieRecord::getPartitionPath).distinct().count(),

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1498954866

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9a40108980f493961409d878dcba79b5642ba981",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937",
       "triggerID" : "9a40108980f493961409d878dcba79b5642ba981",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16123",
       "triggerID" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16154",
       "triggerID" : "7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e343ba438b6f21746cd3aa85ad8f80c2fe469ce8",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16163",
       "triggerID" : "e343ba438b6f21746cd3aa85ad8f80c2fe469ce8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * e343ba438b6f21746cd3aa85ad8f80c2fe469ce8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16163) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1416652140

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * 7857a3839c59105bbb2d6eb8a210b7658dae02bd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887) 
   * 62bb4d2a2a67dabf6452d9af1df17d0fd964d070 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1416651170

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * 7857a3839c59105bbb2d6eb8a210b7658dae02bd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887) 
   * 62bb4d2a2a67dabf6452d9af1df17d0fd964d070 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1415151395

   @jonvex @lokeshj1703 Do you have interests on reviewing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1495279011

   @wuwenchi : Can you look at the PR comments and address them when you get a chance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1495690555

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9a40108980f493961409d878dcba79b5642ba981",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937",
       "triggerID" : "9a40108980f493961409d878dcba79b5642ba981",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * 9a40108980f493961409d878dcba79b5642ba981 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937) 
   * ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1504421161

   > @wuwenchi : Please ping me in this PR once you have addressed all comments and is ready for review.
   
   @bvaradar All comments have now been corrected, it's ready to review now, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1484646598

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9a40108980f493961409d878dcba79b5642ba981",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937",
       "triggerID" : "9a40108980f493961409d878dcba79b5642ba981",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * f928ae1c51cdf405ab1b039b2654709d5de40b38 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688) 
   * 9a40108980f493961409d878dcba79b5642ba981 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1414693127

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * c598b3d64f3dbb5e0640a5adad05afbf13400afd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886) 
   * 7857a3839c59105bbb2d6eb8a210b7658dae02bd UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1133165678


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/storage/HoodieSimpleBucketLayout.java:
##########
@@ -34,6 +34,7 @@ public class HoodieSimpleBucketLayout extends HoodieStorageLayout {
   public static final Set<WriteOperationType> SUPPORTED_OPERATIONS = CollectionUtils.createImmutableSet(
       WriteOperationType.INSERT,
       WriteOperationType.INSERT_PREPPED,
+      WriteOperationType.BULK_INSERT,

Review Comment:
   @wuwenchi @YuweiXiao : should HoodieBucketIndex.requiresTagging also return True for Bulk_Index ? 



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDBucketIndexPartitioner.java:
##########
@@ -18,15 +18,155 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.config.SerializableSchema;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.FlatLists;
+import org.apache.hudi.io.AppendHandleFactory;
+import org.apache.hudi.io.SingleFileHandleCreateFactory;
+import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.hudi.table.HoodieTable;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+
 
 /**
  * Abstract of bucket index bulk_insert partitioner
- * TODO implement partitioner for SIMPLE BUCKET INDEX
  */
 public abstract class RDDBucketIndexPartitioner<T>
     implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+

Review Comment:
   HoodieBucketIndex is defined in Engine agnostic way (uses HoodieData and HoodieEngineContext). Can we also define the base partitioner class using these abstractions instead of directly using JavaRDD ? 



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDBucketIndexPartitioner.java:
##########
@@ -18,15 +18,155 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.config.SerializableSchema;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.FlatLists;
+import org.apache.hudi.io.AppendHandleFactory;
+import org.apache.hudi.io.SingleFileHandleCreateFactory;
+import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.hudi.table.HoodieTable;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+
 
 /**
  * Abstract of bucket index bulk_insert partitioner
- * TODO implement partitioner for SIMPLE BUCKET INDEX
  */
 public abstract class RDDBucketIndexPartitioner<T>
     implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+

Review Comment:
   HoodieBucketIndex is defined in Engine agnostic way (uses HoodieData and HoodieEngineContext). Can we also define the base partitioner class using these abstractions instead of directly using JavaRDD ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1161192206


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -358,6 +358,13 @@ public static String createNewFileIdPfx() {
     return UUID.randomUUID().toString();
   }
 
+  /**
+   * Returns prefix for a file group from fileId.
+   */
+  public static String getFileIdPfxFromFileId(String fileId) {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1157483229


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.index.bucket.HoodieSimpleBucketIndex;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RDDSimpleBucketPartitioner<T extends HoodieRecordPayload> extends RDDBucketIndexPartitioner<T> {
+
+  public RDDSimpleBucketPartitioner(HoodieTable table) {
+    super(table, null, false);
+    ValidationUtils.checkArgument(table.getIndex() instanceof HoodieSimpleBucketIndex);
+  }
+
+  @Override
+  public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records, int outputPartitions) {
+    HoodieSimpleBucketIndex index = (HoodieSimpleBucketIndex) table.getIndex();
+    HashMap<String, Integer> fileIdToIdx = new HashMap<>();
+
+    // Map <partition, <bucketNo, fileID>>
+    Map<String, HashMap<Integer, String>> partitionMapper = getPartitionMapper(records, fileIdToIdx);
+
+    return doPartition(records, new Partitioner() {
+      @Override
+      public int numPartitions() {
+        return index.getNumBuckets() * partitionMapper.size();
+      }
+
+      @Override
+      public int getPartition(Object key) {
+        HoodieKey hoodieKey = (HoodieKey) key;
+        String partitionPath = hoodieKey.getPartitionPath();
+        int bucketID = index.getBucketID(hoodieKey);
+        String fileID = partitionMapper.get(partitionPath).get(bucketID);
+        return fileIdToIdx.get(fileID);
+      }
+    });
+  }
+
+  Map<String, HashMap<Integer, String>> getPartitionMapper(JavaRDD<HoodieRecord<T>> records,
+                                                           HashMap<String, Integer> fileIdToIdx) {
+
+    HoodieSimpleBucketIndex index = (HoodieSimpleBucketIndex) table.getIndex();
+    int numBuckets = index.getNumBuckets();
+    return records
+        .map(HoodieRecord::getPartitionPath)
+        .distinct().collect().stream()
+        .collect(Collectors.toMap(p -> p, p -> {
+          Map<Integer, HoodieRecordLocation> locationMap = index.loadPartitionBucketIdFileIdMapping(table, p);
+          HashMap<Integer, String> fileIdMap = new HashMap<>();

Review Comment:
   done



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.index.bucket.HoodieSimpleBucketIndex;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RDDSimpleBucketPartitioner<T extends HoodieRecordPayload> extends RDDBucketIndexPartitioner<T> {
+
+  public RDDSimpleBucketPartitioner(HoodieTable table) {
+    super(table, null, false);
+    ValidationUtils.checkArgument(table.getIndex() instanceof HoodieSimpleBucketIndex);
+  }
+
+  @Override
+  public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records, int outputPartitions) {
+    HoodieSimpleBucketIndex index = (HoodieSimpleBucketIndex) table.getIndex();
+    HashMap<String, Integer> fileIdToIdx = new HashMap<>();
+
+    // Map <partition, <bucketNo, fileID>>
+    Map<String, HashMap<Integer, String>> partitionMapper = getPartitionMapper(records, fileIdToIdx);
+
+    return doPartition(records, new Partitioner() {
+      @Override
+      public int numPartitions() {
+        return index.getNumBuckets() * partitionMapper.size();
+      }
+
+      @Override
+      public int getPartition(Object key) {
+        HoodieKey hoodieKey = (HoodieKey) key;
+        String partitionPath = hoodieKey.getPartitionPath();
+        int bucketID = index.getBucketID(hoodieKey);
+        String fileID = partitionMapper.get(partitionPath).get(bucketID);
+        return fileIdToIdx.get(fileID);
+      }
+    });
+  }
+
+  Map<String, HashMap<Integer, String>> getPartitionMapper(JavaRDD<HoodieRecord<T>> records,
+                                                           HashMap<String, Integer> fileIdToIdx) {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1498420953

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9a40108980f493961409d878dcba79b5642ba981",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937",
       "triggerID" : "9a40108980f493961409d878dcba79b5642ba981",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16123",
       "triggerID" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16123) 
   * 7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1498426530

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9a40108980f493961409d878dcba79b5642ba981",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937",
       "triggerID" : "9a40108980f493961409d878dcba79b5642ba981",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16123",
       "triggerID" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16154",
       "triggerID" : "7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16123) 
   * 7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16154) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1498549061

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9a40108980f493961409d878dcba79b5642ba981",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937",
       "triggerID" : "9a40108980f493961409d878dcba79b5642ba981",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16123",
       "triggerID" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16154",
       "triggerID" : "7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e343ba438b6f21746cd3aa85ad8f80c2fe469ce8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e343ba438b6f21746cd3aa85ad8f80c2fe469ce8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * 7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16154) 
   * e343ba438b6f21746cd3aa85ad8f80c2fe469ce8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1498585859

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9a40108980f493961409d878dcba79b5642ba981",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937",
       "triggerID" : "9a40108980f493961409d878dcba79b5642ba981",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16123",
       "triggerID" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16154",
       "triggerID" : "7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e343ba438b6f21746cd3aa85ad8f80c2fe469ce8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16163",
       "triggerID" : "e343ba438b6f21746cd3aa85ad8f80c2fe469ce8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * 7c0eb29f0f231e39acd22dfcb131ebe7c780eeb6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16154) 
   * e343ba438b6f21746cd3aa85ad8f80c2fe469ce8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16163) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1159244886


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -358,6 +358,13 @@ public static String createNewFileIdPfx() {
     return UUID.randomUUID().toString();
   }
 
+  /**
+   * Returns prefix for a file group from fileId.
+   */
+  public static String getFileIdPfxFromFileId(String fileId) {

Review Comment:
   This looks dangerous. Instead, can you split by delimiters and take the right part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1495701914

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9a40108980f493961409d878dcba79b5642ba981",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937",
       "triggerID" : "9a40108980f493961409d878dcba79b5642ba981",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16123",
       "triggerID" : "ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * 9a40108980f493961409d878dcba79b5642ba981 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15937) 
   * ea2b9b0f25b87cbf7212ce87bb9b6e9ba63ddab4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16123) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1414677006

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * c598b3d64f3dbb5e0640a5adad05afbf13400afd UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1414822566

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * c598b3d64f3dbb5e0640a5adad05afbf13400afd Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886) 
   * 7857a3839c59105bbb2d6eb8a210b7658dae02bd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1096474189


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDBucketIndexPartitioner.java:
##########
@@ -18,15 +18,155 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.config.SerializableSchema;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.FlatLists;
+import org.apache.hudi.io.AppendHandleFactory;
+import org.apache.hudi.io.SingleFileHandleCreateFactory;
+import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.hudi.table.HoodieTable;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+
 
 /**
  * Abstract of bucket index bulk_insert partitioner
  * TODO implement partitioner for SIMPLE BUCKET INDEX
  */
 public abstract class RDDBucketIndexPartitioner<T>
     implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+
+  public static final Logger LOG = LogManager.getLogger(RDDBucketIndexPartitioner.class);
+
+  public final HoodieTable table;
+  public final String[] sortColumnNames;
+  final List<String> indexKeyFields;
+  final boolean consistentLogicalTimestampEnabled;
+  public final List<Boolean> doAppend = new ArrayList<>();
+  public final List<String> fileIdPfxList = new ArrayList<>();

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1133679532


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDBucketIndexPartitioner.java:
##########
@@ -18,15 +18,155 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.config.SerializableSchema;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.FlatLists;
+import org.apache.hudi.io.AppendHandleFactory;
+import org.apache.hudi.io.SingleFileHandleCreateFactory;
+import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.hudi.table.HoodieTable;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+
 
 /**
  * Abstract of bucket index bulk_insert partitioner
- * TODO implement partitioner for SIMPLE BUCKET INDEX
  */
 public abstract class RDDBucketIndexPartitioner<T>
     implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+

Review Comment:
   This is what was originally written here. I guess it is because flink already has its own bucket implementation, and it involves re-partitioning operations. It is more convenient to use rdd directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1466061413

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * f928ae1c51cdf405ab1b039b2654709d5de40b38 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1414613251

   In the code, the part of the abstract class RDDBucketIndexPartitioner comes from RDDConsistentBucketPartitioner. RDDSimpleBucketPartitioner added its own repartitionRecords method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1414731824

   @YuweiXiao @leesf @codope can you help review it? Thanks! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jonvex commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "jonvex (via GitHub)" <gi...@apache.org>.
jonvex commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1096053251


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDBucketIndexPartitioner.java:
##########
@@ -18,15 +18,155 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.config.SerializableSchema;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.FlatLists;
+import org.apache.hudi.io.AppendHandleFactory;
+import org.apache.hudi.io.SingleFileHandleCreateFactory;
+import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.hudi.table.HoodieTable;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+
 
 /**
  * Abstract of bucket index bulk_insert partitioner
  * TODO implement partitioner for SIMPLE BUCKET INDEX

Review Comment:
   remove todo line



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1133675527


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/storage/HoodieSimpleBucketLayout.java:
##########
@@ -34,6 +34,7 @@ public class HoodieSimpleBucketLayout extends HoodieStorageLayout {
   public static final Set<WriteOperationType> SUPPORTED_OPERATIONS = CollectionUtils.createImmutableSet(
       WriteOperationType.INSERT,
       WriteOperationType.INSERT_PREPPED,
+      WriteOperationType.BULK_INSERT,

Review Comment:
   Yes, `BULK_INSERT` here should return true. Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1465886571

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14886",
       "triggerID" : "c598b3d64f3dbb5e0640a5adad05afbf13400afd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14887",
       "triggerID" : "7857a3839c59105bbb2d6eb8a210b7658dae02bd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918",
       "triggerID" : "62bb4d2a2a67dabf6452d9af1df17d0fd964d070",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688",
       "triggerID" : "f928ae1c51cdf405ab1b039b2654709d5de40b38",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b5a005f667dbcbb03c3c36297e6ba9fd4bad5d1c UNKNOWN
   * 62bb4d2a2a67dabf6452d9af1df17d0fd964d070 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14918) 
   * f928ae1c51cdf405ab1b039b2654709d5de40b38 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15688) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1149056263


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDBucketIndexPartitioner.java:
##########
@@ -18,15 +18,155 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.config.SerializableSchema;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.FlatLists;
+import org.apache.hudi.io.AppendHandleFactory;
+import org.apache.hudi.io.SingleFileHandleCreateFactory;
+import org.apache.hudi.io.WriteHandleFactory;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.hudi.table.HoodieTable;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+import org.apache.spark.Partitioner;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+
 
 /**
  * Abstract of bucket index bulk_insert partitioner
- * TODO implement partitioner for SIMPLE BUCKET INDEX
  */
 public abstract class RDDBucketIndexPartitioner<T>
     implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+

Review Comment:
   It's a good idea, I tried to abstract some methods into hudi-client.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on a diff in pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on code in PR #7834:
URL: https://github.com/apache/hudi/pull/7834#discussion_r1157483471


##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/execution/bulkinsert/TestRDDSimpleBucketPartitioner.java:
##########
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.data.HoodieJavaRDD;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.hudi.table.HoodieSparkTable;
+import org.apache.hudi.testutils.HoodieClientTestHarness;
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+public class TestRDDSimpleBucketPartitioner extends HoodieClientTestHarness {
+
+  @BeforeEach
+  public void setUp() throws Exception {
+    initPath();
+    initSparkContexts("TestRDDSimpleBucketPartitioner");
+    initFileSystem();
+    initTimelineService();
+  }
+
+  @AfterEach
+  public void tearDown() throws IOException {
+    cleanupResources();
+  }
+
+  @ParameterizedTest
+  @MethodSource("configParams")
+  public void testSimpleBucketPartitioner(HoodieTableType type, boolean partitionSort) throws IOException {
+    HoodieTestUtils.init(HoodieTestUtils.getDefaultHadoopConf(), basePath, type);
+    int bucketNum = 2;
+    HoodieWriteConfig config = HoodieWriteConfig
+        .newBuilder()
+        .withPath(basePath)
+        .withSchema(TRIP_EXAMPLE_SCHEMA)
+        .build();
+    config.setValue(HoodieIndexConfig.INDEX_TYPE, HoodieIndex.IndexType.BUCKET.name());
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE, HoodieIndex.BucketIndexEngineType.SIMPLE.name());
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD, "_row_key");
+    config.setValue(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS, "" + bucketNum);
+    if (partitionSort) {
+      config.setValue(HoodieWriteConfig.BULK_INSERT_SORT_MODE, BulkInsertSortMode.PARTITION_SORT.name());
+    }
+
+    HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
+    List<HoodieRecord> records = dataGenerator.generateInserts("0", 100);
+    HoodieJavaRDD<HoodieRecord> javaRDD = HoodieJavaRDD.of(records, context, 1);
+    javaRDD.map(HoodieRecord::getPartitionPath).count();

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wuwenchi commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

Posted by "wuwenchi (via GitHub)" <gi...@apache.org>.
wuwenchi commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1496245214

   > @wuwenchi : Can you look at the PR comments and address them when you get a chance.
   
   @bvaradar  Sorry for the delay... Modified some, but there seems to be a conflict, I will solve it tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org