You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "xushiyan (via GitHub)" <gi...@apache.org> on 2023/03/31 21:42:56 UTC

[GitHub] [hudi] xushiyan opened a new pull request, #8344: [HUDI-5968] Fix global index duplicate when update partition

xushiyan opened a new pull request, #8344:
URL: https://github.com/apache/hudi/pull/8344

   ### Change Logs
   
   When using global index (bloom or simple), and update partition is set to true. There is a chance where record is in p1 at the beginning, and later updated to p2, when updating to p3 and compaction not yet happened, global index joined both old versions of the record in p1 and p2, and tagged 2 records to insert to p3. This sort of duplicates will reside in the dataset and won't be reconciled unless manually dedup the table.
   
   This patch ensure dedup happens within the indexing (tagging) phase.
   
   ### Impact
   
   Global index has an extra dedup step for some records, which may slow down the whole process if a lot partition updates happen. In most scenarios, this is rare and perf impact is negligible.
   
   ### Risk level (write none, low medium or high below)
   
   Medium
   
   ### Documentation Update
   
   - [ ] New config `hoodie.global.index.dedup.parallelism`
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1493149491

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51b9969e85c03f3f5c8274782d6ac5810c760ab1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029) 
   * fa1b1525a163af85271f0dc9e0d5765ea2075044 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495611890

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104",
       "triggerID" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16109",
       "triggerID" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f021bc3227eea58d049420227564b9c98589534e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f021bc3227eea58d049420227564b9c98589534e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7624300eb0d7205a4924783606226bbdfd49ad5a Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16109) 
   * f021bc3227eea58d049420227564b9c98589534e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1156783371


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##########
@@ -168,4 +171,36 @@ public static List<String> filterKeysFromFile(Path filePath, List<String> candid
     }
     return foundRecordKeys;
   }
+
+  public static <R> HoodieData<HoodieRecord<R>> dedupForPartitionUpdates(HoodieData<Pair<HoodieRecord<R>, Boolean>> taggedHoodieRecords, int dedupParallelism) {
+    /*
+     * In case a record is updated from p1 to p2 and then to p3, 2 existing records
+     * will be tagged for the incoming record to insert to p3. So we dedup them here. (Set A)
+     */
+    HoodiePairData<String, HoodieRecord<R>> deduped = taggedHoodieRecords.filter(Pair::getRight)
+        .map(Pair::getLeft)
+        .distinctWithKey(HoodieRecord::getKey, dedupParallelism)
+        .mapToPair(r -> Pair.of(r.getRecordKey(), r));
+
+    /*
+     * This includes
+     *  - tagged existing records whose partition paths are not to be updated (Set B)
+     *  - completely new records (Set C)
+     */
+    HoodieData<HoodieRecord<R>> undeduped = taggedHoodieRecords.filter(p -> !p.getRight()).map(Pair::getLeft);
+
+    /*
+     * There can be intersection between Set A and Set B mentioned above.
+     *
+     * Example: record X is updated from p1 to p2 and then back to p1.
+     * Set A will contain an insert to p1 and Set B will contain an update to p1.
+     *
+     * So we let A left-anti join B to drop the insert from Set A and keep the update in Set B.
+     */
+    return deduped.leftOuterJoin(undeduped
+            .filter(r -> !(r.getData() instanceof EmptyHoodieRecordPayload))

Review Comment:
   synced up directly. lets add java docs to call this out, ie. why we should strictly favor update record and not insert. so that anyone looking to make any changes in this code block is aware of all the nuances.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1492758844

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51b9969e85c03f3f5c8274782d6ac5810c760ab1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495307023

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104",
       "triggerID" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1155553523


##########
hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieSimpleDataGenerator.java:
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.testutils;
+
+import org.apache.hudi.common.model.DefaultHoodieRecordPayload;
+import org.apache.hudi.common.model.HoodieAvroRecord;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class HoodieSimpleDataGenerator {

Review Comment:
   is it not possible to use HoodieTestDataGenerator or any other existing ones. We should try to standardize on these test data generators. Ensuring no flakiess or bugs in new ones are hard. lets try to stick to the ones we have already. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1155054255


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/simple/HoodieGlobalSimpleIndex.java:
##########
@@ -135,8 +135,8 @@ private <R> HoodieData<HoodieRecord<R>> getTaggedRecords(
               HoodieRecord<R> deleteRecord = new HoodieAvroRecord(new HoodieKey(inputRecord.getRecordKey(), partitionPath), new EmptyHoodieRecordPayload());
               deleteRecord.setCurrentLocation(location);
               deleteRecord.seal();
-              // Tag the incoming record for inserting to the new partition
-              HoodieRecord<R> insertRecord = (HoodieRecord<R>) HoodieIndexUtils.getTaggedRecord(inputRecord, Option.empty());
+              // Tag the incoming record for inserting to the new partition; left unsealed for marking as dedup later
+              HoodieRecord<R> insertRecord = (HoodieRecord<R>) HoodieIndexUtils.getUnsealedTaggedRecord(inputRecord, Option.empty());

Review Comment:
   I feel, we are retrofitting the sealing property to meet our goals. I feel, we should just map the record to a pair(record, isUpdate(boolean)) within flatMap and then use that property instead of seal. I don't want the sealing property to be used for external filtering purposes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] vinothchandar commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "vinothchandar (via GitHub)" <gi...@apache.org>.
vinothchandar commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1159331827


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig {
       .defaultValue("true")
       .withDocumentation("Similar to " + BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index.");
 
+  public static final ConfigProperty<String> GLOBAL_INDEX_DEDUP_PARALLELISM = ConfigProperty

Review Comment:
   lets make sure this is tagged an adv config? or not exposed to user by default. User should n't have to tune this.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig {
       .defaultValue("true")
       .withDocumentation("Similar to " + BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index.");
 
+  public static final ConfigProperty<String> GLOBAL_INDEX_DEDUP_PARALLELISM = ConfigProperty

Review Comment:
   also calling this `deduping` overloads the meaning a bit. - we are not removing the duplicates per see, right? We only ensure the tagging routes it to the right record? `"hoodie.global.index.reconcile.parallelism"`



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##########
@@ -168,4 +171,36 @@ public static List<String> filterKeysFromFile(Path filePath, List<String> candid
     }
     return foundRecordKeys;
   }
+
+  public static <R> HoodieData<HoodieRecord<R>> dedupForPartitionUpdates(HoodieData<Pair<HoodieRecord<R>, Boolean>> taggedHoodieRecords, int dedupParallelism) {
+    /*
+     * In case a record is updated from p1 to p2 and then to p3, 2 existing records
+     * will be tagged for the incoming record to insert to p3. So we dedup them here. (Set A)
+     */
+    HoodiePairData<String, HoodieRecord<R>> deduped = taggedHoodieRecords.filter(Pair::getRight)
+        .map(Pair::getLeft)
+        .distinctWithKey(HoodieRecord::getKey, dedupParallelism)
+        .mapToPair(r -> Pair.of(r.getRecordKey(), r));
+
+    /*
+     * This includes
+     *  - tagged existing records whose partition paths are not to be updated (Set B)
+     *  - completely new records (Set C)
+     */
+    HoodieData<HoodieRecord<R>> undeduped = taggedHoodieRecords.filter(p -> !p.getRight()).map(Pair::getLeft);
+
+    /*
+     * There can be intersection between Set A and Set B mentioned above.
+     *
+     * Example: record X is updated from p1 to p2 and then back to p1.
+     * Set A will contain an insert to p1 and Set B will contain an update to p1.
+     *
+     * So we let A left-anti join B to drop the insert from Set A and keep the update in Set B.
+     */
+    return deduped.leftOuterJoin(undeduped
+            .filter(r -> !(r.getData() instanceof EmptyHoodieRecordPayload))

Review Comment:
   should we instead be applying the payload to the old and new record?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##########
@@ -168,4 +171,36 @@ public static List<String> filterKeysFromFile(Path filePath, List<String> candid
     }
     return foundRecordKeys;
   }
+
+  public static <R> HoodieData<HoodieRecord<R>> dedupForPartitionUpdates(HoodieData<Pair<HoodieRecord<R>, Boolean>> taggedHoodieRecords, int dedupParallelism) {
+    /*
+     * In case a record is updated from p1 to p2 and then to p3, 2 existing records
+     * will be tagged for the incoming record to insert to p3. So we dedup them here. (Set A)
+     */
+    HoodiePairData<String, HoodieRecord<R>> deduped = taggedHoodieRecords.filter(Pair::getRight)
+        .map(Pair::getLeft)
+        .distinctWithKey(HoodieRecord::getKey, dedupParallelism)
+        .mapToPair(r -> Pair.of(r.getRecordKey(), r));
+
+    /*
+     * This includes
+     *  - tagged existing records whose partition paths are not to be updated (Set B)
+     *  - completely new records (Set C)
+     */
+    HoodieData<HoodieRecord<R>> undeduped = taggedHoodieRecords.filter(p -> !p.getRight()).map(Pair::getLeft);
+
+    /*
+     * There can be intersection between Set A and Set B mentioned above.
+     *
+     * Example: record X is updated from p1 to p2 and then back to p1.
+     * Set A will contain an insert to p1 and Set B will contain an update to p1.
+     *
+     * So we let A left-anti join B to drop the insert from Set A and keep the update in Set B.
+     */
+    return deduped.leftOuterJoin(undeduped
+            .filter(r -> !(r.getData() instanceof EmptyHoodieRecordPayload))

Review Comment:
   which is kind of the semantics we should be going for?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1156725280


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig {
       .defaultValue("true")
       .withDocumentation("Similar to " + BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index.");
 
+  public static final ConfigProperty<String> GLOBAL_INDEX_DEDUP_PARALLELISM = ConfigProperty

Review Comment:
   ok, let's keep it this way. we can revisit later if necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495339170

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104",
       "triggerID" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104) 
   * 7624300eb0d7205a4924783606226bbdfd49ad5a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495232941

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104",
       "triggerID" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fa1b1525a163af85271f0dc9e0d5765ea2075044 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058) 
   * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495670779

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104",
       "triggerID" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16109",
       "triggerID" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f021bc3227eea58d049420227564b9c98589534e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16118",
       "triggerID" : "f021bc3227eea58d049420227564b9c98589534e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7624300eb0d7205a4924783606226bbdfd49ad5a Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16109) 
   * f021bc3227eea58d049420227564b9c98589534e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16118) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1493194480

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fa1b1525a163af85271f0dc9e0d5765ea2075044 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495227355

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fa1b1525a163af85271f0dc9e0d5765ea2075044 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058) 
   * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1493151752

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51b9969e85c03f3f5c8274782d6ac5810c760ab1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029) 
   * fa1b1525a163af85271f0dc9e0d5765ea2075044 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1492687235

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51b9969e85c03f3f5c8274782d6ac5810c760ab1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1496243696

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104",
       "triggerID" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16109",
       "triggerID" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f021bc3227eea58d049420227564b9c98589534e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16118",
       "triggerID" : "f021bc3227eea58d049420227564b9c98589534e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f021bc3227eea58d049420227564b9c98589534e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16118) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1496078923

   the issue is, we have to read entire logs (including data files), since we realize deletes in diff ways. for eg, "_hoodie_is_deleted" field. So, considering the cost (esply for global index every file group is involved), we thought we will go w/ this approach. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495343022

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104",
       "triggerID" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16109",
       "triggerID" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104) 
   * 7624300eb0d7205a4924783606226bbdfd49ad5a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16109) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495599913

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16029",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058",
       "triggerID" : "fa1b1525a163af85271f0dc9e0d5765ea2075044",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104",
       "triggerID" : "3c004c60160b06b0f4a7a00980c2013cf21af3c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16109",
       "triggerID" : "7624300eb0d7205a4924783606226bbdfd49ad5a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7624300eb0d7205a4924783606226bbdfd49ad5a Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16109) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1156167443


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig {
       .defaultValue("true")
       .withDocumentation("Similar to " + BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index.");
 
+  public static final ConfigProperty<String> GLOBAL_INDEX_DEDUP_PARALLELISM = ConfigProperty

Review Comment:
   This and other parallelism configs seem like good candidates for `HoodieInternalConfig`. This is not going to be used by the users often. Their expectation would be to dedup as fast as we can. Don't have to do it in this patch but just want to know your thoughts?



##########
hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieSimpleDataGenerator.java:
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.testutils;
+
+import org.apache.hudi.common.model.DefaultHoodieRecordPayload;
+import org.apache.hudi.common.model.HoodieAvroRecord;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class HoodieSimpleDataGenerator {

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1156745651


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##########
@@ -168,4 +171,36 @@ public static List<String> filterKeysFromFile(Path filePath, List<String> candid
     }
     return foundRecordKeys;
   }
+
+  public static <R> HoodieData<HoodieRecord<R>> dedupForPartitionUpdates(HoodieData<Pair<HoodieRecord<R>, Boolean>> taggedHoodieRecords, int dedupParallelism) {
+    /*
+     * In case a record is updated from p1 to p2 and then to p3, 2 existing records
+     * will be tagged for the incoming record to insert to p3. So we dedup them here. (Set A)
+     */
+    HoodiePairData<String, HoodieRecord<R>> deduped = taggedHoodieRecords.filter(Pair::getRight)
+        .map(Pair::getLeft)
+        .distinctWithKey(HoodieRecord::getKey, dedupParallelism)
+        .mapToPair(r -> Pair.of(r.getRecordKey(), r));
+
+    /*
+     * This includes
+     *  - tagged existing records whose partition paths are not to be updated (Set B)
+     *  - completely new records (Set C)
+     */
+    HoodieData<HoodieRecord<R>> undeduped = taggedHoodieRecords.filter(p -> !p.getRight()).map(Pair::getLeft);
+
+    /*
+     * There can be intersection between Set A and Set B mentioned above.
+     *
+     * Example: record X is updated from p1 to p2 and then back to p1.
+     * Set A will contain an insert to p1 and Set B will contain an update to p1.
+     *
+     * So we let A left-anti join B to drop the insert from Set A and keep the update in Set B.
+     */
+    return deduped.leftOuterJoin(undeduped
+            .filter(r -> !(r.getData() instanceof EmptyHoodieRecordPayload))

Review Comment:
   does it matter if we favor insert or an update here? 
   If yes, I feel its better to favor insert and drop the update. so that we maintain the behavior across the board. i.e. whenever a record migrates from one partition to another, we will ignore whatever in storage and do an insert to incoming partition. to maintain similar semantics, thinking if we shd favor insert record over update. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan closed pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan closed pull request #8344: [HUDI-5968] Fix global index duplicate when update partition
URL: https://github.com/apache/hudi/pull/8344


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1156417851


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig {
       .defaultValue("true")
       .withDocumentation("Similar to " + BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index.");
 
+  public static final ConfigProperty<String> GLOBAL_INDEX_DEDUP_PARALLELISM = ConfigProperty

Review Comment:
   not very clear at the moment, given this is still tunable depends on the data's update ratio. it may stay as a infrequently used one like `hoodie.markers.delete.parallelism`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1156432021


##########
hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieSimpleDataGenerator.java:
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.testutils;
+
+import org.apache.hudi.common.model.DefaultHoodieRecordPayload;
+import org.apache.hudi.common.model.HoodieAvroRecord;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class HoodieSimpleDataGenerator {

Review Comment:
   `HoodieTestDataGenerator` actually needs an overhaul as the APIs became unorganized over the years and hard to use. More importantly, randomness is a big cause to flakiness and we need a deterministic data gen more than a random data gen for UT/FT scenarios. I can revert this back to using existing data gen class and let the future overhaul work cover the new class adoption.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1492652440

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "51b9969e85c03f3f5c8274782d6ac5810c760ab1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 51b9969e85c03f3f5c8274782d6ac5810c760ab1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1155054255


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/simple/HoodieGlobalSimpleIndex.java:
##########
@@ -135,8 +135,8 @@ private <R> HoodieData<HoodieRecord<R>> getTaggedRecords(
               HoodieRecord<R> deleteRecord = new HoodieAvroRecord(new HoodieKey(inputRecord.getRecordKey(), partitionPath), new EmptyHoodieRecordPayload());
               deleteRecord.setCurrentLocation(location);
               deleteRecord.seal();
-              // Tag the incoming record for inserting to the new partition
-              HoodieRecord<R> insertRecord = (HoodieRecord<R>) HoodieIndexUtils.getTaggedRecord(inputRecord, Option.empty());
+              // Tag the incoming record for inserting to the new partition; left unsealed for marking as dedup later
+              HoodieRecord<R> insertRecord = (HoodieRecord<R>) HoodieIndexUtils.getUnsealedTaggedRecord(inputRecord, Option.empty());

Review Comment:
   I feel, we are retrofitting the sealing property to meet our goals. I feel, we should just map the record to a pair(record, isRecordMigrating(boolean)) within flatMap and then use that property instead of seal. I don't want the sealing property to be used for external filtering purposes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1155185037


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/simple/HoodieGlobalSimpleIndex.java:
##########
@@ -135,8 +135,8 @@ private <R> HoodieData<HoodieRecord<R>> getTaggedRecords(
               HoodieRecord<R> deleteRecord = new HoodieAvroRecord(new HoodieKey(inputRecord.getRecordKey(), partitionPath), new EmptyHoodieRecordPayload());
               deleteRecord.setCurrentLocation(location);
               deleteRecord.seal();
-              // Tag the incoming record for inserting to the new partition
-              HoodieRecord<R> insertRecord = (HoodieRecord<R>) HoodieIndexUtils.getTaggedRecord(inputRecord, Option.empty());
+              // Tag the incoming record for inserting to the new partition; left unsealed for marking as dedup later
+              HoodieRecord<R> insertRecord = (HoodieRecord<R>) HoodieIndexUtils.getUnsealedTaggedRecord(inputRecord, Option.empty());

Review Comment:
   fair enough. updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org