You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/04 13:42:22 UTC

[GitHub] [hudi] codope opened a new pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

codope opened a new pull request #4118:
URL: https://github.com/apache/hudi/pull/4118


   ## What is the purpose of the pull request
   
   With fast ingestion through deltastreamer (no min sync interval) and frequent async clustering enabled, it could be possible that two clustering plans created the same replacecommit metadata. This PR avoid the duplication by merging them.
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   Manually verified by running deltastreamer for 30 commits multiple times. Could not reproduce with this patch.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-979003740


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746",
       "triggerID" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   * 971e52c6cc761bb9b68da6cfa3ba0935db195266 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4118: [HUDI-2774] Handle duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-1004925986


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746",
       "triggerID" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4883",
       "triggerID" : "9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   * 9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4883) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-978996343


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #4118: [HUDI-2774] Handle duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#discussion_r836790799



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java
##########
@@ -124,7 +125,16 @@
         // get all filegroups in the plan
         getFileGroupEntriesInClusteringPlan(clusteringPlan.getLeft(), clusteringPlan.getRight()));
 
-    Map<HoodieFileGroupId, HoodieInstant> resultMap = resultStream.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
+    Map<HoodieFileGroupId, HoodieInstant> resultMap;
+    try {
+      resultMap = resultStream.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
+    } catch (Exception e) {
+      if (e instanceof IllegalStateException && e.getMessage().contains("Duplicate key")) {
+        throw new HoodieException("Found duplicate file groups pending clustering. If you're running deltastreamer in continuous mode, consider adding delay using --min-sync-interval-seconds. "

Review comment:
       with OCC mode and in process lock provider, we should not hit this exception.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
codope commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-979311580


   > possible to add UT for this.
   
   this is actually not reproducible deterministically. i could not think of a way to simulate this in UT. Any suggestions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4118: [HUDI-2774] Handle duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-1004873002


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746",
       "triggerID" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4883",
       "triggerID" : "9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   * 971e52c6cc761bb9b68da6cfa3ba0935db195266 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746) 
   * 9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4883) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #4118: [HUDI-2774] Handle duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
nsivabalan merged pull request #4118:
URL: https://github.com/apache/hudi/pull/4118


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-978996343


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-979275486


   possible to add UT for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope closed pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
codope closed pull request #4118:
URL: https://github.com/apache/hudi/pull/4118


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-979003740


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746",
       "triggerID" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   * 971e52c6cc761bb9b68da6cfa3ba0935db195266 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4118: [HUDI-2774] Handle duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-1004873002


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746",
       "triggerID" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4883",
       "triggerID" : "9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   * 971e52c6cc761bb9b68da6cfa3ba0935db195266 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746) 
   * 9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4883) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-1004825337


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746",
       "triggerID" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   * 971e52c6cc761bb9b68da6cfa3ba0935db195266 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746) 
   * 9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
codope commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-979383743


   Will close this PR. After discussing with @nsivabalan offline, and also confirming from the code that this will avoid the duplicate key issue but it will create duplicate data with different file ids. This adversely affects data correctness. 
   
   This scenario would happen only when there is no data or so less data that deltastreamer finishes one round pretty fast, even before clustering, and there is no min sync interval between rounds. I think it's okay to fail the clustering due to duplicate key in this scenario. As a workaround users could set OCC mode or add delay between rounds of delta sync. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-979000291


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   * 971e52c6cc761bb9b68da6cfa3ba0935db195266 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-979052450


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746",
       "triggerID" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   * 971e52c6cc761bb9b68da6cfa3ba0935db195266 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
codope commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-1004824900


   @nsivabalan Reopened this PR and handled the duplicate instants as we discussed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#discussion_r756992163



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java
##########
@@ -124,7 +124,7 @@
         // get all filegroups in the plan
         getFileGroupEntriesInClusteringPlan(clusteringPlan.getLeft(), clusteringPlan.getRight()));
 
-    Map<HoodieFileGroupId, HoodieInstant> resultMap = resultStream.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
+    Map<HoodieFileGroupId, HoodieInstant> resultMap = resultStream.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, (i1, i2) -> i1));

Review comment:
       yeah that's right. i thought about repetitive scheduling but such race scenarios are hard to avoid.. one way is as you suggested in the ticket i.e. to add some default `--min-sync-interval` but i felt that might give users the impression that job itself is taking longer than usual.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #4118: [HUDI-2774] Handle duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#discussion_r831038701



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java
##########
@@ -124,7 +125,16 @@
         // get all filegroups in the plan
         getFileGroupEntriesInClusteringPlan(clusteringPlan.getLeft(), clusteringPlan.getRight()));
 
-    Map<HoodieFileGroupId, HoodieInstant> resultMap = resultStream.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
+    Map<HoodieFileGroupId, HoodieInstant> resultMap;
+    try {
+      resultMap = resultStream.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
+    } catch (Exception e) {
+      if (e instanceof IllegalStateException && e.getMessage().contains("Duplicate key")) {
+        throw new HoodieException("Found duplicate file groups pending clustering. If you're running deltastreamer in continuous mode, consider adding delay using --min-sync-interval-seconds. "

Review comment:
       would the in process lock provider help here? Since we then just allow once clustering to be scheduled at once?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4118: [HUDI-2774] Handle duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-1004825337


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746",
       "triggerID" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   * 971e52c6cc761bb9b68da6cfa3ba0935db195266 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746) 
   * 9e8fcab139fb1fdd9dd2d109b0e73d3e004360c2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-979052450


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746",
       "triggerID" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   * 971e52c6cc761bb9b68da6cfa3ba0935db195266 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3746) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-979000291


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "971e52c6cc761bb9b68da6cfa3ba0935db195266",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3348904bd9d9a0c4469d3bf3c2e2cc47e995ea06 UNKNOWN
   * 971e52c6cc761bb9b68da6cfa3ba0935db195266 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #4118: [HUDI-2774] Merge duplicate instants while fetching pending clustering plans

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#discussion_r756951070



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java
##########
@@ -124,7 +124,7 @@
         // get all filegroups in the plan
         getFileGroupEntriesInClusteringPlan(clusteringPlan.getLeft(), clusteringPlan.getRight()));
 
-    Map<HoodieFileGroupId, HoodieInstant> resultMap = resultStream.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
+    Map<HoodieFileGroupId, HoodieInstant> resultMap = resultStream.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, (i1, i2) -> i1));

Review comment:
       so if I understand correctly, this will fix the duplicate key exception, but the repetitive scheduling is not yet fixed in this patch right ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org