You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "SteNicholas (via GitHub)" <gi...@apache.org> on 2023/02/08 06:21:29 UTC

[GitHub] [hudi] SteNicholas opened a new pull request, #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

SteNicholas opened a new pull request, #7891:
URL: https://github.com/apache/hudi/pull/7891

   When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline to check whether the file slice generated in pending clustering after archive isn't committed via `HoodieFileGroup#isFileSliceCommitted(slice)`. Therefore `HoodieTimelineArchiver` archive the latest instant before inflight replacecommit.
   
   ### Change Logs
   
   `HoodieTimelineArchiver` archives the latest instant before inflight replacecommit.
   
   ### Impact
   
   The archive behavior of `HoodieTimelineArchiver` when there is the inflight replacecommit.
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1449498190

   @bvaradar, I have addressed above comments and tests the `testPendingClusteringAfterArchiveCommit` successfully. PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhuanshenbsj1 commented on a diff in pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "zhuanshenbsj1 (via GitHub)" <gi...@apache.org>.
zhuanshenbsj1 commented on code in PR #7891:
URL: https://github.com/apache/hudi/pull/7891#discussion_r1103965694


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -473,6 +473,33 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() throws IOException {
                       HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN, instantToRetain.getTimestamp()))
                   .orElse(true)
           );
+
+      // When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline
+      // to check whether the file slice generated in pending clustering after archive isn't committed
+      // via {@code HoodieFileGroup#isFileSliceCommitted(slice)}
+      boolean isOldestPendingReplaceInstant =
+          oldestPendingCompactionAndReplaceInstant.map(instant ->
+              HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instant.getAction())).orElse(false);
+      if (isOldestPendingReplaceInstant) {
+        List<HoodieInstant> instantsToArchive = instantToArchiveStream.collect(Collectors.toList());
+        Option<HoodieInstant> latestInstantRetainForReplace = Option.fromJavaOptional(
+            instantsToArchive.stream()
+                .filter(s -> HoodieTimeline.compareTimestamps(
+                    s.getTimestamp(),
+                    LESSER_THAN,
+                    oldestPendingCompactionAndReplaceInstant.get().getTimestamp()))
+                .reduce((i1, i2) -> i2));
+        if (latestInstantRetainForReplace.isPresent()) {
+          LOG.info(String.format(
+              "Retaining the archived instant %s before the inflight replacecommit %s.",
+              latestInstantRetainForReplace.get().getTimestamp(),
+              oldestPendingCompactionAndReplaceInstant.get().getTimestamp()));
+        }
+        instantToArchiveStream = instantsToArchive.stream()
+            .filter(s -> latestInstantRetainForReplace.map(instant -> s.compareTo(instant) != 0)
+                .orElse(true));
+      }
+

Review Comment:
   getOldestInstantToRetainForClustering(){
    1.get the first unclean clustering instant
    2.get the previous commit of last inflight clustering instant
    3.compare 1&2, return the earliest
   }



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1423362343

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032",
       "triggerID" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b6cf690564944cfeacf6d2e29e029f86fddec51 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1530796823

   I also did not quite understand the what the PR actually solves. 
   from what I glean, its 
   ```
   This fix is to prevent a the fileGroup that may have been created by a failed clustering from being included in the second replaceCommit. 
   
   Actual fix:
   If there is a pending clustering, archival will stop just 1 completed commit just before the earliest clustering instant.
   ```
   
   Is this a good summary of the fix ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1450455766

   @SteNicholas : Code change looks good. Can you point me to the code where latest commit before inflight replace commit is used during clustering/write operation ? I am trying to understand if this is Flink specific issue or a general one. 
   
   Thanks,
   Balaji.V


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1448383371

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032",
       "triggerID" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15483",
       "triggerID" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b6cf690564944cfeacf6d2e29e029f86fddec51 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032) 
   * 17a200739d9723ef35617d8ae9dd704abb6cc916 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15483) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1422078367

   @zhangyue19921010, @yihua,  PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on a diff in pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on code in PR #7891:
URL: https://github.com/apache/hudi/pull/7891#discussion_r1121281585


##########
hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java:
##########
@@ -250,15 +253,24 @@ public static Option<HoodieInstant> getOldestInstantToRetainForClustering(
                         ? cleanInstant
                         : HoodieTimeline.getCleanRequestedInstant(cleanInstant.getTimestamp()))
                 .getEarliestInstantToRetain().getTimestamp();
-        return StringUtils.isNullOrEmpty(earliestCommitToRetain)
-            ? Option.empty()
-            : replaceTimeline.filter(instant ->
-                HoodieTimeline.compareTimestamps(instant.getTimestamp(),
-                    HoodieTimeline.GREATER_THAN_OR_EQUALS,
-                    earliestCommitToRetain))
-            .firstInstant();
+        if (!StringUtils.isNullOrEmpty(earliestCommitToRetain)) {
+          oldestInstantToRetain = replaceTimeline.filterCompletedInstants()
+              .filter(instant -> HoodieTimeline.compareTimestamps(instant.getTimestamp(), HoodieTimeline.GREATER_THAN_OR_EQUALS, earliestCommitToRetain))
+              .firstInstant();
+        }
+      }
+      Option<HoodieInstant> pendingInstantOpt = replaceTimeline.filterInflights().firstInstant();
+      if (pendingInstantOpt.isPresent()) {
+        // Get the previous commit before the first inflight clustering instant.
+        Option<HoodieInstant> beforePendingInstant = activeTimeline.filterCompletedInstants()

Review Comment:
   @bvaradar, right. I will update to `activeTImeline.getCommitsTimeline().filterCompletedInstants()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1449522142

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032",
       "triggerID" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15483",
       "triggerID" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e1140b89f92cfa73ad9a11ae0c83364a5018902b",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15510",
       "triggerID" : "e1140b89f92cfa73ad9a11ae0c83364a5018902b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 17a200739d9723ef35617d8ae9dd704abb6cc916 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15483) 
   * e1140b89f92cfa73ad9a11ae0c83364a5018902b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15510) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "zhangyue19921010 (via GitHub)" <gi...@apache.org>.
zhangyue19921010 commented on code in PR #7891:
URL: https://github.com/apache/hudi/pull/7891#discussion_r1102465811


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -473,6 +473,33 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() throws IOException {
                       HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN, instantToRetain.getTimestamp()))
                   .orElse(true)
           );
+
+      // When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline
+      // to check whether the file slice generated in pending clustering after archive isn't committed
+      // via {@code HoodieFileGroup#isFileSliceCommitted(slice)}
+      boolean isOldestPendingReplaceInstant =
+          oldestPendingCompactionAndReplaceInstant.map(instant ->
+              HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instant.getAction())).orElse(false);
+      if (isOldestPendingReplaceInstant) {
+        List<HoodieInstant> instantsToArchive = instantToArchiveStream.collect(Collectors.toList());
+        Option<HoodieInstant> latestInstantRetainForReplace = Option.fromJavaOptional(

Review Comment:
   Could we just use ` commitTimeline.findInstantsBefore()` to get the successful commit just before oldest pending clustering commit time.
   
   After we get this commit we can filter it out directly in `instantToArchiveStream`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1422082107

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b6cf690564944cfeacf6d2e29e029f86fddec51 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhuanshenbsj1 commented on a diff in pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "zhuanshenbsj1 (via GitHub)" <gi...@apache.org>.
zhuanshenbsj1 commented on code in PR #7891:
URL: https://github.com/apache/hudi/pull/7891#discussion_r1103965694


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -473,6 +473,33 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() throws IOException {
                       HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN, instantToRetain.getTimestamp()))
                   .orElse(true)
           );
+
+      // When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline
+      // to check whether the file slice generated in pending clustering after archive isn't committed
+      // via {@code HoodieFileGroup#isFileSliceCommitted(slice)}
+      boolean isOldestPendingReplaceInstant =
+          oldestPendingCompactionAndReplaceInstant.map(instant ->
+              HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instant.getAction())).orElse(false);
+      if (isOldestPendingReplaceInstant) {
+        List<HoodieInstant> instantsToArchive = instantToArchiveStream.collect(Collectors.toList());
+        Option<HoodieInstant> latestInstantRetainForReplace = Option.fromJavaOptional(
+            instantsToArchive.stream()
+                .filter(s -> HoodieTimeline.compareTimestamps(
+                    s.getTimestamp(),
+                    LESSER_THAN,
+                    oldestPendingCompactionAndReplaceInstant.get().getTimestamp()))
+                .reduce((i1, i2) -> i2));
+        if (latestInstantRetainForReplace.isPresent()) {
+          LOG.info(String.format(
+              "Retaining the archived instant %s before the inflight replacecommit %s.",
+              latestInstantRetainForReplace.get().getTimestamp(),
+              oldestPendingCompactionAndReplaceInstant.get().getTimestamp()));
+        }
+        instantToArchiveStream = instantsToArchive.stream()
+            .filter(s -> latestInstantRetainForReplace.map(instant -> s.compareTo(instant) != 0)
+                .orElse(true));
+      }
+

Review Comment:
   getOldestInstantToRetainForClustering(){
    1.get the first unclean clustering instant
    2.get the previous commit of oldest inflight clustering instant
    3.compare 1&2, return the earliest
   }



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1449505605

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032",
       "triggerID" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15483",
       "triggerID" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e1140b89f92cfa73ad9a11ae0c83364a5018902b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e1140b89f92cfa73ad9a11ae0c83364a5018902b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 17a200739d9723ef35617d8ae9dd704abb6cc916 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15483) 
   * e1140b89f92cfa73ad9a11ae0c83364a5018902b UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1451270751

   @bvaradar, you could follow the `ClusteringPlanStrategy#getFileSlicesEligibleForClustering` that invokes the `TableFileSystemView.SliceView#getLatestFileSlices`  to get the latest file slices in the given partition. `getLatestFileSlices` method finally invokes `HoodieFileGroup#getAllFileSlices` that has the filter condition `HoodieFileGroup#isFileSliceCommitted`, which causes the problem that the description of PR mentioned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1530799867

   @nsivabalan If i understand the PR correctly, your summary accurately describes the fix and the issue! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1452976927

   @SteNicholas  Makes sense. Thanks for the patience. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1422110761

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032",
       "triggerID" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b6cf690564944cfeacf6d2e29e029f86fddec51 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1450456075

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032",
       "triggerID" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15483",
       "triggerID" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e1140b89f92cfa73ad9a11ae0c83364a5018902b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15510",
       "triggerID" : "e1140b89f92cfa73ad9a11ae0c83364a5018902b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e1140b89f92cfa73ad9a11ae0c83364a5018902b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15510) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1457575576

   Apologies for disturbing everyone here. I've tried reaching out to the author on Slack, but failed to get an explanation, so i will post my questions here.
   
   # 1. Rational behind the fix?
   From what @SteNicholas shared, this bug fixes a data-duplication issue after clustering? Since this is not documented anywhere in the PR or the JR ticket, I think it is best to mention it in the PR itself for future references.
   
   # 2. Issues with integration test?
   I apologies again as I don't quite understand the reason behind this fix. Hence, while trying to reproduce this issue, I copied the integration test (`testHoodieFlinkClusteringScheduleAfterArchive`) onto a Hudi version's `ITTestHoodieFlinkClustering` that doesn't have this fix yet expecting the IT to fail. But it still succeeds without the said fix. Hence, the test might not be written properly, or the fix might not be valid to begin with as it passes under ANY conditions. 
   
   I am copying the fix onto 6b178ce978ce324361cf708ce711cd0c46452dd6, this is the commit prior to the fix being merged into master.
   
   I repeated the test 100 times and it's passing for all 100 times on a branch that's without the fix.
   
   ![image](https://user-images.githubusercontent.com/6312314/223329883-05e6a499-72f7-4950-b723-db6468a5e68e.png)
   
   
   # 3. Questions regarding the fix
   
   
   Can someone please help to address the concerns above, mainly (2) and (3).
   > i.e, timeline: deltacommit1, deltacommit2, replacecommit1.inflight, deltacommit3, deltacommit4', at this time, if deltacommit1, deltacommit2 are archived, the containsOrBeforeTimelineStarts returns true for replacecommit1.inflight because isBeforeTimelineStarts returns true. Then, the file slices of the pending clustering instant would be included in another clustering plan.
   
   IIUC, the (input) file slices of the pending clustering instant will be excluded in the second clustering plan as clustering plan generator excludes all file slices in pending:
   
   1. logCompaction
   2. compaction
   3. clustering
   
   Code handling this in master (org.apache.hudi.table.action.cluster.strategy.ClusteringPlanStrategy#getFileSlicesEligibleForClustering) :
   
   ```java
   /**
   * Return file slices eligible for clustering. FileIds in pending clustering/compaction are not eligible for clustering.
   */
   protected Stream<FileSlice> getFileSlicesEligibleForClustering(String partition) {
   SyncableFileSystemView fileSystemView = (SyncableFileSystemView) getHoodieTable().getSliceView();
   Set<HoodieFileGroupId> fgIdsInPendingCompactionLogCompactionAndClustering =
       Stream.concat(fileSystemView.getPendingCompactionOperations(), fileSystemView.getPendingLogCompactionOperations())
           .map(instantTimeOpPair -> instantTimeOpPair.getValue().getFileGroupId())
           .collect(Collectors.toSet());
   fgIdsInPendingCompactionLogCompactionAndClustering.addAll(fileSystemView.getFileGroupsInPendingClustering().map(Pair::getKey).collect(Collectors.toSet()));
   
   return hoodieTable.getSliceView().getLatestFileSlices(partition)
       // file ids already in clustering are not eligible
       .filter(slice -> !fgIdsInPendingCompactionLogCompactionAndClustering.contains(slice.getFileGroupId()));
   }
   ```
   
   > When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline to check whether the file slice generated in pending clustering after archive isn't committed via HoodieFileGroup#isFileSliceCommitted(slice).
   
   IIUC, for a `fileSlice` to be a candidate for clustering inputFileGroup, it needs to be committed no? So, even though the the `replaceCommit` is the first instant on the timeline, the `inputFileGroups` should be tagged as HoodieFileGroup#isFileSliceCommitted(slice) = true.  
   
   # 4. My understanding of the fix
   
   From what i understand, (please correct me if i am wrong) this fix is trying to address an issue where the same fileslice is included in 2 separate clusteringPlan/replaceCommit's `inputFileGroup`. After clustering, the same row will exist in 2 different fileGroups as clustering will create new fileGroups for clustered data.
   
   Reference: A replaceCommit does the following:
   ```
   replaceCommit1 = input{FG_0, FG_1, FG_2} -> output{FG_3}
   replaceCommit2 = input{FG_0, FG_1, FG_2} -> output{FG_4}
   ```
   
   If this is occurring, this seems like an issue with how Hudi is fetching `pendingClusteringPlans` no?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1457829506

   Thank you @zhangyue19921010 for assisting. Figured the rational behind the fix.
   
   This fix is to prevent a the fileGroup that may have been created by a failed clustering from being included in the second replaceCommit.
   
   #Example:
   
   ```
   c0 {write FG_0, FG_1}
   r1 {cluster FG_0, FG_1}
   c2 {write FG_2, FG_3} 
   c3 {write FG_3, FG_4} 
   c4 {write FG_5, FG_6} 
   ```
   
   After archive:
   ```
   r1 {cluster FG_0, FG_1}
   c2 {write FG_2, FG_3} 
   c3 {write FG_3, FG_4} 
   c4 {write FG_5, FG_6} 
   ```
   
   r1 fails after writing some files:
   ```
   r1 {cluster FG_0, FG_1 -> FG_01_r1}
   c2 {write FG_2, FG_3} 
   c3 {write FG_3, FG_4} 
   c4 {write FG_5, FG_6} 
   ```
   
   r5 is created:
   
   ```
   r1 {cluster FG_0, FG_1 -> FG_01_r1}
   c2 {write FG_2, FG_3} 
   c3 {write FG_3, FG_4} 
   c4 {write FG_5, FG_6} 
   r5 {FG_2..FG_6, FG_01_r1}
   ```
   
   `FG_01_r1` should not be included in into r5. However, `FG_01_r1` instant time is r1, which is before c2, it will be deemed as a committed file slice. 
   
   On top of that, i realised that the the function `FSUtils.getCommitTime` is returning the writeToken instead of the instant time, I will raise another PR to fix the test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "zhangyue19921010 (via GitHub)" <gi...@apache.org>.
zhangyue19921010 commented on code in PR #7891:
URL: https://github.com/apache/hudi/pull/7891#discussion_r1102465811


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -473,6 +473,33 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() throws IOException {
                       HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN, instantToRetain.getTimestamp()))
                   .orElse(true)
           );
+
+      // When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline
+      // to check whether the file slice generated in pending clustering after archive isn't committed
+      // via {@code HoodieFileGroup#isFileSliceCommitted(slice)}
+      boolean isOldestPendingReplaceInstant =
+          oldestPendingCompactionAndReplaceInstant.map(instant ->
+              HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instant.getAction())).orElse(false);
+      if (isOldestPendingReplaceInstant) {
+        List<HoodieInstant> instantsToArchive = instantToArchiveStream.collect(Collectors.toList());
+        Option<HoodieInstant> latestInstantRetainForReplace = Option.fromJavaOptional(

Review Comment:
   Could we just use ` commitTimeline.findInstantsBefore()` API simply to get the successful commit just before oldest pending clustering commit time.
   
   After we get this commit we can filter it out directly in `instantToArchiveStream`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar commented on a diff in pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar commented on code in PR #7891:
URL: https://github.com/apache/hudi/pull/7891#discussion_r1121217637


##########
hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java:
##########
@@ -250,15 +253,24 @@ public static Option<HoodieInstant> getOldestInstantToRetainForClustering(
                         ? cleanInstant
                         : HoodieTimeline.getCleanRequestedInstant(cleanInstant.getTimestamp()))
                 .getEarliestInstantToRetain().getTimestamp();
-        return StringUtils.isNullOrEmpty(earliestCommitToRetain)
-            ? Option.empty()
-            : replaceTimeline.filter(instant ->
-                HoodieTimeline.compareTimestamps(instant.getTimestamp(),
-                    HoodieTimeline.GREATER_THAN_OR_EQUALS,
-                    earliestCommitToRetain))
-            .firstInstant();
+        if (!StringUtils.isNullOrEmpty(earliestCommitToRetain)) {
+          oldestInstantToRetain = replaceTimeline.filterCompletedInstants()
+              .filter(instant -> HoodieTimeline.compareTimestamps(instant.getTimestamp(), HoodieTimeline.GREATER_THAN_OR_EQUALS, earliestCommitToRetain))
+              .firstInstant();
+        }
+      }
+      Option<HoodieInstant> pendingInstantOpt = replaceTimeline.filterInflights().firstInstant();
+      if (pendingInstantOpt.isPresent()) {
+        // Get the previous commit before the first inflight clustering instant.
+        Option<HoodieInstant> beforePendingInstant = activeTimeline.filterCompletedInstants()

Review Comment:
   Shouldn't this be activeTImeline.getCommitsTimeline().filterCompletedInstants() to only include DeltaCommit, Commit and Replace Commit ?



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiver.java:
##########
@@ -1400,6 +1400,28 @@ public void testArchivalAndCompactionInMetadataTable() throws Exception {
     }
   }
 
+  @ParameterizedTest
+  @ValueSource(booleans = {true, false})
+  public void testPendingClusteringAfterArchiveCommit(boolean enableMetadata) throws Exception {
+    HoodieWriteConfig writeConfig = initTestTableAndGetWriteConfig(enableMetadata, 2, 5, 2);
+    // timeline:0000000(completed)->00000001(completed)->00000002(replace&inflight)->00000003(completed)->...->00000007(completed)
+    HoodieTestDataGenerator.createPendingReplaceFile(basePath, "00000002", wrapperFs.getConf());
+    for (int i = 1; i < 8; i++) {
+      if (i != 2) {
+        testTable.doWriteOperation("0000000" + i, WriteOperationType.UPSERT, Arrays.asList("p1", "p2"), Arrays.asList("p1", "p2"), 2);

Review Comment:
   Minor Comment: Should the WriteOperationType be one of cluster, insert_overwrite_table, insert_overwrite or delete_partition to make sense ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "zhangyue19921010 (via GitHub)" <gi...@apache.org>.
zhangyue19921010 commented on code in PR #7891:
URL: https://github.com/apache/hudi/pull/7891#discussion_r1102465811


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -473,6 +473,33 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() throws IOException {
                       HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN, instantToRetain.getTimestamp()))
                   .orElse(true)
           );
+
+      // When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline
+      // to check whether the file slice generated in pending clustering after archive isn't committed
+      // via {@code HoodieFileGroup#isFileSliceCommitted(slice)}
+      boolean isOldestPendingReplaceInstant =
+          oldestPendingCompactionAndReplaceInstant.map(instant ->
+              HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instant.getAction())).orElse(false);
+      if (isOldestPendingReplaceInstant) {
+        List<HoodieInstant> instantsToArchive = instantToArchiveStream.collect(Collectors.toList());
+        Option<HoodieInstant> latestInstantRetainForReplace = Option.fromJavaOptional(

Review Comment:
   Could we just use ` commitTimeline.findInstantsBefore()` to get the commit just before oldest pending clustering commit time.
   
   After we get this commit we can filter it out directly in `instantToArchiveStream`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhuanshenbsj1 commented on a diff in pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "zhuanshenbsj1 (via GitHub)" <gi...@apache.org>.
zhuanshenbsj1 commented on code in PR #7891:
URL: https://github.com/apache/hudi/pull/7891#discussion_r1102382018


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -473,6 +473,33 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() throws IOException {
                       HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN, instantToRetain.getTimestamp()))
                   .orElse(true)
           );
+
+      // When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline
+      // to check whether the file slice generated in pending clustering after archive isn't committed
+      // via {@code HoodieFileGroup#isFileSliceCommitted(slice)}
+      boolean isOldestPendingReplaceInstant =
+          oldestPendingCompactionAndReplaceInstant.map(instant ->
+              HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instant.getAction())).orElse(false);
+      if (isOldestPendingReplaceInstant) {
+        List<HoodieInstant> instantsToArchive = instantToArchiveStream.collect(Collectors.toList());
+        Option<HoodieInstant> latestInstantRetainForReplace = Option.fromJavaOptional(
+            instantsToArchive.stream()
+                .filter(s -> HoodieTimeline.compareTimestamps(
+                    s.getTimestamp(),
+                    LESSER_THAN,
+                    oldestPendingCompactionAndReplaceInstant.get().getTimestamp()))
+                .reduce((i1, i2) -> i2));
+        if (latestInstantRetainForReplace.isPresent()) {
+          LOG.info(String.format(
+              "Retaining the archived instant %s before the inflight replacecommit %s.",
+              latestInstantRetainForReplace.get().getTimestamp(),
+              oldestPendingCompactionAndReplaceInstant.get().getTimestamp()));
+        }
+        instantToArchiveStream = instantsToArchive.stream()
+            .filter(s -> latestInstantRetainForReplace.map(instant -> s.compareTo(instant) != 0)
+                .orElse(true));
+      }
+

Review Comment:
   How about move these changes to function getOldestInstantToRetainForClustering, It looks a little messy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on a diff in pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on code in PR #7891:
URL: https://github.com/apache/hudi/pull/7891#discussion_r1119934910


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -473,6 +473,33 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() throws IOException {
                       HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN, instantToRetain.getTimestamp()))
                   .orElse(true)
           );
+
+      // When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline
+      // to check whether the file slice generated in pending clustering after archive isn't committed
+      // via {@code HoodieFileGroup#isFileSliceCommitted(slice)}
+      boolean isOldestPendingReplaceInstant =
+          oldestPendingCompactionAndReplaceInstant.map(instant ->
+              HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instant.getAction())).orElse(false);
+      if (isOldestPendingReplaceInstant) {
+        List<HoodieInstant> instantsToArchive = instantToArchiveStream.collect(Collectors.toList());
+        Option<HoodieInstant> latestInstantRetainForReplace = Option.fromJavaOptional(
+            instantsToArchive.stream()
+                .filter(s -> HoodieTimeline.compareTimestamps(
+                    s.getTimestamp(),
+                    LESSER_THAN,
+                    oldestPendingCompactionAndReplaceInstant.get().getTimestamp()))
+                .reduce((i1, i2) -> i2));
+        if (latestInstantRetainForReplace.isPresent()) {
+          LOG.info(String.format(
+              "Retaining the archived instant %s before the inflight replacecommit %s.",
+              latestInstantRetainForReplace.get().getTimestamp(),
+              oldestPendingCompactionAndReplaceInstant.get().getTimestamp()));
+        }
+        instantToArchiveStream = instantsToArchive.stream()
+            .filter(s -> latestInstantRetainForReplace.map(instant -> s.compareTo(instant) != 0)
+                .orElse(true));
+      }
+

Review Comment:
   @zhuanshenbsj1, I don't think that the changes could move into `ClusteringUtils#getOldestInstantToRetainForClustering`. Because the `instantsToArchive` only retains one commit before the oldest clustering instant, and other commits could be achived instead of retained.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on a diff in pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on code in PR #7891:
URL: https://github.com/apache/hudi/pull/7891#discussion_r1119934910


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -473,6 +473,33 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() throws IOException {
                       HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN, instantToRetain.getTimestamp()))
                   .orElse(true)
           );
+
+      // When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline
+      // to check whether the file slice generated in pending clustering after archive isn't committed
+      // via {@code HoodieFileGroup#isFileSliceCommitted(slice)}
+      boolean isOldestPendingReplaceInstant =
+          oldestPendingCompactionAndReplaceInstant.map(instant ->
+              HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instant.getAction())).orElse(false);
+      if (isOldestPendingReplaceInstant) {
+        List<HoodieInstant> instantsToArchive = instantToArchiveStream.collect(Collectors.toList());
+        Option<HoodieInstant> latestInstantRetainForReplace = Option.fromJavaOptional(
+            instantsToArchive.stream()
+                .filter(s -> HoodieTimeline.compareTimestamps(
+                    s.getTimestamp(),
+                    LESSER_THAN,
+                    oldestPendingCompactionAndReplaceInstant.get().getTimestamp()))
+                .reduce((i1, i2) -> i2));
+        if (latestInstantRetainForReplace.isPresent()) {
+          LOG.info(String.format(
+              "Retaining the archived instant %s before the inflight replacecommit %s.",
+              latestInstantRetainForReplace.get().getTimestamp(),
+              oldestPendingCompactionAndReplaceInstant.get().getTimestamp()));
+        }
+        instantToArchiveStream = instantsToArchive.stream()
+            .filter(s -> latestInstantRetainForReplace.map(instant -> s.compareTo(instant) != 0)
+                .orElse(true));
+      }
+

Review Comment:
   @zhuanshenbsj1, I don't think that the changes could move into `ClusteringUtils#getOldestInstantToRetainForClustering`. Because the `instantsToArchive` only retains one commit before the oldest clustering instant, and other commits could be achived instead of retained.
   
   Gives an timeline example to explain above idea: ie. timeline: commit1, commit2, replacecommit1.inflight, commit3, commit4. In this timeline, only commit2 should not archive to verify whether replacecommit1 is completed, other commits including commit1, commit3, commit4.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1448317420

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032",
       "triggerID" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b6cf690564944cfeacf6d2e29e029f86fddec51 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032) 
   * 17a200739d9723ef35617d8ae9dd704abb6cc916 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar merged pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar merged PR #7891:
URL: https://github.com/apache/hudi/pull/7891


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "zhangyue19921010 (via GitHub)" <gi...@apache.org>.
zhangyue19921010 commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1425452035

   > @zhangyue19921010, I have took the pending compaction into consideration. `HoodieFileGroup#isFileSliceCommitted(slice)` has no any problem for the check of pending compaction because the compaction only works for single filegroup.
   
   That make sense!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bvaradar commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.
bvaradar commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1451441554

   @SteNicholas : timeline.containsOrBeforeTimelineStarts() returns true if it finds the fileId with instant which was archived.  So, Clustering would be fine without this change too. right ?  Sorry, I am not able to see the problem with the description. Can you kindly elaborate with some scenarios
   
   ```  private boolean isFileSliceCommitted(FileSlice slice) {
       if (!compareTimestamps(slice.getBaseInstantTime(), LESSER_THAN_OR_EQUALS, lastInstant.get().getTimestamp())) {
         return false;
       }
   
       return timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime());
     }```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1449282821

   cc @bvaradar for a final view~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1449029690

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15032",
       "triggerID" : "7b6cf690564944cfeacf6d2e29e029f86fddec51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15483",
       "triggerID" : "17a200739d9723ef35617d8ae9dd704abb6cc916",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 17a200739d9723ef35617d8ae9dd704abb6cc916 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15483) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1451591769

   @bvaradar, the ` timeline.containsOrBeforeTimelineStarts()` returns true for the file slice of the pending clustering instant when the previous commits before the pending clustering instant are archived. The `isBeforeTimelineStarts` returns true because there is no completed instant before the pending clustering instant. PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on a diff in pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on code in PR #7891:
URL: https://github.com/apache/hudi/pull/7891#discussion_r1119934910


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -473,6 +473,33 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() throws IOException {
                       HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN, instantToRetain.getTimestamp()))
                   .orElse(true)
           );
+
+      // When inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline
+      // to check whether the file slice generated in pending clustering after archive isn't committed
+      // via {@code HoodieFileGroup#isFileSliceCommitted(slice)}
+      boolean isOldestPendingReplaceInstant =
+          oldestPendingCompactionAndReplaceInstant.map(instant ->
+              HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instant.getAction())).orElse(false);
+      if (isOldestPendingReplaceInstant) {
+        List<HoodieInstant> instantsToArchive = instantToArchiveStream.collect(Collectors.toList());
+        Option<HoodieInstant> latestInstantRetainForReplace = Option.fromJavaOptional(
+            instantsToArchive.stream()
+                .filter(s -> HoodieTimeline.compareTimestamps(
+                    s.getTimestamp(),
+                    LESSER_THAN,
+                    oldestPendingCompactionAndReplaceInstant.get().getTimestamp()))
+                .reduce((i1, i2) -> i2));
+        if (latestInstantRetainForReplace.isPresent()) {
+          LOG.info(String.format(
+              "Retaining the archived instant %s before the inflight replacecommit %s.",
+              latestInstantRetainForReplace.get().getTimestamp(),
+              oldestPendingCompactionAndReplaceInstant.get().getTimestamp()));
+        }
+        instantToArchiveStream = instantsToArchive.stream()
+            .filter(s -> latestInstantRetainForReplace.map(instant -> s.compareTo(instant) != 0)
+                .orElse(true));
+      }
+

Review Comment:
   @zhuanshenbsj1, I don't think that the changes could move into `ClusteringUtils#getOldestInstantToRetainForClustering`. Because the `instantsToArchive` only retains one commit before the oldest clustering instant, and other commits could be achived instead of retained.
   
   Gives an timeline example to explain above idea: ie. timeline: commit1, commit2, replacecommit1.inflight, commit3, commit4. In this timeline, only commit2 should not archive to verify whether replacecommit1 is completed, other commits including commit1, commit3, commit4.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on pull request #7891: [HUDI-5728] HoodieTimelineArchiver archives the latest instant before inflight replacecommit

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1425309699

   @zhangyue19921010, I have took the pending compaction into consideration. `HoodieFileGroup#isFileSliceCommitted(slice)` has no any problem for the check of pending compaction because the compaction only works for single filegroup.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org