You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/10 14:27:53 UTC

[GitHub] [hudi] codope opened a new pull request, #5837: [HUDI-3884] Support archival beyond savepoint commits

codope opened a new pull request, #5837:
URL: https://github.com/apache/hudi/pull/5837

   ## What is the purpose of the pull request
   
   This PR builds on top of #5350 
   
   - So far, archival will stop at the first savepoint commit and will not proceed further. For users who may not be interested in incremental queries, but just "as of instant", we can let them proceed further by skipping just the savepoint commit. This opens up new opportunities for Hudi users. For eg, one can retain commits for years, by adding one savepoint per day for older commits (say > 30 days old).  And they can query hudi using "as of instant" for very old data. If not, one has to retain every commit and let archival stop at the first commit which may not be a good experience for users. 
   - Had to fix one the core methods in HoodieTimeline with this change boolean `isBeforeTimelineStarts(String instant)`. Please do check the changes here. Prior to this patch, we don't allow any holes in the timeline. An instant is considered committed if it's part of the active timeline or if it's < first entry in the active timeline. But since we are letting archival go beyond savepoints, there could be holes in the active timeline. 
   For eg; C1, C2, C3, Savepoint_C3, C4, C5, Savepoint_C5, C6, C7, C8, C9. 
   Let's say, C1, C2, C4, and C6 are archived (with the fix in this patch, otherwise, archival will not proceed after C2). 
   
   So, the active timeline is C3, savepoint_C3, C5, Savepoint_C5, C7, C8, C9. 
   If a filegroup committed with C4 is checked for `isBeforeTimelineStarts(String instant)`, we might return false. So, the fix is to find the first non-savepoint commit in the active timeline and treat that as the first entry in the active timeline. Any instant < this first non-savepoint commit will be considered a valid instant time.
   So, in the above case, it's C7, `isBeforeTimelineStarts(C4)` will return true.
   
   ## Brief change log
   
   - Added a new config named `hoodie.archive.beyond.savepoint` which will guard this behaviour. Have set the default value to false to retain the old behaviour. 
   - Changes in `HoodieTimelineArchiver` to honour the config. 
   - Fixed implementation of `isBeforeTimelineStarts(String instant)` to cater to holes in the active timeline to deduce valid completed commits.
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
   - TestHoodieTimelineArchiver#testSavepointsWithArchival
   - TestHoodieActiveTimeline#testTimelineWithSavepointAndHoles
   - TestHoodieFileGroup#testCommittedFileSlicesWithSavepointAndHoles
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1194002215

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263",
       "triggerID" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10308",
       "triggerID" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1f2c919b0bd26e1e81c547ed10d6688782b6a7d6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10308) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5837:
URL: https://github.com/apache/hudi/pull/5837#discussion_r924020087


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -498,7 +505,7 @@ private Stream<HoodieInstant> getInstantsToArchive() {
       Option<String> earliestActiveDatasetCommit = dataMetaClient.getActiveTimeline().firstInstant().map(HoodieInstant::getTimestamp);

Review Comment:
   Should the metadata table archive the commits till the first non-savepoint commit in the data table's active timeline?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -314,6 +314,12 @@ public class HoodieCompactionConfig extends HoodieConfig {
       .withDocumentation("When enable, hoodie will auto merge several small archive files into larger one. It's"
           + " useful when storage scheme doesn't support append operation.");
 
+  public static final ConfigProperty<Boolean> ARCHIVE_BEYOND_SAVEPOINT = ConfigProperty
+      .key("hoodie.archive.proceed.savepoint")
+      .defaultValue(true)

Review Comment:
   Don't forget to switch this to false before landing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #5837:
URL: https://github.com/apache/hudi/pull/5837#discussion_r928104105


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -498,7 +505,7 @@ private Stream<HoodieInstant> getInstantsToArchive() {
       Option<String> earliestActiveDatasetCommit = dataMetaClient.getActiveTimeline().firstInstant().map(HoodieInstant::getTimestamp);

Review Comment:
   Done. Now, metadata table will also be archived until the first non-savepoint commit in data table. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193227930

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263",
       "triggerID" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e8dd75a8b499cfcf5f11e6491cb907e7dc87e314 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5837:
URL: https://github.com/apache/hudi/pull/5837#discussion_r928144340


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -491,14 +503,33 @@ private Stream<HoodieInstant> getInstantsToArchive() {
     // active timeline. This is required by metadata table,
     // see HoodieTableMetadataUtil#processRollbackMetadata for details.
     if (HoodieTableMetadata.isMetadataTable(config.getBasePath())) {
-      HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder()
-          .setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath()))
+      HoodieTableMetaClient metadataTableMetaClient = HoodieTableMetaClient.builder()

Review Comment:
   This is the meta client for the data table, not the metadata table.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -491,14 +503,33 @@ private Stream<HoodieInstant> getInstantsToArchive() {
     // active timeline. This is required by metadata table,
     // see HoodieTableMetadataUtil#processRollbackMetadata for details.
     if (HoodieTableMetadata.isMetadataTable(config.getBasePath())) {
-      HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder()
-          .setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath()))
+      HoodieTableMetaClient metadataTableMetaClient = HoodieTableMetaClient.builder()
+          .setBasePath(getDatasetBasePath(config.getBasePath()))
           .setConf(metaClient.getHadoopConf())
           .build();
-      Option<String> earliestActiveDatasetCommit = dataMetaClient.getActiveTimeline().firstInstant().map(HoodieInstant::getTimestamp);
-      if (earliestActiveDatasetCommit.isPresent()) {
+      Option<HoodieInstant> earliestActiveDatasetCommit = metadataTableMetaClient.getActiveTimeline().firstInstant();
+
+      // There are chances that there could be holes in the timeline due to archival and savepoint interplay.
+      // So, the first non-savepoint commit in the data timeline is considered as beginning of the active timeline.
+      HoodieTableMetaClient dataTableMetaClient = HoodieTableMetaClient.builder()
+          .setBasePath(getDataTableBasePathFromMetadataTable(config.getBasePath()))
+          .setConf(metaClient.getHadoopConf())
+          .build();
+      Set<String> savepointTimestamps = dataTableMetaClient.getActiveTimeline().getInstants()
+          .filter(entry -> entry.getAction().equals(HoodieTimeline.SAVEPOINT_ACTION))
+          .map(HoodieInstant::getTimestamp)
+          .collect(Collectors.toSet());
+      Option<HoodieInstant> firstNonSavepointCommit = earliestActiveDatasetCommit;
+      if (!savepointTimestamps.isEmpty()) {
+        firstNonSavepointCommit = Option.fromJavaOptional(dataTableMetaClient.getActiveTimeline().getInstants()
+            .filter(entry -> !savepointTimestamps.contains(entry.getTimestamp()))
+            .findFirst());
+      }

Review Comment:
   Should the new logic be guarded by `config.shouldArchiveBeyondSavepoint()` as well?  So that, if the feature flag is off and there are savepoints in the data table, there is no behavior change, i.e., archival does not go beyond saved commits in the metadata table either.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -491,14 +503,33 @@ private Stream<HoodieInstant> getInstantsToArchive() {
     // active timeline. This is required by metadata table,
     // see HoodieTableMetadataUtil#processRollbackMetadata for details.
     if (HoodieTableMetadata.isMetadataTable(config.getBasePath())) {
-      HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder()
-          .setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath()))
+      HoodieTableMetaClient metadataTableMetaClient = HoodieTableMetaClient.builder()
+          .setBasePath(getDatasetBasePath(config.getBasePath()))
           .setConf(metaClient.getHadoopConf())
           .build();
-      Option<String> earliestActiveDatasetCommit = dataMetaClient.getActiveTimeline().firstInstant().map(HoodieInstant::getTimestamp);
-      if (earliestActiveDatasetCommit.isPresent()) {
+      Option<HoodieInstant> earliestActiveDatasetCommit = metadataTableMetaClient.getActiveTimeline().firstInstant();
+
+      // There are chances that there could be holes in the timeline due to archival and savepoint interplay.
+      // So, the first non-savepoint commit in the data timeline is considered as beginning of the active timeline.
+      HoodieTableMetaClient dataTableMetaClient = HoodieTableMetaClient.builder()
+          .setBasePath(getDataTableBasePathFromMetadataTable(config.getBasePath()))
+          .setConf(metaClient.getHadoopConf())
+          .build();

Review Comment:
   You can reuse the data table meta client above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193100361

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216",
       "triggerID" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "triggerType" : "PUSH"
     }, {
       "hash" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10251",
       "triggerID" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed7e7d224b13f2021fb79b4c36c79d2926b8f779 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216) 
   * 65ea08df8cf4bc8899db212de02763ae0a07e0aa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10251) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1198874544

   but atleast don't we need to fail fast? i.e. when someone tries to do a restore when this config is enabled, throw an exception saying that "restore is not supported when this config is enabled" ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5837:
URL: https://github.com/apache/hudi/pull/5837#discussion_r925204360


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -498,7 +505,7 @@ private Stream<HoodieInstant> getInstantsToArchive() {
       Option<String> earliestActiveDatasetCommit = dataMetaClient.getActiveTimeline().firstInstant().map(HoodieInstant::getTimestamp);

Review Comment:
   I am not sure if its strictly required. let me know if you have compelling reasons.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193181850

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216",
       "triggerID" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "triggerType" : "PUSH"
     }, {
       "hash" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10251",
       "triggerID" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263",
       "triggerID" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e8dd75a8b499cfcf5f11e6491cb907e7dc87e314 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193227296

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e8dd75a8b499cfcf5f11e6491cb907e7dc87e314 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1194161820

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10308",
       "triggerID" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1f2c919b0bd26e1e81c547ed10d6688782b6a7d6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10308) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1152489410

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216",
       "triggerID" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed7e7d224b13f2021fb79b4c36c79d2926b8f779 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1188738457

   for eg, 
   
   C3, savepoint_C3, C5, Savepoint_C5, C7, C8, C9.
   
   restore to C3:
   ideally we need to rollback, C9, C8, C7, C6, C5 and C4. 
   but active timeline may not have C4 and C6. So, we may need to fetch from archive timeline and then trigger rollback. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #5837:
URL: https://github.com/apache/hudi/pull/5837#discussion_r928191163


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -92,6 +93,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
       .withDocumentation("When enable, hoodie will auto merge several small archive files into larger one. It's"
           + " useful when storage scheme doesn't support append operation.");
 
+  public static final ConfigProperty<Boolean> ARCHIVE_BEYOND_SAVEPOINT = ConfigProperty
+      .key("hoodie.archive.proceed.savepoint")

Review Comment:
   not sure why you don't call it `hoodie.archive.beyond.savepoint`



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -409,9 +412,11 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() {
             .getTimelineOfActions(CollectionUtils.createSet(HoodieTimeline.COMMIT_ACTION, HoodieTimeline.DELTA_COMMIT_ACTION))
             .filterInflights().firstInstant();
 
-    // We cannot have any holes in the commit timeline. We cannot archive any commits which are
-    // made after the first savepoint present.
+    // NOTE: We cannot have any holes in the commit timeline.
+    // We cannot archive any commits which are made after the first savepoint present,
+    // unless HoodieArchivalConfig#ARCHIVE_BEYOND_SAVEPOINT is enabled.
     Option<HoodieInstant> firstSavepoint = table.getCompletedSavepointTimeline().firstInstant();
+    List<String> savepointTimestamps = table.getSavepointTimestamps();

Review Comment:
   this is mainly used for `contains()` check. so it should be a `Set`



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -479,26 +489,48 @@ private Stream<HoodieInstant> getInstantsToArchive() {
           instants = Stream.empty();
         } else {
           LOG.info("Limiting archiving of instants to latest compaction on metadata table at " + latestCompactionTime.get());
-          instants = instants.filter(instant -> HoodieTimeline.compareTimestamps(instant.getTimestamp(), HoodieTimeline.LESSER_THAN,
+          instants = instants.filter(instant -> compareTimestamps(instant.getTimestamp(), LESSER_THAN,
               latestCompactionTime.get()));
         }
       } catch (Exception e) {
         throw new HoodieException("Error limiting instant archival based on metadata table", e);
       }
     }
 
-    // If this is a metadata table, do not archive the commits that live in data set
-    // active timeline. This is required by metadata table,
-    // see HoodieTableMetadataUtil#processRollbackMetadata for details.
     if (HoodieTableMetadata.isMetadataTable(config.getBasePath())) {
       HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder()
           .setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath()))
           .setConf(metaClient.getHadoopConf())
           .build();
-      Option<String> earliestActiveDatasetCommit = dataMetaClient.getActiveTimeline().firstInstant().map(HoodieInstant::getTimestamp);
-      if (earliestActiveDatasetCommit.isPresent()) {
-        instants = instants.filter(instant ->
-            HoodieTimeline.compareTimestamps(instant.getTimestamp(), HoodieTimeline.LESSER_THAN, earliestActiveDatasetCommit.get()));
+      Option<HoodieInstant> earliestActiveDatasetCommit = dataMetaClient.getActiveTimeline().firstInstant();
+
+      if (config.shouldArchiveBeyondSavepoint()) {
+        // There are chances that there could be holes in the timeline due to archival and savepoint interplay.
+        // So, the first non-savepoint commit in the data timeline is considered as beginning of the active timeline.
+        Set<String> savepointTimestamps = dataMetaClient.getActiveTimeline().getInstants()
+            .filter(entry -> entry.getAction().equals(HoodieTimeline.SAVEPOINT_ACTION))
+            .map(HoodieInstant::getTimestamp)
+            .collect(Collectors.toSet());
+        Option<HoodieInstant> firstNonSavepointCommit = earliestActiveDatasetCommit;
+
+        if (!savepointTimestamps.isEmpty()) {
+          firstNonSavepointCommit = Option.fromJavaOptional(dataMetaClient.getActiveTimeline().getInstants()
+              .filter(entry -> !savepointTimestamps.contains(entry.getTimestamp()))
+              .findFirst());
+        }

Review Comment:
   this logic to find first non savepoint commit reused in hoodie default timeline. can be extracted out



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
codope commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1198849123

   > @codope @yihua : whats the consensus here on supporting restore when this config is enabled?
   
   We should support it. Punted on it due to time constraints for the release. But i can take it up now.
   As discussed, we need to stitch together active and archive timeline (since the given restore instant) while generating the restore plan. With the plan having both active and archive instants, we need to change the restore and rollback executors to consider archive instants as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1152430256

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed7e7d224b13f2021fb79b4c36c79d2926b8f779 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5837:
URL: https://github.com/apache/hudi/pull/5837#discussion_r925204360


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -498,7 +505,7 @@ private Stream<HoodieInstant> getInstantsToArchive() {
       Option<String> earliestActiveDatasetCommit = dataMetaClient.getActiveTimeline().firstInstant().map(HoodieInstant::getTimestamp);

Review Comment:
   good catch. we need that fix infact to ensure metadata table archival also makes progress.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193728215

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263",
       "triggerID" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10308",
       "triggerID" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e8dd75a8b499cfcf5f11e6491cb907e7dc87e314 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263) 
   * 1f2c919b0bd26e1e81c547ed10d6688782b6a7d6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10308) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #5837:
URL: https://github.com/apache/hudi/pull/5837#discussion_r929188666


##########
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java:
##########
@@ -363,6 +363,12 @@ public Stream<HoodieInstant> getReverseOrderedInstants() {
 
   @Override
   public boolean isBeforeTimelineStarts(String instant) {
+    Option<HoodieInstant> firstNonSavepointCommit = getFirstNonSavepointCommit();
+    return firstNonSavepointCommit.isPresent()
+        && compareTimestamps(instant, LESSER_THAN, firstNonSavepointCommit.get().getTimestamp());
+  }
+
+  public Option<HoodieInstant> getFirstNonSavepointCommit() {

Review Comment:
   should always annotate `@Override` if applicable. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1198830203

   @codope @yihua : whats the consensus here on supporting restore when this config is enabled? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1188719716

   Is there any additional change that what I had in my initial patch ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193115881

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216",
       "triggerID" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "triggerType" : "PUSH"
     }, {
       "hash" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10251",
       "triggerID" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 65ea08df8cf4bc8899db212de02763ae0a07e0aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10251) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1194358757

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10308",
       "triggerID" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1f2c919b0bd26e1e81c547ed10d6688782b6a7d6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10308) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5837:
URL: https://github.com/apache/hudi/pull/5837#discussion_r928244854


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -92,6 +93,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
       .withDocumentation("When enable, hoodie will auto merge several small archive files into larger one. It's"
           + " useful when storage scheme doesn't support append operation.");
 
+  public static final ConfigProperty<Boolean> ARCHIVE_BEYOND_SAVEPOINT = ConfigProperty
+      .key("hoodie.archive.proceed.savepoint")

Review Comment:
   Should we make the name consistent with the variable naming, i.e., `hoodie.archive.beyond.savepoint`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1152434489

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216",
       "triggerID" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed7e7d224b13f2021fb79b4c36c79d2926b8f779 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1188730545

   lets sync up on this. restore to older savepoints in active timeline may not  work. we can only support restore to latest savepoint. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #5837:
URL: https://github.com/apache/hudi/pull/5837#discussion_r928149492


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -491,14 +503,33 @@ private Stream<HoodieInstant> getInstantsToArchive() {
     // active timeline. This is required by metadata table,
     // see HoodieTableMetadataUtil#processRollbackMetadata for details.
     if (HoodieTableMetadata.isMetadataTable(config.getBasePath())) {
-      HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder()
-          .setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath()))
+      HoodieTableMetaClient metadataTableMetaClient = HoodieTableMetaClient.builder()

Review Comment:
   you're right. Not sure why `HoodieTableMetadata` has two methods for the same purpose: `getDatasetBasePath` and `getDataTableBasePathFromMetadataTable`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
yihua commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193226711

   > for eg,
   > 
   > C3, savepoint_C3, C5, Savepoint_C5, C7, C8, C9.
   > 
   > restore to C3: ideally we need to rollback, C9, C8, C7, C6, C5 and C4. but active timeline may not have C4 and C6. So, we may need to fetch from archive timeline and then trigger rollback.
   
   Do we want to support such a use case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193723340

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263",
       "triggerID" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e8dd75a8b499cfcf5f11e6491cb907e7dc87e314 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263) 
   * 1f2c919b0bd26e1e81c547ed10d6688782b6a7d6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1194154535

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1f2c919b0bd26e1e81c547ed10d6688782b6a7d6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1f2c919b0bd26e1e81c547ed10d6688782b6a7d6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193166569

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216",
       "triggerID" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "triggerType" : "PUSH"
     }, {
       "hash" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10251",
       "triggerID" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263",
       "triggerID" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 65ea08df8cf4bc8899db212de02763ae0a07e0aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10251) 
   * e8dd75a8b499cfcf5f11e6491cb907e7dc87e314 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193236601

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263",
       "triggerID" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e8dd75a8b499cfcf5f11e6491cb907e7dc87e314 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10263) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan merged pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
xushiyan merged PR #5837:
URL: https://github.com/apache/hudi/pull/5837


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193099754

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216",
       "triggerID" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "triggerType" : "PUSH"
     }, {
       "hash" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed7e7d224b13f2021fb79b4c36c79d2926b8f779 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216) 
   * 65ea08df8cf4bc8899db212de02763ae0a07e0aa UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #5837:
URL: https://github.com/apache/hudi/pull/5837#discussion_r928146386


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##########
@@ -491,14 +503,33 @@ private Stream<HoodieInstant> getInstantsToArchive() {
     // active timeline. This is required by metadata table,
     // see HoodieTableMetadataUtil#processRollbackMetadata for details.
     if (HoodieTableMetadata.isMetadataTable(config.getBasePath())) {
-      HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder()
-          .setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath()))
+      HoodieTableMetaClient metadataTableMetaClient = HoodieTableMetaClient.builder()
+          .setBasePath(getDatasetBasePath(config.getBasePath()))
           .setConf(metaClient.getHadoopConf())
           .build();
-      Option<String> earliestActiveDatasetCommit = dataMetaClient.getActiveTimeline().firstInstant().map(HoodieInstant::getTimestamp);
-      if (earliestActiveDatasetCommit.isPresent()) {
+      Option<HoodieInstant> earliestActiveDatasetCommit = metadataTableMetaClient.getActiveTimeline().firstInstant();
+
+      // There are chances that there could be holes in the timeline due to archival and savepoint interplay.
+      // So, the first non-savepoint commit in the data timeline is considered as beginning of the active timeline.
+      HoodieTableMetaClient dataTableMetaClient = HoodieTableMetaClient.builder()
+          .setBasePath(getDataTableBasePathFromMetadataTable(config.getBasePath()))
+          .setConf(metaClient.getHadoopConf())
+          .build();
+      Set<String> savepointTimestamps = dataTableMetaClient.getActiveTimeline().getInstants()
+          .filter(entry -> entry.getAction().equals(HoodieTimeline.SAVEPOINT_ACTION))
+          .map(HoodieInstant::getTimestamp)
+          .collect(Collectors.toSet());
+      Option<HoodieInstant> firstNonSavepointCommit = earliestActiveDatasetCommit;
+      if (!savepointTimestamps.isEmpty()) {
+        firstNonSavepointCommit = Option.fromJavaOptional(dataTableMetaClient.getActiveTimeline().getInstants()
+            .filter(entry -> !savepointTimestamps.contains(entry.getTimestamp()))
+            .findFirst());
+      }

Review Comment:
   Yew, that's a good point. It should be guarded



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5837: [HUDI-3884] Support archival beyond savepoint commits

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5837:
URL: https://github.com/apache/hudi/pull/5837#issuecomment-1193165884

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9216",
       "triggerID" : "ed7e7d224b13f2021fb79b4c36c79d2926b8f779",
       "triggerType" : "PUSH"
     }, {
       "hash" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10251",
       "triggerID" : "65ea08df8cf4bc8899db212de02763ae0a07e0aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e8dd75a8b499cfcf5f11e6491cb907e7dc87e314",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 65ea08df8cf4bc8899db212de02763ae0a07e0aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10251) 
   * e8dd75a8b499cfcf5f11e6491cb907e7dc87e314 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org