You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/06 00:58:37 UTC

[GitHub] [hudi] nsivabalan opened a new pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

nsivabalan opened a new pull request #4519:
URL: https://github.com/apache/hudi/pull/4519


   ## What is the purpose of the pull request
   
   During metadata table bootstrap, we should include files only from completed commits and not all files after listing files using fs. This patch fixes the same to consider files only from completed commits. 
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
   TestHoodieMetadataBootstrap.testMetadataBootstrapWithExtraFiles
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1006252388


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922",
       "triggerID" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fba1437312207e135f0f7aef489b5b16f9fbe495",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fba1437312207e135f0f7aef489b5b16f9fbe495",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a522d619ceddce3a0241b5363c4762d63a6f7354 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922) 
   * fba1437312207e135f0f7aef489b5b16f9fbe495 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1006200620


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a522d619ceddce3a0241b5363c4762d63a6f7354 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1006201988


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922",
       "triggerID" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a522d619ceddce3a0241b5363c4762d63a6f7354 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1007437335


   @manojpec : can you review the patch please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1006220286


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922",
       "triggerID" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a522d619ceddce3a0241b5363c4762d63a6f7354 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1006220286


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922",
       "triggerID" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a522d619ceddce3a0241b5363c4762d63a6f7354 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1006201988


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922",
       "triggerID" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a522d619ceddce3a0241b5363c4762d63a6f7354 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1006253880


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922",
       "triggerID" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fba1437312207e135f0f7aef489b5b16f9fbe495",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4927",
       "triggerID" : "fba1437312207e135f0f7aef489b5b16f9fbe495",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a522d619ceddce3a0241b5363c4762d63a6f7354 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922) 
   * fba1437312207e135f0f7aef489b5b16f9fbe495 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4927) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1006200620


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a522d619ceddce3a0241b5363c4762d63a6f7354 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1006252388


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922",
       "triggerID" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fba1437312207e135f0f7aef489b5b16f9fbe495",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fba1437312207e135f0f7aef489b5b16f9fbe495",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a522d619ceddce3a0241b5363c4762d63a6f7354 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922) 
   * fba1437312207e135f0f7aef489b5b16f9fbe495 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] manojpec commented on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
manojpec commented on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1009159790


   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1006253880


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922",
       "triggerID" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fba1437312207e135f0f7aef489b5b16f9fbe495",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4927",
       "triggerID" : "fba1437312207e135f0f7aef489b5b16f9fbe495",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a522d619ceddce3a0241b5363c4762d63a6f7354 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922) 
   * fba1437312207e135f0f7aef489b5b16f9fbe495 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4927) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1007437335


   @manojpec : can you review the patch please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
nsivabalan merged pull request #4519:
URL: https://github.com/apache/hudi/pull/4519


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#issuecomment-1006278905


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4922",
       "triggerID" : "a522d619ceddce3a0241b5363c4762d63a6f7354",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fba1437312207e135f0f7aef489b5b16f9fbe495",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4927",
       "triggerID" : "fba1437312207e135f0f7aef489b5b16f9fbe495",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fba1437312207e135f0f7aef489b5b16f9fbe495 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4927) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] manojpec commented on a change in pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
manojpec commented on a change in pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#discussion_r780847273



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -746,9 +746,16 @@ protected void bootstrapCommit(List<DirectoryInfo> partitionInfoList, String cre
     HoodieData<HoodieRecord> partitionRecords = engineContext.parallelize(Arrays.asList(allPartitionRecord), 1);
     if (!partitionInfoList.isEmpty()) {
       HoodieData<HoodieRecord> fileListRecords = engineContext.parallelize(partitionInfoList, partitionInfoList.size()).map(partitionInfo -> {
+        Map<String, Long> fileNameToSizeMap = partitionInfo.getFileNameToSizeMap();
+        // filter for files that are part of the completed commits
+        Map<String, Long> validFileNameToSizeMap = fileNameToSizeMap.entrySet().stream().filter(fileSizePair -> {
+          String commitTime = FSUtils.getCommitTime(fileSizePair.getKey());
+          return HoodieTimeline.compareTimestamps(commitTime, HoodieTimeline.LESSER_THAN_OR_EQUALS, createInstantTime);

Review comment:
       this does not filter out the failed old commits right?

##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieMetadataBootstrap.java
##########
@@ -76,6 +80,36 @@ public void testMetadataBootstrapInsertUpsertClean(HoodieTableType tableType) th
     bootstrapAndVerify();
   }
 
+  /**
+   * Validate that bootstrap considers only files part of completed commit and ignore any extra files.
+   */
+  @Test
+  public void testMetadataBootstrapWithExtraFiles() throws Exception {
+    HoodieTableType tableType = COPY_ON_WRITE;
+    init(tableType, false);
+    doPreBootstrapWriteOperation(testTable, INSERT, "0000001");
+    doPreBootstrapWriteOperation(testTable, "0000002");
+    doPreBootstrapClean(testTable, "0000003", Arrays.asList("0000001"));
+    doPreBootstrapWriteOperation(testTable, "0000005");
+    // add few extra files to table. bootstrap should include those files.
+    String fileName = UUID.randomUUID().toString();
+    Path baseFilePath = FileCreateUtils.getBaseFilePath(basePath, "p1", "0000006", fileName);
+    FileCreateUtils.createBaseFile(basePath, "p1", "0000006", fileName, 100);

Review comment:
       Should we instead start the commit and not have it completed so that we have it in timeline also ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #4519: [HUDI-3180] Include files from completed commits while bootstrapping metadata table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #4519:
URL: https://github.com/apache/hudi/pull/4519#discussion_r781376321



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieMetadataBootstrap.java
##########
@@ -76,6 +80,36 @@ public void testMetadataBootstrapInsertUpsertClean(HoodieTableType tableType) th
     bootstrapAndVerify();
   }
 
+  /**
+   * Validate that bootstrap considers only files part of completed commit and ignore any extra files.
+   */
+  @Test
+  public void testMetadataBootstrapWithExtraFiles() throws Exception {
+    HoodieTableType tableType = COPY_ON_WRITE;
+    init(tableType, false);
+    doPreBootstrapWriteOperation(testTable, INSERT, "0000001");
+    doPreBootstrapWriteOperation(testTable, "0000002");
+    doPreBootstrapClean(testTable, "0000003", Arrays.asList("0000001"));
+    doPreBootstrapWriteOperation(testTable, "0000005");
+    // add few extra files to table. bootstrap should include those files.
+    String fileName = UUID.randomUUID().toString();
+    Path baseFilePath = FileCreateUtils.getBaseFilePath(basePath, "p1", "0000006", fileName);
+    FileCreateUtils.createBaseFile(basePath, "p1", "0000006", fileName, 100);

Review comment:
       if its part of the timeline, bootstrap may not kick in. also, not sure if we will gain much from it. this test fails if not the fix in source code as part of this patch. So, we should be good. Let me know what you think. 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -746,9 +746,16 @@ protected void bootstrapCommit(List<DirectoryInfo> partitionInfoList, String cre
     HoodieData<HoodieRecord> partitionRecords = engineContext.parallelize(Arrays.asList(allPartitionRecord), 1);
     if (!partitionInfoList.isEmpty()) {
       HoodieData<HoodieRecord> fileListRecords = engineContext.parallelize(partitionInfoList, partitionInfoList.size()).map(partitionInfo -> {
+        Map<String, Long> fileNameToSizeMap = partitionInfo.getFileNameToSizeMap();
+        // filter for files that are part of the completed commits
+        Map<String, Long> validFileNameToSizeMap = fileNameToSizeMap.entrySet().stream().filter(fileSizePair -> {
+          String commitTime = FSUtils.getCommitTime(fileSizePair.getKey());
+          return HoodieTimeline.compareTimestamps(commitTime, HoodieTimeline.LESSER_THAN_OR_EQUALS, createInstantTime);

Review comment:
       bootstrap itself will get triggered only if all operations are complete. If there was a partially failed commit, unless an explicit rollback happens, bootstrap may not kick in. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org