You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/07 19:09:00 UTC

[GitHub] [hudi] manojpec opened a new pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

manojpec opened a new pull request #4761:
URL: https://github.com/apache/hudi/pull/4761


   ## What is the purpose of the pull request
   
   Delta writes can change column ranges and the column stats index need
   to be properly updated with new ranges to be consistent with the table
   dataset. This fix add column stats index update support for the delta
   writes.
   
   ## Brief change log
   
   Index initialization and the metadata conversion now call the metadata table util
   to get the column range stats for the delta writes.
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1033224445


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036369061


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0350aa89d478eedddf483214ba42024880efb27",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916",
       "triggerID" : "b0350aa89d478eedddf483214ba42024880efb27",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b0350aa89d478eedddf483214ba42024880efb27 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1031825370


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6f056e6515660c1d179387c2fd264f595d3d9192 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1041375126


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0350aa89d478eedddf483214ba42024880efb27",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916",
       "triggerID" : "b0350aa89d478eedddf483214ba42024880efb27",
       "triggerType" : "PUSH"
     }, {
       "hash" : "82c87f64c7c20e6bb718f7d491f0acccb003a71a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "82c87f64c7c20e6bb718f7d491f0acccb003a71a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b0350aa89d478eedddf483214ba42024880efb27 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916) 
   * 82c87f64c7c20e6bb718f7d491f0acccb003a71a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1041378118


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0350aa89d478eedddf483214ba42024880efb27",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916",
       "triggerID" : "b0350aa89d478eedddf483214ba42024880efb27",
       "triggerType" : "PUSH"
     }, {
       "hash" : "82c87f64c7c20e6bb718f7d491f0acccb003a71a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6058",
       "triggerID" : "82c87f64c7c20e6bb718f7d491f0acccb003a71a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b0350aa89d478eedddf483214ba42024880efb27 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916) 
   * 82c87f64c7c20e6bb718f7d491f0acccb003a71a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6058) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1041375126


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0350aa89d478eedddf483214ba42024880efb27",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916",
       "triggerID" : "b0350aa89d478eedddf483214ba42024880efb27",
       "triggerType" : "PUSH"
     }, {
       "hash" : "82c87f64c7c20e6bb718f7d491f0acccb003a71a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "82c87f64c7c20e6bb718f7d491f0acccb003a71a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b0350aa89d478eedddf483214ba42024880efb27 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916) 
   * 82c87f64c7c20e6bb718f7d491f0acccb003a71a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1031825370


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6f056e6515660c1d179387c2fd264f595d3d9192 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036369061


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0350aa89d478eedddf483214ba42024880efb27",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916",
       "triggerID" : "b0350aa89d478eedddf483214ba42024880efb27",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b0350aa89d478eedddf483214ba42024880efb27 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#discussion_r808202079



##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -608,82 +583,126 @@ private static void processRollbackMetadata(HoodieActiveTimeline metadataTableTi
   }
 
   /**
-   * Convert rollback action metadata to bloom filter index records.
+   * Convert added and deleted files metadata to bloom filter index records.
    */
-  private static List<HoodieRecord> convertFilesToBloomFilterRecords(HoodieEngineContext engineContext,
-                                                                     HoodieTableMetaClient dataMetaClient,
-                                                                     Map<String, List<String>> partitionToDeletedFiles,
-                                                                     Map<String, Map<String, Long>> partitionToAppendedFiles,
-                                                                     String instantTime) {
-    List<HoodieRecord> records = new LinkedList<>();
-    partitionToDeletedFiles.forEach((partitionName, deletedFileList) -> deletedFileList.forEach(deletedFile -> {
-      if (!FSUtils.isBaseFile(new Path(deletedFile))) {
-        return;
-      }
+  public static HoodieData<HoodieRecord> convertFilesToBloomFilterRecords(HoodieEngineContext engineContext,
+                                                                          Map<String, List<String>> partitionToDeletedFiles,
+                                                                          Map<String, Map<String, Long>> partitionToAppendedFiles,
+                                                                          MetadataRecordsGenerationParams recordsGenerationParams,
+                                                                          String instantTime) {
+    HoodieData<HoodieRecord> allRecordsRDD = engineContext.emptyHoodieData();
+
+    List<Pair<String, List<String>>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet()
+        .stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList());
+    HoodieData<Pair<String, List<String>>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList,
+        Math.max(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()));
+
+    HoodieData<HoodieRecord> deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesEntry -> {
+      final String partitionName = partitionToDeletedFilesEntry.getLeft();
+      final List<String> deletedFileList = partitionToDeletedFilesEntry.getRight();
+      return deletedFileList.stream().flatMap(deletedFile -> {
+        if (!FSUtils.isBaseFile(new Path(deletedFile))) {
+          return Stream.empty();
+        }
 
-      final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
-      records.add(HoodieMetadataPayload.createBloomFilterMetadataRecord(
-          partition, deletedFile, instantTime, ByteBuffer.allocate(0), true));
-    }));
+        final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
+        return Stream.<HoodieRecord>of(HoodieMetadataPayload.createBloomFilterMetadataRecord(
+            partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true));
+      }).iterator();
+    });
+    allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
 
-    partitionToAppendedFiles.forEach((partitionName, appendedFileMap) -> {
+    List<Pair<String, Map<String, Long>>> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet()
+        .stream().map(entry -> Pair.of(entry.getKey(), entry.getValue())).collect(Collectors.toList());
+    HoodieData<Pair<String, Map<String, Long>>> partitionToAppendedFilesRDD = engineContext.parallelize(partitionToAppendedFilesList,
+        Math.max(partitionToAppendedFiles.size(), recordsGenerationParams.getBloomIndexParallelism()));
+
+    HoodieData<HoodieRecord> appendedFilesRecordsRDD = partitionToAppendedFilesRDD.flatMap(partitionToAppendedFilesEntry -> {
+      final String partitionName = partitionToAppendedFilesEntry.getKey();
+      final Map<String, Long> appendedFileMap = partitionToAppendedFilesEntry.getValue();
       final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
-      appendedFileMap.forEach((appendedFile, length) -> {
+      return appendedFileMap.entrySet().stream().flatMap(appendedFileLengthPairEntry -> {
+        final String appendedFile = appendedFileLengthPairEntry.getKey();
         if (!FSUtils.isBaseFile(new Path(appendedFile))) {
-          return;
+          return Stream.empty();
         }
         final String pathWithPartition = partitionName + "/" + appendedFile;
-        final Path appendedFilePath = new Path(dataMetaClient.getBasePath(), pathWithPartition);
-        try {
-          HoodieFileReader<IndexedRecord> fileReader =
-              HoodieFileReaderFactory.getFileReader(dataMetaClient.getHadoopConf(), appendedFilePath);
+        final Path appendedFilePath = new Path(recordsGenerationParams.getDataMetaClient().getBasePath(), pathWithPartition);
+        try (HoodieFileReader<IndexedRecord> fileReader =
+                 HoodieFileReaderFactory.getFileReader(recordsGenerationParams.getDataMetaClient().getHadoopConf(), appendedFilePath)) {
           final BloomFilter fileBloomFilter = fileReader.readBloomFilter();
           if (fileBloomFilter == null) {
             LOG.error("Failed to read bloom filter for " + appendedFilePath);
-            return;
+            return Stream.empty();
           }
           ByteBuffer bloomByteBuffer = ByteBuffer.wrap(fileBloomFilter.serializeToString().getBytes());
           HoodieRecord record = HoodieMetadataPayload.createBloomFilterMetadataRecord(
-              partition, appendedFile, instantTime, bloomByteBuffer, false);
-          records.add(record);
-          fileReader.close();
+              partition, appendedFile, instantTime, recordsGenerationParams.getBloomFilterType(), bloomByteBuffer, false);
+          return Stream.of(record);
         } catch (IOException e) {
           LOG.error("Failed to get bloom filter for file: " + appendedFilePath);
         }
-      });
+        return Stream.empty();
+      }).iterator();
     });
-    return records;
+    allRecordsRDD = allRecordsRDD.union(appendedFilesRecordsRDD);
+
+    return allRecordsRDD;
   }
 
   /**
-   * Convert rollback action metadata to column stats index records.
+   * Convert added and deleted action metadata to column stats index records.
    */
-  private static List<HoodieRecord> convertFilesToColumnStatsRecords(HoodieEngineContext engineContext,
-                                                                     HoodieTableMetaClient datasetMetaClient,
-                                                                     Map<String, List<String>> partitionToDeletedFiles,
-                                                                     Map<String, Map<String, Long>> partitionToAppendedFiles,
-                                                                     String instantTime) {
-    List<HoodieRecord> records = new LinkedList<>();
-    List<String> latestColumns = getLatestColumns(datasetMetaClient);
-    partitionToDeletedFiles.forEach((partitionName, deletedFileList) -> deletedFileList.forEach(deletedFile -> {
+  public static HoodieData<HoodieRecord> convertFilesToColumnStatsRecords(HoodieEngineContext engineContext,
+                                                                          Map<String, List<String>> partitionToDeletedFiles,
+                                                                          Map<String, Map<String, Long>> partitionToAppendedFiles,
+                                                                          MetadataRecordsGenerationParams recordsGenerationParams) {
+    HoodieData<HoodieRecord> allRecordsRDD = engineContext.emptyHoodieData();
+    final List<String> columnsToIndex = getColumnsToIndex(recordsGenerationParams.getDataMetaClient());
+
+    final List<Pair<String, List<String>>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet()
+        .stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList());
+    final HoodieData<Pair<String, List<String>>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList,
+        Math.max(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()));
+
+    HoodieData<HoodieRecord> deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesEntry -> {
+      final String partitionName = partitionToDeletedFilesEntry.getLeft();
       final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
-      if (deletedFile.endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {
+      final List<String> deletedFileList = partitionToDeletedFilesEntry.getRight();
+
+      return deletedFileList.stream().flatMap(deletedFile -> {
         final String filePathWithPartition = partitionName + "/" + deletedFile;
-        records.addAll(getColumnStats(partition, filePathWithPartition, datasetMetaClient,
-            latestColumns, true).collect(Collectors.toList()));
-      }
-    }));
-
-    partitionToAppendedFiles.forEach((partitionName, appendedFileMap) -> appendedFileMap.forEach(
-        (appendedFile, size) -> {
-          final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
-          if (appendedFile.endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {
-            final String filePathWithPartition = partitionName + "/" + appendedFile;
-            records.addAll(getColumnStats(partition, filePathWithPartition, datasetMetaClient,
-                latestColumns, false).collect(Collectors.toList()));
-          }
-        }));
-    return records;
+        return getColumnStats(partition, filePathWithPartition, recordsGenerationParams.getDataMetaClient(),
+            columnsToIndex, Option.empty(), true);
+      }).iterator();
+    });
+    allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
+
+    final List<Pair<String, Map<String, Long>>> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet()
+        .stream().map(entry -> Pair.of(entry.getKey(), entry.getValue())).collect(Collectors.toList());
+    final HoodieData<Pair<String, Map<String, Long>>> partitionToAppendedFilesRDD = engineContext.parallelize(partitionToAppendedFilesList,
+        Math.max(partitionToAppendedFiles.size(), recordsGenerationParams.getBloomIndexParallelism()));

Review comment:
       Added a config for col stats parallelism.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1031822368


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6f056e6515660c1d179387c2fd264f595d3d9192 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1031836819


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6f056e6515660c1d179387c2fd264f595d3d9192 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774) 
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1033224445


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036206376


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776) 
   * 6d40a75e6d159f9c961f2e960760022af915beee Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1031880072


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6f056e6515660c1d179387c2fd264f595d3d9192 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774) 
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1031840135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6f056e6515660c1d179387c2fd264f595d3d9192 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774) 
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036206376


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776) 
   * 6d40a75e6d159f9c961f2e960760022af915beee Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036297912


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0350aa89d478eedddf483214ba42024880efb27",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916",
       "triggerID" : "b0350aa89d478eedddf483214ba42024880efb27",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6d40a75e6d159f9c961f2e960760022af915beee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913) 
   * b0350aa89d478eedddf483214ba42024880efb27 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036293797


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0350aa89d478eedddf483214ba42024880efb27",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b0350aa89d478eedddf483214ba42024880efb27",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6d40a75e6d159f9c961f2e960760022af915beee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913) 
   * b0350aa89d478eedddf483214ba42024880efb27 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1031880072


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6f056e6515660c1d179387c2fd264f595d3d9192 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774) 
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#discussion_r803315237



##########
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##########
@@ -599,4 +600,34 @@ public static Object getRecordColumnValues(HoodieRecord<? extends HoodieRecordPa
                                              SerializableSchema schema, boolean consistentLogicalTimestampEnabled) {
     return getRecordColumnValues(record, columns, schema.get(), consistentLogicalTimestampEnabled);
   }
+
+  /**
+   * Accumulate column range statistics for the requested record.
+   *
+   * @param record   - Record to get the column range statistics for
+   * @param schema   - Schema for the record
+   * @param filePath - File that record belongs to
+   */
+  public static void accumulateColumnRanges(IndexedRecord record, Schema schema, String filePath,
+                                            Map<String, HoodieColumnRangeMetadata<Comparable>> columnRangeMap) {
+    if (!(record instanceof GenericRecord)) {
+      throw new HoodieIOException("Record is not a generic type to get column range metadata!");
+    }
+
+    schema.getFields().forEach(field -> {
+      final String fieldVal = getNestedFieldValAsString((GenericRecord) record, field.name(), true, true);
+      final int fieldSize = fieldVal == null ? 0 : fieldVal.length();
+      final HoodieColumnRangeMetadata<Comparable> fieldRange = new HoodieColumnRangeMetadata<>(
+          filePath,
+          field.name(),
+          fieldVal,
+          fieldVal,
+          fieldVal == null ? 1 : 0,

Review comment:
       nit: better to declare `1` and `0` as meaningful constant to help readers?

##########
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##########
@@ -599,4 +600,34 @@ public static Object getRecordColumnValues(HoodieRecord<? extends HoodieRecordPa
                                              SerializableSchema schema, boolean consistentLogicalTimestampEnabled) {
     return getRecordColumnValues(record, columns, schema.get(), consistentLogicalTimestampEnabled);
   }
+
+  /**
+   * Accumulate column range statistics for the requested record.
+   *
+   * @param record   - Record to get the column range statistics for
+   * @param schema   - Schema for the record
+   * @param filePath - File that record belongs to
+   */
+  public static void accumulateColumnRanges(IndexedRecord record, Schema schema, String filePath,

Review comment:
       Shall we move this method out of HoodieAvroUtils? I don't think avro utils should be concerned with construction of column range metadata. Moreover, we can define this method where write config is available so that `consistentLogicalTimestampEnabled` is not hardcoded in the call `HoodieAvroUtils#getNestedFieldValAsString`. I am okay with keeping this method private in `HoodieAppendHandle` as that's the only place it's being used currently.

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieColumnRangeMetadata.java
##########
@@ -33,6 +38,24 @@
   private final long totalSize;
   private final long totalUncompressedSize;
 
+  public static final BiFunction<HoodieColumnRangeMetadata<Comparable>, HoodieColumnRangeMetadata<Comparable>, HoodieColumnRangeMetadata<Comparable>> COLUMN_RANGE_MERGE_FUNCTION =
+      (oldColumnRange, newColumnRange) -> {
+        ValidationUtils.checkArgument(oldColumnRange.getColumnName().equals(newColumnRange.getColumnName()));
+        ValidationUtils.checkArgument(oldColumnRange.getFilePath().equals(newColumnRange.getFilePath()));
+        return new HoodieColumnRangeMetadata<>(
+            newColumnRange.getFilePath(),
+            newColumnRange.getColumnName(),
+            (Comparable) Arrays.asList(oldColumnRange.getMinValue(), newColumnRange.getMinValue())

Review comment:
       nit: remove redundant cast to comparable?

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -874,45 +869,53 @@ public static HoodieTableFileSystemView getFileSystemView(HoodieTableMetaClient
     }
   }
 
-  private static List<String> getLatestColumns(HoodieTableMetaClient datasetMetaClient) {
-    return getLatestColumns(datasetMetaClient, false);
+  private static List<String> getColumnsToIndex(HoodieTableMetaClient datasetMetaClient) {
+    return getColumnsToIndex(datasetMetaClient, false);
   }
 
   public static Stream<HoodieRecord> translateWriteStatToColumnStats(HoodieWriteStat writeStat,
                                                                      HoodieTableMetaClient datasetMetaClient,
-                                                                     List<String> latestColumns) {
-    return getColumnStats(writeStat.getPartitionPath(), writeStat.getPath(), datasetMetaClient, latestColumns, false);
+                                                                     List<String> columnsToIndex) {
+    Option<Map<String, HoodieColumnRangeMetadata<Comparable>>> columnRangeMap = Option.empty();
+    if (writeStat instanceof HoodieDeltaWriteStat && ((HoodieDeltaWriteStat) writeStat).getRecordsStats().isPresent()) {
+      columnRangeMap = Option.of(((HoodieDeltaWriteStat) writeStat).getRecordsStats().get().getStats());
+    }
+    return getColumnStats(writeStat.getPartitionPath(), writeStat.getPath(), datasetMetaClient, columnsToIndex,
+        columnRangeMap, false);
 
   }
 
   private static Stream<HoodieRecord> getColumnStats(final String partitionPath, final String filePathWithPartition,
                                                      HoodieTableMetaClient datasetMetaClient,
-                                                     List<String> columns, boolean isDeleted) {
+                                                     List<String> columnsToIndex,
+                                                     Option<Map<String, HoodieColumnRangeMetadata<Comparable>>> columnRangeMap,
+                                                     boolean isDeleted) {
     final String partition = partitionPath.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionPath;
     final int offset = partition.equals(NON_PARTITIONED_NAME) ? (filePathWithPartition.startsWith("/") ? 1 : 0)
         : partition.length() + 1;
     final String fileName = filePathWithPartition.substring(offset);
-    if (!FSUtils.isBaseFile(new Path(fileName))) {
-      return Stream.empty();
-    }
 
     if (filePathWithPartition.endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {
       List<HoodieColumnRangeMetadata<Comparable>> columnRangeMetadataList = new ArrayList<>();
       final Path fullFilePath = new Path(datasetMetaClient.getBasePath(), filePathWithPartition);
       if (!isDeleted) {
         try {
           columnRangeMetadataList = new ParquetUtils().readRangeFromParquetMetadata(
-              datasetMetaClient.getHadoopConf(), fullFilePath, columns);
+              datasetMetaClient.getHadoopConf(), fullFilePath, columnsToIndex);
         } catch (Exception e) {
           LOG.error("Failed to read column stats for " + fullFilePath, e);
         }
       } else {
         columnRangeMetadataList =
-            columns.stream().map(entry -> new HoodieColumnRangeMetadata<Comparable>(fileName,
+            columnsToIndex.stream().map(entry -> new HoodieColumnRangeMetadata<Comparable>(fileName,
                     entry, null, null, 0, 0, 0, 0))
                 .collect(Collectors.toList());
       }
       return HoodieMetadataPayload.createColumnStatsRecords(partitionPath, columnRangeMetadataList, isDeleted);
+    } else if (columnRangeMap.isPresent()) {

Review comment:
       Should we also check that the stat map is non-empty?

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
##########
@@ -118,6 +121,26 @@
   private HoodieMetadataBloomFilter bloomFilterMetadata = null;
   private HoodieMetadataColumnStats columnStatMetadata = null;
 
+  public static final BiFunction<HoodieMetadataColumnStats, HoodieMetadataColumnStats, HoodieMetadataColumnStats> COLUMN_STATS_MERGE_FUNCTION =
+      (oldColumnStats, newColumnStats) -> {
+        ValidationUtils.checkArgument(oldColumnStats.getFileName().equals(newColumnStats.getFileName()));
+        if (newColumnStats.getIsDeleted()) {
+          return newColumnStats;
+        }
+        return new HoodieMetadataColumnStats(
+            newColumnStats.getFileName(),
+            Arrays.asList(oldColumnStats.getMinValue(), newColumnStats.getMinValue())
+                .stream().filter(Objects::nonNull).min(Comparator.naturalOrder()).orElse(null),
+            Arrays.asList(oldColumnStats.getMinValue(), newColumnStats.getMinValue())
+                .stream().filter(Objects::nonNull).max(Comparator.naturalOrder()).orElse(null),
+            oldColumnStats.getNullCount() + newColumnStats.getNullCount(),

Review comment:
       Is my understanding correct that since this is append handle and we don't expect duplicates so simply add these stats?

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
##########
@@ -118,6 +121,26 @@
   private HoodieMetadataBloomFilter bloomFilterMetadata = null;
   private HoodieMetadataColumnStats columnStatMetadata = null;
 
+  public static final BiFunction<HoodieMetadataColumnStats, HoodieMetadataColumnStats, HoodieMetadataColumnStats> COLUMN_STATS_MERGE_FUNCTION =
+      (oldColumnStats, newColumnStats) -> {
+        ValidationUtils.checkArgument(oldColumnStats.getFileName().equals(newColumnStats.getFileName()));
+        if (newColumnStats.getIsDeleted()) {
+          return newColumnStats;
+        }
+        return new HoodieMetadataColumnStats(
+            newColumnStats.getFileName(),

Review comment:
       So this field is called `fileName` in HoodieMetadataColumnStats but `filePath` in HoodieColumnRangeMetadata. If possible, can we keep the names consistent?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036203793


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776) 
   * 6d40a75e6d159f9c961f2e960760022af915beee UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036293797


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0350aa89d478eedddf483214ba42024880efb27",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b0350aa89d478eedddf483214ba42024880efb27",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6d40a75e6d159f9c961f2e960760022af915beee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913) 
   * b0350aa89d478eedddf483214ba42024880efb27 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
codope commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1044529765


   Closing it in favor of #4848 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1041378118


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0350aa89d478eedddf483214ba42024880efb27",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916",
       "triggerID" : "b0350aa89d478eedddf483214ba42024880efb27",
       "triggerType" : "PUSH"
     }, {
       "hash" : "82c87f64c7c20e6bb718f7d491f0acccb003a71a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6058",
       "triggerID" : "82c87f64c7c20e6bb718f7d491f0acccb003a71a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b0350aa89d478eedddf483214ba42024880efb27 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916) 
   * 82c87f64c7c20e6bb718f7d491f0acccb003a71a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6058) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036203793


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776) 
   * 6d40a75e6d159f9c961f2e960760022af915beee UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#discussion_r804981688



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieColumnRangeMetadata.java
##########
@@ -33,6 +39,24 @@
   private final long totalSize;
   private final long totalUncompressedSize;
 
+  public static final BiFunction<HoodieColumnRangeMetadata<Comparable>, HoodieColumnRangeMetadata<Comparable>, HoodieColumnRangeMetadata<Comparable>> COLUMN_RANGE_MERGE_FUNCTION =
+      (oldColumnRange, newColumnRange) -> {
+        ValidationUtils.checkArgument(oldColumnRange.getColumnName().equals(newColumnRange.getColumnName()));

Review comment:
       do we need to validate for every record? can we move validations one level up and avoid this may be. since this is called or every record in an iterative manner, trying to see if this is over kill

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/MetadataRecordsGenerationParams.java
##########
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+
+import java.io.Serializable;
+import java.util.List;
+
+public class MetadataRecordsGenerationParams implements Serializable {

Review comment:
       java docs

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -608,82 +583,126 @@ private static void processRollbackMetadata(HoodieActiveTimeline metadataTableTi
   }
 
   /**
-   * Convert rollback action metadata to bloom filter index records.
+   * Convert added and deleted files metadata to bloom filter index records.
    */
-  private static List<HoodieRecord> convertFilesToBloomFilterRecords(HoodieEngineContext engineContext,
-                                                                     HoodieTableMetaClient dataMetaClient,
-                                                                     Map<String, List<String>> partitionToDeletedFiles,
-                                                                     Map<String, Map<String, Long>> partitionToAppendedFiles,
-                                                                     String instantTime) {
-    List<HoodieRecord> records = new LinkedList<>();
-    partitionToDeletedFiles.forEach((partitionName, deletedFileList) -> deletedFileList.forEach(deletedFile -> {
-      if (!FSUtils.isBaseFile(new Path(deletedFile))) {
-        return;
-      }
+  public static HoodieData<HoodieRecord> convertFilesToBloomFilterRecords(HoodieEngineContext engineContext,
+                                                                          Map<String, List<String>> partitionToDeletedFiles,
+                                                                          Map<String, Map<String, Long>> partitionToAppendedFiles,
+                                                                          MetadataRecordsGenerationParams recordsGenerationParams,
+                                                                          String instantTime) {
+    HoodieData<HoodieRecord> allRecordsRDD = engineContext.emptyHoodieData();
+
+    List<Pair<String, List<String>>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet()
+        .stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList());
+    HoodieData<Pair<String, List<String>>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList,
+        Math.max(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()));
+
+    HoodieData<HoodieRecord> deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesEntry -> {
+      final String partitionName = partitionToDeletedFilesEntry.getLeft();
+      final List<String> deletedFileList = partitionToDeletedFilesEntry.getRight();
+      return deletedFileList.stream().flatMap(deletedFile -> {
+        if (!FSUtils.isBaseFile(new Path(deletedFile))) {
+          return Stream.empty();
+        }
 
-      final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
-      records.add(HoodieMetadataPayload.createBloomFilterMetadataRecord(
-          partition, deletedFile, instantTime, ByteBuffer.allocate(0), true));
-    }));
+        final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
+        return Stream.<HoodieRecord>of(HoodieMetadataPayload.createBloomFilterMetadataRecord(
+            partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true));
+      }).iterator();
+    });
+    allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
 
-    partitionToAppendedFiles.forEach((partitionName, appendedFileMap) -> {
+    List<Pair<String, Map<String, Long>>> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet()
+        .stream().map(entry -> Pair.of(entry.getKey(), entry.getValue())).collect(Collectors.toList());
+    HoodieData<Pair<String, Map<String, Long>>> partitionToAppendedFilesRDD = engineContext.parallelize(partitionToAppendedFilesList,
+        Math.max(partitionToAppendedFiles.size(), recordsGenerationParams.getBloomIndexParallelism()));
+
+    HoodieData<HoodieRecord> appendedFilesRecordsRDD = partitionToAppendedFilesRDD.flatMap(partitionToAppendedFilesEntry -> {
+      final String partitionName = partitionToAppendedFilesEntry.getKey();
+      final Map<String, Long> appendedFileMap = partitionToAppendedFilesEntry.getValue();
       final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
-      appendedFileMap.forEach((appendedFile, length) -> {
+      return appendedFileMap.entrySet().stream().flatMap(appendedFileLengthPairEntry -> {
+        final String appendedFile = appendedFileLengthPairEntry.getKey();
         if (!FSUtils.isBaseFile(new Path(appendedFile))) {
-          return;
+          return Stream.empty();
         }
         final String pathWithPartition = partitionName + "/" + appendedFile;
-        final Path appendedFilePath = new Path(dataMetaClient.getBasePath(), pathWithPartition);
-        try {
-          HoodieFileReader<IndexedRecord> fileReader =
-              HoodieFileReaderFactory.getFileReader(dataMetaClient.getHadoopConf(), appendedFilePath);
+        final Path appendedFilePath = new Path(recordsGenerationParams.getDataMetaClient().getBasePath(), pathWithPartition);
+        try (HoodieFileReader<IndexedRecord> fileReader =
+                 HoodieFileReaderFactory.getFileReader(recordsGenerationParams.getDataMetaClient().getHadoopConf(), appendedFilePath)) {
           final BloomFilter fileBloomFilter = fileReader.readBloomFilter();
           if (fileBloomFilter == null) {
             LOG.error("Failed to read bloom filter for " + appendedFilePath);
-            return;
+            return Stream.empty();

Review comment:
       shouldn't we be throwing exception here? how an a base file don't have bloom filter? 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
##########
@@ -339,6 +383,11 @@ private void processAppendResult(AppendResult result) {
       updateWriteStatus(stat, result);
     }
 
+    Map<String, HoodieColumnRangeMetadata<Comparable>> columnRangeMap = stat.getRecordsStats().isPresent()

Review comment:
       we should populate this only if metadata table and meta index is enabled right?  or did we decide to serialize this info irrespective of it. bcoz, computing these stats will def add to write latency. So, trying to see if we can avoid if not required at all. 
   

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -642,6 +633,13 @@ private void initializeFileGroups(HoodieTableMetaClient dataMetaClient, Metadata
     }
   }
 
+  private MetadataRecordsGenerationParams getRecordsGenerationParams() {
+    return new MetadataRecordsGenerationParams(
+        dataMetaClient, enabledPartitionTypes, dataWriteConfig.getBloomFilterType(),
+        dataWriteConfig.getBloomIndexParallelism(),

Review comment:
       I feel we can't use the data table's bloom index parallelism here. may be we can re-use file listing parallelism or introduce new config.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -329,48 +321,49 @@ public static void deleteMetadataTable(String basePath, HoodieEngineContext cont
       });
     });
 
-    return engineContext.map(deleteFileList, deleteFileInfo -> {
-      return HoodieMetadataPayload.createBloomFilterMetadataRecord(
-          deleteFileInfo.getLeft(), deleteFileInfo.getRight(), instantTime, ByteBuffer.allocate(0), true);
-    }, 1).stream().collect(Collectors.toList());
+    HoodieData<Pair<String, String>> deleteFileListRDD = engineContext.parallelize(deleteFileList,
+        Math.max(deleteFileList.size(), recordsGenerationParams.getBloomIndexParallelism()));
+    return deleteFileListRDD.map(deleteFileInfo -> HoodieMetadataPayload.createBloomFilterMetadataRecord(
+        deleteFileInfo.getLeft(), deleteFileInfo.getRight(), instantTime, StringUtils.EMPTY_STRING,
+        ByteBuffer.allocate(0), true));
   }
 
   /**
    * Convert clean metadata to column stats index records.
    *
-   * @param cleanMetadata     - Clean action metadata
-   * @param engineContext     - Engine context
-   * @param datasetMetaClient - data table meta client
+   * @param cleanMetadata           - Clean action metadata
+   * @param engineContext           - Engine context
+   * @param recordsGenerationParams - Parameters for bloom filter record generation
    * @return List of column stats index records for the clean metadata
    */
-  public static List<HoodieRecord> convertMetadataToColumnStatsRecords(HoodieCleanMetadata cleanMetadata,
-                                                                       HoodieEngineContext engineContext,
-                                                                       HoodieTableMetaClient datasetMetaClient) {
+  public static HoodieData<HoodieRecord> convertMetadataToColumnStatsRecords(HoodieCleanMetadata cleanMetadata,
+                                                                             HoodieEngineContext engineContext,
+                                                                             MetadataRecordsGenerationParams recordsGenerationParams) {
     List<Pair<String, String>> deleteFileList = new ArrayList<>();
     cleanMetadata.getPartitionMetadata().forEach((partition, partitionMetadata) -> {
       // Files deleted from a partition
       List<String> deletedFiles = partitionMetadata.getDeletePathPatterns();
       deletedFiles.forEach(entry -> deleteFileList.add(Pair.of(partition, entry)));
     });
 
-    List<String> latestColumns = getLatestColumns(datasetMetaClient);
-    return engineContext.flatMap(deleteFileList,
-        deleteFileInfo -> {
-          if (deleteFileInfo.getRight().endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {
-            return getColumnStats(deleteFileInfo.getKey(), deleteFileInfo.getValue(), datasetMetaClient,
-                latestColumns, true);
-          }
-          return Stream.empty();
-        }, 1).stream().collect(Collectors.toList());
+    final List<String> columnsToIndex = getColumnsToIndex(recordsGenerationParams.getDataMetaClient());

Review comment:
       sorry, with getLatestColumns(), I see 2nd arg as isMetaIndexColumnStatsForAllColumns. shouldn't we be setting it appropriately ? why letting it be false here? 

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -608,82 +583,126 @@ private static void processRollbackMetadata(HoodieActiveTimeline metadataTableTi
   }
 
   /**
-   * Convert rollback action metadata to bloom filter index records.
+   * Convert added and deleted files metadata to bloom filter index records.
    */
-  private static List<HoodieRecord> convertFilesToBloomFilterRecords(HoodieEngineContext engineContext,
-                                                                     HoodieTableMetaClient dataMetaClient,
-                                                                     Map<String, List<String>> partitionToDeletedFiles,
-                                                                     Map<String, Map<String, Long>> partitionToAppendedFiles,
-                                                                     String instantTime) {
-    List<HoodieRecord> records = new LinkedList<>();
-    partitionToDeletedFiles.forEach((partitionName, deletedFileList) -> deletedFileList.forEach(deletedFile -> {
-      if (!FSUtils.isBaseFile(new Path(deletedFile))) {
-        return;
-      }
+  public static HoodieData<HoodieRecord> convertFilesToBloomFilterRecords(HoodieEngineContext engineContext,
+                                                                          Map<String, List<String>> partitionToDeletedFiles,
+                                                                          Map<String, Map<String, Long>> partitionToAppendedFiles,
+                                                                          MetadataRecordsGenerationParams recordsGenerationParams,
+                                                                          String instantTime) {
+    HoodieData<HoodieRecord> allRecordsRDD = engineContext.emptyHoodieData();
+
+    List<Pair<String, List<String>>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet()
+        .stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList());
+    HoodieData<Pair<String, List<String>>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList,
+        Math.max(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()));
+
+    HoodieData<HoodieRecord> deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesEntry -> {
+      final String partitionName = partitionToDeletedFilesEntry.getLeft();
+      final List<String> deletedFileList = partitionToDeletedFilesEntry.getRight();
+      return deletedFileList.stream().flatMap(deletedFile -> {
+        if (!FSUtils.isBaseFile(new Path(deletedFile))) {
+          return Stream.empty();
+        }
 
-      final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
-      records.add(HoodieMetadataPayload.createBloomFilterMetadataRecord(
-          partition, deletedFile, instantTime, ByteBuffer.allocate(0), true));
-    }));
+        final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
+        return Stream.<HoodieRecord>of(HoodieMetadataPayload.createBloomFilterMetadataRecord(
+            partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true));
+      }).iterator();
+    });
+    allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
 
-    partitionToAppendedFiles.forEach((partitionName, appendedFileMap) -> {
+    List<Pair<String, Map<String, Long>>> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet()
+        .stream().map(entry -> Pair.of(entry.getKey(), entry.getValue())).collect(Collectors.toList());
+    HoodieData<Pair<String, Map<String, Long>>> partitionToAppendedFilesRDD = engineContext.parallelize(partitionToAppendedFilesList,
+        Math.max(partitionToAppendedFiles.size(), recordsGenerationParams.getBloomIndexParallelism()));
+
+    HoodieData<HoodieRecord> appendedFilesRecordsRDD = partitionToAppendedFilesRDD.flatMap(partitionToAppendedFilesEntry -> {
+      final String partitionName = partitionToAppendedFilesEntry.getKey();
+      final Map<String, Long> appendedFileMap = partitionToAppendedFilesEntry.getValue();
       final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
-      appendedFileMap.forEach((appendedFile, length) -> {
+      return appendedFileMap.entrySet().stream().flatMap(appendedFileLengthPairEntry -> {
+        final String appendedFile = appendedFileLengthPairEntry.getKey();
         if (!FSUtils.isBaseFile(new Path(appendedFile))) {
-          return;
+          return Stream.empty();
         }
         final String pathWithPartition = partitionName + "/" + appendedFile;
-        final Path appendedFilePath = new Path(dataMetaClient.getBasePath(), pathWithPartition);
-        try {
-          HoodieFileReader<IndexedRecord> fileReader =
-              HoodieFileReaderFactory.getFileReader(dataMetaClient.getHadoopConf(), appendedFilePath);
+        final Path appendedFilePath = new Path(recordsGenerationParams.getDataMetaClient().getBasePath(), pathWithPartition);
+        try (HoodieFileReader<IndexedRecord> fileReader =
+                 HoodieFileReaderFactory.getFileReader(recordsGenerationParams.getDataMetaClient().getHadoopConf(), appendedFilePath)) {
           final BloomFilter fileBloomFilter = fileReader.readBloomFilter();
           if (fileBloomFilter == null) {
             LOG.error("Failed to read bloom filter for " + appendedFilePath);
-            return;
+            return Stream.empty();
           }
           ByteBuffer bloomByteBuffer = ByteBuffer.wrap(fileBloomFilter.serializeToString().getBytes());
           HoodieRecord record = HoodieMetadataPayload.createBloomFilterMetadataRecord(
-              partition, appendedFile, instantTime, bloomByteBuffer, false);
-          records.add(record);
-          fileReader.close();
+              partition, appendedFile, instantTime, recordsGenerationParams.getBloomFilterType(), bloomByteBuffer, false);
+          return Stream.of(record);
         } catch (IOException e) {
           LOG.error("Failed to get bloom filter for file: " + appendedFilePath);
         }
-      });
+        return Stream.empty();
+      }).iterator();
     });
-    return records;
+    allRecordsRDD = allRecordsRDD.union(appendedFilesRecordsRDD);
+
+    return allRecordsRDD;
   }
 
   /**
-   * Convert rollback action metadata to column stats index records.
+   * Convert added and deleted action metadata to column stats index records.
    */
-  private static List<HoodieRecord> convertFilesToColumnStatsRecords(HoodieEngineContext engineContext,
-                                                                     HoodieTableMetaClient datasetMetaClient,
-                                                                     Map<String, List<String>> partitionToDeletedFiles,
-                                                                     Map<String, Map<String, Long>> partitionToAppendedFiles,
-                                                                     String instantTime) {
-    List<HoodieRecord> records = new LinkedList<>();
-    List<String> latestColumns = getLatestColumns(datasetMetaClient);
-    partitionToDeletedFiles.forEach((partitionName, deletedFileList) -> deletedFileList.forEach(deletedFile -> {
+  public static HoodieData<HoodieRecord> convertFilesToColumnStatsRecords(HoodieEngineContext engineContext,
+                                                                          Map<String, List<String>> partitionToDeletedFiles,
+                                                                          Map<String, Map<String, Long>> partitionToAppendedFiles,
+                                                                          MetadataRecordsGenerationParams recordsGenerationParams) {
+    HoodieData<HoodieRecord> allRecordsRDD = engineContext.emptyHoodieData();
+    final List<String> columnsToIndex = getColumnsToIndex(recordsGenerationParams.getDataMetaClient());
+
+    final List<Pair<String, List<String>>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet()
+        .stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList());
+    final HoodieData<Pair<String, List<String>>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList,
+        Math.max(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()));
+
+    HoodieData<HoodieRecord> deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesEntry -> {
+      final String partitionName = partitionToDeletedFilesEntry.getLeft();
       final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
-      if (deletedFile.endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {
+      final List<String> deletedFileList = partitionToDeletedFilesEntry.getRight();
+
+      return deletedFileList.stream().flatMap(deletedFile -> {
         final String filePathWithPartition = partitionName + "/" + deletedFile;
-        records.addAll(getColumnStats(partition, filePathWithPartition, datasetMetaClient,
-            latestColumns, true).collect(Collectors.toList()));
-      }
-    }));
-
-    partitionToAppendedFiles.forEach((partitionName, appendedFileMap) -> appendedFileMap.forEach(
-        (appendedFile, size) -> {
-          final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionName;
-          if (appendedFile.endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {
-            final String filePathWithPartition = partitionName + "/" + appendedFile;
-            records.addAll(getColumnStats(partition, filePathWithPartition, datasetMetaClient,
-                latestColumns, false).collect(Collectors.toList()));
-          }
-        }));
-    return records;
+        return getColumnStats(partition, filePathWithPartition, recordsGenerationParams.getDataMetaClient(),
+            columnsToIndex, Option.empty(), true);
+      }).iterator();
+    });
+    allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
+
+    final List<Pair<String, Map<String, Long>>> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet()
+        .stream().map(entry -> Pair.of(entry.getKey(), entry.getValue())).collect(Collectors.toList());
+    final HoodieData<Pair<String, Map<String, Long>>> partitionToAppendedFilesRDD = engineContext.parallelize(partitionToAppendedFilesList,
+        Math.max(partitionToAppendedFiles.size(), recordsGenerationParams.getBloomIndexParallelism()));

Review comment:
       why are we using bloom index parallelism in col stats generation ? 

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -416,47 +412,26 @@ private static void processRestoreMetadata(HoodieActiveTimeline metadataTableTim
                                              Map<String, Map<String, Long>> partitionToAppendedFiles,
                                              Map<String, List<String>> partitionToDeletedFiles,
                                              Option<String> lastSyncTs) {
-    restoreMetadata.getHoodieRestoreMetadata().values().forEach(rms -> {
-      rms.forEach(rm -> processRollbackMetadata(metadataTableTimeline, rm,
-          partitionToDeletedFiles, partitionToAppendedFiles, lastSyncTs));
-    });
+    restoreMetadata.getHoodieRestoreMetadata().values().forEach(rms -> rms.forEach(rm -> processRollbackMetadata(metadataTableTimeline, rm,
+        partitionToDeletedFiles, partitionToAppendedFiles, lastSyncTs)));
   }
 
   /**
    * Convert rollback action metadata to metadata table records.
    */
   public static Map<MetadataPartitionType, HoodieData<HoodieRecord>> convertMetadataToRecords(
-      HoodieEngineContext engineContext, List<MetadataPartitionType> enabledPartitionTypes,
-      HoodieActiveTimeline metadataTableTimeline, HoodieRollbackMetadata rollbackMetadata,
-      HoodieTableMetaClient dataMetaClient, String instantTime, Option<String> lastSyncTs, boolean wasSynced) {
+      HoodieEngineContext engineContext, HoodieActiveTimeline metadataTableTimeline,
+      HoodieRollbackMetadata rollbackMetadata, MetadataRecordsGenerationParams recordsGenerationParams,
+      String instantTime, Option<String> lastSyncTs, boolean wasSynced) {
     final Map<MetadataPartitionType, HoodieData<HoodieRecord>> partitionToRecordsMap = new HashMap<>();
 
     Map<String, List<String>> partitionToDeletedFiles = new HashMap<>();
     Map<String, Map<String, Long>> partitionToAppendedFiles = new HashMap<>();
     List<HoodieRecord> filesPartitionRecords = convertMetadataToRollbackRecords(metadataTableTimeline, rollbackMetadata,
         partitionToDeletedFiles, partitionToAppendedFiles, instantTime, lastSyncTs, wasSynced);
     final HoodieData<HoodieRecord> rollbackRecordsRDD = engineContext.parallelize(filesPartitionRecords, 1);
-    partitionToRecordsMap.put(MetadataPartitionType.FILES, rollbackRecordsRDD);
-
-    if (enabledPartitionTypes.contains(MetadataPartitionType.BLOOM_FILTERS)) {
-      final List<HoodieRecord> metadataBloomFilterRecords = convertFilesToBloomFilterRecords(
-          engineContext, dataMetaClient, partitionToDeletedFiles, partitionToAppendedFiles, instantTime);
-      if (!metadataBloomFilterRecords.isEmpty()) {
-        final HoodieData<HoodieRecord> metadataBloomFilterRecordsRDD = engineContext.parallelize(metadataBloomFilterRecords, 1);
-        partitionToRecordsMap.put(MetadataPartitionType.BLOOM_FILTERS, metadataBloomFilterRecordsRDD);
-      }
-    }
-
-    if (enabledPartitionTypes.contains(MetadataPartitionType.COLUMN_STATS)) {
-      final List<HoodieRecord> metadataColumnStats = convertFilesToColumnStatsRecords(
-          engineContext, dataMetaClient, partitionToDeletedFiles, partitionToAppendedFiles, instantTime);
-      if (!metadataColumnStats.isEmpty()) {
-        final HoodieData<HoodieRecord> metadataColumnStatsRDD = engineContext.parallelize(metadataColumnStats, 1);
-        partitionToRecordsMap.put(MetadataPartitionType.COLUMN_STATS, metadataColumnStatsRDD);
-      }
-    }
-
-    return partitionToRecordsMap;
+    return getMetadataPartitionTypeHoodieDataMap(engineContext, recordsGenerationParams, instantTime,

Review comment:
       may be I see the reason why. we can have a callback and call into that so that its consistent. 

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -381,26 +374,29 @@ public static void deleteMetadataTable(String basePath, HoodieEngineContext cont
     final HoodieData<HoodieRecord> filesPartitionRecordsRDD = engineContext.parallelize(

Review comment:
       not sure why FILES is here and BLOOM and col stats are in getMetadataPartitionTypeHoodieDataMap. can we also move FILES record generation to getMetadataPartitionTypeHoodieDataMap. 

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -847,45 +852,52 @@ public static HoodieTableFileSystemView getFileSystemView(HoodieTableMetaClient
     }
   }
 
-  private static List<String> getLatestColumns(HoodieTableMetaClient datasetMetaClient) {
-    return getLatestColumns(datasetMetaClient, false);
+  private static List<String> getColumnsToIndex(HoodieTableMetaClient datasetMetaClient) {
+    return getColumnsToIndex(datasetMetaClient, false);
   }
 
   public static Stream<HoodieRecord> translateWriteStatToColumnStats(HoodieWriteStat writeStat,
                                                                      HoodieTableMetaClient datasetMetaClient,
-                                                                     List<String> latestColumns) {
-    return getColumnStats(writeStat.getPartitionPath(), writeStat.getPath(), datasetMetaClient, latestColumns, false);
+                                                                     List<String> columnsToIndex) {
+    Option<Map<String, HoodieColumnRangeMetadata<Comparable>>> columnRangeMap = Option.empty();
+    if (writeStat instanceof HoodieDeltaWriteStat && ((HoodieDeltaWriteStat) writeStat).getRecordsStats().isPresent()) {
+      columnRangeMap = Option.of(((HoodieDeltaWriteStat) writeStat).getRecordsStats().get().getStats());
+    }
+    return getColumnStats(writeStat.getPartitionPath(), writeStat.getPath(), datasetMetaClient, columnsToIndex,

Review comment:
       I don't see much value in passing columnRangeMap to getColumnStats. might as well do 
   ```
   List<HoodieColumnRangeMetadata<Comparable>> columnRangeMetadataList = new ArrayList<>(columnRangeMap.get().values());
         return HoodieMetadataPayload.createColumnStatsRecords(partitionPath, columnRangeMetadataList, isDeleted);
   ```
   here within if block and call getColumnStats() only in else block. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036260406


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6d40a75e6d159f9c961f2e960760022af915beee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1031822368


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6f056e6515660c1d179387c2fd264f595d3d9192 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1031836819


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6f056e6515660c1d179387c2fd264f595d3d9192 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774) 
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1031840135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6f056e6515660c1d179387c2fd264f595d3d9192 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774) 
   * 974c5ecb0234bcf346666b0f0ef86057decb9812 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036297912


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0350aa89d478eedddf483214ba42024880efb27",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916",
       "triggerID" : "b0350aa89d478eedddf483214ba42024880efb27",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6d40a75e6d159f9c961f2e960760022af915beee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913) 
   * b0350aa89d478eedddf483214ba42024880efb27 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1036260406


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6d40a75e6d159f9c961f2e960760022af915beee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4761:
URL: https://github.com/apache/hudi/pull/4761#issuecomment-1041465713


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5774",
       "triggerID" : "6f056e6515660c1d179387c2fd264f595d3d9192",
       "triggerType" : "PUSH"
     }, {
       "hash" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5776",
       "triggerID" : "974c5ecb0234bcf346666b0f0ef86057decb9812",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5913",
       "triggerID" : "6d40a75e6d159f9c961f2e960760022af915beee",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0350aa89d478eedddf483214ba42024880efb27",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5916",
       "triggerID" : "b0350aa89d478eedddf483214ba42024880efb27",
       "triggerType" : "PUSH"
     }, {
       "hash" : "82c87f64c7c20e6bb718f7d491f0acccb003a71a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6058",
       "triggerID" : "82c87f64c7c20e6bb718f7d491f0acccb003a71a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 82c87f64c7c20e6bb718f7d491f0acccb003a71a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6058) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope closed pull request #4761: [HUDI-3356][HUDI-3142][HUDI-1492] Metadata column stats index - handling delta writes

Posted by GitBox <gi...@apache.org>.
codope closed pull request #4761:
URL: https://github.com/apache/hudi/pull/4761


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org