You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/30 00:16:06 UTC

[GitHub] [hudi] alexeykudinkin opened a new pull request, #5470: [HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

alexeykudinkin opened a new pull request, #5470:
URL: https://github.com/apache/hudi/pull/5470

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Replacing UDF in Bulk Insert w/ RDD transformation. 
   
   ## Brief change log
   
   TBD
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   This change added tests and can be verified as follows:
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1185923207

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a2ee79f4b4309f2707539971da055263e7ec6e74 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962) 
   * 0600a70a965d19e10a4bd5c46e26ac8ed6474cfb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965) 
   * 6dfc22bdc39748d0d1b52df90375f18f84c48c6b UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1188356913

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 441a54af977c110c75890e729a539496952ca76d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978) 
   * d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1191025551

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109",
       "triggerID" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0000",
       "status" : "CANCELED",
       "url" : "TBD",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109",
       "triggerID" : "1190744358",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * b4573ac05a7bc2bea1a367778a13632ba7aefc3e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109) 
   * 0000 Unknown: [CANCELED](TBD) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1190955634

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "1190744358",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109",
       "triggerID" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b4573ac05a7bc2bea1a367778a13632ba7aefc3e Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r925029410


##########
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieTimer.java:
##########
@@ -30,7 +30,17 @@
 public class HoodieTimer {
 
   // Ordered stack of TimeInfo's to make sure stopping the timer returns the correct elapsed time
-  Deque<TimeInfo> timeInfoDeque = new ArrayDeque<>();
+  private final Deque<TimeInfo> timeInfoDeque = new ArrayDeque<>();
+
+  public HoodieTimer() {
+    this(false);
+  }
+
+  public HoodieTimer(boolean shouldStart) {
+    if (shouldStart) {
+      startTimer();
+    }
+  }

Review Comment:
   Fair enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] vinothchandar commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1194344129

   @alexeykudinkin Want to get my understanding straight, as well make sure we have an explanation for how these factors play out with the new changes. 
   
   
   1. The original row writer impl originated in overhead from doing `df.queryExecution.toRdd` [here](https://github.com/apache/hudi/blob/622d27a099f5dec96f992fd423b666083da4b24a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala#L160), done before Avro record conversion. We traced this into a code in Spark, that makes an additional pass (almost) to materialize the Rows with a schema to be used by the iterator.
   
   2. I see that in 0.11.1 we were just [processing](https://github.com/apache/hudi/blob/622d27a099f5dec96f992fd423b666083da4b24a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L74) the dataframe as `DataSet<Row>` and ergo the use of UDFs for the other functionality. This is what's been fixed in 0.12 now. 
   
   
   I want to understand how we are avoiding the RDD conversion costs, in the current approach? This cost becomes obvious when you do records with large number of columns (due to overhead per record) 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1185815746

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a2ee79f4b4309f2707539971da055263e7ec6e74 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1113872284

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ba983dc2b5c0ad53b99b771d82925cd0d55478c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r864931185


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>

Review Comment:
   does this toRdd incur any perf hit? if yes, can you do some benchmark w/ udfs based vs this and report what do you see. Alternatively you can also, run a benchmark w/ raw parquet write w/ bulk insert row writer non partitioned and no sort mode and ensure we see comparable nos w/ this patch. 



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        lazy val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+          // To minimize # of allocations, we're going to allocate a single array
+          // setting all column values in place for the updated row
+          val newColVals = new Array[Any](schema.fields.length + HoodieRecord.HOODIE_META_COLUMNS.size)
+          // NOTE: Order of the fields have to match that one of `HoodieRecord.HOODIE_META_COLUMNS`
+          newColVals.update(0, commitTimestamp)
+          newColVals.update(1, commitSeqNo)
+          newColVals.update(2, recordKey)
+          newColVals.update(3, partitionPath)
+          newColVals.update(4, filename)
+          // Prepend existing row column values
+          row.toSeq(schema).copyToArray(newColVals, 5)
+          new GenericInternalRow(newColVals)
+        }
+      }
+
+    val metaFields = Seq(
+      StructField(HoodieRecord.COMMIT_TIME_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.RECORD_KEY_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.PARTITION_PATH_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.FILENAME_METADATA_FIELD, StringType))
+
+    val updatedSchema = StructType(metaFields ++ schema.fields)
+    val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema)
+
+    if (!populateMetaFields) {
+      updatedDF

Review Comment:
   probably it was a gap before. but we may not have to support dropPartitionColumns even with virtual key code path. can we fix that please



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##########
@@ -66,12 +66,26 @@ protected BuiltinKeyGenerator(TypedProperties config) {
   @Override
   @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
   public String getRecordKey(Row row) {
+    // TODO avoid conversion to avro
+    //      since converterFn is transient this will be repeatedly initialized over and over again
     if (null == converterFn) {
       converterFn = AvroConversionUtils.createConverterToAvro(row.schema(), STRUCT_NAME, NAMESPACE);
     }
     return getKey(converterFn.apply(row)).getRecordKey();
   }
 
+  @Override
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public String getRecordKey(InternalRow internalRow, StructType schema) {
+    try {

Review Comment:
   with the changes in my other patch, we don't need to deserialize to Row to fetch the value. Can you take a look



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java:
##########
@@ -51,6 +61,24 @@ public String getRecordKey(GenericRecord record) {
     return nonpartitionedAvroKeyGenerator.getRecordKey(record);
   }
 
+  @Override
+  public String getRecordKey(Row row) {

Review Comment:
   shouldn't we migrate this fix to SimpleKeyGen, if you feel existing impl in SimpleKeyGen could be fixed? why making changes just to NonPartitionedKeyGen only. 



##########
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java:
##########
@@ -153,7 +152,37 @@ public List<HoodieInternalWriteStatus> getWriteStatuses() throws IOException {
     return writeStatusList;
   }
 
-  public void abort() {
+  public void abort() {}
+
+  public void close() throws IOException {
+    for (HoodieRowCreateHandle rowCreateHandle : handles.values()) {
+      writeStatusList.add(rowCreateHandle.close());
+    }
+    handles.clear();
+    handle = null;
+  }
+
+  private String extractPartitionPath(InternalRow row) {
+    String partitionPath;
+    if (populateMetaFields) {
+      // In case meta-fields are materialized w/in the table itself, we can just simply extract
+      // partition path from there
+      //
+      // NOTE: Helper keeps track of [[lastKnownPartitionPath]] as [[UTF8String]] to avoid
+      //       conversion from Catalyst internal representation into a [[String]]
+      partitionPath = row.getString(
+          HoodieRecord.HOODIE_META_COLUMNS_NAME_TO_POS.get(HoodieRecord.PARTITION_PATH_METADATA_FIELD));

Review Comment:
   we can directly use 3 here instead of looking up in hashmap



##########
hudi-client/hudi-spark-client/src/test/scala/org/apache/spark/sql/TestHoodieUnsafeRowUtils.scala:
##########
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+import org.junit.jupiter.api.Assertions.{assertEquals, fail}
+import org.junit.jupiter.api.Test
+
+class TestHoodieUnsafeRowUtils {
+
+  @Test
+  def testComposeNestedFieldPath(): Unit = {
+    val schema = StructType(Seq(
+      StructField("foo", StringType),
+      StructField(
+        name = "bar",
+        dataType = StructType(Seq(
+          StructField("baz", DateType),
+          StructField("bor", LongType)
+        ))
+      )
+    ))
+
+    assertEquals(
+      Seq((1, schema(1)), (0, schema(1).dataType.asInstanceOf[StructType](0))),
+      composeNestedFieldPath(schema, "bar.baz").toSeq)
+
+    assertThrows(classOf[IllegalArgumentException]) { () =>
+      composeNestedFieldPath(schema, "foo.baz")
+    }
+  }
+
+  @Test
+  def testGetNestedRowValue(): Unit = {
+    val schema = StructType(Seq(

Review Comment:
   minor: if you intend to use the same schema for many tests, we can make this an instance variable and not declare in every test. its immutable and so we could even make static final. 



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        lazy val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+          // To minimize # of allocations, we're going to allocate a single array
+          // setting all column values in place for the updated row
+          val newColVals = new Array[Any](schema.fields.length + HoodieRecord.HOODIE_META_COLUMNS.size)
+          // NOTE: Order of the fields have to match that one of `HoodieRecord.HOODIE_META_COLUMNS`
+          newColVals.update(0, commitTimestamp)
+          newColVals.update(1, commitSeqNo)
+          newColVals.update(2, recordKey)
+          newColVals.update(3, partitionPath)
+          newColVals.update(4, filename)
+          // Prepend existing row column values
+          row.toSeq(schema).copyToArray(newColVals, 5)
+          new GenericInternalRow(newColVals)
+        }
+      }
+
+    val metaFields = Seq(
+      StructField(HoodieRecord.COMMIT_TIME_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.RECORD_KEY_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.PARTITION_PATH_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.FILENAME_METADATA_FIELD, StringType))
+
+    val updatedSchema = StructType(metaFields ++ schema.fields)
+    val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema)
+
+    if (!populateMetaFields) {
+      updatedDF
+    } else {
+      val trimmedDF = if (dropPartitionColumns) {
+        val keyGenerator = ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps)).asInstanceOf[BuiltinKeyGenerator]
+        val partitionPathFields = keyGenerator.getPartitionPathFields.asScala
+        val nestedPartitionPathFields = partitionPathFields.filter(f => f.contains('.'))
+        if (nestedPartitionPathFields.nonEmpty) {
+          logWarning(s"Can not drop nested partition path fields: $nestedPartitionPathFields")
+        }
+
+        val partitionPathCols = partitionPathFields -- nestedPartitionPathFields
+        updatedDF.drop(partitionPathCols: _*)
+      } else {
+        updatedDF
+      }
+
+      val dedupedDF = if (config.shouldCombineBeforeInsert) {
+        dedupeRows(trimmedDF, config.getPreCombineField, isGlobalIndex)
+      } else {
+        trimmedDF
+      }
+
+      partitioner.repartitionRecords(dedupedDF, config.getBulkInsertShuffleParallelism)
+    }
+  }
+
+  private def dedupeRows(df: DataFrame, preCombineFieldRef: String, isGlobalIndex: Boolean): DataFrame = {
+    val recordKeyMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD)

Review Comment:
   this may also need to be fixed for virtual key path, or we can call it out that its not supported for now. even prior to this patch, we did have support for de-duping in virtual key flow in row writer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1188352584

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 441a54af977c110c75890e729a539496952ca76d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978) 
   * d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1189683752

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 505ee485234d9768ccbabe6c69a8b77219600789 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1189680303

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r926236234


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(
+      StructField(HoodieRecord.COMMIT_TIME_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.RECORD_KEY_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.PARTITION_PATH_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.FILENAME_METADATA_FIELD, StringType))
+
+    val updatedSchema = StructType(metaFields ++ schema.fields)
+    val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema)
+
+    if (!populateMetaFields) {
+      updatedDF
+    } else {
+      val trimmedDF = if (dropPartitionColumns) {
+        val keyGenerator = ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps)).asInstanceOf[BuiltinKeyGenerator]
+        val partitionPathFields = keyGenerator.getPartitionPathFields.asScala
+        val nestedPartitionPathFields = partitionPathFields.filter(f => f.contains('.'))
+        if (nestedPartitionPathFields.nonEmpty) {
+          logWarning(s"Can not drop nested partition path fields: $nestedPartitionPathFields")
+        }
+
+        val partitionPathCols = partitionPathFields -- nestedPartitionPathFields
+        updatedDF.drop(partitionPathCols: _*)
+      } else {
+        updatedDF
+      }
+
+      val dedupedDF = if (config.shouldCombineBeforeInsert) {
+        dedupeRows(trimmedDF, config.getPreCombineField, isGlobalIndex)
+      } else {
+        trimmedDF
+      }
+
+      partitioner.repartitionRecords(dedupedDF, config.getBulkInsertShuffleParallelism)
+    }
+  }
+
+  private def dedupeRows(df: DataFrame, preCombineFieldRef: String, isGlobalIndex: Boolean): DataFrame = {
+    val recordKeyMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD)
+    val partitionPathMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.PARTITION_PATH_METADATA_FIELD)
+    // NOTE: Pre-combine field could be a nested field
+    val preCombineFieldPath = composeNestedFieldPath(df.schema, preCombineFieldRef)
+
+    val dedupedRdd = df.queryExecution.toRdd
+      .map { row =>
+        val rowKey = if (isGlobalIndex) {
+          row.getString(recordKeyMetaFieldOrd)
+        } else {
+          val partitionPath = row.getString(partitionPathMetaFieldOrd)
+          val recordKey = row.getString(recordKeyMetaFieldOrd)
+          s"$partitionPath:$recordKey"
+        }
+        // NOTE: It's critical whenever we keep the reference to the row, to make a copy
+        //       since Spark might be providing us with a mutable copy (updated during the iteration)
+        (rowKey, row.copy())

Review Comment:
   Got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1191021045

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b4573ac05a7bc2bea1a367778a13632ba7aefc3e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1113884689

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ba983dc2b5c0ad53b99b771d82925cd0d55478c Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397) 
   * b14a25c115a1a208d1e0d10088802dba680e44c9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1186010291

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6dfc22bdc39748d0d1b52df90375f18f84c48c6b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1185093460

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dc912614b9f4217bac743897810774e172cf81ac Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564) 
   * 52a72e09bb9724d845218eb5c408523706af5a78 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1185800981

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 52a72e09bb9724d845218eb5c408523706af5a78 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936) 
   * a2ee79f4b4309f2707539971da055263e7ec6e74 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1189681525

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 505ee485234d9768ccbabe6c69a8b77219600789 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1189674781

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 505ee485234d9768ccbabe6c69a8b77219600789 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r925007018


##########
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieTimer.java:
##########
@@ -30,7 +30,17 @@
 public class HoodieTimer {
 
   // Ordered stack of TimeInfo's to make sure stopping the timer returns the correct elapsed time
-  Deque<TimeInfo> timeInfoDeque = new ArrayDeque<>();
+  private final Deque<TimeInfo> timeInfoDeque = new ArrayDeque<>();
+
+  public HoodieTimer() {
+    this(false);
+  }
+
+  public HoodieTimer(boolean shouldStart) {
+    if (shouldStart) {
+      startTimer();
+    }
+  }

Review Comment:
   Understood.  I'm saying that `timer = new HoodieTimer().startTimer()` looks obvious for starting the timer instead of looking into what the boolean represents.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118129741

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9c7e7eacfd743264be7c0d2e6bc18165722358f9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1186069170

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6dfc22bdc39748d0d1b52df90375f18f84c48c6b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967) 
   * 441a54af977c110c75890e729a539496952ca76d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1185091578

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dc912614b9f4217bac743897810774e172cf81ac Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564) 
   * 52a72e09bb9724d845218eb5c408523706af5a78 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1113919641

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b14a25c115a1a208d1e0d10088802dba680e44c9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399) 
   * 1489d3759dfe86f9625ee533ea4ea710b32b18c3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1113882890

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ba983dc2b5c0ad53b99b771d82925cd0d55478c Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397) 
   * b14a25c115a1a208d1e0d10088802dba680e44c9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118112640

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dba2edf235b1bc51a170145bef25424abb2c80dd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426) 
   * 9c7e7eacfd743264be7c0d2e6bc18165722358f9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1188588883

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042) 
   * 34b01276ff7755cf35665ee49cf957b0879cc1eb UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r926236081


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(

Review Comment:
   OK, for this one we can keep it.  Sth to think of for consistency to avoid bugs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1189387595

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 34b01276ff7755cf35665ee49cf957b0879cc1eb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051) 
   * e803911e6e3e8524787d7dd8edaca1a179ae9da8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua merged pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
yihua merged PR #5470:
URL: https://github.com/apache/hudi/pull/5470


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5470: [HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r862265122


##########
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java:
##########
@@ -63,16 +63,20 @@ public class BulkInsertDataInternalWriterHelper {
   private final StructType structType;
   private final Boolean arePartitionRecordsSorted;
   private final List<HoodieInternalWriteStatus> writeStatusList = new ArrayList<>();
-  private HoodieRowCreateHandle handle;
+  private final String fileIdPrefix;

Review Comment:
   These are just made final



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1113900382

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b14a25c115a1a208d1e0d10088802dba680e44c9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1190746801

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1190744358",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 505ee485234d9768ccbabe6c69a8b77219600789 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1191290020

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109",
       "triggerID" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109",
       "triggerID" : "1190744358",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "0000",
       "status" : "DELETED",
       "url" : "TBD",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * b4573ac05a7bc2bea1a367778a13632ba7aefc3e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1186068116

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6dfc22bdc39748d0d1b52df90375f18f84c48c6b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967) 
   * 441a54af977c110c75890e729a539496952ca76d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r922484376


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java:
##########
@@ -51,6 +61,24 @@ public String getRecordKey(GenericRecord record) {
     return nonpartitionedAvroKeyGenerator.getRecordKey(record);
   }
 
+  @Override
+  public String getRecordKey(Row row) {

Review Comment:
   This is addressed in #5523



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r923937210


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java:
##########
@@ -57,187 +92,153 @@ public int numFields() {
   }
 
   @Override
-  public void setNullAt(int i) {
-    if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      switch (i) {
-        case 0: {
-          this.commitTime = null;
-          break;
-        }
-        case 1: {
-          this.commitSeqNumber = null;
-          break;
-        }
-        case 2: {
-          this.recordKey = null;
-          break;
-        }
-        case 3: {
-          this.partitionPath = null;
-          break;
-        }
-        case 4: {
-          this.fileName = null;
-          break;
-        }
-        default: throw new IllegalArgumentException("Not expected");
-      }
+  public void setNullAt(int ordinal) {
+    if (ordinal < metaFields.length) {
+      metaFields[ordinal] = null;
     } else {
-      row.setNullAt(i);
+      row.setNullAt(rebaseOrdinal(ordinal));
     }
   }
 
   @Override
-  public void update(int i, Object value) {
-    if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      switch (i) {
-        case 0: {
-          this.commitTime = value.toString();
-          break;
-        }
-        case 1: {
-          this.commitSeqNumber = value.toString();
-          break;
-        }
-        case 2: {
-          this.recordKey = value.toString();
-          break;
-        }
-        case 3: {
-          this.partitionPath = value.toString();
-          break;
-        }
-        case 4: {
-          this.fileName = value.toString();
-          break;
-        }
-        default: throw new IllegalArgumentException("Not expected");
+  public void update(int ordinal, Object value) {
+    if (ordinal < metaFields.length) {

Review Comment:
   Do we need to check `containsMetaFields` here?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java:
##########
@@ -24,31 +24,66 @@
 import org.apache.spark.sql.catalyst.util.MapData;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.sql.types.StringType$;
 import org.apache.spark.unsafe.types.CalendarInterval;
 import org.apache.spark.unsafe.types.UTF8String;
 
+import java.util.Arrays;
+
 /**
- * Internal Row implementation for Hoodie Row. It wraps an {@link InternalRow} and keeps meta columns locally. But the {@link InternalRow}
- * does include the meta columns as well just that {@link HoodieInternalRow} will intercept queries for meta columns and serve from its
- * copy rather than fetching from {@link InternalRow}.
+ * Hudi internal implementation of the {@link InternalRow} allowing to extend arbitrary
+ * {@link InternalRow} overlaying Hudi-internal meta-fields on top of it.
+ *
+ * Capable of overlaying meta-fields in both cases: whether original {@link #row} contains
+ * meta columns or not. This allows to handle following use-cases allowing to avoid any
+ * manipulation (reshuffling) of the source row, by simply creating new instance
+ * of {@link HoodieInternalRow} with all the meta-values provided
+ *
+ * <ul>
+ *   <li>When meta-fields need to be prepended to the source {@link InternalRow}</li>
+ *   <li>When meta-fields need to be updated w/in the source {@link InternalRow}
+ *   ({@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} currently does not
+ *   allow in-place updates due to its memory layout)</li>
+ * </ul>
  */
 public class HoodieInternalRow extends InternalRow {
 
-  private String commitTime;
-  private String commitSeqNumber;
-  private String recordKey;
-  private String partitionPath;
-  private String fileName;
-  private InternalRow row;
-
-  public HoodieInternalRow(String commitTime, String commitSeqNumber, String recordKey, String partitionPath,
-      String fileName, InternalRow row) {
-    this.commitTime = commitTime;
-    this.commitSeqNumber = commitSeqNumber;
-    this.recordKey = recordKey;
-    this.partitionPath = partitionPath;
-    this.fileName = fileName;
+  /**
+   * Collection of meta-fields as defined by {@link HoodieRecord#HOODIE_META_COLUMNS}
+   */
+  private final UTF8String[] metaFields;
+  private final InternalRow row;
+
+  /**
+   * Specifies whether source {@link #row} contains meta-fields
+   */
+  private final boolean containsMetaFields;
+
+  public HoodieInternalRow(UTF8String commitTime,
+                           UTF8String commitSeqNumber,
+                           UTF8String recordKey,
+                           UTF8String partitionPath,
+                           UTF8String fileName,
+                           InternalRow row,
+                           boolean containsMetaFields) {
+    this.metaFields = new UTF8String[] {
+        commitTime,
+        commitSeqNumber,
+        recordKey,
+        partitionPath,
+        fileName
+    };
+
     this.row = row;
+    this.containsMetaFields = containsMetaFields;

Review Comment:
   if `containsMetaFields` is false, should the length of `metaFields` be 0?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java:
##########
@@ -57,187 +92,153 @@ public int numFields() {
   }
 
   @Override
-  public void setNullAt(int i) {
-    if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      switch (i) {
-        case 0: {
-          this.commitTime = null;
-          break;
-        }
-        case 1: {
-          this.commitSeqNumber = null;
-          break;
-        }
-        case 2: {
-          this.recordKey = null;
-          break;
-        }
-        case 3: {
-          this.partitionPath = null;
-          break;
-        }
-        case 4: {
-          this.fileName = null;
-          break;
-        }
-        default: throw new IllegalArgumentException("Not expected");
-      }
+  public void setNullAt(int ordinal) {
+    if (ordinal < metaFields.length) {
+      metaFields[ordinal] = null;
     } else {
-      row.setNullAt(i);
+      row.setNullAt(rebaseOrdinal(ordinal));
     }
   }
 
   @Override
-  public void update(int i, Object value) {
-    if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      switch (i) {
-        case 0: {
-          this.commitTime = value.toString();
-          break;
-        }
-        case 1: {
-          this.commitSeqNumber = value.toString();
-          break;
-        }
-        case 2: {
-          this.recordKey = value.toString();
-          break;
-        }
-        case 3: {
-          this.partitionPath = value.toString();
-          break;
-        }
-        case 4: {
-          this.fileName = value.toString();
-          break;
-        }
-        default: throw new IllegalArgumentException("Not expected");
+  public void update(int ordinal, Object value) {
+    if (ordinal < metaFields.length) {
+      if (value instanceof UTF8String) {
+        metaFields[ordinal] = (UTF8String) value;
+      } else if (value instanceof String) {
+        metaFields[ordinal] = UTF8String.fromString((String) value);
+      } else {
+        throw new IllegalArgumentException(
+            String.format("Could not update the row at (%d) with value of type (%s), either UTF8String or String are expected", ordinal, value.getClass().getSimpleName()));
       }
     } else {
-      row.update(i, value);
+      row.update(rebaseOrdinal(ordinal), value);
     }
   }
 
-  private String getMetaColumnVal(int ordinal) {
-    switch (ordinal) {
-      case 0: {
-        return commitTime;
-      }
-      case 1: {
-        return commitSeqNumber;
-      }
-      case 2: {
-        return recordKey;
-      }
-      case 3: {
-        return partitionPath;
-      }
-      case 4: {
-        return fileName;
-      }
-      default: throw new IllegalArgumentException("Not expected");
+  @Override
+  public boolean isNullAt(int ordinal) {
+    if (ordinal < metaFields.length) {
+      return metaFields[ordinal] == null;
     }
+    return row.isNullAt(rebaseOrdinal(ordinal));
   }
 
   @Override
-  public boolean isNullAt(int ordinal) {
+  public UTF8String getUTF8String(int ordinal) {
+    if (ordinal < HoodieRecord.HOODIE_META_COLUMNS.size()) {
+      return metaFields[ordinal];
+    }
+    return row.getUTF8String(rebaseOrdinal(ordinal));
+  }
+
+  @Override
+  public Object get(int ordinal, DataType dataType) {
     if (ordinal < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      return null == getMetaColumnVal(ordinal);
+      validateMetaFieldDataType(dataType);
+      return metaFields[ordinal];
     }
-    return row.isNullAt(ordinal);
+    return row.get(rebaseOrdinal(ordinal), dataType);
   }
 
   @Override
   public boolean getBoolean(int ordinal) {
-    return row.getBoolean(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Boolean.class);
+    return row.getBoolean(rebaseOrdinal(ordinal));
   }
 
   @Override
   public byte getByte(int ordinal) {
-    return row.getByte(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Byte.class);
+    return row.getByte(rebaseOrdinal(ordinal));
   }
 
   @Override
   public short getShort(int ordinal) {
-    return row.getShort(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Short.class);
+    return row.getShort(rebaseOrdinal(ordinal));
   }
 
   @Override
   public int getInt(int ordinal) {
-    return row.getInt(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Integer.class);
+    return row.getInt(rebaseOrdinal(ordinal));
   }
 
   @Override
   public long getLong(int ordinal) {
-    return row.getLong(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Long.class);
+    return row.getLong(rebaseOrdinal(ordinal));
   }
 
   @Override
   public float getFloat(int ordinal) {
-    return row.getFloat(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Float.class);
+    return row.getFloat(rebaseOrdinal(ordinal));
   }
 
   @Override
   public double getDouble(int ordinal) {
-    return row.getDouble(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Double.class);
+    return row.getDouble(rebaseOrdinal(ordinal));
   }
 
   @Override
   public Decimal getDecimal(int ordinal, int precision, int scale) {
-    return row.getDecimal(ordinal, precision, scale);
-  }
-
-  @Override
-  public UTF8String getUTF8String(int ordinal) {
-    if (ordinal < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      return UTF8String.fromBytes(getMetaColumnVal(ordinal).getBytes());
-    }
-    return row.getUTF8String(ordinal);
-  }
-
-  @Override
-  public String getString(int ordinal) {
-    if (ordinal < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      return new String(getMetaColumnVal(ordinal).getBytes());
-    }
-    return row.getString(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Decimal.class);
+    return row.getDecimal(rebaseOrdinal(ordinal), precision, scale);
   }
 
   @Override
   public byte[] getBinary(int ordinal) {
-    return row.getBinary(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Byte[].class);
+    return row.getBinary(rebaseOrdinal(ordinal));
   }
 
   @Override
   public CalendarInterval getInterval(int ordinal) {
-    return row.getInterval(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, CalendarInterval.class);
+    return row.getInterval(rebaseOrdinal(ordinal));
   }
 
   @Override
   public InternalRow getStruct(int ordinal, int numFields) {
-    return row.getStruct(ordinal, numFields);
+    ruleOutMetaFieldsAccess(ordinal, InternalRow.class);
+    return row.getStruct(rebaseOrdinal(ordinal), numFields);
   }
 
   @Override
   public ArrayData getArray(int ordinal) {
-    return row.getArray(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, ArrayData.class);
+    return row.getArray(rebaseOrdinal(ordinal));
   }
 
   @Override
   public MapData getMap(int ordinal) {
-    return row.getMap(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, MapData.class);
+    return row.getMap(rebaseOrdinal(ordinal));
   }
 
   @Override
-  public Object get(int ordinal, DataType dataType) {
-    if (ordinal < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      return UTF8String.fromBytes(getMetaColumnVal(ordinal).getBytes());
+  public InternalRow copy() {
+    return new HoodieInternalRow(Arrays.copyOf(metaFields, metaFields.length), row.copy(), containsMetaFields);
+  }
+
+  private int rebaseOrdinal(int ordinal) {
+    // NOTE: In cases when source row does not contain meta fields, we will have to
+    //       rebase ordinal onto its indexes
+    return containsMetaFields ? ordinal : ordinal - metaFields.length;

Review Comment:
   If the source row does not contain meta fields (`containsMetaFields` is false), and assuming `metaFields` is empty, the logic here for adjusting the ordinal is not necessary?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieInternalRowFileWriter.java:
##########
@@ -37,7 +38,7 @@ public interface HoodieInternalRowFileWriter {
    *
    * @throws IOException on any exception while writing.
    */
-  void writeRow(String key, InternalRow row) throws IOException;
+  void writeRow(UTF8String key, InternalRow row) throws IOException;

Review Comment:
   Is the usage of `UTF8String` type for performance?



##########
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieTimer.java:
##########
@@ -30,7 +30,17 @@
 public class HoodieTimer {
 
   // Ordered stack of TimeInfo's to make sure stopping the timer returns the correct elapsed time
-  Deque<TimeInfo> timeInfoDeque = new ArrayDeque<>();
+  private final Deque<TimeInfo> timeInfoDeque = new ArrayDeque<>();
+
+  public HoodieTimer() {
+    this(false);
+  }
+
+  public HoodieTimer(boolean shouldStart) {
+    if (shouldStart) {
+      startTimer();
+    }
+  }

Review Comment:
   This is not obvious (`timer = new HoodieTime(true)`) compared to exsiting way (`timer = new HoodieTimer().startTimer()`).  Should we revert the change?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -73,18 +75,15 @@ public WriteSupport.FinalizedWriteContext finalizeWrite() {
     return new WriteSupport.FinalizedWriteContext(extraMetaData);
   }
 
-  public void add(String recordKey) {
-    this.bloomFilter.add(recordKey);
-    if (minRecordKey != null) {
-      minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey;
-    } else {
-      minRecordKey = recordKey;
+  public void add(UTF8String recordKey) {
+    this.bloomFilter.add(recordKey.getBytes());

Review Comment:
   So the content or the byte array of the `String` and `UTF8String` instances should be the same here, right?  So that the bloom filter lookup is not affected based on the String key.



##########
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java:
##########
@@ -88,13 +91,21 @@ public BulkInsertDataInternalWriterHelper(HoodieTable hoodieTable, HoodieWriteCo
     this.populateMetaFields = populateMetaFields;
     this.arePartitionRecordsSorted = arePartitionRecordsSorted;
     this.fileIdPrefix = UUID.randomUUID().toString();
+
     if (!populateMetaFields) {
       this.keyGeneratorOpt = getKeyGenerator(writeConfig.getProps());
-      if (keyGeneratorOpt.isPresent() && keyGeneratorOpt.get() instanceof SimpleKeyGenerator) {
-        simpleKeyGen = true;
-        simplePartitionFieldIndex = (Integer) structType.getFieldIndex((keyGeneratorOpt.get()).getPartitionPathFields().get(0)).get();
-        simplePartitionFieldDataType = structType.fields()[simplePartitionFieldIndex].dataType();
-      }
+    } else {
+      this.keyGeneratorOpt = Option.empty();
+    }
+
+    if (keyGeneratorOpt.isPresent() && keyGeneratorOpt.get() instanceof SimpleKeyGenerator) {
+      this.simpleKeyGen = true;
+      this.simplePartitionFieldIndex = (Integer) structType.getFieldIndex(keyGeneratorOpt.get().getPartitionPathFields().get(0)).get();
+      this.simplePartitionFieldDataType = structType.fields()[simplePartitionFieldIndex].dataType();
+    } else {
+      this.simpleKeyGen = false;
+      this.simplePartitionFieldIndex = -1;
+      this.simplePartitionFieldDataType = null;

Review Comment:
   The question is out-of-scope for this PR, but why do we need special-case handling for the simple key generator here?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##########
@@ -102,20 +114,20 @@ public String getPartitionPath(InternalRow internalRow, StructType structType) {
       return RowKeyGeneratorHelper.getPartitionPathFromInternalRow(internalRow, getPartitionPathFields(),
           hiveStylePartitioning, partitionPathSchemaInfo);
     } catch (Exception e) {
-      throw new HoodieIOException("Conversion of InternalRow to Row failed with exception " + e);
+      throw new HoodieException("Conversion of InternalRow to Row failed with exception", e);
     }
   }
 
   void buildFieldSchemaInfoIfNeeded(StructType structType) {
     if (this.structType == null) {
+      this.structType = structType;

Review Comment:
   Does the change or order have any side effect?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -73,18 +75,15 @@ public WriteSupport.FinalizedWriteContext finalizeWrite() {
     return new WriteSupport.FinalizedWriteContext(extraMetaData);
   }
 
-  public void add(String recordKey) {
-    this.bloomFilter.add(recordKey);
-    if (minRecordKey != null) {
-      minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey;
-    } else {
-      minRecordKey = recordKey;
+  public void add(UTF8String recordKey) {
+    this.bloomFilter.add(recordKey.getBytes());
+
+    if (minRecordKey == null || minRecordKey.compareTo(recordKey) < 0) {
+      minRecordKey =  recordKey.copy();
     }
 
-    if (maxRecordKey != null) {
-      maxRecordKey = maxRecordKey.compareTo(recordKey) >= 0 ? maxRecordKey : recordKey;
-    } else {
-      maxRecordKey = recordKey;
+    if (maxRecordKey == null || maxRecordKey.compareTo(recordKey) > 0) {
+      maxRecordKey = recordKey.copy();

Review Comment:
   Could the `copy()` here introduce overhead?  Does it have to be a deep copy?



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {

Review Comment:
   Is this refactored based on `hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java`?



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieUnsafeRowUtils.scala:
##########
@@ -0,0 +1,120 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StructField, StructType}
+
+import scala.collection.mutable.ArrayBuffer
+
+object HoodieUnsafeRowUtils {

Review Comment:
   Are these code copied from Spark OSS?  Wondering if they work across Spark versions.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(

Review Comment:
   nit: use `HoodieRecord.HOODIE_META_COLUMNS` and transformation to form the meta fields?



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(
+      StructField(HoodieRecord.COMMIT_TIME_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.RECORD_KEY_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.PARTITION_PATH_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.FILENAME_METADATA_FIELD, StringType))
+
+    val updatedSchema = StructType(metaFields ++ schema.fields)
+    val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema)
+
+    if (!populateMetaFields) {
+      updatedDF
+    } else {
+      val trimmedDF = if (dropPartitionColumns) {
+        val keyGenerator = ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps)).asInstanceOf[BuiltinKeyGenerator]
+        val partitionPathFields = keyGenerator.getPartitionPathFields.asScala
+        val nestedPartitionPathFields = partitionPathFields.filter(f => f.contains('.'))
+        if (nestedPartitionPathFields.nonEmpty) {
+          logWarning(s"Can not drop nested partition path fields: $nestedPartitionPathFields")
+        }
+
+        val partitionPathCols = partitionPathFields -- nestedPartitionPathFields
+        updatedDF.drop(partitionPathCols: _*)
+      } else {
+        updatedDF
+      }
+
+      val dedupedDF = if (config.shouldCombineBeforeInsert) {
+        dedupeRows(trimmedDF, config.getPreCombineField, isGlobalIndex)
+      } else {
+        trimmedDF
+      }
+
+      partitioner.repartitionRecords(dedupedDF, config.getBulkInsertShuffleParallelism)
+    }
+  }
+
+  private def dedupeRows(df: DataFrame, preCombineFieldRef: String, isGlobalIndex: Boolean): DataFrame = {
+    val recordKeyMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD)
+    val partitionPathMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.PARTITION_PATH_METADATA_FIELD)
+    // NOTE: Pre-combine field could be a nested field
+    val preCombineFieldPath = composeNestedFieldPath(df.schema, preCombineFieldRef)
+
+    val dedupedRdd = df.queryExecution.toRdd
+      .map { row =>
+        val rowKey = if (isGlobalIndex) {
+          row.getString(recordKeyMetaFieldOrd)
+        } else {
+          val partitionPath = row.getString(partitionPathMetaFieldOrd)
+          val recordKey = row.getString(recordKeyMetaFieldOrd)
+          s"$partitionPath:$recordKey"
+        }
+        // NOTE: It's critical whenever we keep the reference to the row, to make a copy
+        //       since Spark might be providing us with a mutable copy (updated during the iteration)
+        (rowKey, row.copy())

Review Comment:
   is `row.copy()` needed here for `reduceByKey`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r925021322


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(
+      StructField(HoodieRecord.COMMIT_TIME_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.RECORD_KEY_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.PARTITION_PATH_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.FILENAME_METADATA_FIELD, StringType))
+
+    val updatedSchema = StructType(metaFields ++ schema.fields)
+    val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema)
+
+    if (!populateMetaFields) {
+      updatedDF
+    } else {
+      val trimmedDF = if (dropPartitionColumns) {
+        val keyGenerator = ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps)).asInstanceOf[BuiltinKeyGenerator]
+        val partitionPathFields = keyGenerator.getPartitionPathFields.asScala
+        val nestedPartitionPathFields = partitionPathFields.filter(f => f.contains('.'))
+        if (nestedPartitionPathFields.nonEmpty) {
+          logWarning(s"Can not drop nested partition path fields: $nestedPartitionPathFields")
+        }
+
+        val partitionPathCols = partitionPathFields -- nestedPartitionPathFields
+        updatedDF.drop(partitionPathCols: _*)
+      } else {
+        updatedDF
+      }
+
+      val dedupedDF = if (config.shouldCombineBeforeInsert) {
+        dedupeRows(trimmedDF, config.getPreCombineField, isGlobalIndex)
+      } else {
+        trimmedDF
+      }
+
+      partitioner.repartitionRecords(dedupedDF, config.getBulkInsertShuffleParallelism)
+    }
+  }
+
+  private def dedupeRows(df: DataFrame, preCombineFieldRef: String, isGlobalIndex: Boolean): DataFrame = {
+    val recordKeyMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD)
+    val partitionPathMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.PARTITION_PATH_METADATA_FIELD)
+    // NOTE: Pre-combine field could be a nested field
+    val preCombineFieldPath = composeNestedFieldPath(df.schema, preCombineFieldRef)
+
+    val dedupedRdd = df.queryExecution.toRdd
+      .map { row =>
+        val rowKey = if (isGlobalIndex) {
+          row.getString(recordKeyMetaFieldOrd)
+        } else {
+          val partitionPath = row.getString(partitionPathMetaFieldOrd)
+          val recordKey = row.getString(recordKeyMetaFieldOrd)
+          s"$partitionPath:$recordKey"
+        }
+        // NOTE: It's critical whenever we keep the reference to the row, to make a copy
+        //       since Spark might be providing us with a mutable copy (updated during the iteration)
+        (rowKey, row.copy())

Review Comment:
   OK, later on, I think we need to revisit this pattern of `copy()` in the DAG to make sure they are needed.  Could you create a ticket?



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(

Review Comment:
   That's fair.  But shouldn't all the places use the same order so that we can maintain the order in one place like `HoodieRecord.HOODIE_META_COLUMNS` to avoid discrepancy?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1189391772

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 34b01276ff7755cf35665ee49cf957b0879cc1eb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051) 
   * e803911e6e3e8524787d7dd8edaca1a179ae9da8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r922483301


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>

Review Comment:
   `toRdd` is how Datasets are getting executed in Spark eventually. There's no perf hit by using it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r925005472


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java:
##########
@@ -57,187 +92,153 @@ public int numFields() {
   }
 
   @Override
-  public void setNullAt(int i) {
-    if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      switch (i) {
-        case 0: {
-          this.commitTime = null;
-          break;
-        }
-        case 1: {
-          this.commitSeqNumber = null;
-          break;
-        }
-        case 2: {
-          this.recordKey = null;
-          break;
-        }
-        case 3: {
-          this.partitionPath = null;
-          break;
-        }
-        case 4: {
-          this.fileName = null;
-          break;
-        }
-        default: throw new IllegalArgumentException("Not expected");
-      }
+  public void setNullAt(int ordinal) {
+    if (ordinal < metaFields.length) {
+      metaFields[ordinal] = null;
     } else {
-      row.setNullAt(i);
+      row.setNullAt(rebaseOrdinal(ordinal));
     }
   }
 
   @Override
-  public void update(int i, Object value) {
-    if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      switch (i) {
-        case 0: {
-          this.commitTime = value.toString();
-          break;
-        }
-        case 1: {
-          this.commitSeqNumber = value.toString();
-          break;
-        }
-        case 2: {
-          this.recordKey = value.toString();
-          break;
-        }
-        case 3: {
-          this.partitionPath = value.toString();
-          break;
-        }
-        case 4: {
-          this.fileName = value.toString();
-          break;
-        }
-        default: throw new IllegalArgumentException("Not expected");
+  public void update(int ordinal, Object value) {
+    if (ordinal < metaFields.length) {

Review Comment:
   Makes sense to me now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1190744358

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r924811553


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java:
##########
@@ -24,31 +24,66 @@
 import org.apache.spark.sql.catalyst.util.MapData;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.sql.types.StringType$;
 import org.apache.spark.unsafe.types.CalendarInterval;
 import org.apache.spark.unsafe.types.UTF8String;
 
+import java.util.Arrays;
+
 /**
- * Internal Row implementation for Hoodie Row. It wraps an {@link InternalRow} and keeps meta columns locally. But the {@link InternalRow}
- * does include the meta columns as well just that {@link HoodieInternalRow} will intercept queries for meta columns and serve from its
- * copy rather than fetching from {@link InternalRow}.
+ * Hudi internal implementation of the {@link InternalRow} allowing to extend arbitrary
+ * {@link InternalRow} overlaying Hudi-internal meta-fields on top of it.
+ *
+ * Capable of overlaying meta-fields in both cases: whether original {@link #row} contains
+ * meta columns or not. This allows to handle following use-cases allowing to avoid any
+ * manipulation (reshuffling) of the source row, by simply creating new instance
+ * of {@link HoodieInternalRow} with all the meta-values provided
+ *
+ * <ul>
+ *   <li>When meta-fields need to be prepended to the source {@link InternalRow}</li>
+ *   <li>When meta-fields need to be updated w/in the source {@link InternalRow}
+ *   ({@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} currently does not
+ *   allow in-place updates due to its memory layout)</li>
+ * </ul>
  */
 public class HoodieInternalRow extends InternalRow {
 
-  private String commitTime;
-  private String commitSeqNumber;
-  private String recordKey;
-  private String partitionPath;
-  private String fileName;
-  private InternalRow row;
-
-  public HoodieInternalRow(String commitTime, String commitSeqNumber, String recordKey, String partitionPath,
-      String fileName, InternalRow row) {
-    this.commitTime = commitTime;
-    this.commitSeqNumber = commitSeqNumber;
-    this.recordKey = recordKey;
-    this.partitionPath = partitionPath;
-    this.fileName = fileName;
+  /**
+   * Collection of meta-fields as defined by {@link HoodieRecord#HOODIE_META_COLUMNS}
+   */
+  private final UTF8String[] metaFields;
+  private final InternalRow row;
+
+  /**
+   * Specifies whether source {@link #row} contains meta-fields
+   */
+  private final boolean containsMetaFields;
+
+  public HoodieInternalRow(UTF8String commitTime,
+                           UTF8String commitSeqNumber,
+                           UTF8String recordKey,
+                           UTF8String partitionPath,
+                           UTF8String fileName,
+                           InternalRow row,
+                           boolean containsMetaFields) {
+    this.metaFields = new UTF8String[] {
+        commitTime,
+        commitSeqNumber,
+        recordKey,
+        partitionPath,
+        fileName
+    };
+
     this.row = row;
+    this.containsMetaFields = containsMetaFields;

Review Comment:
   Am gonna update the docs to make it more clear



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java:
##########
@@ -24,31 +24,66 @@
 import org.apache.spark.sql.catalyst.util.MapData;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.sql.types.StringType$;
 import org.apache.spark.unsafe.types.CalendarInterval;
 import org.apache.spark.unsafe.types.UTF8String;
 
+import java.util.Arrays;
+
 /**
- * Internal Row implementation for Hoodie Row. It wraps an {@link InternalRow} and keeps meta columns locally. But the {@link InternalRow}
- * does include the meta columns as well just that {@link HoodieInternalRow} will intercept queries for meta columns and serve from its
- * copy rather than fetching from {@link InternalRow}.
+ * Hudi internal implementation of the {@link InternalRow} allowing to extend arbitrary
+ * {@link InternalRow} overlaying Hudi-internal meta-fields on top of it.
+ *
+ * Capable of overlaying meta-fields in both cases: whether original {@link #row} contains
+ * meta columns or not. This allows to handle following use-cases allowing to avoid any
+ * manipulation (reshuffling) of the source row, by simply creating new instance
+ * of {@link HoodieInternalRow} with all the meta-values provided
+ *
+ * <ul>
+ *   <li>When meta-fields need to be prepended to the source {@link InternalRow}</li>
+ *   <li>When meta-fields need to be updated w/in the source {@link InternalRow}
+ *   ({@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} currently does not
+ *   allow in-place updates due to its memory layout)</li>
+ * </ul>
  */
 public class HoodieInternalRow extends InternalRow {
 
-  private String commitTime;
-  private String commitSeqNumber;
-  private String recordKey;
-  private String partitionPath;
-  private String fileName;
-  private InternalRow row;
-
-  public HoodieInternalRow(String commitTime, String commitSeqNumber, String recordKey, String partitionPath,
-      String fileName, InternalRow row) {
-    this.commitTime = commitTime;
-    this.commitSeqNumber = commitSeqNumber;
-    this.recordKey = recordKey;
-    this.partitionPath = partitionPath;
-    this.fileName = fileName;
+  /**
+   * Collection of meta-fields as defined by {@link HoodieRecord#HOODIE_META_COLUMNS}
+   */
+  private final UTF8String[] metaFields;
+  private final InternalRow row;
+
+  /**
+   * Specifies whether source {@link #row} contains meta-fields
+   */
+  private final boolean containsMetaFields;
+
+  public HoodieInternalRow(UTF8String commitTime,
+                           UTF8String commitSeqNumber,
+                           UTF8String recordKey,
+                           UTF8String partitionPath,
+                           UTF8String fileName,
+                           InternalRow row,
+                           boolean containsMetaFields) {
+    this.metaFields = new UTF8String[] {
+        commitTime,
+        commitSeqNumber,
+        recordKey,
+        partitionPath,
+        fileName
+    };
+
     this.row = row;
+    this.containsMetaFields = containsMetaFields;

Review Comment:
   There's some confusion: `containsMetaFields` relates to whether inner row contains the meta-fields itself. However, HIR will always override the meta-fields by overlaying on top of whatever the source row contains (this is necessary b/c `UnsafeRow` can't be updated)



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {

Review Comment:
   Correct. It's a simplified version converted into Scala (to handle RDDs)



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(

Review Comment:
   Good call. The reason i didn't do it in the first place was b/c order is critical here, and even though we're using a list, i didn't want this constraint to be instead obscured in other class (where order actually might not matter at all)



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(
+      StructField(HoodieRecord.COMMIT_TIME_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.RECORD_KEY_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.PARTITION_PATH_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.FILENAME_METADATA_FIELD, StringType))
+
+    val updatedSchema = StructType(metaFields ++ schema.fields)
+    val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema)
+
+    if (!populateMetaFields) {
+      updatedDF
+    } else {
+      val trimmedDF = if (dropPartitionColumns) {
+        val keyGenerator = ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps)).asInstanceOf[BuiltinKeyGenerator]
+        val partitionPathFields = keyGenerator.getPartitionPathFields.asScala
+        val nestedPartitionPathFields = partitionPathFields.filter(f => f.contains('.'))
+        if (nestedPartitionPathFields.nonEmpty) {
+          logWarning(s"Can not drop nested partition path fields: $nestedPartitionPathFields")
+        }
+
+        val partitionPathCols = partitionPathFields -- nestedPartitionPathFields
+        updatedDF.drop(partitionPathCols: _*)
+      } else {
+        updatedDF
+      }
+
+      val dedupedDF = if (config.shouldCombineBeforeInsert) {
+        dedupeRows(trimmedDF, config.getPreCombineField, isGlobalIndex)
+      } else {
+        trimmedDF
+      }
+
+      partitioner.repartitionRecords(dedupedDF, config.getBulkInsertShuffleParallelism)
+    }
+  }
+
+  private def dedupeRows(df: DataFrame, preCombineFieldRef: String, isGlobalIndex: Boolean): DataFrame = {
+    val recordKeyMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD)
+    val partitionPathMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.PARTITION_PATH_METADATA_FIELD)
+    // NOTE: Pre-combine field could be a nested field
+    val preCombineFieldPath = composeNestedFieldPath(df.schema, preCombineFieldRef)
+
+    val dedupedRdd = df.queryExecution.toRdd
+      .map { row =>
+        val rowKey = if (isGlobalIndex) {
+          row.getString(recordKeyMetaFieldOrd)
+        } else {
+          val partitionPath = row.getString(partitionPathMetaFieldOrd)
+          val recordKey = row.getString(recordKeyMetaFieldOrd)
+          s"$partitionPath:$recordKey"
+        }
+        // NOTE: It's critical whenever we keep the reference to the row, to make a copy
+        //       since Spark might be providing us with a mutable copy (updated during the iteration)
+        (rowKey, row.copy())

Review Comment:
   We only can get away w/o copying when we do one-pass (streaming-like) processing. If at any point we need to hold a reference to it -- we will have to make a copy (it's gonna fail otherwise)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -73,18 +75,15 @@ public WriteSupport.FinalizedWriteContext finalizeWrite() {
     return new WriteSupport.FinalizedWriteContext(extraMetaData);
   }
 
-  public void add(String recordKey) {
-    this.bloomFilter.add(recordKey);
-    if (minRecordKey != null) {
-      minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey;
-    } else {
-      minRecordKey = recordKey;
+  public void add(UTF8String recordKey) {
+    this.bloomFilter.add(recordKey.getBytes());

Review Comment:
   Bloom filter always ingest UTF8 (Java by default encodes in UTF16)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieInternalRowFileWriter.java:
##########
@@ -37,7 +38,7 @@ public interface HoodieInternalRowFileWriter {
    *
    * @throws IOException on any exception while writing.
    */
-  void writeRow(String key, InternalRow row) throws IOException;
+  void writeRow(UTF8String key, InternalRow row) throws IOException;

Review Comment:
   Correct -- to avoid conversion b/w String and UTF8String



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -73,18 +75,15 @@ public WriteSupport.FinalizedWriteContext finalizeWrite() {
     return new WriteSupport.FinalizedWriteContext(extraMetaData);
   }
 
-  public void add(String recordKey) {
-    this.bloomFilter.add(recordKey);
-    if (minRecordKey != null) {
-      minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey;
-    } else {
-      minRecordKey = recordKey;
+  public void add(UTF8String recordKey) {
+    this.bloomFilter.add(recordKey.getBytes());
+
+    if (minRecordKey == null || minRecordKey.compareTo(recordKey) < 0) {
+      minRecordKey =  recordKey.copy();
     }
 
-    if (maxRecordKey != null) {
-      maxRecordKey = maxRecordKey.compareTo(recordKey) >= 0 ? maxRecordKey : recordKey;
-    } else {
-      maxRecordKey = recordKey;
+    if (maxRecordKey == null || maxRecordKey.compareTo(recordKey) > 0) {
+      maxRecordKey = recordKey.copy();

Review Comment:
   Good catch! Should have been `clone` instead.
   
   We need to `clone` here b/c `UTF8String` doesn't do copying by default -- instead it would point into the holding (record's) buffer, and since such buffer could be mutable we have to make a copy of it in that case



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java:
##########
@@ -57,187 +92,153 @@ public int numFields() {
   }
 
   @Override
-  public void setNullAt(int i) {
-    if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      switch (i) {
-        case 0: {
-          this.commitTime = null;
-          break;
-        }
-        case 1: {
-          this.commitSeqNumber = null;
-          break;
-        }
-        case 2: {
-          this.recordKey = null;
-          break;
-        }
-        case 3: {
-          this.partitionPath = null;
-          break;
-        }
-        case 4: {
-          this.fileName = null;
-          break;
-        }
-        default: throw new IllegalArgumentException("Not expected");
-      }
+  public void setNullAt(int ordinal) {
+    if (ordinal < metaFields.length) {
+      metaFields[ordinal] = null;
     } else {
-      row.setNullAt(i);
+      row.setNullAt(rebaseOrdinal(ordinal));
     }
   }
 
   @Override
-  public void update(int i, Object value) {
-    if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      switch (i) {
-        case 0: {
-          this.commitTime = value.toString();
-          break;
-        }
-        case 1: {
-          this.commitSeqNumber = value.toString();
-          break;
-        }
-        case 2: {
-          this.recordKey = value.toString();
-          break;
-        }
-        case 3: {
-          this.partitionPath = value.toString();
-          break;
-        }
-        case 4: {
-          this.fileName = value.toString();
-          break;
-        }
-        default: throw new IllegalArgumentException("Not expected");
+  public void update(int ordinal, Object value) {
+    if (ordinal < metaFields.length) {

Review Comment:
   Great eye! We don't need to check it here: we only use `containsMetaFields` to understand whether the source row *already* contains meta-fields which affects how we index within the `HoodieInternalRow` (meta-fields are always prepended, and we always read meta-fields from HIR and never from the source row)



##########
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieTimer.java:
##########
@@ -30,7 +30,17 @@
 public class HoodieTimer {
 
   // Ordered stack of TimeInfo's to make sure stopping the timer returns the correct elapsed time
-  Deque<TimeInfo> timeInfoDeque = new ArrayDeque<>();
+  private final Deque<TimeInfo> timeInfoDeque = new ArrayDeque<>();
+
+  public HoodieTimer() {
+    this(false);
+  }
+
+  public HoodieTimer(boolean shouldStart) {
+    if (shouldStart) {
+      startTimer();
+    }
+  }

Review Comment:
   Old semantic is still preserved: it works as it have been, and just adds new way when you don't need to invoke `startTimer` explicitly



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieUnsafeRowUtils.scala:
##########
@@ -0,0 +1,120 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StructField, StructType}
+
+import scala.collection.mutable.ArrayBuffer
+
+object HoodieUnsafeRowUtils {

Review Comment:
   This is our code. We're now testing across all the major versions we're running against so this is pretty well-tested. Also, the API we're using here is fairly stable and doesn't change much b/w Spark versions



##########
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java:
##########
@@ -88,13 +91,21 @@ public BulkInsertDataInternalWriterHelper(HoodieTable hoodieTable, HoodieWriteCo
     this.populateMetaFields = populateMetaFields;
     this.arePartitionRecordsSorted = arePartitionRecordsSorted;
     this.fileIdPrefix = UUID.randomUUID().toString();
+
     if (!populateMetaFields) {
       this.keyGeneratorOpt = getKeyGenerator(writeConfig.getProps());
-      if (keyGeneratorOpt.isPresent() && keyGeneratorOpt.get() instanceof SimpleKeyGenerator) {
-        simpleKeyGen = true;
-        simplePartitionFieldIndex = (Integer) structType.getFieldIndex((keyGeneratorOpt.get()).getPartitionPathFields().get(0)).get();
-        simplePartitionFieldDataType = structType.fields()[simplePartitionFieldIndex].dataType();
-      }
+    } else {
+      this.keyGeneratorOpt = Option.empty();
+    }
+
+    if (keyGeneratorOpt.isPresent() && keyGeneratorOpt.get() instanceof SimpleKeyGenerator) {
+      this.simpleKeyGen = true;
+      this.simplePartitionFieldIndex = (Integer) structType.getFieldIndex(keyGeneratorOpt.get().getPartitionPathFields().get(0)).get();
+      this.simplePartitionFieldDataType = structType.fields()[simplePartitionFieldIndex].dataType();
+    } else {
+      this.simpleKeyGen = false;
+      this.simplePartitionFieldIndex = -1;
+      this.simplePartitionFieldDataType = null;

Review Comment:
   We actually do not (it was done for perf reasons before). This is addressed in #5523



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##########
@@ -102,20 +114,20 @@ public String getPartitionPath(InternalRow internalRow, StructType structType) {
       return RowKeyGeneratorHelper.getPartitionPathFromInternalRow(internalRow, getPartitionPathFields(),
           hiveStylePartitioning, partitionPathSchemaInfo);
     } catch (Exception e) {
-      throw new HoodieIOException("Conversion of InternalRow to Row failed with exception " + e);
+      throw new HoodieException("Conversion of InternalRow to Row failed with exception", e);
     }
   }
 
   void buildFieldSchemaInfoIfNeeded(StructType structType) {
     if (this.structType == null) {
+      this.structType = structType;

Review Comment:
   Nope



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java:
##########
@@ -57,187 +92,153 @@ public int numFields() {
   }
 
   @Override
-  public void setNullAt(int i) {
-    if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      switch (i) {
-        case 0: {
-          this.commitTime = null;
-          break;
-        }
-        case 1: {
-          this.commitSeqNumber = null;
-          break;
-        }
-        case 2: {
-          this.recordKey = null;
-          break;
-        }
-        case 3: {
-          this.partitionPath = null;
-          break;
-        }
-        case 4: {
-          this.fileName = null;
-          break;
-        }
-        default: throw new IllegalArgumentException("Not expected");
-      }
+  public void setNullAt(int ordinal) {
+    if (ordinal < metaFields.length) {
+      metaFields[ordinal] = null;
     } else {
-      row.setNullAt(i);
+      row.setNullAt(rebaseOrdinal(ordinal));
     }
   }
 
   @Override
-  public void update(int i, Object value) {
-    if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      switch (i) {
-        case 0: {
-          this.commitTime = value.toString();
-          break;
-        }
-        case 1: {
-          this.commitSeqNumber = value.toString();
-          break;
-        }
-        case 2: {
-          this.recordKey = value.toString();
-          break;
-        }
-        case 3: {
-          this.partitionPath = value.toString();
-          break;
-        }
-        case 4: {
-          this.fileName = value.toString();
-          break;
-        }
-        default: throw new IllegalArgumentException("Not expected");
+  public void update(int ordinal, Object value) {
+    if (ordinal < metaFields.length) {
+      if (value instanceof UTF8String) {
+        metaFields[ordinal] = (UTF8String) value;
+      } else if (value instanceof String) {
+        metaFields[ordinal] = UTF8String.fromString((String) value);
+      } else {
+        throw new IllegalArgumentException(
+            String.format("Could not update the row at (%d) with value of type (%s), either UTF8String or String are expected", ordinal, value.getClass().getSimpleName()));
       }
     } else {
-      row.update(i, value);
+      row.update(rebaseOrdinal(ordinal), value);
     }
   }
 
-  private String getMetaColumnVal(int ordinal) {
-    switch (ordinal) {
-      case 0: {
-        return commitTime;
-      }
-      case 1: {
-        return commitSeqNumber;
-      }
-      case 2: {
-        return recordKey;
-      }
-      case 3: {
-        return partitionPath;
-      }
-      case 4: {
-        return fileName;
-      }
-      default: throw new IllegalArgumentException("Not expected");
+  @Override
+  public boolean isNullAt(int ordinal) {
+    if (ordinal < metaFields.length) {
+      return metaFields[ordinal] == null;
     }
+    return row.isNullAt(rebaseOrdinal(ordinal));
   }
 
   @Override
-  public boolean isNullAt(int ordinal) {
+  public UTF8String getUTF8String(int ordinal) {
+    if (ordinal < HoodieRecord.HOODIE_META_COLUMNS.size()) {
+      return metaFields[ordinal];
+    }
+    return row.getUTF8String(rebaseOrdinal(ordinal));
+  }
+
+  @Override
+  public Object get(int ordinal, DataType dataType) {
     if (ordinal < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      return null == getMetaColumnVal(ordinal);
+      validateMetaFieldDataType(dataType);
+      return metaFields[ordinal];
     }
-    return row.isNullAt(ordinal);
+    return row.get(rebaseOrdinal(ordinal), dataType);
   }
 
   @Override
   public boolean getBoolean(int ordinal) {
-    return row.getBoolean(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Boolean.class);
+    return row.getBoolean(rebaseOrdinal(ordinal));
   }
 
   @Override
   public byte getByte(int ordinal) {
-    return row.getByte(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Byte.class);
+    return row.getByte(rebaseOrdinal(ordinal));
   }
 
   @Override
   public short getShort(int ordinal) {
-    return row.getShort(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Short.class);
+    return row.getShort(rebaseOrdinal(ordinal));
   }
 
   @Override
   public int getInt(int ordinal) {
-    return row.getInt(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Integer.class);
+    return row.getInt(rebaseOrdinal(ordinal));
   }
 
   @Override
   public long getLong(int ordinal) {
-    return row.getLong(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Long.class);
+    return row.getLong(rebaseOrdinal(ordinal));
   }
 
   @Override
   public float getFloat(int ordinal) {
-    return row.getFloat(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Float.class);
+    return row.getFloat(rebaseOrdinal(ordinal));
   }
 
   @Override
   public double getDouble(int ordinal) {
-    return row.getDouble(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Double.class);
+    return row.getDouble(rebaseOrdinal(ordinal));
   }
 
   @Override
   public Decimal getDecimal(int ordinal, int precision, int scale) {
-    return row.getDecimal(ordinal, precision, scale);
-  }
-
-  @Override
-  public UTF8String getUTF8String(int ordinal) {
-    if (ordinal < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      return UTF8String.fromBytes(getMetaColumnVal(ordinal).getBytes());
-    }
-    return row.getUTF8String(ordinal);
-  }
-
-  @Override
-  public String getString(int ordinal) {
-    if (ordinal < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      return new String(getMetaColumnVal(ordinal).getBytes());
-    }
-    return row.getString(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Decimal.class);
+    return row.getDecimal(rebaseOrdinal(ordinal), precision, scale);
   }
 
   @Override
   public byte[] getBinary(int ordinal) {
-    return row.getBinary(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, Byte[].class);
+    return row.getBinary(rebaseOrdinal(ordinal));
   }
 
   @Override
   public CalendarInterval getInterval(int ordinal) {
-    return row.getInterval(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, CalendarInterval.class);
+    return row.getInterval(rebaseOrdinal(ordinal));
   }
 
   @Override
   public InternalRow getStruct(int ordinal, int numFields) {
-    return row.getStruct(ordinal, numFields);
+    ruleOutMetaFieldsAccess(ordinal, InternalRow.class);
+    return row.getStruct(rebaseOrdinal(ordinal), numFields);
   }
 
   @Override
   public ArrayData getArray(int ordinal) {
-    return row.getArray(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, ArrayData.class);
+    return row.getArray(rebaseOrdinal(ordinal));
   }
 
   @Override
   public MapData getMap(int ordinal) {
-    return row.getMap(ordinal);
+    ruleOutMetaFieldsAccess(ordinal, MapData.class);
+    return row.getMap(rebaseOrdinal(ordinal));
   }
 
   @Override
-  public Object get(int ordinal, DataType dataType) {
-    if (ordinal < HoodieRecord.HOODIE_META_COLUMNS.size()) {
-      return UTF8String.fromBytes(getMetaColumnVal(ordinal).getBytes());
+  public InternalRow copy() {
+    return new HoodieInternalRow(Arrays.copyOf(metaFields, metaFields.length), row.copy(), containsMetaFields);
+  }
+
+  private int rebaseOrdinal(int ordinal) {
+    // NOTE: In cases when source row does not contain meta fields, we will have to
+    //       rebase ordinal onto its indexes
+    return containsMetaFields ? ordinal : ordinal - metaFields.length;

Review Comment:
   Please check my comments above -- we always overlay meta-fields, since we need them to be mutable (they're being updated dynamically in writer)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1186077485

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 441a54af977c110c75890e729a539496952ca76d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r925024764


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(
+      StructField(HoodieRecord.COMMIT_TIME_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.RECORD_KEY_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.PARTITION_PATH_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.FILENAME_METADATA_FIELD, StringType))
+
+    val updatedSchema = StructType(metaFields ++ schema.fields)
+    val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema)
+
+    if (!populateMetaFields) {
+      updatedDF
+    } else {
+      val trimmedDF = if (dropPartitionColumns) {
+        val keyGenerator = ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps)).asInstanceOf[BuiltinKeyGenerator]
+        val partitionPathFields = keyGenerator.getPartitionPathFields.asScala
+        val nestedPartitionPathFields = partitionPathFields.filter(f => f.contains('.'))
+        if (nestedPartitionPathFields.nonEmpty) {
+          logWarning(s"Can not drop nested partition path fields: $nestedPartitionPathFields")
+        }
+
+        val partitionPathCols = partitionPathFields -- nestedPartitionPathFields
+        updatedDF.drop(partitionPathCols: _*)
+      } else {
+        updatedDF
+      }
+
+      val dedupedDF = if (config.shouldCombineBeforeInsert) {
+        dedupeRows(trimmedDF, config.getPreCombineField, isGlobalIndex)
+      } else {
+        trimmedDF
+      }
+
+      partitioner.repartitionRecords(dedupedDF, config.getBulkInsertShuffleParallelism)
+    }
+  }
+
+  private def dedupeRows(df: DataFrame, preCombineFieldRef: String, isGlobalIndex: Boolean): DataFrame = {
+    val recordKeyMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD)
+    val partitionPathMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.PARTITION_PATH_METADATA_FIELD)
+    // NOTE: Pre-combine field could be a nested field
+    val preCombineFieldPath = composeNestedFieldPath(df.schema, preCombineFieldRef)
+
+    val dedupedRdd = df.queryExecution.toRdd
+      .map { row =>
+        val rowKey = if (isGlobalIndex) {
+          row.getString(recordKeyMetaFieldOrd)
+        } else {
+          val partitionPath = row.getString(partitionPathMetaFieldOrd)
+          val recordKey = row.getString(recordKeyMetaFieldOrd)
+          s"$partitionPath:$recordKey"
+        }
+        // NOTE: It's critical whenever we keep the reference to the row, to make a copy
+        //       since Spark might be providing us with a mutable copy (updated during the iteration)
+        (rowKey, row.copy())

Review Comment:
   Note, what i'm saying is only applicable to `InternalRow` which don't copy by default and instead point into shared, mutable underlying buffer (actually holding what's been read)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r925009530


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -73,18 +75,15 @@ public WriteSupport.FinalizedWriteContext finalizeWrite() {
     return new WriteSupport.FinalizedWriteContext(extraMetaData);
   }
 
-  public void add(String recordKey) {
-    this.bloomFilter.add(recordKey);
-    if (minRecordKey != null) {
-      minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey;
-    } else {
-      minRecordKey = recordKey;
+  public void add(UTF8String recordKey) {
+    this.bloomFilter.add(recordKey.getBytes());

Review Comment:
   Sg, want to make sure there is no gap between Spark UTF8String and UTF8 encoding in Java, since this is going to affect the Bloom Index.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -73,18 +75,15 @@ public WriteSupport.FinalizedWriteContext finalizeWrite() {
     return new WriteSupport.FinalizedWriteContext(extraMetaData);
   }
 
-  public void add(String recordKey) {
-    this.bloomFilter.add(recordKey);
-    if (minRecordKey != null) {
-      minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey;
-    } else {
-      minRecordKey = recordKey;
+  public void add(UTF8String recordKey) {
+    this.bloomFilter.add(recordKey.getBytes());
+
+    if (minRecordKey == null || minRecordKey.compareTo(recordKey) < 0) {
+      minRecordKey =  recordKey.copy();
     }
 
-    if (maxRecordKey != null) {
-      maxRecordKey = maxRecordKey.compareTo(recordKey) >= 0 ? maxRecordKey : recordKey;
-    } else {
-      maxRecordKey = recordKey;
+    if (maxRecordKey == null || maxRecordKey.compareTo(recordKey) > 0) {
+      maxRecordKey = recordKey.copy();

Review Comment:
   Yeah, this makes more sense now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] vinothchandar commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r928336189


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##########
@@ -18,26 +18,24 @@
 
 package org.apache.hudi.keygen;
 
+import org.apache.avro.generic.GenericRecord;
 import org.apache.hudi.ApiMaturityLevel;
 import org.apache.hudi.AvroConversionUtils;
 import org.apache.hudi.PublicAPIMethod;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.util.collection.Pair;
-import org.apache.hudi.exception.HoodieIOException;
-
-import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.exception.HoodieException;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.catalyst.InternalRow;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.StructType;
+import scala.Function1;

Review Comment:
   don't really like scala imports in Java (becomes an issue - for us one day when we want to shrink scala spread in code). Any way we can avoid this



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>

Review Comment:
   Can you confirm this issue has been resolved in Master branch



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(

Review Comment:
   We should follow up and consolidate into one list in HoodieRecord. +1.  Unless the other usages break with or . different ordering, I don't see any reason why we won't



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##########
@@ -66,12 +66,26 @@ protected BuiltinKeyGenerator(TypedProperties config) {
   @Override
   @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
   public String getRecordKey(Row row) {
+    // TODO avoid conversion to avro
+    //      since converterFn is transient this will be repeatedly initialized over and over again
     if (null == converterFn) {
       converterFn = AvroConversionUtils.createConverterToAvro(row.schema(), STRUCT_NAME, NAMESPACE);
     }
     return getKey(converterFn.apply(row)).getRecordKey();
   }
 
+  @Override
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public String getRecordKey(InternalRow internalRow, StructType schema) {
+    try {

Review Comment:
   are resolution on this? Did you end up backing out the temporary changes



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -73,18 +75,15 @@ public WriteSupport.FinalizedWriteContext finalizeWrite() {
     return new WriteSupport.FinalizedWriteContext(extraMetaData);
   }
 
-  public void add(String recordKey) {
-    this.bloomFilter.add(recordKey);
-    if (minRecordKey != null) {
-      minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey;
-    } else {
-      minRecordKey = recordKey;
+  public void add(UTF8String recordKey) {
+    this.bloomFilter.add(recordKey.getBytes());

Review Comment:
   BloomFilter add does . So we seem to be fine. it's good to trust-but-verify once though that `recordKey.getBytes()` is equal to `string.getBytes(StandardCharsets.UTF_8)`. @alexeykudinkin you probably checked that during development? 



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(
+      StructField(HoodieRecord.COMMIT_TIME_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.RECORD_KEY_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.PARTITION_PATH_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.FILENAME_METADATA_FIELD, StringType))
+
+    val updatedSchema = StructType(metaFields ++ schema.fields)
+    val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema)
+
+    if (!populateMetaFields) {
+      updatedDF
+    } else {
+      val trimmedDF = if (dropPartitionColumns) {
+        val keyGenerator = ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps)).asInstanceOf[BuiltinKeyGenerator]
+        val partitionPathFields = keyGenerator.getPartitionPathFields.asScala
+        val nestedPartitionPathFields = partitionPathFields.filter(f => f.contains('.'))
+        if (nestedPartitionPathFields.nonEmpty) {
+          logWarning(s"Can not drop nested partition path fields: $nestedPartitionPathFields")
+        }
+
+        val partitionPathCols = partitionPathFields -- nestedPartitionPathFields
+        updatedDF.drop(partitionPathCols: _*)
+      } else {
+        updatedDF
+      }
+
+      val dedupedDF = if (config.shouldCombineBeforeInsert) {
+        dedupeRows(trimmedDF, config.getPreCombineField, isGlobalIndex)
+      } else {
+        trimmedDF
+      }
+
+      partitioner.repartitionRecords(dedupedDF, config.getBulkInsertShuffleParallelism)
+    }
+  }
+
+  private def dedupeRows(df: DataFrame, preCombineFieldRef: String, isGlobalIndex: Boolean): DataFrame = {
+    val recordKeyMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD)
+    val partitionPathMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.PARTITION_PATH_METADATA_FIELD)
+    // NOTE: Pre-combine field could be a nested field
+    val preCombineFieldPath = composeNestedFieldPath(df.schema, preCombineFieldRef)
+
+    val dedupedRdd = df.queryExecution.toRdd
+      .map { row =>
+        val rowKey = if (isGlobalIndex) {
+          row.getString(recordKeyMetaFieldOrd)
+        } else {
+          val partitionPath = row.getString(partitionPathMetaFieldOrd)
+          val recordKey = row.getString(recordKeyMetaFieldOrd)
+          s"$partitionPath:$recordKey"
+        }
+        // NOTE: It's critical whenever we keep the reference to the row, to make a copy
+        //       since Spark might be providing us with a mutable copy (updated during the iteration)
+        (rowKey, row.copy())

Review Comment:
   What exact scenarios cause Spark to fail without copy. Could you please expand on that? 



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {

Review Comment:
   Trying to understand this better. Why did this need to be in scala?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] vinothchandar commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1194088131

   I need to convince myself of the RDD conversion in place. So this is marked - "major concerns" until then


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1113919259

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b14a25c115a1a208d1e0d10088802dba680e44c9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399) 
   * 1489d3759dfe86f9625ee533ea4ea710b32b18c3 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118056892

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1489d3759dfe86f9625ee533ea4ea710b32b18c3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403) 
   * dba2edf235b1bc51a170145bef25424abb2c80dd UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118113709

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dba2edf235b1bc51a170145bef25424abb2c80dd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426) 
   * 9c7e7eacfd743264be7c0d2e6bc18165722358f9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993][Stacked on 5497] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1123153507

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9c7e7eacfd743264be7c0d2e6bc18165722358f9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429) 
   * dc912614b9f4217bac743897810774e172cf81ac UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5470: [HUDI-3993][Stacked on 5497] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r869662469


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java:
##########
@@ -39,49 +39,77 @@
 import org.apache.log4j.Logger;
 import org.apache.spark.sql.catalyst.InternalRow;
 import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
 
 import java.io.IOException;
 import java.io.Serializable;
 import java.util.concurrent.atomic.AtomicLong;
+import java.util.function.Function;
 
 /**
  * Create handle with InternalRow for datasource implementation of bulk insert.
  */
 public class HoodieRowCreateHandle implements Serializable {
 
   private static final long serialVersionUID = 1L;
+
   private static final Logger LOG = LogManager.getLogger(HoodieRowCreateHandle.class);
-  private static final AtomicLong SEQGEN = new AtomicLong(1);
+  private static final AtomicLong GLOBAL_SEQ_NO = new AtomicLong(1);
+
+  private static final Integer RECORD_KEY_META_FIELD_ORD =
+      HoodieRecord.HOODIE_META_COLUMNS_NAME_TO_POS.get(HoodieRecord.RECORD_KEY_METADATA_FIELD);
+  private static final Integer PARTITION_PATH_META_FIELD_ORD =
+      HoodieRecord.HOODIE_META_COLUMNS_NAME_TO_POS.get(HoodieRecord.PARTITION_PATH_METADATA_FIELD);
 
-  private final String instantTime;
-  private final int taskPartitionId;
-  private final long taskId;
-  private final long taskEpochId;
   private final HoodieTable table;
   private final HoodieWriteConfig writeConfig;
-  protected final HoodieInternalRowFileWriter fileWriter;
+
+  private final FileSystem fs;
+
   private final String partitionPath;
   private final Path path;
   private final String fileId;
-  private final FileSystem fs;
-  protected final HoodieInternalWriteStatus writeStatus;
+
+  private final boolean populateMetaFields;
+
+  private final UTF8String fileName;
+  private final UTF8String commitTime;
+  private final Function<Long, String> seqIdGenerator;
+
   private final HoodieTimer currTimer;
 
-  public HoodieRowCreateHandle(HoodieTable table, HoodieWriteConfig writeConfig, String partitionPath, String fileId,
-      String instantTime, int taskPartitionId, long taskId, long taskEpochId,
-      StructType structType) {
+  protected final HoodieInternalRowFileWriter fileWriter;
+  protected final HoodieInternalWriteStatus writeStatus;
+
+  public HoodieRowCreateHandle(HoodieTable table,
+                               HoodieWriteConfig writeConfig,
+                               String partitionPath,
+                               String fileId,
+                               String instantTime,
+                               int taskPartitionId,
+                               long taskId,
+                               long taskEpochId,
+                               StructType structType,
+                               boolean populateMetaFields) {
     this.partitionPath = partitionPath;
     this.table = table;
     this.writeConfig = writeConfig;
-    this.instantTime = instantTime;
-    this.taskPartitionId = taskPartitionId;
-    this.taskId = taskId;
-    this.taskEpochId = taskEpochId;
     this.fileId = fileId;
-    this.currTimer = new HoodieTimer();
-    this.currTimer.startTimer();
+
+    this.currTimer = new HoodieTimer(true);
+
     this.fs = table.getMetaClient().getFs();
-    this.path = makeNewPath(partitionPath);
+
+    String writeToken = getWriteToken(taskPartitionId, taskId, taskEpochId);
+    String fileName = FSUtils.makeDataFileName(instantTime, writeToken, fileId,
+        table.getMetaClient().getTableConfig().getBaseFileFormat().getFileExtension());

Review Comment:
   table.getBaseFileExtension()



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java:
##########
@@ -39,49 +39,77 @@
 import org.apache.log4j.Logger;
 import org.apache.spark.sql.catalyst.InternalRow;
 import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
 
 import java.io.IOException;
 import java.io.Serializable;
 import java.util.concurrent.atomic.AtomicLong;
+import java.util.function.Function;
 
 /**
  * Create handle with InternalRow for datasource implementation of bulk insert.
  */
 public class HoodieRowCreateHandle implements Serializable {
 
   private static final long serialVersionUID = 1L;
+
   private static final Logger LOG = LogManager.getLogger(HoodieRowCreateHandle.class);
-  private static final AtomicLong SEQGEN = new AtomicLong(1);
+  private static final AtomicLong GLOBAL_SEQ_NO = new AtomicLong(1);
+
+  private static final Integer RECORD_KEY_META_FIELD_ORD =
+      HoodieRecord.HOODIE_META_COLUMNS_NAME_TO_POS.get(HoodieRecord.RECORD_KEY_METADATA_FIELD);
+  private static final Integer PARTITION_PATH_META_FIELD_ORD =
+      HoodieRecord.HOODIE_META_COLUMNS_NAME_TO_POS.get(HoodieRecord.PARTITION_PATH_METADATA_FIELD);
 
-  private final String instantTime;
-  private final int taskPartitionId;
-  private final long taskId;
-  private final long taskEpochId;
   private final HoodieTable table;
   private final HoodieWriteConfig writeConfig;
-  protected final HoodieInternalRowFileWriter fileWriter;
+
+  private final FileSystem fs;
+
   private final String partitionPath;
   private final Path path;
   private final String fileId;
-  private final FileSystem fs;
-  protected final HoodieInternalWriteStatus writeStatus;
+
+  private final boolean populateMetaFields;
+
+  private final UTF8String fileName;
+  private final UTF8String commitTime;
+  private final Function<Long, String> seqIdGenerator;
+
   private final HoodieTimer currTimer;
 
-  public HoodieRowCreateHandle(HoodieTable table, HoodieWriteConfig writeConfig, String partitionPath, String fileId,
-      String instantTime, int taskPartitionId, long taskId, long taskEpochId,
-      StructType structType) {
+  protected final HoodieInternalRowFileWriter fileWriter;
+  protected final HoodieInternalWriteStatus writeStatus;
+
+  public HoodieRowCreateHandle(HoodieTable table,
+                               HoodieWriteConfig writeConfig,
+                               String partitionPath,
+                               String fileId,
+                               String instantTime,
+                               int taskPartitionId,
+                               long taskId,
+                               long taskEpochId,
+                               StructType structType,
+                               boolean populateMetaFields) {
     this.partitionPath = partitionPath;
     this.table = table;
     this.writeConfig = writeConfig;
-    this.instantTime = instantTime;
-    this.taskPartitionId = taskPartitionId;
-    this.taskId = taskId;
-    this.taskEpochId = taskEpochId;
     this.fileId = fileId;
-    this.currTimer = new HoodieTimer();
-    this.currTimer.startTimer();
+
+    this.currTimer = new HoodieTimer(true);
+

Review Comment:
   startTimer ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1185973331

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0600a70a965d19e10a4bd5c46e26ac8ed6474cfb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965) 
   * 6dfc22bdc39748d0d1b52df90375f18f84c48c6b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1188591364

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042) 
   * 34b01276ff7755cf35665ee49cf957b0879cc1eb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1185806581

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 52a72e09bb9724d845218eb5c408523706af5a78 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936) 
   * a2ee79f4b4309f2707539971da055263e7ec6e74 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1185930967

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0600a70a965d19e10a4bd5c46e26ac8ed6474cfb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965) 
   * 6dfc22bdc39748d0d1b52df90375f18f84c48c6b UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1185915177

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a2ee79f4b4309f2707539971da055263e7ec6e74 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962) 
   * 0600a70a965d19e10a4bd5c46e26ac8ed6474cfb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r922483854


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        lazy val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+          // To minimize # of allocations, we're going to allocate a single array
+          // setting all column values in place for the updated row
+          val newColVals = new Array[Any](schema.fields.length + HoodieRecord.HOODIE_META_COLUMNS.size)
+          // NOTE: Order of the fields have to match that one of `HoodieRecord.HOODIE_META_COLUMNS`
+          newColVals.update(0, commitTimestamp)
+          newColVals.update(1, commitSeqNo)
+          newColVals.update(2, recordKey)
+          newColVals.update(3, partitionPath)
+          newColVals.update(4, filename)
+          // Prepend existing row column values
+          row.toSeq(schema).copyToArray(newColVals, 5)
+          new GenericInternalRow(newColVals)
+        }
+      }
+
+    val metaFields = Seq(
+      StructField(HoodieRecord.COMMIT_TIME_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.RECORD_KEY_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.PARTITION_PATH_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.FILENAME_METADATA_FIELD, StringType))
+
+    val updatedSchema = StructType(metaFields ++ schema.fields)
+    val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema)
+
+    if (!populateMetaFields) {
+      updatedDF
+    } else {
+      val trimmedDF = if (dropPartitionColumns) {
+        val keyGenerator = ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps)).asInstanceOf[BuiltinKeyGenerator]
+        val partitionPathFields = keyGenerator.getPartitionPathFields.asScala
+        val nestedPartitionPathFields = partitionPathFields.filter(f => f.contains('.'))
+        if (nestedPartitionPathFields.nonEmpty) {
+          logWarning(s"Can not drop nested partition path fields: $nestedPartitionPathFields")
+        }
+
+        val partitionPathCols = partitionPathFields -- nestedPartitionPathFields
+        updatedDF.drop(partitionPathCols: _*)
+      } else {
+        updatedDF
+      }
+
+      val dedupedDF = if (config.shouldCombineBeforeInsert) {
+        dedupeRows(trimmedDF, config.getPreCombineField, isGlobalIndex)
+      } else {
+        trimmedDF
+      }
+
+      partitioner.repartitionRecords(dedupedDF, config.getBulkInsertShuffleParallelism)
+    }
+  }
+
+  private def dedupeRows(df: DataFrame, preCombineFieldRef: String, isGlobalIndex: Boolean): DataFrame = {
+    val recordKeyMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD)

Review Comment:
   Yes, virtual keys de-duping isn't supported currently



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r922483453


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>

Review Comment:
   This PR should only be applied in conjunction w/ this one #5523



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1190787872

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "1190744358",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 505ee485234d9768ccbabe6c69a8b77219600789 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072) 
   * b4573ac05a7bc2bea1a367778a13632ba7aefc3e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1189609176

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e803911e6e3e8524787d7dd8edaca1a179ae9da8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066) 
   * 505ee485234d9768ccbabe6c69a8b77219600789 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1190794738

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "1190744358",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109",
       "triggerID" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 505ee485234d9768ccbabe6c69a8b77219600789 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072) 
   * b4573ac05a7bc2bea1a367778a13632ba7aefc3e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1113873223

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ba983dc2b5c0ad53b99b771d82925cd0d55478c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993][Stacked on 5497] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1123155545

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9c7e7eacfd743264be7c0d2e6bc18165722358f9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429) 
   * dc912614b9f4217bac743897810774e172cf81ac Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5470: [HUDI-3993][Stacked on 5497] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r870235097


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>

Review Comment:
   I am seeing some perf hit w/ this code change. will wait to sync up with Alexey on this. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1185118380

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 52a72e09bb9724d845218eb5c408523706af5a78 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1185907345

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a2ee79f4b4309f2707539971da055263e7ec6e74 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962) 
   * 0600a70a965d19e10a4bd5c46e26ac8ed6474cfb UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1188783345

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 34b01276ff7755cf35665ee49cf957b0879cc1eb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r925003741


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java:
##########
@@ -24,31 +24,66 @@
 import org.apache.spark.sql.catalyst.util.MapData;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.sql.types.StringType$;
 import org.apache.spark.unsafe.types.CalendarInterval;
 import org.apache.spark.unsafe.types.UTF8String;
 
+import java.util.Arrays;
+
 /**
- * Internal Row implementation for Hoodie Row. It wraps an {@link InternalRow} and keeps meta columns locally. But the {@link InternalRow}
- * does include the meta columns as well just that {@link HoodieInternalRow} will intercept queries for meta columns and serve from its
- * copy rather than fetching from {@link InternalRow}.
+ * Hudi internal implementation of the {@link InternalRow} allowing to extend arbitrary
+ * {@link InternalRow} overlaying Hudi-internal meta-fields on top of it.
+ *
+ * Capable of overlaying meta-fields in both cases: whether original {@link #row} contains
+ * meta columns or not. This allows to handle following use-cases allowing to avoid any
+ * manipulation (reshuffling) of the source row, by simply creating new instance
+ * of {@link HoodieInternalRow} with all the meta-values provided
+ *
+ * <ul>
+ *   <li>When meta-fields need to be prepended to the source {@link InternalRow}</li>
+ *   <li>When meta-fields need to be updated w/in the source {@link InternalRow}
+ *   ({@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} currently does not
+ *   allow in-place updates due to its memory layout)</li>
+ * </ul>
  */
 public class HoodieInternalRow extends InternalRow {
 
-  private String commitTime;
-  private String commitSeqNumber;
-  private String recordKey;
-  private String partitionPath;
-  private String fileName;
-  private InternalRow row;
-
-  public HoodieInternalRow(String commitTime, String commitSeqNumber, String recordKey, String partitionPath,
-      String fileName, InternalRow row) {
-    this.commitTime = commitTime;
-    this.commitSeqNumber = commitSeqNumber;
-    this.recordKey = recordKey;
-    this.partitionPath = partitionPath;
-    this.fileName = fileName;
+  /**
+   * Collection of meta-fields as defined by {@link HoodieRecord#HOODIE_META_COLUMNS}
+   */
+  private final UTF8String[] metaFields;
+  private final InternalRow row;
+
+  /**
+   * Specifies whether source {@link #row} contains meta-fields
+   */
+  private final boolean containsMetaFields;
+
+  public HoodieInternalRow(UTF8String commitTime,
+                           UTF8String commitSeqNumber,
+                           UTF8String recordKey,
+                           UTF8String partitionPath,
+                           UTF8String fileName,
+                           InternalRow row,
+                           boolean containsMetaFields) {
+    this.metaFields = new UTF8String[] {
+        commitTime,
+        commitSeqNumber,
+        recordKey,
+        partitionPath,
+        fileName
+    };
+
     this.row = row;
+    this.containsMetaFields = containsMetaFields;

Review Comment:
   Sg.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r925017068


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -73,18 +75,15 @@ public WriteSupport.FinalizedWriteContext finalizeWrite() {
     return new WriteSupport.FinalizedWriteContext(extraMetaData);
   }
 
-  public void add(String recordKey) {
-    this.bloomFilter.add(recordKey);
-    if (minRecordKey != null) {
-      minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey;
-    } else {
-      minRecordKey = recordKey;
+  public void add(UTF8String recordKey) {
+    this.bloomFilter.add(recordKey.getBytes());
+
+    if (minRecordKey == null || minRecordKey.compareTo(recordKey) < 0) {
+      minRecordKey =  recordKey.copy();
     }
 
-    if (maxRecordKey != null) {
-      maxRecordKey = maxRecordKey.compareTo(recordKey) >= 0 ? maxRecordKey : recordKey;
-    } else {
-      maxRecordKey = recordKey;
+    if (maxRecordKey == null || maxRecordKey.compareTo(recordKey) > 0) {
+      maxRecordKey = recordKey.copy();

Review Comment:
   So looking at the UTF8String implementation, should we actually use `copy()` instead:
   
   ```
   @Override
     public UTF8String clone() {
       return fromBytes(getBytes());
     }
   
     public UTF8String copy() {
       byte[] bytes = new byte[numBytes];
       copyMemory(base, offset, bytes, BYTE_ARRAY_OFFSET, numBytes);
       return fromBytes(bytes);
     }
   
   /**
      * Creates an UTF8String from byte array, which should be encoded in UTF-8.
      *
      * Note: `bytes` will be hold by returned UTF8String.
      */
     public static UTF8String fromBytes(byte[] bytes) {
       if (bytes != null) {
         return new UTF8String(bytes, BYTE_ARRAY_OFFSET, bytes.length);
       } else {
         return null;
       }
     }
   
   /**
      * Returns the underline bytes, will be a copy of it if it's part of another array.
      */
     public byte[] getBytes() {
       // avoid copy if `base` is `byte[]`
       if (offset == BYTE_ARRAY_OFFSET && base instanceof byte[]
         && ((byte[]) base).length == numBytes) {
         return (byte[]) base;
       } else {
         byte[] bytes = new byte[numBytes];
         copyMemory(base, offset, bytes, BYTE_ARRAY_OFFSET, numBytes);
         return bytes;
       }
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1188452511

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1190784185

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072",
       "triggerID" : "1190744358",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 505ee485234d9768ccbabe6c69a8b77219600789 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10072) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1194362557

   TL;DR is the difference b/w `Row` and `InternalRow`:
   
    - When you do `df.rdd` you invoke deserializer which will deserialize internal binary representation (`UnsafeRow`) into a `Row` holding Java native types (it also holds the schema)
   
    - `df.queryExecution.toRdd` is an internal API that returns you an RDD of `InternalRow`s avoiding such conversion (that’s the primary reason for introduction of many utilities in `HoodieUnsafeUtils` to be able to access private Spark APIs)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1191023418

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109",
       "triggerID" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "b4573ac05a7bc2bea1a367778a13632ba7aefc3e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1190744358",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "0000",
       "status" : "CANCELED",
       "url" : "TBD",
       "triggerID" : "1189680303",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * b4573ac05a7bc2bea1a367778a13632ba7aefc3e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10109) 
   * 0000 Unknown: [CANCELED](TBD) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1189497775

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e803911e6e3e8524787d7dd8edaca1a179ae9da8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r922484121


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##########
@@ -66,12 +66,26 @@ protected BuiltinKeyGenerator(TypedProperties config) {
   @Override
   @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
   public String getRecordKey(Row row) {
+    // TODO avoid conversion to avro
+    //      since converterFn is transient this will be repeatedly initialized over and over again
     if (null == converterFn) {
       converterFn = AvroConversionUtils.createConverterToAvro(row.schema(), STRUCT_NAME, NAMESPACE);
     }
     return getKey(converterFn.apply(row)).getRecordKey();
   }
 
+  @Override
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public String getRecordKey(InternalRow internalRow, StructType schema) {
+    try {

Review Comment:
   These are temporary changes, that are addressed in #5523



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1189606260

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9936",
       "triggerID" : "52a72e09bb9724d845218eb5c408523706af5a78",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9962",
       "triggerID" : "a2ee79f4b4309f2707539971da055263e7ec6e74",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9965",
       "triggerID" : "0600a70a965d19e10a4bd5c46e26ac8ed6474cfb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9967",
       "triggerID" : "6dfc22bdc39748d0d1b52df90375f18f84c48c6b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "441a54af977c110c75890e729a539496952ca76d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9978",
       "triggerID" : "441a54af977c110c75890e729a539496952ca76d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10042",
       "triggerID" : "d02c06e0a1a95a6bd1e48b0d9b8c712c670c7bd5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10051",
       "triggerID" : "34b01276ff7755cf35665ee49cf957b0879cc1eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066",
       "triggerID" : "e803911e6e3e8524787d7dd8edaca1a179ae9da8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "505ee485234d9768ccbabe6c69a8b77219600789",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e803911e6e3e8524787d7dd8edaca1a179ae9da8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10066) 
   * 505ee485234d9768ccbabe6c69a8b77219600789 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r925017068


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -73,18 +75,15 @@ public WriteSupport.FinalizedWriteContext finalizeWrite() {
     return new WriteSupport.FinalizedWriteContext(extraMetaData);
   }
 
-  public void add(String recordKey) {
-    this.bloomFilter.add(recordKey);
-    if (minRecordKey != null) {
-      minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey;
-    } else {
-      minRecordKey = recordKey;
+  public void add(UTF8String recordKey) {
+    this.bloomFilter.add(recordKey.getBytes());
+
+    if (minRecordKey == null || minRecordKey.compareTo(recordKey) < 0) {
+      minRecordKey =  recordKey.copy();
     }
 
-    if (maxRecordKey != null) {
-      maxRecordKey = maxRecordKey.compareTo(recordKey) >= 0 ? maxRecordKey : recordKey;
-    } else {
-      maxRecordKey = recordKey;
+    if (maxRecordKey == null || maxRecordKey.compareTo(recordKey) > 0) {
+      maxRecordKey = recordKey.copy();

Review Comment:
   So looking at the UTF8String implementation, should we actually use `copy()` instead since `clone()` may reference the same byte array:
   
   ```
   @Override
     public UTF8String clone() {
       return fromBytes(getBytes());
     }
   
     public UTF8String copy() {
       byte[] bytes = new byte[numBytes];
       copyMemory(base, offset, bytes, BYTE_ARRAY_OFFSET, numBytes);
       return fromBytes(bytes);
     }
   
   /**
      * Creates an UTF8String from byte array, which should be encoded in UTF-8.
      *
      * Note: `bytes` will be hold by returned UTF8String.
      */
     public static UTF8String fromBytes(byte[] bytes) {
       if (bytes != null) {
         return new UTF8String(bytes, BYTE_ARRAY_OFFSET, bytes.length);
       } else {
         return null;
       }
     }
   
   /**
      * Returns the underline bytes, will be a copy of it if it's part of another array.
      */
     public byte[] getBytes() {
       // avoid copy if `base` is `byte[]`
       if (offset == BYTE_ARRAY_OFFSET && base instanceof byte[]
         && ((byte[]) base).length == numBytes) {
         return (byte[]) base;
       } else {
         byte[] bytes = new byte[numBytes];
         copyMemory(base, offset, bytes, BYTE_ARRAY_OFFSET, numBytes);
         return bytes;
       }
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r925026537


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(

Review Comment:
   Fair point. Problem is that ordering only matters in a handful of contexts (compared to all usages of this list) and it harder to justify why ordering matters when you're looking at just the `HoodieRecord` class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [HUDI-3993][Stacked on 5497] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1123158511

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429",
       "triggerID" : "9c7e7eacfd743264be7c0d2e6bc18165722358f9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dc912614b9f4217bac743897810774e172cf81ac",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564",
       "triggerID" : "dc912614b9f4217bac743897810774e172cf81ac",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dc912614b9f4217bac743897810774e172cf81ac Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8564) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5470: [HUDI-3993][Stacked on 5497] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r864969575


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        lazy val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+          // To minimize # of allocations, we're going to allocate a single array
+          // setting all column values in place for the updated row
+          val newColVals = new Array[Any](schema.fields.length + HoodieRecord.HOODIE_META_COLUMNS.size)
+          // NOTE: Order of the fields have to match that one of `HoodieRecord.HOODIE_META_COLUMNS`
+          newColVals.update(0, commitTimestamp)
+          newColVals.update(1, commitSeqNo)
+          newColVals.update(2, recordKey)
+          newColVals.update(3, partitionPath)
+          newColVals.update(4, filename)
+          // Prepend existing row column values
+          row.toSeq(schema).copyToArray(newColVals, 5)
+          new GenericInternalRow(newColVals)
+        }
+      }
+
+    val metaFields = Seq(
+      StructField(HoodieRecord.COMMIT_TIME_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.RECORD_KEY_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.PARTITION_PATH_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.FILENAME_METADATA_FIELD, StringType))
+
+    val updatedSchema = StructType(metaFields ++ schema.fields)
+    val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema)
+
+    if (!populateMetaFields) {
+      updatedDF
+    } else {
+      val trimmedDF = if (dropPartitionColumns) {
+        val keyGenerator = ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps)).asInstanceOf[BuiltinKeyGenerator]
+        val partitionPathFields = keyGenerator.getPartitionPathFields.asScala
+        val nestedPartitionPathFields = partitionPathFields.filter(f => f.contains('.'))
+        if (nestedPartitionPathFields.nonEmpty) {
+          logWarning(s"Can not drop nested partition path fields: $nestedPartitionPathFields")
+        }
+
+        val partitionPathCols = partitionPathFields -- nestedPartitionPathFields
+        updatedDF.drop(partitionPathCols: _*)
+      } else {
+        updatedDF
+      }
+
+      val dedupedDF = if (config.shouldCombineBeforeInsert) {
+        dedupeRows(trimmedDF, config.getPreCombineField, isGlobalIndex)
+      } else {
+        trimmedDF
+      }
+
+      partitioner.repartitionRecords(dedupedDF, config.getBulkInsertShuffleParallelism)
+    }
+  }
+
+  private def dedupeRows(df: DataFrame, preCombineFieldRef: String, isGlobalIndex: Boolean): DataFrame = {
+    val recordKeyMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD)

Review Comment:
   this may also need to be fixed for virtual key path, or we can call it out that its not supported for now. even prior to this patch, we did not have support for de-duping in virtual key flow in row writer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5470:
URL: https://github.com/apache/hudi/pull/5470#discussion_r929165914


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##########
@@ -66,12 +66,26 @@ protected BuiltinKeyGenerator(TypedProperties config) {
   @Override
   @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
   public String getRecordKey(Row row) {
+    // TODO avoid conversion to avro
+    //      since converterFn is transient this will be repeatedly initialized over and over again
     if (null == converterFn) {
       converterFn = AvroConversionUtils.createConverterToAvro(row.schema(), STRUCT_NAME, NAMESPACE);
     }
     return getKey(converterFn.apply(row)).getRecordKey();
   }
 
+  @Override
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public String getRecordKey(InternalRow internalRow, StructType schema) {
+    try {

Review Comment:
   Correct. These are revisited



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {

Review Comment:
   To avoid back-n-forth Java/Scala conversions



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -73,18 +75,15 @@ public WriteSupport.FinalizedWriteContext finalizeWrite() {
     return new WriteSupport.FinalizedWriteContext(extraMetaData);
   }
 
-  public void add(String recordKey) {
-    this.bloomFilter.add(recordKey);
-    if (minRecordKey != null) {
-      minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey;
-    } else {
-      minRecordKey = recordKey;
+  public void add(UTF8String recordKey) {
+    this.bloomFilter.add(recordKey.getBytes());

Review Comment:
   Correct



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>

Review Comment:
   I think issue @nsivabalan is referring to is the fact that this PR shouldn't be measured in isolation but only together with #5523 (which is landed as well)



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.client.model.HoodieInternalRow
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.ReflectionUtils
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.keygen.BuiltinKeyGenerator
+import org.apache.hudi.table.BulkInsertPartitioner
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.HoodieUnsafeRDDUtils.createDataFrame
+import org.apache.spark.sql.HoodieUnsafeRowUtils.{composeNestedFieldPath, getNestedInternalRowValue}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeRDDUtils, Row}
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.JavaConverters.asScalaBufferConverter
+
+object HoodieDatasetBulkInsertHelper extends Logging {
+
+  /**
+   * Prepares [[DataFrame]] for bulk-insert into Hudi table, taking following steps:
+   *
+   * <ol>
+   *   <li>Invoking configured [[KeyGenerator]] to produce record key, alas partition-path value</li>
+   *   <li>Prepends Hudi meta-fields to every row in the dataset</li>
+   *   <li>Dedupes rows (if necessary)</li>
+   *   <li>Partitions dataset using provided [[partitioner]]</li>
+   * </ol>
+   */
+  def prepareForBulkInsert(df: DataFrame,
+                           config: HoodieWriteConfig,
+                           partitioner: BulkInsertPartitioner[Dataset[Row]],
+                           isGlobalIndex: Boolean,
+                           dropPartitionColumns: Boolean): Dataset[Row] = {
+    val populateMetaFields = config.populateMetaFields()
+    val schema = df.schema
+
+    val keyGeneratorClassName = config.getStringOrThrow(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME,
+      "Key-generator class name is required")
+
+    val prependedRdd: RDD[InternalRow] =
+      df.queryExecution.toRdd.mapPartitions { iter =>
+        val keyGenerator =
+          ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps))
+            .asInstanceOf[BuiltinKeyGenerator]
+
+        iter.map { row =>
+          val (recordKey, partitionPath) =
+            if (populateMetaFields) {
+              (UTF8String.fromString(keyGenerator.getRecordKey(row, schema)),
+                UTF8String.fromString(keyGenerator.getPartitionPath(row, schema)))
+            } else {
+              (UTF8String.EMPTY_UTF8, UTF8String.EMPTY_UTF8)
+            }
+          val commitTimestamp = UTF8String.EMPTY_UTF8
+          val commitSeqNo = UTF8String.EMPTY_UTF8
+          val filename = UTF8String.EMPTY_UTF8
+
+          // TODO use mutable row, avoid re-allocating
+          new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false)
+        }
+      }
+
+    val metaFields = Seq(
+      StructField(HoodieRecord.COMMIT_TIME_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.RECORD_KEY_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.PARTITION_PATH_METADATA_FIELD, StringType),
+      StructField(HoodieRecord.FILENAME_METADATA_FIELD, StringType))
+
+    val updatedSchema = StructType(metaFields ++ schema.fields)
+    val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema)
+
+    if (!populateMetaFields) {
+      updatedDF
+    } else {
+      val trimmedDF = if (dropPartitionColumns) {
+        val keyGenerator = ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps)).asInstanceOf[BuiltinKeyGenerator]
+        val partitionPathFields = keyGenerator.getPartitionPathFields.asScala
+        val nestedPartitionPathFields = partitionPathFields.filter(f => f.contains('.'))
+        if (nestedPartitionPathFields.nonEmpty) {
+          logWarning(s"Can not drop nested partition path fields: $nestedPartitionPathFields")
+        }
+
+        val partitionPathCols = partitionPathFields -- nestedPartitionPathFields
+        updatedDF.drop(partitionPathCols: _*)
+      } else {
+        updatedDF
+      }
+
+      val dedupedDF = if (config.shouldCombineBeforeInsert) {
+        dedupeRows(trimmedDF, config.getPreCombineField, isGlobalIndex)
+      } else {
+        trimmedDF
+      }
+
+      partitioner.repartitionRecords(dedupedDF, config.getBulkInsertShuffleParallelism)
+    }
+  }
+
+  private def dedupeRows(df: DataFrame, preCombineFieldRef: String, isGlobalIndex: Boolean): DataFrame = {
+    val recordKeyMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD)
+    val partitionPathMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.PARTITION_PATH_METADATA_FIELD)
+    // NOTE: Pre-combine field could be a nested field
+    val preCombineFieldPath = composeNestedFieldPath(df.schema, preCombineFieldRef)
+
+    val dedupedRdd = df.queryExecution.toRdd
+      .map { row =>
+        val rowKey = if (isGlobalIndex) {
+          row.getString(recordKeyMetaFieldOrd)
+        } else {
+          val partitionPath = row.getString(partitionPathMetaFieldOrd)
+          val recordKey = row.getString(recordKeyMetaFieldOrd)
+          s"$partitionPath:$recordKey"
+        }
+        // NOTE: It's critical whenever we keep the reference to the row, to make a copy
+        //       since Spark might be providing us with a mutable copy (updated during the iteration)
+        (rowKey, row.copy())

Review Comment:
   This exact code will fail if we remove copy, b/c often `InternalRow` is a mutable copy that Spark changes during iteration, which is safe when we access just the one under the pointer, but in the subsequent `reduceByKey` we access 2 rows at the same time



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##########
@@ -18,26 +18,24 @@
 
 package org.apache.hudi.keygen;
 
+import org.apache.avro.generic.GenericRecord;
 import org.apache.hudi.ApiMaturityLevel;
 import org.apache.hudi.AvroConversionUtils;
 import org.apache.hudi.PublicAPIMethod;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.util.collection.Pair;
-import org.apache.hudi.exception.HoodieIOException;
-
-import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.exception.HoodieException;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.catalyst.InternalRow;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.StructType;
+import scala.Function1;

Review Comment:
   This is removed in #5523



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1113882101

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ba983dc2b5c0ad53b99b771d82925cd0d55478c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397) 
   * b14a25c115a1a208d1e0d10088802dba680e44c9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1113936430

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1489d3759dfe86f9625ee533ea4ea710b32b18c3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118058171

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1489d3759dfe86f9625ee533ea4ea710b32b18c3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403) 
   * dba2edf235b1bc51a170145bef25424abb2c80dd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118061811

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8397",
       "triggerID" : "1ba983dc2b5c0ad53b99b771d82925cd0d55478c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8399",
       "triggerID" : "b14a25c115a1a208d1e0d10088802dba680e44c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403",
       "triggerID" : "1489d3759dfe86f9625ee533ea4ea710b32b18c3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426",
       "triggerID" : "dba2edf235b1bc51a170145bef25424abb2c80dd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dba2edf235b1bc51a170145bef25424abb2c80dd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org