You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/15 03:10:21 UTC

[GitHub] [hudi] alexeykudinkin opened a new pull request, #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path

alexeykudinkin opened a new pull request, #5328:
URL: https://github.com/apache/hudi/pull/5328

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #5328: [WIP][HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1186990513

   @alexeykudinkin : please address the feedback on adding a new sort mode. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1192484192

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1b0332d969b26cc5ddd7b53d4d4d9589e8a98107 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194659542

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 87e6b69600ba3f17f1fe098d3585773a56d6d933 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1099810246

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 96b33942edf6a1d6d89361d2e056ed1c3a8d326b UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1099842852

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 96b33942edf6a1d6d89361d2e056ed1c3a8d326b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1193062213

   CI is green:
   https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10228&view=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194980961

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 82147298deaec87b776a284746826c4004bb3d73 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194293246

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220",
       "triggerID" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10228",
       "triggerID" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6a1cd5667097f06028d3391f2558c12292d2e3e8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10228) 
   * 87e6b69600ba3f17f1fe098d3585773a56d6d933 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1192953917

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220",
       "triggerID" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b0a2a781b8e00b42a3670d24c3d2b5d443299c06 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1192431058

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0f0fae82a029d42fa9db7ea8d2df4ba1787fded6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129) 
   * 1b0332d969b26cc5ddd7b53d4d4d9589e8a98107 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1193014130

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220",
       "triggerID" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b0a2a781b8e00b42a3670d24c3d2b5d443299c06 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220) 
   * 6a1cd5667097f06028d3391f2558c12292d2e3e8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on code in PR #5328:
URL: https://github.com/apache/hudi/pull/5328#discussion_r928092517


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java:
##########
@@ -27,16 +28,18 @@
  */
 public abstract class BulkInsertInternalPartitionerFactory {
 
-  public static BulkInsertPartitioner get(BulkInsertSortMode sortMode) {
-    switch (sortMode) {
+  public static BulkInsertPartitioner get(BulkInsertSortMode bulkInsertMode, HoodieTableConfig tableConfig) {

Review Comment:
   There are modes which have nothing to do with sorting (re-partitioning, for ex), so long-term we should strip the sort part of it



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RepartitioningBulkInsertPartitionerBase.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.function.SerializableFunctionUnchecked;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.Partitioner;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * Base class for any {@link BulkInsertPartitioner} implementation that does re-partitioning,
+ * to better align "logical" (query-engine's partitioning of the incoming dataset) w/ the table's
+ * "physical" partitioning
+ */
+public abstract class RepartitioningBulkInsertPartitionerBase<I> implements BulkInsertPartitioner<I> {
+
+  protected final boolean isPartitionedTable;
+
+  public RepartitioningBulkInsertPartitionerBase(HoodieTableConfig tableConfig) {

Review Comment:
   Good call. I thought about it initially, but then decided that it's better to abstract this handling w/in the partitioner rather than pushing it onto the caller. LMK what you think.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RepartitioningBulkInsertPartitionerBase.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.function.SerializableFunctionUnchecked;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.Partitioner;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * Base class for any {@link BulkInsertPartitioner} implementation that does re-partitioning,
+ * to better align "logical" (query-engine's partitioning of the incoming dataset) w/ the table's
+ * "physical" partitioning
+ */
+public abstract class RepartitioningBulkInsertPartitionerBase<I> implements BulkInsertPartitioner<I> {
+
+  protected final boolean isPartitionedTable;
+
+  public RepartitioningBulkInsertPartitionerBase(HoodieTableConfig tableConfig) {
+    this.isPartitionedTable = tableConfig.getPartitionFields().map(pfs -> pfs.length > 0).orElse(false);
+  }
+
+  protected static class PartitionPathRDDPartitioner extends Partitioner implements Serializable {
+    private final SerializableFunctionUnchecked<Object, String> partitionPathExtractor;
+    private final int numPartitions;
+
+    PartitionPathRDDPartitioner(SerializableFunctionUnchecked<Object, String> partitionPathExtractor, int numPartitions) {
+      this.partitionPathExtractor = partitionPathExtractor;
+      this.numPartitions = numPartitions;
+    }
+
+    @Override
+    public int numPartitions() {
+      return numPartitions;
+    }
+
+    @SuppressWarnings("unchecked")
+    @Override
+    public int getPartition(Object o) {
+      return Math.abs(Objects.hash(partitionPathExtractor.apply(o))) % numPartitions;

Review Comment:
   Not sure i follow your train of thought: that's the whole idea of such partitioners (parittion-sort and partitioner-no-sort) to be able to partition the data to be better aligned with physical partitioning, right? 
   
   In case data is heavily skewed into most recent partition, it shouldn't be handled with this partitioner.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##########
@@ -19,42 +19,69 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements BulkInsertPartitioner<Dataset<Row>> {
+public class RowCustomColumnsSortPartitioner extends RepartitioningBulkInsertPartitionerBase<Dataset<Row>> {
+
+  private final String[] orderByColumnNames;
 
-  private final String[] sortColumnNames;
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, HoodieTableConfig tableConfig) {
+    super(tableConfig);
+    this.orderByColumnNames = getOrderByColumnNames(config);
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-    this.sortColumnNames = getSortColumnName(config);
+    checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-    this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, HoodieTableConfig tableConfig) {
+    super(tableConfig);
+    this.orderByColumnNames = columnNames;
+
+    checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset<Row> repartitionRecords(Dataset<Row> records, int outputSparkPartitions) {
-    final String[] sortColumns = this.sortColumnNames;
-    return records.coalesce(outputSparkPartitions)
-        .sortWithinPartitions(HoodieRecord.PARTITION_PATH_METADATA_FIELD, sortColumns);
+  public Dataset<Row> repartitionRecords(Dataset<Row> dataset, int outputSparkPartitions) {
+    Dataset<Row> repartitionedDataset;
+
+    // NOTE: In case of partitioned table even "global" ordering (across all RDD partitions) could
+    //       not change table's partitioning and therefore there's no point in doing global sorting
+    //       across "physical" partitions, and instead we can reduce total amount of data being
+    //       shuffled by doing do "local" sorting:
+    //          - First, re-partitioning dataset such that "logical" partitions are aligned w/
+    //          "physical" ones
+    //          - Sorting locally w/in RDD ("logical") partitions
+    //
+    //       Non-partitioned tables will be globally sorted.
+    if (isPartitionedTable) {
+      repartitionedDataset = dataset.repartition(outputSparkPartitions, new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));

Review Comment:
   It will if the dataset wasn't partitioned before, but it brings the benefits of the properly sized files (b/c whole physical partition will be written by a single executor)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1340240268

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343",
       "triggerID" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10435",
       "triggerID" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10437",
       "triggerID" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "triggerType" : "PUSH"
     }, {
       "hash" : "158b38c4c46eabee066862751cbb3461797d20e2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13491",
       "triggerID" : "158b38c4c46eabee066862751cbb3461797d20e2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 76fea0d2cbac3928c2f9088629999207afbad053 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10437) 
   * 158b38c4c46eabee066862751cbb3461797d20e2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13491) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1192917434

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1b0332d969b26cc5ddd7b53d4d4d9589e8a98107 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198) 
   * b0a2a781b8e00b42a3670d24c3d2b5d443299c06 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on code in PR #5328:
URL: https://github.com/apache/hudi/pull/5328#discussion_r928155745


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java:
##########
@@ -27,16 +28,18 @@
  */
 public abstract class BulkInsertInternalPartitionerFactory {
 
-  public static BulkInsertPartitioner get(BulkInsertSortMode sortMode) {
-    switch (sortMode) {
+  public static BulkInsertPartitioner get(BulkInsertSortMode bulkInsertMode, HoodieTableConfig tableConfig) {
+    switch (bulkInsertMode) {
       case NONE:
-        return new NonSortPartitioner();
+        return new NonSortPartitioner<>();
       case GLOBAL_SORT:
-        return new GlobalSortPartitioner();
+        return new GlobalSortPartitioner<>();
       case PARTITION_SORT:
-        return new RDDPartitionSortPartitioner();
+        return new RDDPartitionSortPartitioner<>(tableConfig);

Review Comment:
   Will take it up along with `BulkInsertMode` rename



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/PartitionNoSortPartitioner.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
+
+/**
+ * A built-in partitioner that only does re-partitioning to better align "logical" partitioning

Review Comment:
   Correct. It's def not a silver-bullet.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/GlobalSortPartitioner.java:
##########
@@ -20,34 +20,46 @@
 
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
 import org.apache.spark.api.java.JavaRDD;
 
 /**
- * A built-in partitioner that does global sorting for the input records across partitions
- * after repartition for bulk insert operation, corresponding to the
- * {@code BulkInsertSortMode.GLOBAL_SORT} mode.
+ * A built-in partitioner that does global sorting of the input records across all Spark partitions,
+ * corresponding to the {@link BulkInsertSortMode#GLOBAL_SORT} mode.
  *
- * @param <T> HoodieRecordPayload type
+ * NOTE: Records are sorted by (partitionPath, key) tuple to make sure that physical
+ *       partitioning on disk is aligned with logical partitioning of the dataset (by Spark)
+ *       as much as possible.
+ *       Consider following scenario: dataset is inserted w/ parallelism of N (meaning that Spark
+ *       will partition it into N _logical_ partitions while writing), and has M physical partitions

Review Comment:
   Will address in a follow-up (to avoid re-triggering CI again)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/PartitionSortPartitionerWithRows.java:
##########
@@ -19,19 +19,39 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 
 /**
- * A built-in partitioner that does local sorting for each spark partitions after coalesce for bulk insert operation, corresponding to the {@code BulkInsertSortMode.PARTITION_SORT} mode.
+ * A built-in partitioner that does local sorting w/in the Spark partition,
+ * corresponding to the {@code BulkInsertSortMode.PARTITION_SORT} mode.
  */
-public class PartitionSortPartitionerWithRows implements BulkInsertPartitioner<Dataset<Row>> {
+public class PartitionSortPartitionerWithRows extends RepartitioningBulkInsertPartitionerBase<Dataset<Row>> {
+
+  public PartitionSortPartitionerWithRows(HoodieTableConfig tableConfig) {
+    super(tableConfig);
+  }
 
   @Override
-  public Dataset<Row> repartitionRecords(Dataset<Row> rows, int outputSparkPartitions) {
-    return rows.coalesce(outputSparkPartitions).sortWithinPartitions(HoodieRecord.PARTITION_PATH_METADATA_FIELD, HoodieRecord.RECORD_KEY_METADATA_FIELD);
+  public Dataset<Row> repartitionRecords(Dataset<Row> dataset, int outputSparkPartitions) {
+    Dataset<Row> repartitionedDataset;
+
+    // NOTE: Datasets being ingested into partitioned tables are additionally re-partitioned to better
+    //       align dataset's logical partitioning with expected table's physical partitioning to
+    //       provide for appropriate file-sizing and better control of the number of files created.
+    //
+    //       Please check out {@code GlobalSortPartitioner} java-doc for more details
+    if (isPartitionedTable) {
+      repartitionedDataset = dataset.repartition(outputSparkPartitions, new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));

Review Comment:
   `sortWithinPartitons` does not shuffle (it sorts w/in partitions only)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/GlobalSortPartitioner.java:
##########
@@ -20,34 +20,46 @@
 
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
 import org.apache.spark.api.java.JavaRDD;
 
 /**
- * A built-in partitioner that does global sorting for the input records across partitions
- * after repartition for bulk insert operation, corresponding to the
- * {@code BulkInsertSortMode.GLOBAL_SORT} mode.
+ * A built-in partitioner that does global sorting of the input records across all Spark partitions,
+ * corresponding to the {@link BulkInsertSortMode#GLOBAL_SORT} mode.
  *
- * @param <T> HoodieRecordPayload type
+ * NOTE: Records are sorted by (partitionPath, key) tuple to make sure that physical
+ *       partitioning on disk is aligned with logical partitioning of the dataset (by Spark)
+ *       as much as possible.
+ *       Consider following scenario: dataset is inserted w/ parallelism of N (meaning that Spark
+ *       will partition it into N _logical_ partitions while writing), and has M physical partitions
+ *       on disk. Without alignment "physical" and "logical" partitions (assuming
+ *       here that records are inserted uniformly across partitions), every logical partition,
+ *       which might be handled by separate executor, will be inserting into every physical
+ *       partition, creating a new file for the records it's writing, entailing that new N x M
+ *       files will be added to the table.
+ *
+ *       Instead, we want no more than N + M files to be created, and therefore sort by
+ *       a tuple of (partitionPath, key), which provides for following invariants where every
+ *       Spark partition will either
+ *          - Hold _all_ record from particular physical partition, or
+ *          - Hold _only_ records from that particular physical partition
+ *
+ *       In other words a single Spark partition will either be hold full set of records for
+ *       a few smaller partitions, or it will hold just the records of the larger one. This
+ *       allows us to provide a guarantee that no more N + M files will be created.
+ *
+ * @param <T> {@code HoodieRecordPayload} type
  */
 public class GlobalSortPartitioner<T extends HoodieRecordPayload>
     implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
 
   @Override
   public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records,
                                                      int outputSparkPartitions) {
-    // Now, sort the records and line them up nicely for loading.
-    return records.sortBy(record -> {
-      // Let's use "partitionPath + key" as the sort key. Spark, will ensure
-      // the records split evenly across RDD partitions, such that small partitions fit
-      // into 1 RDD partition, while big ones spread evenly across multiple RDD partitions
-      return new StringBuilder()
-          .append(record.getPartitionPath())
-          .append("+")
-          .append(record.getRecordKey())
-          .toString();
-    }, true, outputSparkPartitions);
+    return records.sortBy(record ->
+        Pair.of(record.getPartitionPath(), record.getRecordKey()), true, outputSparkPartitions);

Review Comment:
   We have tests for these



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java:
##########
@@ -27,16 +28,18 @@
  */
 public abstract class BulkInsertInternalPartitionerFactory {
 
-  public static BulkInsertPartitioner get(BulkInsertSortMode sortMode) {
-    switch (sortMode) {
+  public static BulkInsertPartitioner get(BulkInsertSortMode bulkInsertMode, HoodieTableConfig tableConfig) {

Review Comment:
   Will follow up with renames



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDCustomColumnsSortPartitioner.java:
##########
@@ -18,69 +18,120 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
 import org.apache.hudi.avro.HoodieAvroUtils;
 import org.apache.hudi.common.config.SerializableSchema;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
-import org.apache.avro.Schema;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
 
+import java.io.Serializable;
 import java.util.Arrays;
+import java.util.Comparator;
+import java.util.function.Function;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
 
 /**
- * A partitioner that does sorting based on specified column values for each RDD partition.
+ * A partitioner that does local sorting for each RDD partition based on the tuple of
+ * values of the columns configured for ordering.
  *
  * @param <T> HoodieRecordPayload type
  */
 public class RDDCustomColumnsSortPartitioner<T extends HoodieRecordPayload>
-    implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+    extends RepartitioningBulkInsertPartitionerBase<JavaRDD<HoodieRecord<T>>> {
 
-  private final String[] sortColumnNames;
+  private final String[] orderByColumnNames;

Review Comment:
   That's been a while ago, frankly, can't recollect the context



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDCustomColumnsSortPartitioner.java:
##########
@@ -18,69 +18,120 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
 import org.apache.hudi.avro.HoodieAvroUtils;
 import org.apache.hudi.common.config.SerializableSchema;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
-import org.apache.avro.Schema;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;

Review Comment:
   We can't since we're using JavaRDD API directly here



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##########
@@ -19,42 +19,69 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements BulkInsertPartitioner<Dataset<Row>> {
+public class RowCustomColumnsSortPartitioner extends RepartitioningBulkInsertPartitionerBase<Dataset<Row>> {
+
+  private final String[] orderByColumnNames;
 
-  private final String[] sortColumnNames;
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, HoodieTableConfig tableConfig) {
+    super(tableConfig);
+    this.orderByColumnNames = getOrderByColumnNames(config);
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-    this.sortColumnNames = getSortColumnName(config);
+    checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-    this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, HoodieTableConfig tableConfig) {
+    super(tableConfig);
+    this.orderByColumnNames = columnNames;
+
+    checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset<Row> repartitionRecords(Dataset<Row> records, int outputSparkPartitions) {
-    final String[] sortColumns = this.sortColumnNames;
-    return records.coalesce(outputSparkPartitions)
-        .sortWithinPartitions(HoodieRecord.PARTITION_PATH_METADATA_FIELD, sortColumns);
+  public Dataset<Row> repartitionRecords(Dataset<Row> dataset, int outputSparkPartitions) {
+    Dataset<Row> repartitionedDataset;
+
+    // NOTE: In case of partitioned table even "global" ordering (across all RDD partitions) could
+    //       not change table's partitioning and therefore there's no point in doing global sorting
+    //       across "physical" partitions, and instead we can reduce total amount of data being
+    //       shuffled by doing do "local" sorting:
+    //          - First, re-partitioning dataset such that "logical" partitions are aligned w/
+    //          "physical" ones
+    //          - Sorting locally w/in RDD ("logical") partitions
+    //
+    //       Non-partitioned tables will be globally sorted.
+    if (isPartitionedTable) {
+      repartitionedDataset = dataset.repartition(outputSparkPartitions, new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));
+    } else {
+      repartitionedDataset = dataset.coalesce(outputSparkPartitions);
+    }
+
+    return repartitionedDataset.sortWithinPartitions(

Review Comment:
   Correct



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala:
##########
@@ -138,4 +140,16 @@ trait SparkAdapter extends Serializable {
    * TODO move to HoodieCatalystExpressionUtils
    */
   def createInterpretedPredicate(e: Expression): InterpretedPredicate
+
+  /**
+   * Insert all records, updates related task metrics, and return a completion iterator
+   * over all the data written to this [[ExternalSorter]], aggregated by our aggregator.
+   *
+   * On task completion (success, failure, or cancellation), it releases resources by
+   * calling `stop()`.
+   *
+   * NOTE: This method is an [[ExternalSorter#insertAllAndUpdateMetrics]] back-ported to Spark 2.4
+   */
+  def insertInto[K, V, C](ctx: TaskContext, records: Iterator[Product2[K, V]], sorter: ExternalSorter[K, V, C]): Iterator[Product2[K, C]]

Review Comment:
   API incompatibility b/w Spark 3.2 and prior versions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

codope commented on code in PR #5328:
URL: https://github.com/apache/hudi/pull/5328#discussion_r927363643


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##########
@@ -19,42 +19,69 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements BulkInsertPartitioner<Dataset<Row>> {
+public class RowCustomColumnsSortPartitioner extends RepartitioningBulkInsertPartitionerBase<Dataset<Row>> {
+
+  private final String[] orderByColumnNames;
 
-  private final String[] sortColumnNames;
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, HoodieTableConfig tableConfig) {
+    super(tableConfig);
+    this.orderByColumnNames = getOrderByColumnNames(config);
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-    this.sortColumnNames = getSortColumnName(config);
+    checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-    this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, HoodieTableConfig tableConfig) {
+    super(tableConfig);
+    this.orderByColumnNames = columnNames;
+
+    checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset<Row> repartitionRecords(Dataset<Row> records, int outputSparkPartitions) {
-    final String[] sortColumns = this.sortColumnNames;
-    return records.coalesce(outputSparkPartitions)
-        .sortWithinPartitions(HoodieRecord.PARTITION_PATH_METADATA_FIELD, sortColumns);
+  public Dataset<Row> repartitionRecords(Dataset<Row> dataset, int outputSparkPartitions) {
+    Dataset<Row> repartitionedDataset;
+
+    // NOTE: In case of partitioned table even "global" ordering (across all RDD partitions) could
+    //       not change table's partitioning and therefore there's no point in doing global sorting
+    //       across "physical" partitions, and instead we can reduce total amount of data being
+    //       shuffled by doing do "local" sorting:
+    //          - First, re-partitioning dataset such that "logical" partitions are aligned w/
+    //          "physical" ones
+    //          - Sorting locally w/in RDD ("logical") partitions
+    //
+    //       Non-partitioned tables will be globally sorted.
+    if (isPartitionedTable) {
+      repartitionedDataset = dataset.repartition(outputSparkPartitions, new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));

Review Comment:
   Most tables will be partitioned. Isn't repartitioning just gonna increase the amount of shuffle?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RepartitioningBulkInsertPartitionerBase.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.function.SerializableFunctionUnchecked;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.Partitioner;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * Base class for any {@link BulkInsertPartitioner} implementation that does re-partitioning,
+ * to better align "logical" (query-engine's partitioning of the incoming dataset) w/ the table's
+ * "physical" partitioning
+ */
+public abstract class RepartitioningBulkInsertPartitionerBase<I> implements BulkInsertPartitioner<I> {
+
+  protected final boolean isPartitionedTable;
+
+  public RepartitioningBulkInsertPartitionerBase(HoodieTableConfig tableConfig) {

Review Comment:
   instead of having tableConfig across executors, should we just define needed attributes here?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java:
##########
@@ -27,16 +28,18 @@
  */
 public abstract class BulkInsertInternalPartitionerFactory {
 
-  public static BulkInsertPartitioner get(BulkInsertSortMode sortMode) {
-    switch (sortMode) {
+  public static BulkInsertPartitioner get(BulkInsertSortMode bulkInsertMode, HoodieTableConfig tableConfig) {

Review Comment:
   I would avoid cosmetic changes in this PR. But, if you prefer to change the name `bulkInsertSortMode` might be better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin closed pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by "alexeykudinkin (via GitHub)" <gi...@apache.org>.

alexeykudinkin closed pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting
URL: https://github.com/apache/hudi/pull/5328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1100192710

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 96b33942edf6a1d6d89361d2e056ed1c3a8d326b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077) 
   * 6812e0065e1411107d7d53ad2997d02e7ce34d06 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1340335362

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343",
       "triggerID" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10435",
       "triggerID" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10437",
       "triggerID" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "triggerType" : "PUSH"
     }, {
       "hash" : "158b38c4c46eabee066862751cbb3461797d20e2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13491",
       "triggerID" : "158b38c4c46eabee066862751cbb3461797d20e2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 158b38c4c46eabee066862751cbb3461797d20e2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13491) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194983907

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 82147298deaec87b776a284746826c4004bb3d73 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341) 
   * f1c00f46279d3d79c4cf438af1e5a398718c426a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5328: [WIP][HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on code in PR #5328:
URL: https://github.com/apache/hudi/pull/5328#discussion_r923686755


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/PartitionSortPartitionerWithRows.java:
##########
@@ -19,19 +19,39 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 
 /**
- * A built-in partitioner that does local sorting for each spark partitions after coalesce for bulk insert operation, corresponding to the {@code BulkInsertSortMode.PARTITION_SORT} mode.
+ * A built-in partitioner that does local sorting w/in the Spark partition,
+ * corresponding to the {@code BulkInsertSortMode.PARTITION_SORT} mode.
  */
-public class PartitionSortPartitionerWithRows implements BulkInsertPartitioner<Dataset<Row>> {
+public class PartitionSortPartitionerWithRows extends RepartitioningBulkInsertPartitionerBase<Dataset<Row>> {
+
+  public PartitionSortPartitionerWithRows(HoodieTableConfig tableConfig) {
+    super(tableConfig);
+  }
 
   @Override
-  public Dataset<Row> repartitionRecords(Dataset<Row> rows, int outputSparkPartitions) {
-    return rows.coalesce(outputSparkPartitions).sortWithinPartitions(HoodieRecord.PARTITION_PATH_METADATA_FIELD, HoodieRecord.RECORD_KEY_METADATA_FIELD);
+  public Dataset<Row> repartitionRecords(Dataset<Row> dataset, int outputSparkPartitions) {
+    Dataset<Row> repartitionedDataset;
+
+    // NOTE: Datasets being ingested into partitioned tables are additionally re-partitioned to better
+    //       align dataset's logical partitioning with expected table's physical partitioning to
+    //       provide for appropriate file-sizing and better control of the number of files created.
+    //
+    //       Please check out {@code GlobalSortPartitioner} java-doc for more details
+    if (isPartitionedTable) {
+      repartitionedDataset = dataset.repartition(outputSparkPartitions, new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));

Review Comment:
   Correct, we will be adding a new insertion mode. I've made the changes locally and working on fixing the tests currently. Will update the PR once this is done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1192920465

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220",
       "triggerID" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1b0332d969b26cc5ddd7b53d4d4d9589e8a98107 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198) 
   * b0a2a781b8e00b42a3670d24c3d2b5d443299c06 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1198618309

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343",
       "triggerID" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10435",
       "triggerID" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10437",
       "triggerID" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f1c00f46279d3d79c4cf438af1e5a398718c426a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343) 
   * 7409db4ca5e362170ce99f6479bdeeceb3402a8e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10435) 
   * 76fea0d2cbac3928c2f9088629999207afbad053 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10437) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP][HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1102194382

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0f0fae82a029d42fa9db7ea8d2df4ba1787fded6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1100190821

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 96b33942edf6a1d6d89361d2e056ed1c3a8d326b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077) 
   * 6812e0065e1411107d7d53ad2997d02e7ce34d06 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1100196310

   high level comment. I would prefer to introduce a new sort mode instead of fixing NONE. and add documentation around when to use which sort mode so that users are aware of diff sort modes and their implications


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP][HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1101963690

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6812e0065e1411107d7d53ad2997d02e7ce34d06 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079) 
   * 0f0fae82a029d42fa9db7ea8d2df4ba1787fded6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1192189718

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 0f0fae82a029d42fa9db7ea8d2df4ba1787fded6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5328: [WIP][HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on code in PR #5328:
URL: https://github.com/apache/hudi/pull/5328#discussion_r923159642


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/PartitionSortPartitionerWithRows.java:
##########
@@ -19,19 +19,39 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 
 /**
- * A built-in partitioner that does local sorting for each spark partitions after coalesce for bulk insert operation, corresponding to the {@code BulkInsertSortMode.PARTITION_SORT} mode.
+ * A built-in partitioner that does local sorting w/in the Spark partition,
+ * corresponding to the {@code BulkInsertSortMode.PARTITION_SORT} mode.
  */
-public class PartitionSortPartitionerWithRows implements BulkInsertPartitioner<Dataset<Row>> {
+public class PartitionSortPartitionerWithRows extends RepartitioningBulkInsertPartitionerBase<Dataset<Row>> {
+
+  public PartitionSortPartitionerWithRows(HoodieTableConfig tableConfig) {
+    super(tableConfig);
+  }
 
   @Override
-  public Dataset<Row> repartitionRecords(Dataset<Row> rows, int outputSparkPartitions) {
-    return rows.coalesce(outputSparkPartitions).sortWithinPartitions(HoodieRecord.PARTITION_PATH_METADATA_FIELD, HoodieRecord.RECORD_KEY_METADATA_FIELD);
+  public Dataset<Row> repartitionRecords(Dataset<Row> dataset, int outputSparkPartitions) {
+    Dataset<Row> repartitionedDataset;
+
+    // NOTE: Datasets being ingested into partitioned tables are additionally re-partitioned to better
+    //       align dataset's logical partitioning with expected table's physical partitioning to
+    //       provide for appropriate file-sizing and better control of the number of files created.
+    //
+    //       Please check out {@code GlobalSortPartitioner} java-doc for more details
+    if (isPartitionedTable) {
+      repartitionedDataset = dataset.repartition(outputSparkPartitions, new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));

Review Comment:
   Did you have any discussion w/ @vinothchandar  around this. If I am not wrong, decision was to not touch of the existing sort modes and introduce new ones. If not, let me know if you had brainstormed this w/ anyone else already. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194664776

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 87e6b69600ba3f17f1fe098d3585773a56d6d933 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1099811861

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 96b33942edf6a1d6d89361d2e056ed1c3a8d326b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1193030885

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220",
       "triggerID" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10228",
       "triggerID" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6a1cd5667097f06028d3391f2558c12292d2e3e8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10228) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194669269

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 87e6b69600ba3f17f1fe098d3585773a56d6d933 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

yihua commented on code in PR #5328:
URL: https://github.com/apache/hudi/pull/5328#discussion_r930430983


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/PartitionSortPartitionerWithRows.java:
##########
@@ -19,19 +19,42 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 
 /**
- * A built-in partitioner that does local sorting for each spark partitions after coalesce for bulk insert operation, corresponding to the {@code BulkInsertSortMode.PARTITION_SORT} mode.
+ * A built-in partitioner that does local sorting w/in the Spark partition,
+ * corresponding to the {@code BulkInsertSortMode.PARTITION_SORT} mode.
  */
-public class PartitionSortPartitionerWithRows implements BulkInsertPartitioner<Dataset<Row>> {
+public class PartitionSortPartitionerWithRows extends RepartitioningBulkInsertPartitionerBase<Dataset<Row>> {
+
+  public PartitionSortPartitionerWithRows(HoodieTableConfig tableConfig) {
+    super(tableConfig);
+  }
 
   @Override
-  public Dataset<Row> repartitionRecords(Dataset<Row> rows, int outputSparkPartitions) {
-    return rows.coalesce(outputSparkPartitions).sortWithinPartitions(HoodieRecord.PARTITION_PATH_METADATA_FIELD, HoodieRecord.RECORD_KEY_METADATA_FIELD);
+  public Dataset<Row> repartitionRecords(Dataset<Row> dataset, int outputSparkPartitions) {
+    Dataset<Row> repartitionedDataset;
+
+    // NOTE: Datasets being ingested into partitioned tables are additionally re-partitioned to better
+    //       align dataset's logical partitioning with expected table's physical partitioning to
+    //       provide for appropriate file-sizing and better control of the number of files created.
+    //
+    //       Please check out {@code GlobalSortPartitioner} java-doc for more details
+    if (isPartitionedTable) {
+      repartitionedDataset = dataset.repartition(outputSparkPartitions, new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));

Review Comment:
   This indeed changes the sorting behavior, right?
   The old DAG: `spark partitions -> coalesce (combining existing partitions to avoid a full shuffle) -> sort within each spark partition`
   The new DAG: `spark partitions -> repartition based on partition path (full shuffle) -> spark partitions corresponding to table partitions -> sort within each spark/table partition`



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##########
@@ -19,42 +19,69 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements BulkInsertPartitioner<Dataset<Row>> {
+public class RowCustomColumnsSortPartitioner extends RepartitioningBulkInsertPartitionerBase<Dataset<Row>> {
+
+  private final String[] orderByColumnNames;
 
-  private final String[] sortColumnNames;
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, HoodieTableConfig tableConfig) {
+    super(tableConfig);
+    this.orderByColumnNames = getOrderByColumnNames(config);
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-    this.sortColumnNames = getSortColumnName(config);
+    checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-    this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, HoodieTableConfig tableConfig) {
+    super(tableConfig);
+    this.orderByColumnNames = columnNames;
+
+    checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset<Row> repartitionRecords(Dataset<Row> records, int outputSparkPartitions) {
-    final String[] sortColumns = this.sortColumnNames;
-    return records.coalesce(outputSparkPartitions)
-        .sortWithinPartitions(HoodieRecord.PARTITION_PATH_METADATA_FIELD, sortColumns);
+  public Dataset<Row> repartitionRecords(Dataset<Row> dataset, int outputSparkPartitions) {
+    Dataset<Row> repartitionedDataset;
+
+    // NOTE: In case of partitioned table even "global" ordering (across all RDD partitions) could
+    //       not change table's partitioning and therefore there's no point in doing global sorting
+    //       across "physical" partitions, and instead we can reduce total amount of data being
+    //       shuffled by doing do "local" sorting:
+    //          - First, re-partitioning dataset such that "logical" partitions are aligned w/
+    //          "physical" ones
+    //          - Sorting locally w/in RDD ("logical") partitions
+    //
+    //       Non-partitioned tables will be globally sorted.
+    if (isPartitionedTable) {
+      repartitionedDataset = dataset.repartition(outputSparkPartitions, new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));

Review Comment:
   Same here on repartition vs coalesce.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDPartitionSortPartitioner.java:
##########
@@ -20,46 +20,62 @@
 
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.spark.api.java.JavaPairRDD;
 import org.apache.spark.api.java.JavaRDD;
-
-import java.util.ArrayList;
-import java.util.Collections;
-import java.util.List;
-
+import org.apache.spark.sql.HoodieJavaRDDUtils;
 import scala.Tuple2;
 
+import java.util.Comparator;
+
 /**
  * A built-in partitioner that does local sorting for each RDD partition
- * after coalesce for bulk insert operation, corresponding to the
- * {@code BulkInsertSortMode.PARTITION_SORT} mode.
+ * after coalescing it to specified number of partitions.
+ * Corresponds to the {@link BulkInsertSortMode#PARTITION_SORT} mode.
  *
  * @param <T> HoodieRecordPayload type
  */
 public class RDDPartitionSortPartitioner<T extends HoodieRecordPayload>
-    implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+    extends RepartitioningBulkInsertPartitionerBase<JavaRDD<HoodieRecord<T>>> {
 
+  public RDDPartitionSortPartitioner(HoodieTableConfig tableConfig) {
+    super(tableConfig);
+  }
+
+  @SuppressWarnings("unchecked")
   @Override
   public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records,
                                                      int outputSparkPartitions) {
-    return records.coalesce(outputSparkPartitions)
-        .mapToPair(record ->
-            new Tuple2<>(
-                new StringBuilder()
-                    .append(record.getPartitionPath())
-                    .append("+")
-                    .append(record.getRecordKey())
-                    .toString(), record))
-        .mapPartitions(partition -> {
-          // Sort locally in partition
-          List<Tuple2<String, HoodieRecord<T>>> recordList = new ArrayList<>();
-          for (; partition.hasNext(); ) {
-            recordList.add(partition.next());
-          }
-          Collections.sort(recordList, (o1, o2) -> o1._1.compareTo(o2._1));
-          return recordList.stream().map(e -> e._2).iterator();
-        });
+
+    // NOTE: Datasets being ingested into partitioned tables are additionally re-partitioned to better
+    //       align dataset's logical partitioning with expected table's physical partitioning to
+    //       provide for appropriate file-sizing and better control of the number of files created.
+    //
+    //       Please check out {@code GlobalSortPartitioner} java-doc for more details
+    if (isPartitionedTable) {
+      PartitionPathRDDPartitioner partitioner =
+          new PartitionPathRDDPartitioner((pair) -> ((Pair<String, String>) pair).getKey(), outputSparkPartitions);
+
+      // Both partition-path and record-key are extracted, since
+      //    - Partition-path will be used for re-partitioning (as called out above)
+      //    - Record-key will be used for sorting the records w/in individual partitions
+      return records.mapToPair(record -> new Tuple2<>(Pair.of(record.getPartitionPath(), record.getRecordKey()), record))
+          // NOTE: We're sorting by (partition-path, record-key) pair to make sure that in case
+          //       when there are less Spark partitions (requested) than there are physical partitions
+          //       (in which case multiple physical partitions, will be handled w/in single Spark
+          //       partition) records w/in a single Spark partition are still ordered first by
+          //       partition-path, then record's key
+          .repartitionAndSortWithinPartitions(partitioner, Comparator.naturalOrder())

Review Comment:
   Again, the original sort behavior does not repartition the records based on the table partition path.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java:
##########
@@ -27,16 +28,18 @@
  */
 public abstract class BulkInsertInternalPartitionerFactory {
 
-  public static BulkInsertPartitioner get(BulkInsertSortMode sortMode) {
-    switch (sortMode) {
+  public static BulkInsertPartitioner get(BulkInsertSortMode bulkInsertMode, HoodieTableConfig tableConfig) {
+    switch (bulkInsertMode) {
       case NONE:
-        return new NonSortPartitioner();
+        return new NonSortPartitioner<>();
       case GLOBAL_SORT:
-        return new GlobalSortPartitioner();
+        return new GlobalSortPartitioner<>();
       case PARTITION_SORT:
-        return new RDDPartitionSortPartitioner();
+        return new RDDPartitionSortPartitioner<>(tableConfig);

Review Comment:
   `RDD` is added for clarification that the sorting happens within one RDD partition, not the table/physical partition.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RepartitioningBulkInsertPartitionerBase.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.function.SerializableFunctionUnchecked;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.Partitioner;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * Base class for any {@link BulkInsertPartitioner} implementation that does re-partitioning,
+ * to better align "logical" (query-engine's partitioning of the incoming dataset) w/ the table's
+ * "physical" partitioning
+ */
+public abstract class RepartitioningBulkInsertPartitionerBase<I> implements BulkInsertPartitioner<I> {
+
+  protected final boolean isPartitionedTable;
+
+  public RepartitioningBulkInsertPartitionerBase(HoodieTableConfig tableConfig) {
+    this.isPartitionedTable = tableConfig.getPartitionFields().map(pfs -> pfs.length > 0).orElse(false);
+  }
+
+  protected static class PartitionPathRDDPartitioner extends Partitioner implements Serializable {
+    private final SerializableFunctionUnchecked<Object, String> partitionPathExtractor;
+    private final int numPartitions;
+
+    PartitionPathRDDPartitioner(SerializableFunctionUnchecked<Object, String> partitionPathExtractor, int numPartitions) {
+      this.partitionPathExtractor = partitionPathExtractor;
+      this.numPartitions = numPartitions;
+    }
+
+    @Override
+    public int numPartitions() {
+      return numPartitions;
+    }
+
+    @SuppressWarnings("unchecked")
+    @Override
+    public int getPartition(Object o) {
+      return Math.abs(Objects.hash(partitionPathExtractor.apply(o))) % numPartitions;

Review Comment:
   Understood.  This should be documented in the docs.  Even without sorting, a better bucketing strategy can be developed to avoid skews, e.g., a larger table partition can be split into multiple Spark partitions.



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala:
##########
@@ -138,4 +140,16 @@ trait SparkAdapter extends Serializable {
    * TODO move to HoodieCatalystExpressionUtils
    */
   def createInterpretedPredicate(e: Expression): InterpretedPredicate
+
+  /**
+   * Insert all records, updates related task metrics, and return a completion iterator
+   * over all the data written to this [[ExternalSorter]], aggregated by our aggregator.
+   *
+   * On task completion (success, failure, or cancellation), it releases resources by
+   * calling `stop()`.
+   *
+   * NOTE: This method is an [[ExternalSorter#insertAllAndUpdateMetrics]] back-ported to Spark 2.4
+   */
+  def insertInto[K, V, C](ctx: TaskContext, records: Iterator[Product2[K, V]], sorter: ExternalSorter[K, V, C]): Iterator[Product2[K, C]]

Review Comment:
   Why is this called `insertInto`, which gets me confused with`INSERT INTO` SQL?  Should it be renamed as `sortExternally`?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDPartitionSortPartitioner.java:
##########
@@ -20,46 +20,62 @@
 
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.spark.api.java.JavaPairRDD;
 import org.apache.spark.api.java.JavaRDD;
-
-import java.util.ArrayList;
-import java.util.Collections;
-import java.util.List;
-
+import org.apache.spark.sql.HoodieJavaRDDUtils;
 import scala.Tuple2;
 
+import java.util.Comparator;
+
 /**
  * A built-in partitioner that does local sorting for each RDD partition
- * after coalesce for bulk insert operation, corresponding to the
- * {@code BulkInsertSortMode.PARTITION_SORT} mode.
+ * after coalescing it to specified number of partitions.
+ * Corresponds to the {@link BulkInsertSortMode#PARTITION_SORT} mode.
  *
  * @param <T> HoodieRecordPayload type
  */
 public class RDDPartitionSortPartitioner<T extends HoodieRecordPayload>
-    implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+    extends RepartitioningBulkInsertPartitionerBase<JavaRDD<HoodieRecord<T>>> {
 
+  public RDDPartitionSortPartitioner(HoodieTableConfig tableConfig) {
+    super(tableConfig);
+  }
+
+  @SuppressWarnings("unchecked")
   @Override
   public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records,
                                                      int outputSparkPartitions) {
-    return records.coalesce(outputSparkPartitions)
-        .mapToPair(record ->
-            new Tuple2<>(
-                new StringBuilder()
-                    .append(record.getPartitionPath())
-                    .append("+")
-                    .append(record.getRecordKey())
-                    .toString(), record))
-        .mapPartitions(partition -> {
-          // Sort locally in partition
-          List<Tuple2<String, HoodieRecord<T>>> recordList = new ArrayList<>();
-          for (; partition.hasNext(); ) {
-            recordList.add(partition.next());
-          }
-          Collections.sort(recordList, (o1, o2) -> o1._1.compareTo(o2._1));
-          return recordList.stream().map(e -> e._2).iterator();
-        });
+
+    // NOTE: Datasets being ingested into partitioned tables are additionally re-partitioned to better
+    //       align dataset's logical partitioning with expected table's physical partitioning to
+    //       provide for appropriate file-sizing and better control of the number of files created.
+    //
+    //       Please check out {@code GlobalSortPartitioner} java-doc for more details
+    if (isPartitionedTable) {
+      PartitionPathRDDPartitioner partitioner =
+          new PartitionPathRDDPartitioner((pair) -> ((Pair<String, String>) pair).getKey(), outputSparkPartitions);
+
+      // Both partition-path and record-key are extracted, since
+      //    - Partition-path will be used for re-partitioning (as called out above)
+      //    - Record-key will be used for sorting the records w/in individual partitions
+      return records.mapToPair(record -> new Tuple2<>(Pair.of(record.getPartitionPath(), record.getRecordKey()), record))
+          // NOTE: We're sorting by (partition-path, record-key) pair to make sure that in case
+          //       when there are less Spark partitions (requested) than there are physical partitions
+          //       (in which case multiple physical partitions, will be handled w/in single Spark
+          //       partition) records w/in a single Spark partition are still ordered first by
+          //       partition-path, then record's key
+          .repartitionAndSortWithinPartitions(partitioner, Comparator.naturalOrder())
+          .values();
+    } else {
+      JavaPairRDD<String, HoodieRecord<T>> kvPairsRDD =
+          records.coalesce(outputSparkPartitions).mapToPair(record -> new Tuple2<>(record.getRecordKey(), record));
+
+      // NOTE: [[JavaRDD]] doesn't expose an API to do the sorting w/o (re-)shuffling, as such
+      //       we're relying on our own sequence to achieve that
+      return HoodieJavaRDDUtils.sortWithinPartitions(kvPairsRDD, Comparator.naturalOrder()).values();

Review Comment:
   Can we do `mapPartitions` and then sort instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1198562338

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343",
       "triggerID" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10435",
       "triggerID" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f1c00f46279d3d79c4cf438af1e5a398718c426a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343) 
   * 7409db4ca5e362170ce99f6479bdeeceb3402a8e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10435) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

yihua commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1250008970

   @alexeykudinkin any change you have revised the PR based on the discussion to have new bulk insert modes for the repartitioning behavior?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

yihua commented on code in PR #5328:
URL: https://github.com/apache/hudi/pull/5328#discussion_r928084505


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RepartitioningBulkInsertPartitionerBase.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.function.SerializableFunctionUnchecked;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.Partitioner;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * Base class for any {@link BulkInsertPartitioner} implementation that does re-partitioning,
+ * to better align "logical" (query-engine's partitioning of the incoming dataset) w/ the table's
+ * "physical" partitioning
+ */
+public abstract class RepartitioningBulkInsertPartitionerBase<I> implements BulkInsertPartitioner<I> {
+
+  protected final boolean isPartitionedTable;
+
+  public RepartitioningBulkInsertPartitionerBase(HoodieTableConfig tableConfig) {
+    this.isPartitionedTable = tableConfig.getPartitionFields().map(pfs -> pfs.length > 0).orElse(false);
+  }
+
+  protected static class PartitionPathRDDPartitioner extends Partitioner implements Serializable {
+    private final SerializableFunctionUnchecked<Object, String> partitionPathExtractor;
+    private final int numPartitions;
+
+    PartitionPathRDDPartitioner(SerializableFunctionUnchecked<Object, String> partitionPathExtractor, int numPartitions) {
+      this.partitionPathExtractor = partitionPathExtractor;
+      this.numPartitions = numPartitions;
+    }
+
+    @Override
+    public int numPartitions() {
+      return numPartitions;
+    }
+
+    @SuppressWarnings("unchecked")
+    @Override
+    public int getPartition(Object o) {
+      return Math.abs(Objects.hash(partitionPathExtractor.apply(o))) % numPartitions;

Review Comment:
   This can introduce data skew if most data are in the latest date partition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194949516

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 87e6b69600ba3f17f1fe098d3585773a56d6d933 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320) 
   * 82147298deaec87b776a284746826c4004bb3d73 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1192187988

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1195863390

   CI is green:
   
   <img width="503" alt="Screen Shot 2022-07-26 at 11 55 51 AM" src="https://user-images.githubusercontent.com/428277/181089085-7330eb8b-2476-4ca8-8a5c-9fa5c9053151.png">
   
   https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10343&view=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194385555

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220",
       "triggerID" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10228",
       "triggerID" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 87e6b69600ba3f17f1fe098d3585773a56d6d933 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP][HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1101962392

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6812e0065e1411107d7d53ad2997d02e7ce34d06 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079) 
   * 0f0fae82a029d42fa9db7ea8d2df4ba1787fded6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1195023946

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343",
       "triggerID" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f1c00f46279d3d79c4cf438af1e5a398718c426a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194946760

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 87e6b69600ba3f17f1fe098d3585773a56d6d933 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320) 
   * 82147298deaec87b776a284746826c4004bb3d73 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1192434973

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0f0fae82a029d42fa9db7ea8d2df4ba1787fded6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129) 
   * 1b0332d969b26cc5ddd7b53d4d4d9589e8a98107 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1193015416

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220",
       "triggerID" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10228",
       "triggerID" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b0a2a781b8e00b42a3670d24c3d2b5d443299c06 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220) 
   * 6a1cd5667097f06028d3391f2558c12292d2e3e8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10228) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1198558865

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343",
       "triggerID" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f1c00f46279d3d79c4cf438af1e5a398718c426a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343) 
   * 7409db4ca5e362170ce99f6479bdeeceb3402a8e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1198671070

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343",
       "triggerID" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10435",
       "triggerID" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10437",
       "triggerID" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 76fea0d2cbac3928c2f9088629999207afbad053 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10437) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1340235056

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343",
       "triggerID" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10435",
       "triggerID" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10437",
       "triggerID" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "triggerType" : "PUSH"
     }, {
       "hash" : "158b38c4c46eabee066862751cbb3461797d20e2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "158b38c4c46eabee066862751cbb3461797d20e2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 76fea0d2cbac3928c2f9088629999207afbad053 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10437) 
   * 158b38c4c46eabee066862751cbb3461797d20e2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1100234074

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6812e0065e1411107d7d53ad2997d02e7ce34d06 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1192214134

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 0f0fae82a029d42fa9db7ea8d2df4ba1787fded6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194285380

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077",
       "triggerID" : "96b33942edf6a1d6d89361d2e056ed1c3a8d326b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079",
       "triggerID" : "6812e0065e1411107d7d53ad2997d02e7ce34d06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f0fae82a029d42fa9db7ea8d2df4ba1787fded6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8129",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10198",
       "triggerID" : "1b0332d969b26cc5ddd7b53d4d4d9589e8a98107",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10220",
       "triggerID" : "b0a2a781b8e00b42a3670d24c3d2b5d443299c06",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10228",
       "triggerID" : "6a1cd5667097f06028d3391f2558c12292d2e3e8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6a1cd5667097f06028d3391f2558c12292d2e3e8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10228) 
   * 87e6b69600ba3f17f1fe098d3585773a56d6d933 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1193082041

   @yihua can you please elaborate on the repartitioner you're referring to?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a diff in pull request #5328: [HUDI-3883] Fix Bulk Insert to repartition the dataset based on Partition Path

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on code in PR #5328:
URL: https://github.com/apache/hudi/pull/5328#discussion_r928149228


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/GlobalSortPartitioner.java:
##########
@@ -20,34 +20,46 @@
 
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
 import org.apache.spark.api.java.JavaRDD;
 
 /**
- * A built-in partitioner that does global sorting for the input records across partitions
- * after repartition for bulk insert operation, corresponding to the
- * {@code BulkInsertSortMode.GLOBAL_SORT} mode.
+ * A built-in partitioner that does global sorting of the input records across all Spark partitions,
+ * corresponding to the {@link BulkInsertSortMode#GLOBAL_SORT} mode.
  *
- * @param <T> HoodieRecordPayload type
+ * NOTE: Records are sorted by (partitionPath, key) tuple to make sure that physical
+ *       partitioning on disk is aligned with logical partitioning of the dataset (by Spark)
+ *       as much as possible.
+ *       Consider following scenario: dataset is inserted w/ parallelism of N (meaning that Spark
+ *       will partition it into N _logical_ partitions while writing), and has M physical partitions

Review Comment:
   "Table partitions" may be a better term instead of physical



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java:
##########
@@ -27,16 +28,18 @@
  */
 public abstract class BulkInsertInternalPartitionerFactory {
 
-  public static BulkInsertPartitioner get(BulkInsertSortMode sortMode) {
-    switch (sortMode) {
+  public static BulkInsertPartitioner get(BulkInsertSortMode bulkInsertMode, HoodieTableConfig tableConfig) {
+    switch (bulkInsertMode) {
       case NONE:
-        return new NonSortPartitioner();
+        return new NonSortPartitioner<>();
       case GLOBAL_SORT:
-        return new GlobalSortPartitioner();
+        return new GlobalSortPartitioner<>();
       case PARTITION_SORT:
-        return new RDDPartitionSortPartitioner();
+        return new RDDPartitionSortPartitioner<>(tableConfig);

Review Comment:
   at some point, we should rename the RDDxxxx class also consistently withothers



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/GlobalSortPartitioner.java:
##########
@@ -20,34 +20,46 @@
 
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
 import org.apache.spark.api.java.JavaRDD;
 
 /**
- * A built-in partitioner that does global sorting for the input records across partitions
- * after repartition for bulk insert operation, corresponding to the
- * {@code BulkInsertSortMode.GLOBAL_SORT} mode.
+ * A built-in partitioner that does global sorting of the input records across all Spark partitions,
+ * corresponding to the {@link BulkInsertSortMode#GLOBAL_SORT} mode.
  *
- * @param <T> HoodieRecordPayload type
+ * NOTE: Records are sorted by (partitionPath, key) tuple to make sure that physical
+ *       partitioning on disk is aligned with logical partitioning of the dataset (by Spark)
+ *       as much as possible.
+ *       Consider following scenario: dataset is inserted w/ parallelism of N (meaning that Spark
+ *       will partition it into N _logical_ partitions while writing), and has M physical partitions
+ *       on disk. Without alignment "physical" and "logical" partitions (assuming
+ *       here that records are inserted uniformly across partitions), every logical partition,
+ *       which might be handled by separate executor, will be inserting into every physical
+ *       partition, creating a new file for the records it's writing, entailing that new N x M
+ *       files will be added to the table.
+ *
+ *       Instead, we want no more than N + M files to be created, and therefore sort by
+ *       a tuple of (partitionPath, key), which provides for following invariants where every
+ *       Spark partition will either
+ *          - Hold _all_ record from particular physical partition, or
+ *          - Hold _only_ records from that particular physical partition
+ *
+ *       In other words a single Spark partition will either be hold full set of records for
+ *       a few smaller partitions, or it will hold just the records of the larger one. This
+ *       allows us to provide a guarantee that no more N + M files will be created.
+ *
+ * @param <T> {@code HoodieRecordPayload} type
  */
 public class GlobalSortPartitioner<T extends HoodieRecordPayload>
     implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
 
   @Override
   public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records,
                                                      int outputSparkPartitions) {
-    // Now, sort the records and line them up nicely for loading.
-    return records.sortBy(record -> {
-      // Let's use "partitionPath + key" as the sort key. Spark, will ensure
-      // the records split evenly across RDD partitions, such that small partitions fit
-      // into 1 RDD partition, while big ones spread evenly across multiple RDD partitions
-      return new StringBuilder()
-          .append(record.getPartitionPath())
-          .append("+")
-          .append(record.getRecordKey())
-          .toString();
-    }, true, outputSparkPartitions);
+    return records.sortBy(record ->
+        Pair.of(record.getPartitionPath(), record.getRecordKey()), true, outputSparkPartitions);

Review Comment:
   Change looks good. Lets ensure one of the UTs or by some local testing, we ensure the sorting based on Pair comparator, results in same behavior



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDCustomColumnsSortPartitioner.java:
##########
@@ -18,69 +18,120 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
 import org.apache.hudi.avro.HoodieAvroUtils;
 import org.apache.hudi.common.config.SerializableSchema;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
-import org.apache.avro.Schema;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;

Review Comment:
   can we not depend on this here in java code.There are other places in the code, but love to not proliferate if possible



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDCustomColumnsSortPartitioner.java:
##########
@@ -18,69 +18,120 @@
 
 package org.apache.hudi.execution.bulkinsert;
 
+import org.apache.avro.Schema;
 import org.apache.hudi.avro.HoodieAvroUtils;
 import org.apache.hudi.common.config.SerializableSchema;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
-import org.apache.avro.Schema;
 import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
 
+import java.io.Serializable;
 import java.util.Arrays;
+import java.util.Comparator;
+import java.util.function.Function;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
 
 /**
- * A partitioner that does sorting based on specified column values for each RDD partition.
+ * A partitioner that does local sorting for each RDD partition based on the tuple of
+ * values of the columns configured for ordering.
  *
  * @param <T> HoodieRecordPayload type
  */
 public class RDDCustomColumnsSortPartitioner<T extends HoodieRecordPayload>
-    implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+    extends RepartitioningBulkInsertPartitionerBase<JavaRDD<HoodieRecord<T>>> {
 
-  private final String[] sortColumnNames;
+  private final String[] orderByColumnNames;

Review Comment:
   why the rename



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/PartitionNoSortPartitioner.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.spark.api.java.JavaRDD;
+import scala.Tuple2;
+
+/**
+ * A built-in partitioner that only does re-partitioning to better align "logical" partitioning

Review Comment:
   the side effect of this is skew. for e.g if the users trim down the total numbers of spark partitions or have only a few partition paths (table partitions), then N*M is actually okay. But if one of M table partitions is very large, then that spark partition is going to take a long time to finish writing. 
   
   We should add this to the comments here and also to the site docs.  Global sort handles this too, since sorting will evenly distribute data amongst executors.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java:
##########
@@ -27,16 +28,18 @@
  */
 public abstract class BulkInsertInternalPartitionerFactory {
 
-  public static BulkInsertPartitioner get(BulkInsertSortMode sortMode) {
-    switch (sortMode) {
+  public static BulkInsertPartitioner get(BulkInsertSortMode bulkInsertMode, HoodieTableConfig tableConfig) {

Review Comment:
   +1 on avoiding cosmetic changes. 



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -78,12 +78,12 @@ object HoodieSparkSqlWriter {
     SparkRDDWriteClient[HoodieRecordPayload[Nothing]], HoodieTableConfig) = {
 
     assert(optParams.get("path").exists(!StringUtils.isNullOrEmpty(_)), "'path' must be set")
-    val path = optParams("path")
-    val basePath = new Path(path)
+    val basePathStr = optParams("path")

Review Comment:
   lets avoid these changes, however small



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala:
##########
@@ -138,4 +140,16 @@ trait SparkAdapter extends Serializable {
    * TODO move to HoodieCatalystExpressionUtils
    */
   def createInterpretedPredicate(e: Expression): InterpretedPredicate
+
+  /**
+   * Insert all records, updates related task metrics, and return a completion iterator
+   * over all the data written to this [[ExternalSorter]], aggregated by our aggregator.
+   *
+   * On task completion (success, failure, or cancellation), it releases resources by
+   * calling `stop()`.
+   *
+   * NOTE: This method is an [[ExternalSorter#insertAllAndUpdateMetrics]] back-ported to Spark 2.4
+   */
+  def insertInto[K, V, C](ctx: TaskContext, records: Iterator[Product2[K, V]], sorter: ExternalSorter[K, V, C]): Iterator[Product2[K, C]]

Review Comment:
   Can you explain why this change is needed



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/PartitionSortPartitionerWithRows.java:
##########
@@ -19,19 +19,39 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.table.BulkInsertPartitioner;
 
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 
 /**
- * A built-in partitioner that does local sorting for each spark partitions after coalesce for bulk insert operation, corresponding to the {@code BulkInsertSortMode.PARTITION_SORT} mode.
+ * A built-in partitioner that does local sorting w/in the Spark partition,
+ * corresponding to the {@code BulkInsertSortMode.PARTITION_SORT} mode.
  */
-public class PartitionSortPartitionerWithRows implements BulkInsertPartitioner<Dataset<Row>> {
+public class PartitionSortPartitionerWithRows extends RepartitioningBulkInsertPartitionerBase<Dataset<Row>> {
+
+  public PartitionSortPartitionerWithRows(HoodieTableConfig tableConfig) {
+    super(tableConfig);
+  }
 
   @Override
-  public Dataset<Row> repartitionRecords(Dataset<Row> rows, int outputSparkPartitions) {
-    return rows.coalesce(outputSparkPartitions).sortWithinPartitions(HoodieRecord.PARTITION_PATH_METADATA_FIELD, HoodieRecord.RECORD_KEY_METADATA_FIELD);
+  public Dataset<Row> repartitionRecords(Dataset<Row> dataset, int outputSparkPartitions) {
+    Dataset<Row> repartitionedDataset;
+
+    // NOTE: Datasets being ingested into partitioned tables are additionally re-partitioned to better
+    //       align dataset's logical partitioning with expected table's physical partitioning to
+    //       provide for appropriate file-sizing and better control of the number of files created.
+    //
+    //       Please check out {@code GlobalSortPartitioner} java-doc for more details
+    if (isPartitionedTable) {
+      repartitionedDataset = dataset.repartition(outputSparkPartitions, new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));

Review Comment:
   We are calling `repartition` here, followed by `sortWithinPartitons`.Won't this shuffle two times?. Lets get on the same page w.r.t this.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##########
@@ -19,42 +19,69 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements BulkInsertPartitioner<Dataset<Row>> {
+public class RowCustomColumnsSortPartitioner extends RepartitioningBulkInsertPartitionerBase<Dataset<Row>> {
+
+  private final String[] orderByColumnNames;
 
-  private final String[] sortColumnNames;
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, HoodieTableConfig tableConfig) {
+    super(tableConfig);
+    this.orderByColumnNames = getOrderByColumnNames(config);
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-    this.sortColumnNames = getSortColumnName(config);
+    checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-    this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, HoodieTableConfig tableConfig) {
+    super(tableConfig);
+    this.orderByColumnNames = columnNames;
+
+    checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset<Row> repartitionRecords(Dataset<Row> records, int outputSparkPartitions) {
-    final String[] sortColumns = this.sortColumnNames;
-    return records.coalesce(outputSparkPartitions)
-        .sortWithinPartitions(HoodieRecord.PARTITION_PATH_METADATA_FIELD, sortColumns);
+  public Dataset<Row> repartitionRecords(Dataset<Row> dataset, int outputSparkPartitions) {
+    Dataset<Row> repartitionedDataset;
+
+    // NOTE: In case of partitioned table even "global" ordering (across all RDD partitions) could
+    //       not change table's partitioning and therefore there's no point in doing global sorting
+    //       across "physical" partitions, and instead we can reduce total amount of data being
+    //       shuffled by doing do "local" sorting:
+    //          - First, re-partitioning dataset such that "logical" partitions are aligned w/
+    //          "physical" ones
+    //          - Sorting locally w/in RDD ("logical") partitions
+    //
+    //       Non-partitioned tables will be globally sorted.
+    if (isPartitionedTable) {
+      repartitionedDataset = dataset.repartition(outputSparkPartitions, new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));
+    } else {
+      repartitionedDataset = dataset.coalesce(outputSparkPartitions);
+    }
+
+    return repartitionedDataset.sortWithinPartitions(

Review Comment:
   same q. `sortWithinPartitions` does not shuffle, but just does the external spill sorting?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java:
##########
@@ -27,16 +28,18 @@
  */
 public abstract class BulkInsertInternalPartitionerFactory {
 
-  public static BulkInsertPartitioner get(BulkInsertSortMode sortMode) {
-    switch (sortMode) {
+  public static BulkInsertPartitioner get(BulkInsertSortMode bulkInsertMode, HoodieTableConfig tableConfig) {

Review Comment:
   I think this PR actually adds the first mode that does not sort. (except NONE, which is easy to understand anyway). So if anything we should fix this in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194673575

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 87e6b69600ba3f17f1fe098d3585773a56d6d933 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1194986544

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343",
       "triggerID" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 82147298deaec87b776a284746826c4004bb3d73 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341) 
   * f1c00f46279d3d79c4cf438af1e5a398718c426a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1198565910

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87e6b69600ba3f17f1fe098d3585773a56d6d933",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10320",
       "triggerID" : "1192187988",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "82147298deaec87b776a284746826c4004bb3d73",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10341",
       "triggerID" : "82147298deaec87b776a284746826c4004bb3d73",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343",
       "triggerID" : "f1c00f46279d3d79c4cf438af1e5a398718c426a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10435",
       "triggerID" : "7409db4ca5e362170ce99f6479bdeeceb3402a8e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "76fea0d2cbac3928c2f9088629999207afbad053",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f1c00f46279d3d79c4cf438af1e5a398718c426a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10343) 
   * 7409db4ca5e362170ce99f6479bdeeceb3402a8e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10435) 
   * 76fea0d2cbac3928c2f9088629999207afbad053 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on code in PR #5328:
URL: https://github.com/apache/hudi/pull/5328#discussion_r930466594


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDPartitionSortPartitioner.java:
##########
@@ -20,46 +20,62 @@
 
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.spark.api.java.JavaPairRDD;
 import org.apache.spark.api.java.JavaRDD;
-
-import java.util.ArrayList;
-import java.util.Collections;
-import java.util.List;
-
+import org.apache.spark.sql.HoodieJavaRDDUtils;
 import scala.Tuple2;
 
+import java.util.Comparator;
+
 /**
  * A built-in partitioner that does local sorting for each RDD partition
- * after coalesce for bulk insert operation, corresponding to the
- * {@code BulkInsertSortMode.PARTITION_SORT} mode.
+ * after coalescing it to specified number of partitions.
+ * Corresponds to the {@link BulkInsertSortMode#PARTITION_SORT} mode.
  *
  * @param <T> HoodieRecordPayload type
  */
 public class RDDPartitionSortPartitioner<T extends HoodieRecordPayload>
-    implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+    extends RepartitioningBulkInsertPartitionerBase<JavaRDD<HoodieRecord<T>>> {
 
+  public RDDPartitionSortPartitioner(HoodieTableConfig tableConfig) {
+    super(tableConfig);
+  }
+
+  @SuppressWarnings("unchecked")
   @Override
   public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records,
                                                      int outputSparkPartitions) {
-    return records.coalesce(outputSparkPartitions)
-        .mapToPair(record ->
-            new Tuple2<>(
-                new StringBuilder()
-                    .append(record.getPartitionPath())
-                    .append("+")
-                    .append(record.getRecordKey())
-                    .toString(), record))
-        .mapPartitions(partition -> {
-          // Sort locally in partition
-          List<Tuple2<String, HoodieRecord<T>>> recordList = new ArrayList<>();
-          for (; partition.hasNext(); ) {
-            recordList.add(partition.next());
-          }
-          Collections.sort(recordList, (o1, o2) -> o1._1.compareTo(o2._1));
-          return recordList.stream().map(e -> e._2).iterator();
-        });
+
+    // NOTE: Datasets being ingested into partitioned tables are additionally re-partitioned to better
+    //       align dataset's logical partitioning with expected table's physical partitioning to
+    //       provide for appropriate file-sizing and better control of the number of files created.
+    //
+    //       Please check out {@code GlobalSortPartitioner} java-doc for more details
+    if (isPartitionedTable) {
+      PartitionPathRDDPartitioner partitioner =
+          new PartitionPathRDDPartitioner((pair) -> ((Pair<String, String>) pair).getKey(), outputSparkPartitions);
+
+      // Both partition-path and record-key are extracted, since
+      //    - Partition-path will be used for re-partitioning (as called out above)
+      //    - Record-key will be used for sorting the records w/in individual partitions
+      return records.mapToPair(record -> new Tuple2<>(Pair.of(record.getPartitionPath(), record.getRecordKey()), record))
+          // NOTE: We're sorting by (partition-path, record-key) pair to make sure that in case
+          //       when there are less Spark partitions (requested) than there are physical partitions
+          //       (in which case multiple physical partitions, will be handled w/in single Spark
+          //       partition) records w/in a single Spark partition are still ordered first by
+          //       partition-path, then record's key
+          .repartitionAndSortWithinPartitions(partitioner, Comparator.naturalOrder())
+          .values();
+    } else {
+      JavaPairRDD<String, HoodieRecord<T>> kvPairsRDD =
+          records.coalesce(outputSparkPartitions).mapToPair(record -> new Tuple2<>(record.getRecordKey(), record));
+
+      // NOTE: [[JavaRDD]] doesn't expose an API to do the sorting w/o (re-)shuffling, as such
+      //       we're relying on our own sequence to achieve that
+      return HoodieJavaRDDUtils.sortWithinPartitions(kvPairsRDD, Comparator.naturalOrder()).values();

Review Comment:
   We can't as it could spill



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDPartitionSortPartitioner.java:
##########
@@ -20,46 +20,62 @@
 
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordPayload;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.spark.api.java.JavaPairRDD;
 import org.apache.spark.api.java.JavaRDD;
-
-import java.util.ArrayList;
-import java.util.Collections;
-import java.util.List;
-
+import org.apache.spark.sql.HoodieJavaRDDUtils;
 import scala.Tuple2;
 
+import java.util.Comparator;
+
 /**
  * A built-in partitioner that does local sorting for each RDD partition
- * after coalesce for bulk insert operation, corresponding to the
- * {@code BulkInsertSortMode.PARTITION_SORT} mode.
+ * after coalescing it to specified number of partitions.
+ * Corresponds to the {@link BulkInsertSortMode#PARTITION_SORT} mode.
  *
  * @param <T> HoodieRecordPayload type
  */
 public class RDDPartitionSortPartitioner<T extends HoodieRecordPayload>
-    implements BulkInsertPartitioner<JavaRDD<HoodieRecord<T>>> {
+    extends RepartitioningBulkInsertPartitionerBase<JavaRDD<HoodieRecord<T>>> {
 
+  public RDDPartitionSortPartitioner(HoodieTableConfig tableConfig) {
+    super(tableConfig);
+  }
+
+  @SuppressWarnings("unchecked")
   @Override
   public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records,
                                                      int outputSparkPartitions) {
-    return records.coalesce(outputSparkPartitions)
-        .mapToPair(record ->
-            new Tuple2<>(
-                new StringBuilder()
-                    .append(record.getPartitionPath())
-                    .append("+")
-                    .append(record.getRecordKey())
-                    .toString(), record))
-        .mapPartitions(partition -> {
-          // Sort locally in partition
-          List<Tuple2<String, HoodieRecord<T>>> recordList = new ArrayList<>();
-          for (; partition.hasNext(); ) {
-            recordList.add(partition.next());
-          }
-          Collections.sort(recordList, (o1, o2) -> o1._1.compareTo(o2._1));
-          return recordList.stream().map(e -> e._2).iterator();
-        });
+
+    // NOTE: Datasets being ingested into partitioned tables are additionally re-partitioned to better
+    //       align dataset's logical partitioning with expected table's physical partitioning to
+    //       provide for appropriate file-sizing and better control of the number of files created.
+    //
+    //       Please check out {@code GlobalSortPartitioner} java-doc for more details
+    if (isPartitionedTable) {
+      PartitionPathRDDPartitioner partitioner =
+          new PartitionPathRDDPartitioner((pair) -> ((Pair<String, String>) pair).getKey(), outputSparkPartitions);
+
+      // Both partition-path and record-key are extracted, since
+      //    - Partition-path will be used for re-partitioning (as called out above)
+      //    - Record-key will be used for sorting the records w/in individual partitions
+      return records.mapToPair(record -> new Tuple2<>(Pair.of(record.getPartitionPath(), record.getRecordKey()), record))
+          // NOTE: We're sorting by (partition-path, record-key) pair to make sure that in case
+          //       when there are less Spark partitions (requested) than there are physical partitions
+          //       (in which case multiple physical partitions, will be handled w/in single Spark
+          //       partition) records w/in a single Spark partition are still ordered first by
+          //       partition-path, then record's key
+          .repartitionAndSortWithinPartitions(partitioner, Comparator.naturalOrder())

Review Comment:
   Fair call out. What's particularly troubling from your perspective?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #5328: [HUDI-3883] Add new Bulk Insert mode to repartition the dataset based on Partition Path without sorting

Posted by "alexeykudinkin (via GitHub)" <gi...@apache.org>.

alexeykudinkin commented on PR #5328:
URL: https://github.com/apache/hudi/pull/5328#issuecomment-1419959573

   Superseded by 7872


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org