You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/15 03:12:41 UTC

[GitHub] [hudi] alexeykudinkin opened a new pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

alexeykudinkin opened a new pull request #4606:
URL: https://github.com/apache/hudi/pull/4606


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Refactoring layout optimization (clustering) flow to
    - Enable support for linear (lexicographic) ordering as one of the ordering strategies (along w/ Z-order, Hilbert)
    - Reconcile Layout Optimization and Clustering configuration to be more congruent
   
   ## Brief change log
   
    - Refactored layout optimization flow to enable support for linear (lexicographic) ordering in column-stats indexes
    - Reconcile Layout Optimization and Clustering configuration to be more congruent
    - Refactored tests to validate full matrix of all optimization strategies, spatial curve composition strategies
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017069976


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) 
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#discussion_r789257773



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
##########
@@ -190,15 +190,7 @@
       .withDocumentation("When rewriting data, preserves existing hoodie_commit_time");
 
   /**
-   * Using space-filling curves to optimize the layout of table to boost query performance.
-   * The table data which sorted by space-filling curve has better aggregation;
-   * combine with min-max filtering, it can achieve good performance improvement.
-   *
-   * Notice:
-   * when we use this feature, we need specify the sort columns.
-   * The more columns involved in sorting, the worse the aggregation, and the smaller the query performance improvement.
-   * Choose the filter columns which commonly used in query sql as sort columns.
-   * It is recommend that 2 ~ 4 columns participate in sorting.
+   * @deprecated this setting has no effect

Review comment:
       can you add in documentation as to what other config(s) the user is supposed to look into instead of this deprecated one. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1019658824


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5452",
       "triggerID" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5452) 
   * 2151a3b80f2a05975b518d788e54c503051f8a7e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1019658824


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5452",
       "triggerID" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5452) 
   * 2151a3b80f2a05975b518d788e54c503051f8a7e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1018842306


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356) 
   * 2151a3b80f2a05975b518d788e54c503051f8a7e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1018886679


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5421",
       "triggerID" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2151a3b80f2a05975b518d788e54c503051f8a7e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5421) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1019642254


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5452",
       "triggerID" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5452) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1019659815


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5452",
       "triggerID" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5421",
       "triggerID" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2151a3b80f2a05975b518d788e54c503051f8a7e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5421) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017071243


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) 
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1019642254


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5452",
       "triggerID" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5452) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
nsivabalan merged pull request #4606:
URL: https://github.com/apache/hudi/pull/4606


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017109276


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on a change in pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#discussion_r788291798



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
##########
@@ -207,41 +199,71 @@
       .withDocumentation("Enable use z-ordering/space-filling curves to optimize the layout of table to boost query performance. "
           + "This parameter takes precedence over clustering strategy set using " + EXECUTION_STRATEGY_CLASS_NAME.key());
 
-  public static final ConfigProperty LAYOUT_OPTIMIZE_STRATEGY = ConfigProperty
+  /**
+   * Determines ordering strategy in for records layout optimization.
+   * Currently, following strategies are supported
+   * <ul>
+   *   <li>Linear: simply orders records lexicographically</li>
+   *   <li>Z-order: orders records along Z-order spatial-curve</li>
+   *   <li>Hilbert: orders records along Hilbert's spatial-curve</li>
+   * </ul>
+   *
+   * NOTE: "z-order", "hilbert" strategies may consume considerably more compute, than "linear".
+   *       Make sure to perform small-scale local testing for your dataset before applying globally.
+   */
+  public static final ConfigProperty<String> LAYOUT_OPTIMIZE_STRATEGY = ConfigProperty
       .key(LAYOUT_OPTIMIZE_PARAM_PREFIX + "strategy")
       .defaultValue("z-order")
       .sinceVersion("0.10.0")
-      .withDocumentation("Type of layout optimization to be applied, current only supports `z-order` and `hilbert` curves.");
+      .withDocumentation("Determines ordering strategy used in records layout optimization. "
+          + "Currently supported strategies are \"linear\", \"z-order\" and \"hilbert\" values are supported.");
 
   /**
-   * There exists two method to build z-curve.
-   * one is directly mapping sort cols to z-value to build z-curve;
-   * we can find this method in Amazon DynamoDB https://aws.amazon.com/cn/blogs/database/tag/z-order/
-   * the other one is Boundary-based Interleaved Index method which we proposed. simply call it sample method.
-   * Refer to rfc-28 for specific algorithm flow.
-   * Boundary-based Interleaved Index method has better generalization, but the build speed is slower than direct method.
+   * NOTE: This setting only has effect if {@link #LAYOUT_OPTIMIZE_STRATEGY} value is set to
+   *       either "z-order" or "hilbert" (ie leveraging space-filling curves)
+   *
+   * Currently, two methods to order records along the curve are supported "build" and "sample":

Review comment:
       Good catch!

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -134,16 +134,28 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext
    * @return {@link RDDCustomColumnsSortPartitioner} if sort columns are provided, otherwise empty.
    */
   protected Option<BulkInsertPartitioner<T>> getPartitioner(Map<String, String> strategyParams, Schema schema) {
-    if (getWriteConfig().isLayoutOptimizationEnabled()) {
-      // sort input records by z-order/hilbert
-      return Option.of(new RDDSpatialCurveOptimizationSortPartitioner((HoodieSparkEngineContext) getEngineContext(),
-          getWriteConfig(), HoodieAvroUtils.addMetadataFields(schema)));
-    } else if (strategyParams.containsKey(PLAN_STRATEGY_SORT_COLUMNS.key())) {
-      return Option.of(new RDDCustomColumnsSortPartitioner(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()).split(","),
-          HoodieAvroUtils.addMetadataFields(schema), getWriteConfig().isConsistentLogicalTimestampEnabled()));
-    } else {
-      return Option.empty();
-    }
+    Option<String[]> orderByColumnsOpt =
+        Option.ofNullable(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()))
+            .map(listStr -> listStr.split(","));
+
+    return orderByColumnsOpt.map(orderByColumns -> {

Review comment:
       It will fallback to no-op in that case




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017109276


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1018844411


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5421",
       "triggerID" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356) 
   * 2151a3b80f2a05975b518d788e54c503051f8a7e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5421) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
nsivabalan merged pull request #4606:
URL: https://github.com/apache/hudi/pull/4606


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1014930959


   btw, looks like there are some CI failures. can you please check them. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017071243


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) 
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1018842306


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356) 
   * 2151a3b80f2a05975b518d788e54c503051f8a7e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1019641355


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1018844411


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5421",
       "triggerID" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356) 
   * 2151a3b80f2a05975b518d788e54c503051f8a7e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5421) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1018886679


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5421",
       "triggerID" : "2151a3b80f2a05975b518d788e54c503051f8a7e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2151a3b80f2a05975b518d788e54c503051f8a7e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5421) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013597581


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013607661


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017069976


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a0b32bbf0d5d23b8facbe2581ad086433afc2de6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) 
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1014930782


   @alexeykudinkin : is there anyone you know will review this patch or you want me to review. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013598133


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on a change in pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#discussion_r789977849



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
##########
@@ -190,15 +190,7 @@
       .withDocumentation("When rewriting data, preserves existing hoodie_commit_time");
 
   /**
-   * Using space-filling curves to optimize the layout of table to boost query performance.
-   * The table data which sorted by space-filling curve has better aggregation;
-   * combine with min-max filtering, it can achieve good performance improvement.
-   *
-   * Notice:
-   * when we use this feature, we need specify the sort columns.
-   * The more columns involved in sorting, the worse the aggregation, and the smaller the query performance improvement.
-   * Choose the filter columns which commonly used in query sql as sort columns.
-   * It is recommend that 2 ~ 4 columns participate in sorting.
+   * @deprecated this setting has no effect

Review comment:
       Updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013607661


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1019641355


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d2e3cd6dc903a7012a031b6bc5ea30dbbe25d68c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1019638953


   @alexeykudinkin : I pushed a minor update to fix the build failure. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#discussion_r788247581



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
##########
@@ -207,41 +199,71 @@
       .withDocumentation("Enable use z-ordering/space-filling curves to optimize the layout of table to boost query performance. "
           + "This parameter takes precedence over clustering strategy set using " + EXECUTION_STRATEGY_CLASS_NAME.key());
 
-  public static final ConfigProperty LAYOUT_OPTIMIZE_STRATEGY = ConfigProperty
+  /**
+   * Determines ordering strategy in for records layout optimization.
+   * Currently, following strategies are supported
+   * <ul>
+   *   <li>Linear: simply orders records lexicographically</li>
+   *   <li>Z-order: orders records along Z-order spatial-curve</li>
+   *   <li>Hilbert: orders records along Hilbert's spatial-curve</li>
+   * </ul>
+   *
+   * NOTE: "z-order", "hilbert" strategies may consume considerably more compute, than "linear".
+   *       Make sure to perform small-scale local testing for your dataset before applying globally.
+   */
+  public static final ConfigProperty<String> LAYOUT_OPTIMIZE_STRATEGY = ConfigProperty
       .key(LAYOUT_OPTIMIZE_PARAM_PREFIX + "strategy")
       .defaultValue("z-order")
       .sinceVersion("0.10.0")
-      .withDocumentation("Type of layout optimization to be applied, current only supports `z-order` and `hilbert` curves.");
+      .withDocumentation("Determines ordering strategy used in records layout optimization. "
+          + "Currently supported strategies are \"linear\", \"z-order\" and \"hilbert\" values are supported.");
 
   /**
-   * There exists two method to build z-curve.
-   * one is directly mapping sort cols to z-value to build z-curve;
-   * we can find this method in Amazon DynamoDB https://aws.amazon.com/cn/blogs/database/tag/z-order/
-   * the other one is Boundary-based Interleaved Index method which we proposed. simply call it sample method.
-   * Refer to rfc-28 for specific algorithm flow.
-   * Boundary-based Interleaved Index method has better generalization, but the build speed is slower than direct method.
+   * NOTE: This setting only has effect if {@link #LAYOUT_OPTIMIZE_STRATEGY} value is set to
+   *       either "z-order" or "hilbert" (ie leveraging space-filling curves)
+   *
+   * Currently, two methods to order records along the curve are supported "build" and "sample":

Review comment:
       is it "Direct" instead of "build" ?

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -134,16 +134,28 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext
    * @return {@link RDDCustomColumnsSortPartitioner} if sort columns are provided, otherwise empty.
    */
   protected Option<BulkInsertPartitioner<T>> getPartitioner(Map<String, String> strategyParams, Schema schema) {
-    if (getWriteConfig().isLayoutOptimizationEnabled()) {
-      // sort input records by z-order/hilbert
-      return Option.of(new RDDSpatialCurveOptimizationSortPartitioner((HoodieSparkEngineContext) getEngineContext(),
-          getWriteConfig(), HoodieAvroUtils.addMetadataFields(schema)));
-    } else if (strategyParams.containsKey(PLAN_STRATEGY_SORT_COLUMNS.key())) {
-      return Option.of(new RDDCustomColumnsSortPartitioner(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()).split(","),
-          HoodieAvroUtils.addMetadataFields(schema), getWriteConfig().isConsistentLogicalTimestampEnabled()));
-    } else {
-      return Option.empty();
-    }
+    Option<String[]> orderByColumnsOpt =
+        Option.ofNullable(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()))
+            .map(listStr -> listStr.split(","));
+
+    return orderByColumnsOpt.map(orderByColumns -> {

Review comment:
       what happens is sort columns config is null or set to empty string




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017049008


   @nsivabalan correct, all configs are kept and marked as deprecated. The only thing that changes is that some of them have actually no effect anymore. How should we handle this?
   
   For example `LAYOUT_OPTIMIZATION_ENABLE` is not used anymore, but that should not have an effect on users:
   
   1. Those that didn't use Clustering based on Spatial Curves, they will stay the same way (there are other configs required for that)
   2. Those that did use Clustering based on Spatial Curves, will also not be affected b/c it also required clustering to be enabled (which they should have to already had enabled)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013598133


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013597581


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org