You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "yihua (via GitHub)" <gi...@apache.org> on 2023/03/11 06:32:49 UTC

[GitHub] [hudi] yihua opened a new pull request, #8157: [HUDI-5920] Improve documentation of parallelism configs

yihua opened a new pull request, #8157:
URL: https://github.com/apache/hudi/pull/8157

   ### Change Logs
   
   This PR improves the documentation for the following parallelism configs:
   ```
   hoodie.archive.delete.parallelism
   hoodie.bloom.index.parallelism
   hoodie.simple.index.parallelism
   hoodie.global.simple.index.parallelism
   hoodie.insert.shuffle.parallelism
   hoodie.bulkinsert.shuffle.parallelism
   hoodie.upsert.shuffle.parallelism
   hoodie.delete.shuffle.parallelism
   hoodie.rollback.parallelism
   ```
   
   ### Impact
   
   Improves config docs so the user understands how the parallelism config affects the corresponding operation.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   As above.  The config docs are going to be populated to the ["All Configurations"](https://hudi.apache.org/docs/configurations) page.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8157:
URL: https://github.com/apache/hudi/pull/8157#issuecomment-1468700187

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672",
       "triggerID" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15713",
       "triggerID" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6416774dc64379af30baf6d0d45956d9a19be64a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15713) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua commented on code in PR #8157:
URL: https://github.com/apache/hudi/pull/8157#discussion_r1135775680


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -247,13 +247,29 @@ public class HoodieWriteConfig extends HoodieConfig {
   public static final ConfigProperty<String> INSERT_PARALLELISM_VALUE = ConfigProperty
       .key("hoodie.insert.shuffle.parallelism")
       .defaultValue("0")
-      .withDocumentation("Parallelism for inserting records into the table. Inserts can shuffle data before writing to tune file sizes and optimize the storage layout.");
+      .withDocumentation("Parallelism for inserting records into the table. Inserts can shuffle "
+          + "data before writing to tune file sizes and optimize the storage layout. Before "
+          + "0.13.0 release, if users do not configure it, Hudi would use 200 as the default "
+          + "shuffle parallelism. From 0.13.0 onwards Hudi by default automatically uses the "
+          + "parallelism deduced by Spark based on the source data. If the shuffle parallelism "
+          + "is explicitly configured by the user, the user-configured parallelism is "
+          + "used in defining the actual parallelism. If you observe small files from the insert "
+          + "operation, we suggest configuring this shuffle parallelism explicitly, so that the "
+          + "parallelism is around total_input_data_size/500MB.");

Review Comment:
   Makes sense.  Fixed now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8157:
URL: https://github.com/apache/hudi/pull/8157#issuecomment-1464848306

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5075feb0a984758ac4dc2999bf503d0df3b1dbd1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8157:
URL: https://github.com/apache/hudi/pull/8157#issuecomment-1468440307

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672",
       "triggerID" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5075feb0a984758ac4dc2999bf503d0df3b1dbd1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672) 
   * 6416774dc64379af30baf6d0d45956d9a19be64a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #8157:
URL: https://github.com/apache/hudi/pull/8157#discussion_r1133042083


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -247,13 +247,29 @@ public class HoodieWriteConfig extends HoodieConfig {
   public static final ConfigProperty<String> INSERT_PARALLELISM_VALUE = ConfigProperty
       .key("hoodie.insert.shuffle.parallelism")
       .defaultValue("0")
-      .withDocumentation("Parallelism for inserting records into the table. Inserts can shuffle data before writing to tune file sizes and optimize the storage layout.");
+      .withDocumentation("Parallelism for inserting records into the table. Inserts can shuffle "
+          + "data before writing to tune file sizes and optimize the storage layout. Before "
+          + "0.13.0 release, if users do not configure it, Hudi would use 200 as the default "
+          + "shuffle parallelism. From 0.13.0 onwards Hudi by default automatically uses the "
+          + "parallelism deduced by Spark based on the source data. If the shuffle parallelism "
+          + "is explicitly configured by the user, the user-configured parallelism is "
+          + "used in defining the actual parallelism. If you observe small files from the insert "
+          + "operation, we suggest configuring this shuffle parallelism explicitly, so that the "
+          + "parallelism is around total_input_data_size/500MB.");

Review Comment:
   lets try to stick to 120Mb which is hudi's default file size. high scale users can tweak the configs as they wish. but for an avg user, 120Mb should be good. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8157:
URL: https://github.com/apache/hudi/pull/8157#issuecomment-1468460014

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672",
       "triggerID" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15713",
       "triggerID" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5075feb0a984758ac4dc2999bf503d0df3b1dbd1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672) 
   * 6416774dc64379af30baf6d0d45956d9a19be64a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15713) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8157:
URL: https://github.com/apache/hudi/pull/8157#issuecomment-1472630518

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672",
       "triggerID" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15713",
       "triggerID" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e690ef6dd47345352557c1894d4f77d16f1dd01e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15757",
       "triggerID" : "e690ef6dd47345352557c1894d4f77d16f1dd01e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e690ef6dd47345352557c1894d4f77d16f1dd01e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15757) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8157:
URL: https://github.com/apache/hudi/pull/8157#issuecomment-1464849404

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672",
       "triggerID" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5075feb0a984758ac4dc2999bf503d0df3b1dbd1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8157:
URL: https://github.com/apache/hudi/pull/8157#issuecomment-1464877673

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672",
       "triggerID" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5075feb0a984758ac4dc2999bf503d0df3b1dbd1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua merged pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua merged PR #8157:
URL: https://github.com/apache/hudi/pull/8157


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8157:
URL: https://github.com/apache/hudi/pull/8157#issuecomment-1472411570

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672",
       "triggerID" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15713",
       "triggerID" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e690ef6dd47345352557c1894d4f77d16f1dd01e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15757",
       "triggerID" : "e690ef6dd47345352557c1894d4f77d16f1dd01e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6416774dc64379af30baf6d0d45956d9a19be64a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15713) 
   * e690ef6dd47345352557c1894d4f77d16f1dd01e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15757) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8157:
URL: https://github.com/apache/hudi/pull/8157#issuecomment-1472340794

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15672",
       "triggerID" : "5075feb0a984758ac4dc2999bf503d0df3b1dbd1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15713",
       "triggerID" : "6416774dc64379af30baf6d0d45956d9a19be64a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e690ef6dd47345352557c1894d4f77d16f1dd01e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e690ef6dd47345352557c1894d4f77d16f1dd01e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6416774dc64379af30baf6d0d45956d9a19be64a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15713) 
   * e690ef6dd47345352557c1894d4f77d16f1dd01e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org