You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/24 21:14:50 UTC

[GitHub] [hudi] alexeykudinkin opened a new pull request, #5966: [WIP] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting

alexeykudinkin opened a new pull request, #5966:
URL: https://github.com/apache/hudi/pull/5966

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Proper configuration was left stranded and not propagated properly to a Parquet writing falling back to `false` by default (not sure why it's false by default in Parquet) substantially increasing storage footprint in our tests by about **30%**.
   
   ## Brief change log
   
    - Fixing config enabling dictionary encoding to be propagated to Parquet writer properly
    - Removed unnecessary override for the `HoodieParquetConfig` class
    
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan merged pull request #5966: [HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting

Posted by GitBox <gi...@apache.org>.
nsivabalan merged PR #5966:
URL: https://github.com/apache/hudi/pull/5966


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5966: [HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5966:
URL: https://github.com/apache/hudi/pull/5966#issuecomment-1165950156

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "85ea89b51298cee4982757376fdb1aa7c01b7d3f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "85ea89b51298cee4982757376fdb1aa7c01b7d3f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 85ea89b51298cee4982757376fdb1aa7c01b7d3f UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5966: [HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5966:
URL: https://github.com/apache/hudi/pull/5966#issuecomment-1165955269

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "85ea89b51298cee4982757376fdb1aa7c01b7d3f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "85ea89b51298cee4982757376fdb1aa7c01b7d3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b086b2ade66b295867e65e4219f1e78e903b403c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9519",
       "triggerID" : "b086b2ade66b295867e65e4219f1e78e903b403c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 85ea89b51298cee4982757376fdb1aa7c01b7d3f UNKNOWN
   * b086b2ade66b295867e65e4219f1e78e903b403c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9519) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5966: [HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5966:
URL: https://github.com/apache/hudi/pull/5966#issuecomment-1166166305

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "85ea89b51298cee4982757376fdb1aa7c01b7d3f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "85ea89b51298cee4982757376fdb1aa7c01b7d3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b086b2ade66b295867e65e4219f1e78e903b403c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9519",
       "triggerID" : "b086b2ade66b295867e65e4219f1e78e903b403c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 85ea89b51298cee4982757376fdb1aa7c01b7d3f UNKNOWN
   * b086b2ade66b295867e65e4219f1e78e903b403c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9519) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5966: [HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5966:
URL: https://github.com/apache/hudi/pull/5966#discussion_r906417856


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieInternalRowFileWriterFactory.java:
##########
@@ -68,14 +69,17 @@ private static HoodieInternalRowFileWriter newParquetInternalRowFileWriter(
     HoodieRowParquetWriteSupport writeSupport =
             new HoodieRowParquetWriteSupport(table.getHadoopConf(), structType, filter, writeConfig);
     return new HoodieInternalRowParquetWriter(
-        path, new HoodieRowParquetConfig(
+        path,
+        new HoodieParquetConfig<>(
             writeSupport,
             writeConfig.getParquetCompressionCodec(),
             writeConfig.getParquetBlockSize(),
             writeConfig.getParquetPageSize(),
             writeConfig.getParquetMaxFileSize(),
             writeSupport.getHadoopConf(),
-            writeConfig.getParquetCompressionRatio()));
+            writeConfig.getParquetCompressionRatio(),
+            writeConfig.parquetDictionaryEnabled()

Review Comment:
   This is the culprit



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieInternalRowFileWriterFactory.java:
##########
@@ -68,14 +69,17 @@ private static HoodieInternalRowFileWriter newParquetInternalRowFileWriter(
     HoodieRowParquetWriteSupport writeSupport =
             new HoodieRowParquetWriteSupport(table.getHadoopConf(), structType, filter, writeConfig);
     return new HoodieInternalRowParquetWriter(
-        path, new HoodieRowParquetConfig(
+        path,
+        new HoodieParquetConfig<>(
             writeSupport,
             writeConfig.getParquetCompressionCodec(),
             writeConfig.getParquetBlockSize(),
             writeConfig.getParquetPageSize(),
             writeConfig.getParquetMaxFileSize(),
             writeSupport.getHadoopConf(),
-            writeConfig.getParquetCompressionRatio()));
+            writeConfig.getParquetCompressionRatio(),
+            writeConfig.parquetDictionaryEnabled()

Review Comment:
   This was the culprit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5966: [HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5966:
URL: https://github.com/apache/hudi/pull/5966#issuecomment-1165952689

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "85ea89b51298cee4982757376fdb1aa7c01b7d3f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "85ea89b51298cee4982757376fdb1aa7c01b7d3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b086b2ade66b295867e65e4219f1e78e903b403c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b086b2ade66b295867e65e4219f1e78e903b403c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 85ea89b51298cee4982757376fdb1aa7c01b7d3f UNKNOWN
   * b086b2ade66b295867e65e4219f1e78e903b403c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org