You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by 10110346 <gi...@git.apache.org> on 2018/09/06 10:43:16 UTC

[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

GitHub user 10110346 opened a pull request:

    https://github.com/apache/spark/pull/22350

    [SPARK-25356][SQL]Add Parquet block size  option to SparkSQL configuration

    ## What changes were proposed in this pull request?
    
    
    I think we should configure the Parquet buffer size when using Parquet format.
    Because for HDFS, `dfs.block.size` is configurable, sometimes we hope the block size of parquet to be consistent with it.
    And  whether this parameter `spark.sql.files.maxPartitionBytes` is best consistent with the Parquet  block size when using Parquet format?
    Also we may want to shrink Parquet block size in some tests.
    
    ## How was this patch tested?
    N/A


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/10110346/spark addblocksize

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22350.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22350
    
----
commit 3485b523d54e83ed3388febd06b3ac4914d181ed
Author: liuxian <li...@...>
Date:   2018-09-06T10:35:43Z

    fix

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22350
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95752/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

Posted by 10110346 <gi...@git.apache.org>.
Github user 10110346 closed the pull request at:

    https://github.com/apache/spark/pull/22350


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22350#discussion_r215812113
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -123,6 +123,9 @@ class ParquetFileFormat
         // Sets compression scheme
         conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName)
     
    +    // Sets Parquet block size
    +    conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize)
    --- End diff --
    
    I doubt if it is common enough to have an alias and document this in `sql-programming-guide.md`. Other configurations like `parquet.page.size`, `parquet.enable.dictionary` or `parquet.writer.version` are also rather similarly used as much as that configuration in my experience.
    
    I would add this for now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22350
  
    **[Test build #95752 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95752/testReport)** for PR 22350 at commit [`3485b52`](https://github.com/apache/spark/commit/3485b523d54e83ed3388febd06b3ac4914d181ed).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22350
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22350
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2898/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

Posted by 10110346 <gi...@git.apache.org>.
Github user 10110346 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22350#discussion_r215598798
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -123,6 +123,9 @@ class ParquetFileFormat
         // Sets compression scheme
         conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName)
     
    +    // Sets Parquet block size
    +    conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize)
    --- End diff --
    
    Yes, we are already able to set this via `parquet.block.size`, 
    I think we should add this parameter into  "sql-programming-guide.md"


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

Posted by 10110346 <gi...@git.apache.org>.
Github user 10110346 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22350#discussion_r215819785
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -123,6 +123,9 @@ class ParquetFileFormat
         // Sets compression scheme
         conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName)
     
    +    // Sets Parquet block size
    +    conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize)
    --- End diff --
    
     Sounds reasonable.  I close it now, thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22350
  
    **[Test build #95752 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95752/testReport)** for PR 22350 at commit [`3485b52`](https://github.com/apache/spark/commit/3485b523d54e83ed3388febd06b3ac4914d181ed).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22350
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22350#discussion_r215595058
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -123,6 +123,9 @@ class ParquetFileFormat
         // Sets compression scheme
         conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName)
     
    +    // Sets Parquet block size
    +    conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize)
    --- End diff --
    
    For clarification, we are already able to set this via `parquet.block.size` but this PR proposes an alias for it, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org