You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by 10110346 <gi...@git.apache.org> on 2018/09/06 10:43:16 UTC
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
GitHub user 10110346 opened a pull request:
https://github.com/apache/spark/pull/22350
[SPARK-25356][SQL]Add Parquet block size option to SparkSQL configuration
## What changes were proposed in this pull request?
I think we should configure the Parquet buffer size when using Parquet format.
Because for HDFS, `dfs.block.size` is configurable, sometimes we hope the block size of parquet to be consistent with it.
And whether this parameter `spark.sql.files.maxPartitionBytes` is best consistent with the Parquet block size when using Parquet format?
Also we may want to shrink Parquet block size in some tests.
## How was this patch tested?
N/A
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/10110346/spark addblocksize
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22350.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22350
----
commit 3485b523d54e83ed3388febd06b3ac4914d181ed
Author: liuxian <li...@...>
Date: 2018-09-06T10:35:43Z
fix
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22350
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95752/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Posted by 10110346 <gi...@git.apache.org>.
Github user 10110346 closed the pull request at:
https://github.com/apache/spark/pull/22350
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22350#discussion_r215812113
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
@@ -123,6 +123,9 @@ class ParquetFileFormat
// Sets compression scheme
conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName)
+ // Sets Parquet block size
+ conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize)
--- End diff --
I doubt if it is common enough to have an alias and document this in `sql-programming-guide.md`. Other configurations like `parquet.page.size`, `parquet.enable.dictionary` or `parquet.writer.version` are also rather similarly used as much as that configuration in my experience.
I would add this for now.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22350
**[Test build #95752 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95752/testReport)** for PR 22350 at commit [`3485b52`](https://github.com/apache/spark/commit/3485b523d54e83ed3388febd06b3ac4914d181ed).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22350
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22350
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2898/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Posted by 10110346 <gi...@git.apache.org>.
Github user 10110346 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22350#discussion_r215598798
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
@@ -123,6 +123,9 @@ class ParquetFileFormat
// Sets compression scheme
conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName)
+ // Sets Parquet block size
+ conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize)
--- End diff --
Yes, we are already able to set this via `parquet.block.size`,
I think we should add this parameter into "sql-programming-guide.md"
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Posted by 10110346 <gi...@git.apache.org>.
Github user 10110346 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22350#discussion_r215819785
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
@@ -123,6 +123,9 @@ class ParquetFileFormat
// Sets compression scheme
conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName)
+ // Sets Parquet block size
+ conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize)
--- End diff --
Sounds reasonable. I close it now, thanks
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22350
**[Test build #95752 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95752/testReport)** for PR 22350 at commit [`3485b52`](https://github.com/apache/spark/commit/3485b523d54e83ed3388febd06b3ac4914d181ed).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22350: [SPARK-25356][SQL]Add Parquet block size option to Spark...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22350
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22350#discussion_r215595058
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
@@ -123,6 +123,9 @@ class ParquetFileFormat
// Sets compression scheme
conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName)
+ // Sets Parquet block size
+ conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize)
--- End diff --
For clarification, we are already able to set this via `parquet.block.size` but this PR proposes an alias for it, right?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org