You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by gengliangwang <gi...@git.apache.org> on 2018/07/19 06:54:34 UTC
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
GitHub user gengliangwang opened a pull request:
https://github.com/apache/spark/pull/21814
[SPARK-24858][SQL] Avoid unnecessary parquet footer reads
## What changes were proposed in this pull request?
Currently the same Parquet footer is read twice in the function `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is enabled.
Fix it with simple changes.
## How was this patch tested?
Unit test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gengliangwang/spark parquetFooter
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21814.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21814
----
commit 5667cc57022d840aecb6c7d0c967e2a3448a4928
Author: Gengliang Wang <ge...@...>
Date: 2018-07-19T06:43:45Z
Avoid unnecessary parquet footer reading
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21814
Merged to master.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93264/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1113/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:
https://github.com/apache/spark/pull/21814#discussion_r203621872
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
@@ -384,12 +385,10 @@ class ParquetFileFormat
// *only* if the file was created by something other than "parquet-mr", so check the actual
// writer here for this file. We have to do this per-file, as each file in the table may
// have different writers.
- def isCreatedByParquetMr(): Boolean = {
- val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS)
- footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr")
- }
+ val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
--- End diff --
Yes, I just push the code to make it lazy value.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1124/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1115/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21814
**[Test build #93263 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93263/testReport)** for PR 21814 at commit [`5667cc5`](https://github.com/apache/spark/commit/5667cc57022d840aecb6c7d0c967e2a3448a4928).
* This patch **fails due to an unknown error code, -9**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/21814
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93275/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21814
**[Test build #93265 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93265/testReport)** for PR 21814 at commit [`c215b46`](https://github.com/apache/spark/commit/c215b466ae5cb7af30bdffcf739b554d4a9f1844).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21814
**[Test build #93265 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93265/testReport)** for PR 21814 at commit [`c215b46`](https://github.com/apache/spark/commit/c215b466ae5cb7af30bdffcf739b554d4a9f1844).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1114/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93265/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:
https://github.com/apache/spark/pull/21814#discussion_r203622451
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
@@ -384,12 +385,10 @@ class ParquetFileFormat
// *only* if the file was created by something other than "parquet-mr", so check the actual
// writer here for this file. We have to do this per-file, as each file in the table may
// have different writers.
- def isCreatedByParquetMr(): Boolean = {
- val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS)
- footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr")
- }
+ val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
--- End diff --
I think making `footerFileMetaData` lazy value is better.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93263/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21814
**[Test build #93275 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93275/testReport)** for PR 21814 at commit [`f3e863b`](https://github.com/apache/spark/commit/f3e863b7e11d50f0dedde9ef5291c1a08a2654fd).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/21814#discussion_r203638620
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
@@ -384,12 +385,10 @@ class ParquetFileFormat
// *only* if the file was created by something other than "parquet-mr", so check the actual
// writer here for this file. We have to do this per-file, as each file in the table may
// have different writers.
- def isCreatedByParquetMr(): Boolean = {
- val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS)
- footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr")
- }
+ val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
--- End diff --
Also, let's leave a comment here saying that we avoid reading it by short circuit. I already saw multiple people confused here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:
https://github.com/apache/spark/pull/21814#discussion_r203621554
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
@@ -384,12 +385,10 @@ class ParquetFileFormat
// *only* if the file was created by something other than "parquet-mr", so check the actual
// writer here for this file. We have to do this per-file, as each file in the table may
// have different writers.
- def isCreatedByParquetMr(): Boolean = {
- val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS)
- footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr")
- }
+ val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
--- End diff --
Is `lazy` better? `timestampConversion` default is false.
```scala
lazy val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21814
**[Test build #93263 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93263/testReport)** for PR 21814 at commit [`5667cc5`](https://github.com/apache/spark/commit/5667cc57022d840aecb6c7d0c967e2a3448a4928).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21814
**[Test build #93264 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93264/testReport)** for PR 21814 at commit [`24f69e4`](https://github.com/apache/spark/commit/24f69e4de08f5a65ffc163155074ac5c26ab8809).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21814
**[Test build #93264 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93264/testReport)** for PR 21814 at commit [`24f69e4`](https://github.com/apache/spark/commit/24f69e4de08f5a65ffc163155074ac5c26ab8809).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21814
**[Test build #93275 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93275/testReport)** for PR 21814 at commit [`f3e863b`](https://github.com/apache/spark/commit/f3e863b7e11d50f0dedde9ef5291c1a08a2654fd).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/21814#discussion_r203638199
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
@@ -384,12 +385,10 @@ class ParquetFileFormat
// *only* if the file was created by something other than "parquet-mr", so check the actual
// writer here for this file. We have to do this per-file, as each file in the table may
// have different writers.
- def isCreatedByParquetMr(): Boolean = {
- val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS)
- footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr")
- }
+ val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
--- End diff --
Hm? `isCreatedByParquetMr` will be evaluated here. We should make `isCreatedByParquetMr` lazy too ..
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21814
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org