You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by gengliangwang <gi...@git.apache.org> on 2018/07/19 06:54:34 UTC

[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...

GitHub user gengliangwang opened a pull request:

    https://github.com/apache/spark/pull/21814

    [SPARK-24858][SQL] Avoid unnecessary parquet footer reads

    ## What changes were proposed in this pull request?
    
    Currently the same Parquet footer is read twice in the function `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is enabled.
    
    Fix it with simple changes.
    ## How was this patch tested?
    
    Unit test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gengliangwang/spark parquetFooter

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21814.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21814
    
----
commit 5667cc57022d840aecb6c7d0c967e2a3448a4928
Author: Gengliang Wang <ge...@...>
Date:   2018-07-19T06:43:45Z

    Avoid unnecessary parquet footer reading

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93264/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1113/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21814#discussion_r203621872
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -384,12 +385,10 @@ class ParquetFileFormat
           // *only* if the file was created by something other than "parquet-mr", so check the actual
           // writer here for this file.  We have to do this per-file, as each file in the table may
           // have different writers.
    -      def isCreatedByParquetMr(): Boolean = {
    -        val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS)
    -        footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr")
    -      }
    +      val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
    --- End diff --
    
    Yes, I just push the code to make it lazy value.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1124/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1115/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    **[Test build #93263 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93263/testReport)** for PR 21814 at commit [`5667cc5`](https://github.com/apache/spark/commit/5667cc57022d840aecb6c7d0c967e2a3448a4928).
     * This patch **fails due to an unknown error code, -9**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21814


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93275/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    **[Test build #93265 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93265/testReport)** for PR 21814 at commit [`c215b46`](https://github.com/apache/spark/commit/c215b466ae5cb7af30bdffcf739b554d4a9f1844).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    **[Test build #93265 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93265/testReport)** for PR 21814 at commit [`c215b46`](https://github.com/apache/spark/commit/c215b466ae5cb7af30bdffcf739b554d4a9f1844).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1114/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93265/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21814#discussion_r203622451
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -384,12 +385,10 @@ class ParquetFileFormat
           // *only* if the file was created by something other than "parquet-mr", so check the actual
           // writer here for this file.  We have to do this per-file, as each file in the table may
           // have different writers.
    -      def isCreatedByParquetMr(): Boolean = {
    -        val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS)
    -        footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr")
    -      }
    +      val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
    --- End diff --
    
    I think making `footerFileMetaData` lazy value is better.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93263/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    **[Test build #93275 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93275/testReport)** for PR 21814 at commit [`f3e863b`](https://github.com/apache/spark/commit/f3e863b7e11d50f0dedde9ef5291c1a08a2654fd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21814#discussion_r203638620
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -384,12 +385,10 @@ class ParquetFileFormat
           // *only* if the file was created by something other than "parquet-mr", so check the actual
           // writer here for this file.  We have to do this per-file, as each file in the table may
           // have different writers.
    -      def isCreatedByParquetMr(): Boolean = {
    -        val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS)
    -        footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr")
    -      }
    +      val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
    --- End diff --
    
    Also, let's leave a comment here saying that we avoid reading it by short circuit. I already saw multiple people confused here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21814#discussion_r203621554
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -384,12 +385,10 @@ class ParquetFileFormat
           // *only* if the file was created by something other than "parquet-mr", so check the actual
           // writer here for this file.  We have to do this per-file, as each file in the table may
           // have different writers.
    -      def isCreatedByParquetMr(): Boolean = {
    -        val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS)
    -        footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr")
    -      }
    +      val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
    --- End diff --
    
    Is `lazy` better? `timestampConversion` default is false.
    ```scala
    lazy val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    **[Test build #93263 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93263/testReport)** for PR 21814 at commit [`5667cc5`](https://github.com/apache/spark/commit/5667cc57022d840aecb6c7d0c967e2a3448a4928).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    **[Test build #93264 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93264/testReport)** for PR 21814 at commit [`24f69e4`](https://github.com/apache/spark/commit/24f69e4de08f5a65ffc163155074ac5c26ab8809).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    **[Test build #93264 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93264/testReport)** for PR 21814 at commit [`24f69e4`](https://github.com/apache/spark/commit/24f69e4de08f5a65ffc163155074ac5c26ab8809).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    **[Test build #93275 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93275/testReport)** for PR 21814 at commit [`f3e863b`](https://github.com/apache/spark/commit/f3e863b7e11d50f0dedde9ef5291c1a08a2654fd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21814#discussion_r203638199
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -384,12 +385,10 @@ class ParquetFileFormat
           // *only* if the file was created by something other than "parquet-mr", so check the actual
           // writer here for this file.  We have to do this per-file, as each file in the table may
           // have different writers.
    -      def isCreatedByParquetMr(): Boolean = {
    -        val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS)
    -        footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr")
    -      }
    +      val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr")
    --- End diff --
    
    Hm? `isCreatedByParquetMr` will be evaluated here. We should make `isCreatedByParquetMr` lazy too .. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21814: [SPARK-24858][SQL] Avoid unnecessary parquet footer read...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21814
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org