You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by bersprockets <gi...@git.apache.org> on 2018/08/22 18:29:10 UTC

[GitHub] spark pull request #22188: [SPARK-25164][SQL] Avoid rebuilding column and pa...

GitHub user bersprockets opened a pull request:

    https://github.com/apache/spark/pull/22188

    [SPARK-25164][SQL] Avoid rebuilding column and path list for each column in parquet reader

    ## What changes were proposed in this pull request?
    
    VectorizedParquetRecordReader::initializeInternal rebuilds the column list and path list once for each column. Therefore, it indirectly iterates 2\*colCount\*colCount times for each parquet file.
    
    This inefficiency impacts jobs that read parquet-backed tables with many columns and many files. Jobs that read tables with few columns or few files are not impacted.
    
    This PR changes initializeInternal so that it builds each list only once.
    
    I ran benchmarks on my laptop with 1 worker thread, running this query:
    <pre>
    sql("select * from parquet_backed_table where id1 = 1").collect
    </pre>
    There are roughly one matching row for every 425 rows, and the matching rows are sprinkled pretty evenly throughout the table (that is, every page for column <code>id1</code> has at least one matching row).
    
    6000 columns, 1 million rows, 67 32M files:
    
    master | branch | improvement
    -------|---------|-----------
    10.87 min | 6.09 min | 44%
    
    6000 columns, 1 million rows, 23 98m files:
    
    master | branch | improvement
    -------|---------|-----------
    7.39 min | 5.80 min | 21%
    
    600 columns 10 million rows, 67 32M files:
    
    master | branch | improvement
    -------|---------|-----------
    1.95 min | 1.96 min | -0.5%
    
    60 columns, 100 million rows, 67 32M files:
    
    master | branch | improvement
    -------|---------|-----------
    0.55 min | 0.55 min | 0%
    
    ## How was this patch tested?
    
    - sql unit tests
    - pyspark-sql tests
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bersprockets/spark SPARK-25164

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22188.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22188
    
----
commit 697de21501acbda3dbcd8ccc13a35ad3723a652e
Author: Bruce Robbins <be...@...>
Date:   2018-08-22T02:00:28Z

    Initial commit

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    **[Test build #95118 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95118/testReport)** for PR 22188 at commit [`697de21`](https://github.com/apache/spark/commit/697de21501acbda3dbcd8ccc13a35ad3723a652e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    That does seem counter intuitive, but no idea what could explain that since the new code seems like a straight-forward better version.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22188: [SPARK-25164][SQL] Avoid rebuilding column and pa...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22188


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    Normally, we do not backport such improvement PRs. However, the risk of this PR is pretty small. I think it is fine. Let me do this. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    @bersprockets The risk is pretty small I think. I am fine to backport it to the previous versions. Why 2.2 only?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    **[Test build #95118 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95118/testReport)** for PR 22188 at commit [`697de21`](https://github.com/apache/spark/commit/697de21501acbda3dbcd8ccc13a35ad3723a652e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    LGTM. Will leave here for a bit to see if anyone else comments...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95118/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by bersprockets <gi...@git.apache.org>.

Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    @gatorsmile 
    >Why 2.2 only?
    
    Only that I forgot that master is already on 2.4. We should do 2.3 as well, but I haven't tested it yet.
    
    Do I need to do anything on my end to get it into 2.2, and once I test, into 2.3?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by bersprockets <gi...@git.apache.org>.

Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    OK, I reran the tests for the lower column count cases, and the runs with the patch consistently show a tiny (1-3%) improvement compared to the master branch. So even the lower column count cases benefit a little.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by bersprockets <gi...@git.apache.org>.

Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    @gatorsmile Thanks much!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22188: [SPARK-25164][SQL] Avoid rebuilding column and pa...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22188#discussion_r212199355
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java ---
    @@ -270,21 +270,23 @@ public boolean nextBatch() throws IOException {
       private void initializeInternal() throws IOException, UnsupportedOperationException {
         // Check that the requested schema is supported.
         missingColumns = new boolean[requestedSchema.getFieldCount()];
    +    List<ColumnDescriptor> columns = requestedSchema.getColumns();
    +    List<String[]> paths = requestedSchema.getPaths();
    --- End diff --
    
    cc @michal-databricks @mswit-databricks 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by bersprockets <gi...@git.apache.org>.

Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    @cloud-fan @gatorsmile Should we merge this also onto 2.2? It was a clean cherry-pick for me (from master to branch-2.2), and I ran the top and bottom tests (6000 columns, 1 million rows, 67 32M files, and 60 columns, 100 million rows, 67 32M files) from the PR description and got the same results.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...

Posted by bersprockets <gi...@git.apache.org>.

Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22188
  
    Thanks @vanzin. In my benchmark tests, the tiny degradation (0.5%) in the lower column count cases is pretty consistent, which concerns me a little. I am going to re-run those tests in a different environment and see what happens.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org