You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by bersprockets <gi...@git.apache.org> on 2018/04/11 17:28:30 UTC

[GitHub] spark pull request #21043: [SPARK-23963] [SQL] Properly handle large number ...

GitHub user bersprockets opened a pull request:

    https://github.com/apache/spark/pull/21043

    [SPARK-23963] [SQL] Properly handle large number of columns in query on text-based Hive table

    ## What changes were proposed in this pull request?
    
    TableReader would get disproportionately slower as the number of columns in the query increased.
    
    I fixed the way TableReader was looking up metadata for each column in the row. Previously, it had been looking up this data in linked lists, accessing each linked list by an index (column number). Now it looks up this data in arrays, where indexing by column number works better.
    
    ## How was this patch tested?
    
    All sbt unit tests
    python sql tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bersprockets/spark tabreadfix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21043.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21043
    
----
commit 26715a8110be1d72f18604dfd4ae74a5a56d9878
Author: Bruce Robbins <be...@...>
Date:   2018-04-11T05:05:12Z

    Initial commit for testing

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    **[Test build #89212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89212/testReport)** for PR 21043 at commit [`26715a8`](https://github.com/apache/spark/commit/26715a8110be1d72f18604dfd4ae74a5a56d9878).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    **[Test build #89212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89212/testReport)** for PR 21043 at commit [`26715a8`](https://github.com/apache/spark/commit/26715a8110be1d72f18604dfd4ae74a5a56d9878).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89212/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    @gatorsmile 
    
    On on laptop, running
    <pre>
    spark.sql("select * from hive_table").write.mode(SaveMode.Overwrite).csv("outputfile.csv") 
    </pre>
    Input | master<br>runtime | branch<br>runtime
    --- | --- | ---
    6000 cols, 150k rows | 59 minutes | 2.6 minutes
    3000 cols, 150k rows | 13.6 minutes | 1.2 minutes
    20 cols, 150k rows | 7.6 seconds | 7.7 seconds
    20 cols, 1m rows |  10 seconds  | 8.6 second
    
    The branch runtimes are proportional to the number of columns, and also much faster for a large number of columns (but the same for a small number of columns).
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    This sounds reasonable. This is an internal change with a very low risk. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by Tagar <gi...@git.apache.org>.
Github user Tagar commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    @gatorsmile - thanks a lot!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    Could you show some perf number in the PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21043: [SPARK-23963] [SQL] Properly handle large number ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21043


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    @Tagar Merged to both 2.3 and 2.2


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by Tagar <gi...@git.apache.org>.
Github user Tagar commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    @gatorsmile could you please backport this to a Spark 2.2 branch as well?
    This PR gives 24x improvement on 6000 columns as @bersprockets discovered, so I think this 1-line change should be fairly safely applied to Spark 2.2 as well. We see the same performance degradation on wide dataframes in Spark 2.2 as well. Thanks both! 



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    LGTM but let's see if someone more familiar with this code has comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    Thanks! Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org