You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by andrewor14 <gi...@git.apache.org> on 2016/06/24 21:32:10 UTC

[GitHub] spark pull request #13899: [SPARK-16196][SQL] Codegen caching + store rows a...

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/13899

    [SPARK-16196][SQL] Codegen caching + store rows as ColumnarBatches

    ## What changes were proposed in this pull request?
    
    This patch makes `InMemoryRelation` faster by generating code to store the input rows as `ColumnarBatches`. This code path is enabled by default but only supports primitive types, falling back to the old, slower code path if there are unsupported types (e.g. strings, arrays, UDTs) in the schema.
    
    The old code path reads the input rows into `ColumnBuilder`s, which is slow because these builders are backed by `ByteBuffer`s and there are a lot of virtual function calls involved, especially when compression is involved.
    
    The following numbers are derived from the read path (i.e. reading cached batches from memory). The baseline is the first row. The second and third rows describe caching performance before this patch. The last row describes caching performance after this patch.
    ```
    Cache random keys:                       Best/Avg Time(ms)   Rate(M/s)   Per Row(ns)   Relative
    -----------------------------------------------------------------------------------------------
    cache = F                                      890 /  920        47.1          21.2       1.0X
    cache = T columnar_batches = F compress = F   1950 / 1978        21.5          46.5       0.5X
    cache = T columnar_batches = F compress = T   1893 / 1927        22.2          45.1       0.5X
    cache = T columnar_batches = T                 540 /  544        77.7          12.9       1.6X
    ```
    
    ## How was this patch tested?
    
    `CacheBenchmark`, `InMemoryColumnarQuerySuite`, existing tests
    
    ## Generated code
    
    (Will be posted shortly)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark speedup-cache

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13899.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13899
    
----
commit be1ae40a6a1c1097909006570f7ce0fa42097128
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-17T20:44:18Z

    Move it

commit bf11d278cb6420c10d4f748e2b19cead4a3f6391
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-17T21:47:20Z

    Add benchmark code

commit 82499c37f8a2a539e97febc05f3f416411dc0985
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-20T18:35:21Z

    backup

commit 2f12e96f3d23d49587e15364861dbe34bdfc8972
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-20T22:36:44Z

    Narrow benchmarked code + add back old scan code

commit 6da1e71be250fd4ddfe5cbca076ede3b78d67d0e
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-21T18:28:19Z

    Fix benchmark to time only the read path

commit fdf321e3c6d9c057193620bfba8fbc97a01e8513
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-21T21:35:01Z

    First working impl. of ColumnarBatch based caching
    
    Note, this doesn't work: spark.table("tab1").collect(), because
    we're trying to cast ColumnarBatch.Row into UnsafeRow. This works,
    however: spark.table("tab1").groupBy("i").sum("j").collect().

commit d0d2661f47d351dab0627fde44e192e144e661a6
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-22T00:45:34Z

    Always enable codegen and vectorized hashmap

commit 570d0c3470bfcd095c4a0389cd05c1a2c764bd25
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-22T19:48:37Z

    Don't benchmark aggregate

commit 3e96f4efbe17a1f7f6047d937379401daa6f252c
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-22T21:57:56Z

    Codegen memory scan using ColumnarBatches

commit 5726d11adb202136f827133ce8f9a3ab595a17f0
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-22T23:10:41Z

    Clean up the code a little

commit d255eb02f0188da630f17e8d1af711297cf03e7d
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-22T23:15:45Z

    Merge branch 'master' of github.com:apache/spark into speedup-cache

commit f4f81826b5facb83e1ab6cd0988d056feedc5d54
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-22T23:19:42Z

    Clean up a little more

commit 41d52b75fa39d09adb40a792cef4e2ffe2e0851f
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-23T22:10:41Z

    Generate code for write path to support other types
    
    Previously we could only support schemas where all columns are
    Longs because we hardcode putLong and getLong calls in the write
    path. This led to unfathomable NPEs if we try to cache something
    with other types.
    
    This commit fixes this by generalizing the code to build column
    batches.

commit b6618d77e924dd49c9dcd2e31bbb24a3d8fa5d14
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-23T22:15:29Z

    Merge branch 'master' of github.com:apache/spark into speedup-cache

commit 06bbfdbf040e509b88e8462c80bb566e0ac314c8
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-23T23:01:44Z

    Move cache benchmark to new file

commit 1a12d06e4e3f71cd21229d9adc766d5643dfdfa3
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-23T23:43:13Z

    Abstract codegen code into ColumnarBatchScan

commit 8cdbdd0c729936d731e531ee10c2ba4e72ceec57
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-24T00:04:09Z

    Introduce CACHE_CODEGEN config to reduce dup code

commit faa6776b92a8ca5281699df3af1f1fc59aa786e8
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-24T00:34:57Z

    Add some tests for InMemoryRelation

commit 2ba6b1e2f79a1c41b56e51a7d3a01b06417f03dd
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-24T00:40:11Z

    Add some tests for InMemoryRelation

commit 7f09753a5df4465d1e4f0d57d06b53b4637f7470
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-24T00:44:21Z

    Fix InMemoryColumnarQuerySuite

commit c72c085b32179113e546fb0251032e95106b2cd3
Author: Andrew Or <an...@databricks.com>
Date:   2016-06-24T19:00:37Z

    Clean up code: abstract CachedBatch and ColumnarBatch

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    **[Test build #61203 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61203/consoleFull)** for PR 13899 at commit [`c72c085`](https://github.com/apache/spark/commit/c72c085b32179113e546fb0251032e95106b2cd3).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    **[Test build #64644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64644/consoleFull)** for PR 13899 at commit [`0125aa2`](https://github.com/apache/spark/commit/0125aa2f24ee6ffc227a8df83917d25a2f9eb273).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    **[Test build #61206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61206/consoleFull)** for PR 13899 at commit [`0125aa2`](https://github.com/apache/spark/commit/0125aa2f24ee6ffc227a8df83917d25a2f9eb273).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    **[Test build #61206 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61206/consoleFull)** for PR 13899 at commit [`0125aa2`](https://github.com/apache/spark/commit/0125aa2f24ee6ffc227a8df83917d25a2f9eb273).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    **[Test build #3237 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3237/consoleFull)** for PR 13899 at commit [`0125aa2`](https://github.com/apache/spark/commit/0125aa2f24ee6ffc227a8df83917d25a2f9eb273).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61203/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    Closing for now; too many conflicts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    **[Test build #61203 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61203/consoleFull)** for PR 13899 at commit [`c72c085`](https://github.com/apache/spark/commit/c72c085b32179113e546fb0251032e95106b2cd3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13899: [SPARK-16196][SQL] Codegen in-memory scan with Co...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 closed the pull request at:

    https://github.com/apache/spark/pull/13899


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    **[Test build #64644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64644/consoleFull)** for PR 13899 at commit [`0125aa2`](https://github.com/apache/spark/commit/0125aa2f24ee6ffc227a8df83917d25a2f9eb273).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61206/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    @rxin @sameeragarwal @ooq


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    @andrewor14 Looks interesting.
    
    I created two PRs that generate similar code like [your code](https://gist.github.com/andrewor14/7ce4c37a3c6bcd5cc2b6b16c861859e9). My PRs use current ```ByteBuffer``` and supports compressions for primitive types. Do these PRs help you?
    https://github.com/apache/spark/pull/11956
    https://github.com/apache/spark/pull/12894
    
    I am waiting for review.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64644/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13899
  
    **[Test build #3237 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3237/consoleFull)** for PR 13899 at commit [`0125aa2`](https://github.com/apache/spark/commit/0125aa2f24ee6ffc227a8df83917d25a2f9eb273).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org