You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by uzadude <gi...@git.apache.org> on 2017/11/07 11:53:43 UTC

[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...

GitHub user uzadude opened a pull request:

    https://github.com/apache/spark/pull/19683

    [SPARK-21657][SQL] optimize explode quadratic memory consumpation

    ## What changes were proposed in this pull request?
    
    The issue has been raised in two Jira tickets: [SPARK-21657](https://issues.apache.org/jira/browse/SPARK-21657), [SPARK-16998](https://issues.apache.org/jira/browse/SPARK-16998). Basically, what happens is that in collection generators like explode/inline we create many rows from each row. Currently each exploded row contains also the column on which it was created. This causes, for example, if we have a 10k array in one row that this array will get copy 10k times - to each of the row. this results a qudratic memory consumption. However, it is a common case that the original column gets projected out after the explode, so we can avoid duplicating it.
    In this solution we propose to identify this situation in the optimizer and turn on a flag for omitting the original column in the generation process.
    
    ## How was this patch tested?
    
    1. We added a benchmark test to MiscBenchmark that shows x16 improvement in runtimes.
    2. We ran some of the other tests in MiscBenchmark and they show 15% improvements.
    3. We ran this code on a specific case from our production data with rows containing arrays of size ~200k and it reduced the runtime from 6 hours to 3 mins.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uzadude/spark optimize_explode

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19683.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19683
    
----
commit ce7c3694a99584348957dc756234bb667466be4e
Author: oraviv <or...@paypal.com>
Date:   2017-11-07T11:34:21Z

    [SPARK-21657][SQL] optimize explode quadratic memory consumpation

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19683#discussion_r158436824
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/MiscBenchmark.scala ---
    @@ -227,4 +227,31 @@ class MiscBenchmark extends BenchmarkBase {
         generate stack wholestage on                   836 /  847         20.1          49.8      15.5X
          */
       }
    +
    +  ignore("generate explode big struct array") {
    +    val BASE = 1234567890
    --- End diff --
    
    why need this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19683
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19683
  
    I can reproduce the test failure locally.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19683
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19683
  
    **[Test build #85503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85503/testReport)** for PR 19683 at commit [`1c6626a`](https://github.com/apache/spark/commit/1c6626acad080404a73519735bc1b3a0fbf6e303).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19683
  
    **[Test build #85462 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85462/testReport)** for PR 19683 at commit [`9edd864`](https://github.com/apache/spark/commit/9edd864e6bd3bc1fce0f6c4d2b45620addb82514).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19683
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85382/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19683
  
    **[Test build #85502 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85502/testReport)** for PR 19683 at commit [`1c6626a`](https://github.com/apache/spark/commit/1c6626acad080404a73519735bc1b3a0fbf6e303).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

Posted by uzadude <gi...@git.apache.org>.

Github user uzadude commented on the issue:

    https://github.com/apache/spark/pull/19683
  
    this timeout in "org.apache.spark.ml.regression.LinearRegressionSuite.linear regression with intercept without regularization" doesn't seem related to our fix..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19683
  
    To re-run the tests, simple leave a comment "retest this please".


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19683
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19683
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83566/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org