You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/08/18 11:55:12 UTC

[GitHub] spark pull request #18993: [SPARK-21743][SQL][follow-up] top-most limit shou...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/18993

    [SPARK-21743][SQL][follow-up] top-most limit should not cause memory leak

    ## What changes were proposed in this pull request?
    
    This is a follow-up of https://github.com/apache/spark/pull/18955 , to fix a bug that we break whole stage codegen for `Limit`.
    
    ## How was this patch tested?
    
    existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18993.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18993
    
----
commit b6d51dead6af75ee49eb59ebc48ff0a4c58353ed
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-08-18T11:53:12Z

    do not break whole stage codegen for limit

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18993: [SPARK-21743][SQL][follow-up] top-most limit shou...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18993#discussion_r133941933
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -1180,6 +1180,9 @@ object ConvertToLocalRelation extends Rule[LogicalPlan] {
           val projection = new InterpretedProjection(projectList, output)
           projection.initialize(0)
           LocalRelation(projectList.map(_.toAttribute), data.map(projection))
    +
    +    case Limit(IntegerLiteral(limit), LocalRelation(output, data)) =>
    --- End diff --
    
    This is to fix `SQLQuerySuite.SPARK-19650: An action on a Command should not trigger a Spark job`, limit over local relation should not trigger a spark job.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18993: [SPARK-21743][SQL][follow-up] top-most limit shou...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18993#discussion_r133972522
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -1180,6 +1180,9 @@ object ConvertToLocalRelation extends Rule[LogicalPlan] {
           val projection = new InterpretedProjection(projectList, output)
           projection.initialize(0)
           LocalRelation(projectList.map(_.toAttribute), data.map(projection))
    +
    +    case Limit(IntegerLiteral(limit), LocalRelation(output, data)) =>
    --- End diff --
    
    Yeah, you are right about that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80843/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    **[Test build #80848 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80848/testReport)** for PR 18993 at commit [`b6d51de`](https://github.com/apache/spark/commit/b6d51dead6af75ee49eb59ebc48ff0a4c58353ed).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18993: [SPARK-21743][SQL][follow-up] top-most limit shou...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18993#discussion_r133941776
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -63,29 +63,24 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
        */
       object SpecialLimits extends Strategy {
         override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
    -      case logical.ReturnAnswer(rootPlan) => rootPlan match {
    -        case logical.Limit(IntegerLiteral(limit), logical.Sort(order, true, child)) =>
    -          execution.TakeOrderedAndProjectExec(limit, order, child.output, planLater(child)) :: Nil
    --- End diff --
    
    kinda unrelated, remove these `logical` and `execution` prefix to shorten the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18993: [SPARK-21743][SQL][follow-up] top-most limit shou...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18993#discussion_r133941559
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
    @@ -54,14 +54,6 @@ trait BaseLimitExec extends UnaryExecNode with CodegenSupport {
       val limit: Int
       override def output: Seq[Attribute] = child.output
     
    -  // Do not enable whole stage codegen for a single limit.
    -  override def supportCodegen: Boolean = child match {
    -    case plan: CodegenSupport => plan.supportCodegen
    -    case _ => false
    --- End diff --
    
    This is wrong, we may have more operators above `Limit`, so it's not a single `Limit`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18993: [SPARK-21743][SQL][follow-up] top-most limit shou...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18993


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80848/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    LGTM
    
    Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18993: [SPARK-21743][SQL][follow-up] top-most limit shou...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18993#discussion_r133971618
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -1180,6 +1180,9 @@ object ConvertToLocalRelation extends Rule[LogicalPlan] {
           val projection = new InterpretedProjection(projectList, output)
           projection.initialize(0)
           LocalRelation(projectList.map(_.toAttribute), data.map(projection))
    +
    +    case Limit(IntegerLiteral(limit), LocalRelation(output, data)) =>
    --- End diff --
    
    technically this is not about correctness, `An action on a Command should not trigger a Spark job` is also kind of optimization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    cc @hvanhovell @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    LGTM pending jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18993: [SPARK-21743][SQL][follow-up] top-most limit shou...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18993#discussion_r133941801
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -63,29 +63,24 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
        */
       object SpecialLimits extends Strategy {
         override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
    -      case logical.ReturnAnswer(rootPlan) => rootPlan match {
    -        case logical.Limit(IntegerLiteral(limit), logical.Sort(order, true, child)) =>
    -          execution.TakeOrderedAndProjectExec(limit, order, child.output, planLater(child)) :: Nil
    -        case logical.Limit(
    -            IntegerLiteral(limit),
    -            logical.Project(projectList, logical.Sort(order, true, child))) =>
    -          execution.TakeOrderedAndProjectExec(
    -            limit, order, projectList, planLater(child)) :: Nil
    -        case logical.Limit(IntegerLiteral(limit), child) =>
    -          // Normally wrapping child with `LocalLimitExec` here is a no-op, because
    -          // `CollectLimitExec.executeCollect` will call `LocalLimitExec.executeTake`, which
    -          // calls `child.executeTake`. If child supports whole stage codegen, adding this
    -          // `LocalLimitExec` can stop the processing of whole stage codegen and trigger the
    -          // resource releasing work, after we consume `limit` rows.
    -          execution.CollectLimitExec(limit, LocalLimitExec(limit, planLater(child))) :: Nil
    +      case ReturnAnswer(rootPlan) => rootPlan match {
    +        case Limit(IntegerLiteral(limit), Sort(order, true, child)) =>
    +          TakeOrderedAndProjectExec(limit, order, child.output, planLater(child)) :: Nil
    +        case Limit(IntegerLiteral(limit), Project(projectList, Sort(order, true, child))) =>
    +          TakeOrderedAndProjectExec(limit, order, projectList, planLater(child)) :: Nil
    +        case Limit(IntegerLiteral(limit), child) =>
    +          // With whole stage codegen, Spark releases resources only when all the output data of the
    +          // query plan are consumed. It's possible that `CollectLimitExec` only consumes a little
    +          // data from child plan and finishes the query without releasing resources. Here we wrap
    +          // the child plan with `LocalLimitExec`, to stop the processing of whole stage codegen and
    +          // trigger the resource releasing work, after we consume `limit` rows.
    --- End diff --
    
    comments updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    **[Test build #80848 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80848/testReport)** for PR 18993 at commit [`b6d51de`](https://github.com/apache/spark/commit/b6d51dead6af75ee49eb59ebc48ff0a4c58353ed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    **[Test build #80843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80843/testReport)** for PR 18993 at commit [`b6d51de`](https://github.com/apache/spark/commit/b6d51dead6af75ee49eb59ebc48ff0a4c58353ed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18993: [SPARK-21743][SQL][follow-up] top-most limit shou...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18993#discussion_r133943823
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -1180,6 +1180,9 @@ object ConvertToLocalRelation extends Rule[LogicalPlan] {
           val projection = new InterpretedProjection(projectList, output)
           projection.initialize(0)
           LocalRelation(projectList.map(_.toAttribute), data.map(projection))
    +
    +    case Limit(IntegerLiteral(limit), LocalRelation(output, data)) =>
    --- End diff --
    
    This kinda violates the idea that we shouldn't rely on optimization for correctness, but I suppose this is ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    **[Test build #80843 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80843/testReport)** for PR 18993 at commit [`b6d51de`](https://github.com/apache/spark/commit/b6d51dead6af75ee49eb59ebc48ff0a4c58353ed).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18993: [SPARK-21743][SQL][follow-up] top-most limit should not ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18993
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org