You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by lw-lin <gi...@git.apache.org> on 2016/07/15 03:36:52 UTC

[GitHub] spark pull request #14214: [SPARK-16545][SQL] Eliminate one unnecessary roun...

GitHub user lw-lin opened a pull request:

    https://github.com/apache/spark/pull/14214

    [SPARK-16545][SQL] Eliminate one unnecessary round of physical planning in ForeachSink

    ## Problem
    
    As reported by [SPARK-16545](https://issues.apache.org/jira/browse/SPARK-16545), in `ForeachSink` we have initialized 3 rounds of physical planning.
    
    Specifically:
    
    [1] In `StreamExecution`, [lastExecution.executedPlan](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L369)
    
    [2] In `ForeachSink`, [forearchPartition()](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala#L69) calls withNewExecutionId(..., **_queryExection_**) which further calls [**_queryExecution_**.executedPlan](https://github.com/apache/spark/blob/9a5071996b968148f6b9aba12e0d3fe888d9acd8/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala#L55)
     
    [3] In `ForeachSink`, [val rdd = { ... incrementalExecution = new IncrementalExecution ...}](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala#L53)
    
    ## What changes were proposed in this pull request?
    
    [1] should not be eliminated in general;
    
    **[2] is eliminated by this patch, by replacing the `queryExecution` with `incrementalExecution` provided by [3];**
    
    [3] should be eliminated but can not be done at this stage; let's revisit it when SPARK-16264 is resolved.
    
    
    ## How was this patch tested?
    
    - checked manually now there are only 2 rounds of physical planning in ForeachSink after this patch
    - existing tests ensues it cause no regression


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lw-lin/spark physical-3x

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14214.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14214
    
----
commit 8ec635fe7403baf5149e3f6714872bf706b37cd7
Author: Liwei Lin <lw...@gmail.com>
Date:   2016-07-15T02:12:02Z

    Fix foreachPartition

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate unnecessary rounds of physi...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    Thanks for working on this, but I'm tempted to close this as "won't fix".  Its likely we are going to have to rewrite the incremental planner completely for 2.1 and this is just a minor inefficiency if I understand correctly.
    
    In a more general comments:
     - There's no tests
     - If you find yourself needing to reference a bug/JIRA in several different places throughout the code, perhaps we need to reexamine the problem more holistically, rather than continue to pile on hacks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate one unnecessary round of ph...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate one unnecessary round of ph...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    **[Test build #62362 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62362/consoleFull)** for PR 14214 at commit [`8ec635f`](https://github.com/apache/spark/commit/8ec635fe7403baf5149e3f6714872bf706b37cd7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate one unnecessary round of ph...

Posted by mariobriggs <gi...@git.apache.org>.
Github user mariobriggs commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    What i tried to do as a 'side fix' was like this,
      eliminate [1] since it was a lazy val. 
    
      Move [2] out of the code path of the main thread i.e. let ListenerBus thread pay the penalty of producing the physical plan for logging ( i was coming from a performance test scenario, so it allowed me to proceed :-) ) . So the change was that SparkListenerSQLExecutionStart only take QueryExecution as a input parameter and not physicalPlanDescription & SparkPlanInfo . However this cannot be the solution since SparkListenerSQLExecutionStart is a public API already.
    
     [3] remains.
    
    As you might have already noticed ConsoleSink also suffers from the same problem of [2] and these are inside Dataset.withTypedCallback/withCallback, but it is only for Debug purposes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate unnecessary rounds of physi...

Posted by mariobriggs <gi...@git.apache.org>.
Github user mariobriggs commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    >Yea we probably do not want to modify this public API; so what we did in this patch was, passing [3]'s incrementalExecution into the listener so we would initialize physical planning only once for [2] and [3].
    
    oh cool. I didn't realize that. I can give it a try and understand it more. Thanks... this was the trick i was missing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate one unnecessary round of ph...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    **[Test build #62362 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62362/consoleFull)** for PR 14214 at commit [`8ec635f`](https://github.com/apache/spark/commit/8ec635fe7403baf5149e3f6714872bf706b37cd7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate one unnecessary round of ph...

Posted by lw-lin <gi...@git.apache.org>.
Github user lw-lin commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    @mariobriggs Thanks for the information!
    
    > 1 can be eliminated because 'executedPlan' is a ' lazy val' on QueryExecution ?
    
    Yea indeed. Its being there can provide us debug info but on second thought it might not be worth it. So let's also skip it in the case of `ForeachSink`.
    
    > ... However this cannot be the solution since SparkListenerSQLExecutionStart is a public API already
    
    Yea we probably do not want to modify this public API; so what we did in this patch was, passing [3]'s `incrementalExecution` into the listener so we'll only initialize physical planning only once for [2] and [3].
    
    > ... why not keep [1] and the change to [2] be the simple case of changing L52 to the following: new Dataset(data.sparkSession, data.queryExecution, implicitly[Encoder[T]])
    
    This is great. If reviews decide that 2 rounds of physical planning is acceptable, then let's do it your way!
    
    > ... ConsoleSink ... but it is only for Debug purposes
    
    So maybe let us live with it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate one unnecessary round of ph...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    **[Test build #62368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62368/consoleFull)** for PR 14214 at commit [`9334105`](https://github.com/apache/spark/commit/933410558429ea82f063f21236b8c5c645650a78).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate unnecessary rounds of physi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    **[Test build #62368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62368/consoleFull)** for PR 14214 at commit [`9334105`](https://github.com/apache/spark/commit/933410558429ea82f063f21236b8c5c645650a78).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate unnecessary rounds of physi...

Posted by lw-lin <gi...@git.apache.org>.
Github user lw-lin commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    @marmbrus @zsxwing could you take a look and share some ideas? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate unnecessary rounds of physi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate one unnecessary round of ph...

Posted by mariobriggs <gi...@git.apache.org>.
Github user mariobriggs commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    > [1] should not be eliminated in general;
    
      I dont understand the full internal aspects of IncrementalExecution, but my generally thinking was that 1 can be eliminated because 'executedPlan' is a ' lazy val' on QueryExecution ?
    
    >[2] is eliminated by this patch, by replacing the queryExecution with incrementalExecution provided by [3];
    
    If the goal is to get it to just as minimal as possible for now and wait for SPARK-16264 (which i was also thinking where it will have to finally wait for full resolution), why not keep [1] and the change to [2] be the simple case of changing [L52](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala#L52) to the following
     
    ``` new Dataset(data.sparkSession, data.queryExecution, implicitly[Encoder[T]]) ```
    
    and no further changes required to your ealier code. Will it be the case that the wrong physical plan will logged in SparkListenerSQLExecutionStart ?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate one unnecessary round of ph...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62362/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate unnecessary rounds of physi...

Posted by lw-lin <gi...@git.apache.org>.
Github user lw-lin commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    Sure let's rewrite the incremental planner to solve problems more holistically; actually this patch is not satisfying to myself either. So I'm closing this, and -- thank you for the ideas!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14214: [SPARK-16545][SQL] Eliminate unnecessary rounds o...

Posted by lw-lin <gi...@git.apache.org>.
Github user lw-lin closed the pull request at:

    https://github.com/apache/spark/pull/14214


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14214: [SPARK-16545][SQL] Eliminate unnecessary rounds of physi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14214
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62368/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org