You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jerryshao <gi...@git.apache.org> on 2016/05/20 09:26:06 UTC

[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] A solution to ex...

GitHub user jerryshao opened a pull request:

    https://github.com/apache/spark/pull/13221

    [SPARK-15443][SQL][Streaming] A solution to explain continuous query

    ## What changes were proposed in this pull request?
    
    Currently directly call `explain` on streaming Dataset will get exception for optimized logical plan and physical plan:
    
    ```
    scala> res0.explain(true)
    == Parsed Logical Plan ==
    FileSource[file:///tmp/input]
    == Analyzed Logical Plan ==
    value: string
    FileSource[file:///tmp/input]
    == Optimized Logical Plan ==
    org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with write.startStream();
    == Physical Plan ==
    org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with write.startStream();
    ```
    
    Inside `StreamExecution`, logical plan still needs to tranform to finally materialize in the run-time. Also in the future optimized logical plan and physical plan may be changed according the different batch. So here propose a way to get an explained plan in the run-time for continuous query.
    
    User could use like:
    
    ```
    val query = spark.read.format("text").stream("file:///tmp/input")
      .write
      .format("console")
      .option("checkpointLocation", "file:///tmp/checkpoint1")
      .trigger(ProcessingTime("2 seconds"))
      .startStream()
    
    // This could be called in the runtime
    query.explain()
    ```
    
    ## How was this patch tested?
    
    Add unit test.
    
    Propose one possible solution, please help to review, thanks lot for your time.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jerryshao/apache-spark SPARK-15443

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13221.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13221
    
----
commit f19351f986047b3c29ca027476d6a22a4933a0cd
Author: jerryshao <ss...@hortonworks.com>
Date:   2016-05-20T09:11:57Z

    A solution to explain coninuous query

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-222269157
  
    > Also a better solution is to find out a good solution to get the plan without really executing the query.
    
    Yeah this seems like the best solution to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-221738477
  
    @marmbrus What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-220558912
  
    **[Test build #58974 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58974/consoleFull)** for PR 13221 at commit [`f19351f`](https://github.com/apache/spark/commit/f19351f986047b3c29ca027476d6a22a4933a0cd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-220753587
  
    **[Test build #59057 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59057/consoleFull)** for PR 13221 at commit [`f19351f`](https://github.com/apache/spark/commit/f19351f986047b3c29ca027476d6a22a4933a0cd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13221: [SPARK-15443][SQL][Streaming] Properly explain continuou...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/13221
  
    I'm going to close until I have a thorough fix about this issue, thanks a lot for your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-220756523
  
    **[Test build #59057 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59057/consoleFull)** for PR 13221 at commit [`f19351f`](https://github.com/apache/spark/commit/f19351f986047b3c29ca027476d6a22a4933a0cd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-220756565
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59057/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-220573726
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-220573614
  
    **[Test build #58974 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58974/consoleFull)** for PR 13221 at commit [`f19351f`](https://github.com/apache/spark/commit/f19351f986047b3c29ca027476d6a22a4933a0cd).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-220753532
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-220756564
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-221745872
  
    I think its unfortunate that you have to actually start the query before you can see what the physical plan looks like, that seems counter to the goal of explain.
    
    How hard would it be to insert stubs for the sources so we could run the actual planner (we would want to avoid anything with side effects like actually instantiating a source).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13221: [SPARK-15443][SQL][Streaming] Properly explain co...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao closed the pull request at:

    https://github.com/apache/spark/pull/13221


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain continuou...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/13221
  
    > but still it cannot reflect the real plan in the run-time
    
    I think you could get really close to the actual plan in most cases by just substituting dummy nodes.  We can indicate that this is a streaming plan and as such is only one possible plan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-220573729
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58974/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-222415753
  
    Thanks a lot @marmbrus for your reply.
    
    My concern is that ContinuousQuery still transforms the logical plan in the runtime according to the Sources, so it is hard to get the exact logical plan beforehand (or without initializing the Sources). One thing we could do is to add some explain-only streaming plans (logical plan and physical plan) to get the explained tree, but still it cannot reflect the real plan in the run-time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15443][SQL][Streaming] Properly explain...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/13221#issuecomment-221757617
  
    Thanks a lot @marmbrus for your suggestion, as you mentioned currently we could get the physical plan only when query is started, which is counter to the goal of explain.
    
    But I think it is quite useful for user to understanding the translated plan. One thing we could do is to change the api to be different from `explain`, so that we will not have this semantic break. Also a better solution is to find out a good solution to get the plan without really executing the query.
    
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org