You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by zsxwing <gi...@git.apache.org> on 2016/03/30 01:21:47 UTC

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

GitHub user zsxwing opened a pull request:

    https://github.com/apache/spark/pull/12049

    [SPARK-14257][SQL]Allow multiple continuous queries to be started from the same DataFrame

    ## What changes were proposed in this pull request?
    
    Make StreamingRelation store the closure to create the source in StreamExecution so that we can start multiple continuous queries from the same DataFrame. 
    
    ## How was this patch tested?
    
    `test("DataFrame reuse")`

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zsxwing/spark df-reuse

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12049.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12049
    
----
commit 50c39b83e9d0f0a3ebe8cf5cb2675662b320a6e9
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-03-29T00:09:28Z

    Allow multiple continuous queries to be started from the same DataFrame

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205652858
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54955/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-204511022
  
    **[Test build #2722 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2722/consoleFull)** for PR 12049 at commit [`6196790`](https://github.com/apache/spark/commit/61967904705767af7c1d95af97c7a814ea1d2207).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r57974037
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala ---
    @@ -81,4 +85,62 @@ class StreamSuite extends StreamTest with SharedSQLContext {
           AddData(inputData, 1, 2, 3, 4),
           CheckAnswer(2, 4))
       }
    +
    +  test("DataFrame reuse") {
    +    def assertDF(df: DataFrame) {
    +      withTempDir { outputDir =>
    +        withTempDir { checkpointDir =>
    +          val query = df.write.format("parquet")
    +            .option("checkpointLocation", checkpointDir.getAbsolutePath)
    +            .startStream(outputDir.getAbsolutePath)
    +          try {
    +            eventually(timeout(streamingTimeout)) {
    --- End diff --
    
    > I'd like to ban use of eventually. Most of our tests that rely on this construct end up being flakey.
    
    As there are two threads here, even if not using `eventually`, we still to need to use `await` or other similar stuff in order to collect results from another thread.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r57946679
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala ---
    @@ -81,4 +85,62 @@ class StreamSuite extends StreamTest with SharedSQLContext {
           AddData(inputData, 1, 2, 3, 4),
           CheckAnswer(2, 4))
       }
    +
    +  test("DataFrame reuse") {
    +    def assertDF(df: DataFrame) {
    +      withTempDir { outputDir =>
    +        withTempDir { checkpointDir =>
    +          val query = df.write.format("parquet")
    +            .option("checkpointLocation", checkpointDir.getAbsolutePath)
    +            .startStream(outputDir.getAbsolutePath)
    +          try {
    +            eventually(timeout(streamingTimeout)) {
    +              val outputDf = sqlContext.read.parquet(outputDir.getAbsolutePath).as[Long]
    +              checkDataset[Long](outputDf, (0L to 10L).toArray: _*)
    +            }
    +          } finally {
    +            query.stop()
    +          }
    +        }
    +      }
    +    }
    +
    +    val df = sqlContext.read.format(classOf[FakeDefaultSource].getName).stream()
    +    assertDF(df)
    +    assertDF(df)
    +    assertDF(df)
    --- End diff --
    
    This test passes if there are 2 of these but not 3?  That seems a little weird to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205489573
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54878/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205453826
  
    **[Test build #54879 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54879/consoleFull)** for PR 12049 at commit [`ac51850`](https://github.com/apache/spark/commit/ac51850ccb3e84faf78fb5b85d34a6806d3df8e9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12049


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205449459
  
    **[Test build #54878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54878/consoleFull)** for PR 12049 at commit [`aa55afe`](https://github.com/apache/spark/commit/aa55afe9eff89322e50ff23fe3332ce13689244d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205495527
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r57974213
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala ---
    @@ -81,4 +85,62 @@ class StreamSuite extends StreamTest with SharedSQLContext {
           AddData(inputData, 1, 2, 3, 4),
           CheckAnswer(2, 4))
       }
    +
    +  test("DataFrame reuse") {
    +    def assertDF(df: DataFrame) {
    +      withTempDir { outputDir =>
    +        withTempDir { checkpointDir =>
    +          val query = df.write.format("parquet")
    +            .option("checkpointLocation", checkpointDir.getAbsolutePath)
    +            .startStream(outputDir.getAbsolutePath)
    +          try {
    +            eventually(timeout(streamingTimeout)) {
    +              val outputDf = sqlContext.read.parquet(outputDir.getAbsolutePath).as[Long]
    +              checkDataset[Long](outputDf, (0L to 10L).toArray: _*)
    +            }
    +          } finally {
    +            query.stop()
    +          }
    +        }
    +      }
    +    }
    +
    +    val df = sqlContext.read.format(classOf[FakeDefaultSource].getName).stream()
    +    assertDF(df)
    +    assertDF(df)
    +    assertDF(df)
    --- End diff --
    
    > This test passes if there are 2 of these but not 3? That seems a little weird to me.
    
    Indeed, we only need to test twice here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r57982704
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala ---
    @@ -81,4 +85,62 @@ class StreamSuite extends StreamTest with SharedSQLContext {
           AddData(inputData, 1, 2, 3, 4),
           CheckAnswer(2, 4))
       }
    +
    +  test("DataFrame reuse") {
    +    def assertDF(df: DataFrame) {
    +      withTempDir { outputDir =>
    +        withTempDir { checkpointDir =>
    +          val query = df.write.format("parquet")
    +            .option("checkpointLocation", checkpointDir.getAbsolutePath)
    +            .startStream(outputDir.getAbsolutePath)
    +          try {
    +            eventually(timeout(streamingTimeout)) {
    --- End diff --
    
    We have existing constructs like `testStream` or `processAllAvailable`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205925072
  
    Thanks, merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205650525
  
    **[Test build #54955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54955/consoleFull)** for PR 12049 at commit [`48d760e`](https://github.com/apache/spark/commit/48d760eed41a3d559ad8aa6363b6000d4b9ed54d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205869083
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-203175003
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54472/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-203174600
  
    **[Test build #54472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54472/consoleFull)** for PR 12049 at commit [`50c39b8`](https://github.com/apache/spark/commit/50c39b83e9d0f0a3ebe8cf5cb2675662b320a6e9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class StreamingRelation(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205871057
  
    **[Test build #54994 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54994/consoleFull)** for PR 12049 at commit [`48d760e`](https://github.com/apache/spark/commit/48d760eed41a3d559ad8aa6363b6000d4b9ed54d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-203175002
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205488623
  
    **[Test build #54878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54878/consoleFull)** for PR 12049 at commit [`aa55afe`](https://github.com/apache/spark/commit/aa55afe9eff89322e50ff23fe3332ce13689244d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205630404
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54932/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-204511207
  
    **[Test build #2723 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2723/consoleFull)** for PR 12049 at commit [`6196790`](https://github.com/apache/spark/commit/61967904705767af7c1d95af97c7a814ea1d2207).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205495534
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54879/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r58117149
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala ---
    @@ -71,9 +71,18 @@ class StreamExecution(
       /** The current batchId or -1 if execution has not yet been initialized. */
       private var currentBatchId: Long = -1
     
    +  private[sql] val logicalPlan = _logicalPlan.transform {
    +    case StreamingRelation(sourceCreator, output) =>
    +      // Materialize source to avoid creating it in every batch
    +      val source = sourceCreator()
    +      // We still need to use the previous `output` instead of `source.schema` as attributes in
    +      // "_logicalPlan" has already used attributes of the previous `output`.
    +      StreamingRelation(() => source, output)
    --- End diff --
    
    > Its not going to be clear from explain() which mode you are in.
    
    By the way, the stream DataFrame has not yet supported `explain`. Should we fix it now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r58116652
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala ---
    @@ -71,9 +71,18 @@ class StreamExecution(
       /** The current batchId or -1 if execution has not yet been initialized. */
       private var currentBatchId: Long = -1
     
    +  private[sql] val logicalPlan = _logicalPlan.transform {
    +    case StreamingRelation(sourceCreator, output) =>
    +      // Materialize source to avoid creating it in every batch
    +      val source = sourceCreator()
    +      // We still need to use the previous `output` instead of `source.schema` as attributes in
    +      // "_logicalPlan" has already used attributes of the previous `output`.
    +      StreamingRelation(() => source, output)
    --- End diff --
    
    I tried `Map[DataSource, Source]` but failed because of RichSource.
    
    ```
      implicit class RichSource(s: Source) {
        def toDF(): DataFrame = Dataset.ofRows(sqlContext, StreamingRelation(s))
    
        def toDS[A: Encoder](): Dataset[A] = Dataset(sqlContext, StreamingRelation(s))
      }
    ```
    If we only have `StreamingRelaction(DataSource)`, then RichSource needs to create a DataSource for Source dynamically. 
    
    So the above codes will be changed to
    ```
      implicit class RichSource(s: Source) {
        def toDF(): DataFrame = Dataset.ofRows(sqlContext, StreamingRelation(DataSource(sqlContext, className = ...)))
    
        def toDS[A: Encoder](): Dataset[A] = Dataset(sqlContext, StreamingRelation(sqlContext, className = ...))
      }
    ```
    
    Here I don't what to fill for `className`. Without code generation, we won't be able to create a new class for different Source instances. This seems too complicated.
    
    Therefore, I used the `StreamExecutionRelation` idea finally.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205499930
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r58488129
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala ---
    @@ -19,16 +19,33 @@ package org.apache.spark.sql.execution.streaming
     
     import org.apache.spark.sql.catalyst.expressions.Attribute
     import org.apache.spark.sql.catalyst.plans.logical.LeafNode
    +import org.apache.spark.sql.execution.datasources.DataSource
     
     object StreamingRelation {
    -  def apply(source: Source): StreamingRelation =
    -    StreamingRelation(source, source.schema.toAttributes)
    +  def apply(dataSource: DataSource): StreamingRelation = {
    +    val source = dataSource.createSource()
    +    StreamingRelation(dataSource, source.schema.toAttributes)
    +  }
    +}
    +
    +/**
    + * Used to link a streaming [[DataSource]] into a
    + * [[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]].
    + */
    +case class StreamingRelation(dataSource: DataSource, output: Seq[Attribute]) extends LeafNode {
    +  override def toString: String = dataSource.createSource().toString
    --- End diff --
    
    > This could be expensive right? I'm not sure that we want to do that in a toString.
    
    Added a `sourceName` field and used it directly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r57973689
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala ---
    @@ -71,9 +71,18 @@ class StreamExecution(
       /** The current batchId or -1 if execution has not yet been initialized. */
       private var currentBatchId: Long = -1
     
    +  private[sql] val logicalPlan = _logicalPlan.transform {
    +    case StreamingRelation(sourceCreator, output) =>
    +      // Materialize source to avoid creating it in every batch
    +      val source = sourceCreator()
    +      // We still need to use the previous `output` instead of `source.schema` as attributes in
    +      // "_logicalPlan" has already used attributes of the previous `output`.
    +      StreamingRelation(() => source, output)
    --- End diff --
    
    > I think its confusing to have an opaque function that sometimes is creating a source and sometimes returning a static source. Its not going to be clear from explain() which mode you are in.
    
    How about adding a new Relation for a static source (maybe call it StreamExecutionRelation)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205630102
  
    **[Test build #54932 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54932/consoleFull)** for PR 12049 at commit [`527f55f`](https://github.com/apache/spark/commit/527f55feef0418c36745f9287f4df1bf5067fe64).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205652778
  
    **[Test build #54955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54955/consoleFull)** for PR 12049 at commit [`48d760e`](https://github.com/apache/spark/commit/48d760eed41a3d559ad8aa6363b6000d4b9ed54d).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute])`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205494817
  
    **[Test build #54879 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54879/consoleFull)** for PR 12049 at commit [`ac51850`](https://github.com/apache/spark/commit/ac51850ccb3e84faf78fb5b85d34a6806d3df8e9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r58116999
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala ---
    @@ -81,4 +85,62 @@ class StreamSuite extends StreamTest with SharedSQLContext {
           AddData(inputData, 1, 2, 3, 4),
           CheckAnswer(2, 4))
       }
    +
    +  test("DataFrame reuse") {
    +    def assertDF(df: DataFrame) {
    +      withTempDir { outputDir =>
    +        withTempDir { checkpointDir =>
    +          val query = df.write.format("parquet")
    +            .option("checkpointLocation", checkpointDir.getAbsolutePath)
    +            .startStream(outputDir.getAbsolutePath)
    +          try {
    +            eventually(timeout(streamingTimeout)) {
    +              val outputDf = sqlContext.read.parquet(outputDir.getAbsolutePath).as[Long]
    +              checkDataset[Long](outputDf, (0L to 10L).toArray: _*)
    +            }
    +          } finally {
    +            query.stop()
    +          }
    +        }
    +      }
    +    }
    +
    +    val df = sqlContext.read.format(classOf[FakeDefaultSource].getName).stream()
    +    assertDF(df)
    +    assertDF(df)
    +    assertDF(df)
    --- End diff --
    
    > I was saying that when I ran it locally it passed (without the other fixes in this PR) when there were only two.
    
    Seems weird. I just tested this test with the master branch (without any fixed in this PR) and it did fail when there were only two.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205910356
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54994/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205910352
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205499497
  
    **[Test build #54881 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54881/consoleFull)** for PR 12049 at commit [`9b5f007`](https://github.com/apache/spark/commit/9b5f00745de1b303f114c6145e2445bcaf76869e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205499932
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54881/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r58478509
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala ---
    @@ -19,16 +19,33 @@ package org.apache.spark.sql.execution.streaming
     
     import org.apache.spark.sql.catalyst.expressions.Attribute
     import org.apache.spark.sql.catalyst.plans.logical.LeafNode
    +import org.apache.spark.sql.execution.datasources.DataSource
     
     object StreamingRelation {
    -  def apply(source: Source): StreamingRelation =
    -    StreamingRelation(source, source.schema.toAttributes)
    +  def apply(dataSource: DataSource): StreamingRelation = {
    +    val source = dataSource.createSource()
    +    StreamingRelation(dataSource, source.schema.toAttributes)
    +  }
    +}
    +
    +/**
    + * Used to link a streaming [[DataSource]] into a
    + * [[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]].
    --- End diff --
    
    Maybe include a description of how this gets turned into a `StreamingExecutionRelation` and who's responsibility that is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-204561765
  
    **[Test build #2722 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2722/consoleFull)** for PR 12049 at commit [`6196790`](https://github.com/apache/spark/commit/61967904705767af7c1d95af97c7a814ea1d2207).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class StreamingRelation(dataSource: DataSource, output: Seq[Attribute]) extends LeafNode `
      * `case class StreamingExecutionRelation(source: Source, output: Seq[Attribute]) extends LeafNode `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r57982896
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala ---
    @@ -71,9 +71,18 @@ class StreamExecution(
       /** The current batchId or -1 if execution has not yet been initialized. */
       private var currentBatchId: Long = -1
     
    +  private[sql] val logicalPlan = _logicalPlan.transform {
    +    case StreamingRelation(sourceCreator, output) =>
    +      // Materialize source to avoid creating it in every batch
    +      val source = sourceCreator()
    +      // We still need to use the previous `output` instead of `source.schema` as attributes in
    +      // "_logicalPlan" has already used attributes of the previous `output`.
    +      StreamingRelation(() => source, output)
    --- End diff --
    
    Or StreamRelation could just hold a `DataSource` and we could have a `Map[DataSource, Source]` here thats initialized at startup.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-204084943
  
    **[Test build #54653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54653/consoleFull)** for PR 12049 at commit [`6196790`](https://github.com/apache/spark/commit/61967904705767af7c1d95af97c7a814ea1d2207).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-203155154
  
    **[Test build #54472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54472/consoleFull)** for PR 12049 at commit [`50c39b8`](https://github.com/apache/spark/commit/50c39b83e9d0f0a3ebe8cf5cb2675662b320a6e9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205630402
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205460728
  
    **[Test build #54881 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54881/consoleFull)** for PR 12049 at commit [`9b5f007`](https://github.com/apache/spark/commit/9b5f00745de1b303f114c6145e2445bcaf76869e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r57982732
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala ---
    @@ -81,4 +85,62 @@ class StreamSuite extends StreamTest with SharedSQLContext {
           AddData(inputData, 1, 2, 3, 4),
           CheckAnswer(2, 4))
       }
    +
    +  test("DataFrame reuse") {
    +    def assertDF(df: DataFrame) {
    +      withTempDir { outputDir =>
    +        withTempDir { checkpointDir =>
    +          val query = df.write.format("parquet")
    +            .option("checkpointLocation", checkpointDir.getAbsolutePath)
    +            .startStream(outputDir.getAbsolutePath)
    +          try {
    +            eventually(timeout(streamingTimeout)) {
    +              val outputDf = sqlContext.read.parquet(outputDir.getAbsolutePath).as[Long]
    +              checkDataset[Long](outputDf, (0L to 10L).toArray: _*)
    +            }
    +          } finally {
    +            query.stop()
    +          }
    +        }
    +      }
    +    }
    +
    +    val df = sqlContext.read.format(classOf[FakeDefaultSource].getName).stream()
    +    assertDF(df)
    +    assertDF(df)
    +    assertDF(df)
    --- End diff --
    
    I was saying that when I ran it locally it passed (without the other fixes in this PR) when there were only two.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r57943784
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala ---
    @@ -71,9 +71,18 @@ class StreamExecution(
       /** The current batchId or -1 if execution has not yet been initialized. */
       private var currentBatchId: Long = -1
     
    +  private[sql] val logicalPlan = _logicalPlan.transform {
    +    case StreamingRelation(sourceCreator, output) =>
    +      // Materialize source to avoid creating it in every batch
    +      val source = sourceCreator()
    +      // We still need to use the previous `output` instead of `source.schema` as attributes in
    +      // "_logicalPlan" has already used attributes of the previous `output`.
    +      StreamingRelation(() => source, output)
    --- End diff --
    
    I think its confusing to have an opaque function that sometimes is creating a `source` and sometimes returning a static `source`.  Its not going to be clear from `explain()` which mode you are in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205489572
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r58116791
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala ---
    @@ -81,4 +85,62 @@ class StreamSuite extends StreamTest with SharedSQLContext {
           AddData(inputData, 1, 2, 3, 4),
           CheckAnswer(2, 4))
       }
    +
    +  test("DataFrame reuse") {
    +    def assertDF(df: DataFrame) {
    +      withTempDir { outputDir =>
    +        withTempDir { checkpointDir =>
    +          val query = df.write.format("parquet")
    +            .option("checkpointLocation", checkpointDir.getAbsolutePath)
    +            .startStream(outputDir.getAbsolutePath)
    +          try {
    +            eventually(timeout(streamingTimeout)) {
    --- End diff --
    
    Fixed the test by using `processAllAvailable`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-204563462
  
    **[Test build #2723 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2723/consoleFull)** for PR 12049 at commit [`6196790`](https://github.com/apache/spark/commit/61967904705767af7c1d95af97c7a814ea1d2207).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class StreamingRelation(dataSource: DataSource, output: Seq[Attribute]) extends LeafNode `
      * `case class StreamingExecutionRelation(source: Source, output: Seq[Attribute]) extends LeafNode `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r58478636
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala ---
    @@ -19,16 +19,33 @@ package org.apache.spark.sql.execution.streaming
     
     import org.apache.spark.sql.catalyst.expressions.Attribute
     import org.apache.spark.sql.catalyst.plans.logical.LeafNode
    +import org.apache.spark.sql.execution.datasources.DataSource
     
     object StreamingRelation {
    -  def apply(source: Source): StreamingRelation =
    -    StreamingRelation(source, source.schema.toAttributes)
    +  def apply(dataSource: DataSource): StreamingRelation = {
    +    val source = dataSource.createSource()
    +    StreamingRelation(dataSource, source.schema.toAttributes)
    +  }
    +}
    +
    +/**
    + * Used to link a streaming [[DataSource]] into a
    + * [[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]].
    + */
    +case class StreamingRelation(dataSource: DataSource, output: Seq[Attribute]) extends LeafNode {
    +  override def toString: String = dataSource.createSource().toString
    --- End diff --
    
    This could be expensive right?  I'm not sure that we want to do that in a `toString`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205652851
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r57944719
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala ---
    @@ -81,4 +85,62 @@ class StreamSuite extends StreamTest with SharedSQLContext {
           AddData(inputData, 1, 2, 3, 4),
           CheckAnswer(2, 4))
       }
    +
    +  test("DataFrame reuse") {
    +    def assertDF(df: DataFrame) {
    +      withTempDir { outputDir =>
    +        withTempDir { checkpointDir =>
    +          val query = df.write.format("parquet")
    +            .option("checkpointLocation", checkpointDir.getAbsolutePath)
    +            .startStream(outputDir.getAbsolutePath)
    +          try {
    +            eventually(timeout(streamingTimeout)) {
    --- End diff --
    
    I'd like to ban use of `eventually`.  Most of our tests that rely on this construct end up being flakey.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r58478446
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ContinuousQueryManager.scala ---
    @@ -178,11 +178,19 @@ class ContinuousQueryManager(sqlContext: SQLContext) {
             throw new IllegalArgumentException(
               s"Cannot start query with name $name as a query with that name is already active")
           }
    +      val logicalPlan = df.logicalPlan.transform {
    +        case StreamingRelation(dataSource, output) =>
    +          // Materialize source to avoid creating it in every batch
    +          val source = dataSource.createSource()
    +          // We still need to use the previous `output` instead of `source.schema` as attributes in
    +          // "_logicalPlan" has already used attributes of the previous `output`.
    --- End diff --
    
    nit: i don't see anything named `_logicalPlan`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12049#discussion_r58488089
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala ---
    @@ -19,16 +19,33 @@ package org.apache.spark.sql.execution.streaming
     
     import org.apache.spark.sql.catalyst.expressions.Attribute
     import org.apache.spark.sql.catalyst.plans.logical.LeafNode
    +import org.apache.spark.sql.execution.datasources.DataSource
     
     object StreamingRelation {
    -  def apply(source: Source): StreamingRelation =
    -    StreamingRelation(source, source.schema.toAttributes)
    +  def apply(dataSource: DataSource): StreamingRelation = {
    +    val source = dataSource.createSource()
    +    StreamingRelation(dataSource, source.schema.toAttributes)
    +  }
    +}
    +
    +/**
    + * Used to link a streaming [[DataSource]] into a
    + * [[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]].
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205600854
  
    **[Test build #54932 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54932/consoleFull)** for PR 12049 at commit [`527f55f`](https://github.com/apache/spark/commit/527f55feef0418c36745f9287f4df1bf5067fe64).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12049#issuecomment-205909806
  
    **[Test build #54994 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54994/consoleFull)** for PR 12049 at commit [`48d760e`](https://github.com/apache/spark/commit/48d760eed41a3d559ad8aa6363b6000d4b9ed54d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute])`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org