You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by marmbrus <gi...@git.apache.org> on 2016/03/30 00:46:16 UTC

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/12048

    [SPARK-14255] [SQL] Streaming Aggregation

    This PR adds the ability to perform aggregations inside of a `ContinuousQuery`.  In order to implement this feature, the planning of aggregation has augmented with a new `StatefulAggregationStrategy`.  Unlike batch aggregation, stateful-aggregation uses the `StateStore` (introduced in #11645) to persist the results of partial aggregation across different invocations.  The resulting physical plan performs the aggregation using the following progression:
       - Partial Aggregation
       - Shuffle
       - Partial Merge (now there is at most 1 tuple per group)
       - StateStoreRestore (now there is 1 tuple from this batch + optionally one from the previous)
       - Partial Merge (now there is at most 1 tuple per group)
       - StateStoreSave (saves the tuple for the next batch)
       - Complete (output the current result of the aggregation)
    
    The following refactoring was also performed to allow us to plug into existing code:
     - The logic for breaking down and de-duping the physical execution of aggregation has been move into a new pattern `PhysicalAggregation`
     - The `AttributeReference` used to identify the result of an `AggregateFunction` as been moved into the `AggregateExpression` container.  This change moves the reference into the same object as the other intermediate references used in aggregation and eliminates the need to pass around a `Map[(AggregateFunction, Boolean), Attribute]`.  Further clean up (using a different aggregation container for logical/physical plans) is deferred to a followup.
     - Some planning logic is moved from the `SessionState` into the `QueryExecution` to make it easier to override in the streaming case.
     - The ability to write a `StreamTest` that checks only the output of the last batch has been added to simulate the future addition of output modes.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark statefulAgg

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12048
    
----
commit 00f059d385ca48989a003809a12dab88fdae3371
Author: Michael Armbrust <mi...@databricks.com>
Date:   2016-03-25T00:36:43Z

    Basic Stateful Agg

commit 17a829bf82ef929d5363561ef8645bfe54a3dec8
Author: Michael Armbrust <mi...@databricks.com>
Date:   2016-03-29T01:25:11Z

    add testing for delta queries

commit b4fed4ae1516b2831b05ab44a8e60ae49a760f97
Author: Michael Armbrust <mi...@databricks.com>
Date:   2016-03-29T01:25:22Z

    aggregation refactoring

commit d6f7941172c3ac6adb9b2aa767504388e1186826
Author: Michael Armbrust <mi...@databricks.com>
Date:   2016-03-29T17:13:54Z

    plug state managment into the query execution

commit 7a1969237145137fefa54869131a14013410bae8
Author: Michael Armbrust <mi...@databricks.com>
Date:   2016-03-29T18:22:59Z

    Add get/put to StateStore

commit f7bb8d2475fd5f12c65f1dd37315e7ff8d47e7bb
Author: Michael Armbrust <mi...@databricks.com>
Date:   2016-03-29T18:23:08Z

    cleanup and refactoring

commit b46c325fb5418356154abaa692d79e8002492cb7
Author: Michael Armbrust <mi...@databricks.com>
Date:   2016-03-29T21:59:32Z

    cleanup

commit 1ef82c6ec4620ee3b13dab974d0f8325b3a8e545
Author: Michael Armbrust <mi...@databricks.com>
Date:   2016-03-29T22:43:18Z

    Merge remote-tracking branch 'apache/master' into statefulAgg
    
    Conflicts:
    	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala
    	sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

commit 7a5e0ae1a8da4b589cf768dda826480095a1f3fb
Author: Michael Armbrust <mi...@databricks.com>
Date:   2016-03-29T22:45:50Z

    style

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12048#discussion_r58268028
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StreamTest.scala ---
    @@ -224,11 +239,8 @@ trait StreamTest extends QueryTest with Timeouts {
              """.stripMargin
     
         def verify(condition: => Boolean, message: String): Unit = {
    -      try {
    -        Assertions.assert(condition)
    -      } catch {
    -        case NonFatal(e) =>
    -          failTest(message, e)
    +      if (!condition) {
    --- End diff --
    
    All uses of `verify` are just doing equality checks with variables (and thus can't throw), except for those that were modified specifically such that that were going to throw exceptions upon failure.  So I think really the problem is overloading what was a simple assert to be an error handler.
    
    The issue this this construction is it now pollutes the output with the obvious:
    ```
    condition was false
    [info]   org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
    [info]          org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
    [info]          org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
    [info]          org.apache.spark.sql.StreamTest$class.verify$1(StreamTest.scala:228)
    [info]          org.apache.spark.sql.StreamTest$$anonfun$testStream$1.apply(StreamTest.scala:355)
    [info]          org.apache.spark.sql.StreamTest$$anonfun$testStream$1.apply(StreamTest.scala:271)
    [info]          scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    [info]          scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    [info]          org.apache.spark.sql.StreamTest$class.testStream(StreamTest.scala:271)
    [info]          org.apache.spark.sql.streaming.StreamSuite.testStream(StreamSuite.scala:24)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203145481
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54469/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203145960
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12048


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203247192
  
    Test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203169766
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203145471
  
    **[Test build #54469 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54469/consoleFull)** for PR 12048 at commit [`7a5e0ae`](https://github.com/apache/spark/commit/7a5e0ae1a8da4b589cf768dda826480095a1f3fb).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12048#discussion_r58159654
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StreamTest.scala ---
    @@ -224,11 +239,8 @@ trait StreamTest extends QueryTest with Timeouts {
              """.stripMargin
     
         def verify(condition: => Boolean, message: String): Unit = {
    -      try {
    -        Assertions.assert(condition)
    -      } catch {
    -        case NonFatal(e) =>
    -          failTest(message, e)
    +      if (!condition) {
    --- End diff --
    
    i had written this in this way so that if there are any errors in the lazy eval of `condition` that gets caught and message printed correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203145477
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203612082
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54543/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203169663
  
    **[Test build #54470 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54470/consoleFull)** for PR 12048 at commit [`29355db`](https://github.com/apache/spark/commit/29355db0d3d5b66ee839f9f3798f30db7a5f85d5).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203235006
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54485/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203169767
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54470/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203679135
  
    Aggregation part looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203611662
  
    **[Test build #54543 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54543/consoleFull)** for PR 12048 at commit [`6aeb27a`](https://github.com/apache/spark/commit/6aeb27afc057eb9ce917654f4dabf8bd47aed01a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203569305
  
    **[Test build #2710 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2710/consoleFull)** for PR 12048 at commit [`6aeb27a`](https://github.com/apache/spark/commit/6aeb27afc057eb9ce917654f4dabf8bd47aed01a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203234921
  
    **[Test build #54485 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54485/consoleFull)** for PR 12048 at commit [`6aeb27a`](https://github.com/apache/spark/commit/6aeb27afc057eb9ce917654f4dabf8bd47aed01a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12048#discussion_r58160004
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala ---
    @@ -0,0 +1,72 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.execution.streaming
    +
    +import org.apache.spark.sql.SQLContext
    +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
    +import org.apache.spark.sql.catalyst.rules.Rule
    +import org.apache.spark.sql.execution.{QueryExecution, SparkPlan, SparkPlanner, UnaryNode}
    +
    +/**
    + * A variant of [[QueryExecution]] that allows the execution of the given [[LogicalPlan]]
    + * plan incrementally. Possibly preserving state in between each execution.
    + */
    +class IncrementalExecution(
    +    ctx: SQLContext,
    +    logicalPlan: LogicalPlan,
    +    checkpointLocation: String,
    +    currentBatchId: Long) extends QueryExecution(ctx, logicalPlan) {
    +
    +  // TODO: make this always part of planning.
    +  val stateStrategy = sqlContext.sessionState.planner.StatefulAggregationStrategy :: Nil
    +
    +  // Modified planner with stateful operations.
    +  override def planner: SparkPlanner =
    +    new SparkPlanner(
    +      sqlContext.sparkContext,
    +      sqlContext.conf,
    +      stateStrategy)
    +
    +  /**
    +   * Records the current id for a given stateful operator in the query plan as the `state`
    +   * preperation walks the query plan.
    +   */
    +  private var operatorId = 0
    +
    +  /** Locates save/restore pairs surrounding aggregation. */
    +  val state = new Rule[SparkPlan] {
    +    override def apply(plan: SparkPlan): SparkPlan = plan transform {
    +      case StateStoreSave(keys, None,
    +             UnaryNode(agg,
    +               StateStoreRestore(keys2, None, child))) =>
    +        val stateId = OperatorStateId(checkpointLocation, operatorId, currentBatchId - 1)
    +        operatorId += 1
    +
    +        StateStoreSave(
    +          keys,
    +          Some(stateId),
    +          agg.withNewChildren(
    +            StateStoreRestore(
    +              keys,
    +              Some(stateId),
    +              child) :: Nil))
    --- End diff --
    
    nit: maybe its me, but this nested tree is a little hard to read


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203197496
  
    **[Test build #54485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54485/consoleFull)** for PR 12048 at commit [`6aeb27a`](https://github.com/apache/spark/commit/6aeb27afc057eb9ce917654f4dabf8bd47aed01a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12048#discussion_r58159893
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/utils.scala ---
    @@ -261,4 +246,90 @@ object Utils {
     
         finalAndCompleteAggregate :: Nil
       }
    +
    +  /**
    +   * Plans a streaming aggregation using the following progression:
    +   *  - Partial Aggregation
    +   *  - Shuffle
    +   *  - Partial Merge (now there is at most 1 tuple per group)
    +   *  - StateStoreRestore (now there is 1 tuple from this batch + optionally one from the previous)
    +   *  - PartialMerge (now there is at most 1 tuple per group)
    +   *  - StateStoreSave (saves the tuple for the next batch)
    +   *  - Complete (output the current result of the aggregation)
    --- End diff --
    
    is this Complete or Final?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203612076
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203235005
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203150507
  
    **[Test build #54470 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54470/consoleFull)** for PR 12048 at commit [`29355db`](https://github.com/apache/spark/commit/29355db0d3d5b66ee839f9f3798f30db7a5f85d5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203568579
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-204589811
  
    Thanks, I'm going to merge to master and will address further comments in follow-ups.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203144311
  
    **[Test build #54469 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54469/consoleFull)** for PR 12048 at commit [`7a5e0ae`](https://github.com/apache/spark/commit/7a5e0ae1a8da4b589cf768dda826480095a1f3fb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12048#discussion_r58160046
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala ---
    @@ -28,37 +28,36 @@ package object state {
       implicit class StateStoreOps[T: ClassTag](dataRDD: RDD[T]) {
     
         /** Map each partition of a RDD along with data in a [[StateStore]]. */
    -    def mapPartitionWithStateStore[U: ClassTag](
    -        storeUpdateFunction: (StateStore, Iterator[T]) => Iterator[U],
    +    def mapPartitionsWithStateStore[U: ClassTag](
    --- End diff --
    
    nit: isnt it technically more correct to say `mapPartitionsWithStateStore*s*`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203612828
  
    **[Test build #2710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2710/consoleFull)** for PR 12048 at commit [`6aeb27a`](https://github.com/apache/spark/commit/6aeb27afc057eb9ce917654f4dabf8bd47aed01a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203569702
  
    **[Test build #54543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54543/consoleFull)** for PR 12048 at commit [`6aeb27a`](https://github.com/apache/spark/commit/6aeb27afc057eb9ce917654f4dabf8bd47aed01a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14255] [SQL] Streaming Aggregation

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/12048#issuecomment-203614818
  
    @tdas the maintenance test failed twice on this PR, but not the third time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org