You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2014/12/15 18:54:28 UTC

[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/3703

    [SPARK-1442][SQL][WIP] Initial window function implementation (refactored from #2953)

    This WIP PR is refactored from PR #2953. Please refer to the original PR description for features implemented and not implemented in this PR.
    
    The original PR was a huge one, commenting on each issue could be very time consuming. After offline discussions with @guowei2, I decided to work on a refactoring branch to fix most minor issues first and then start discussion based on this refactored version.
    
    Major issues left in this PR are:
    
    1. Window spec is added to aggregation functions with a `var`, which breaks query plan immutability.
    2. When used with window specs, common aggregation functions like `COUNT`, `SUM`, `AVG` etc are not translated into Hive aggregation functions rather than Spark SQL builtin implementations.
    3. Execution code (`execution.WindowFunction`) can be further simplified.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark window-refact

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3703.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3703
    
----
commit 9897413564ff27f0a311cc2cef6322422f3807ab
Author: guowei2 <gu...@asiainfo.com>
Date:   2014-10-24T09:55:47Z

    window function

commit 7d7a703d5e7bf37e00d074cb8c04e2150f8fbeb4
Author: guowei2 <gu...@asiainfo.com>
Date:   2014-10-27T05:29:35Z

    window function

commit 1999e07a23c18808738e4e3b14b64c1db108eda2
Author: guowei2 <gu...@asiainfo.com>
Date:   2014-10-27T06:03:17Z

    window function

commit 76bfd4b8b1137426b0dbcde5d56cefb0c98cfab5
Author: guowei2 <gu...@asiainfo.com>
Date:   2014-10-27T14:16:22Z

    window function

commit 88c5789d9f6989d0fedcbdd129de097152e2d8eb
Author: guowei2 <gu...@asiainfo.com>
Date:   2014-10-28T04:00:57Z

    window function

commit 828199a48c619d03b4ec524dbdfe9c043baa5e14
Author: guowei2 <gu...@asiainfo.com>
Date:   2014-10-29T07:49:01Z

    fix problems after rebase

commit 03bd77d5533f76484d7589e0296283b58f2d0688
Author: guowei2 <gu...@asiainfo.com>
Date:   2014-10-30T10:12:42Z

    change test suite and golden files

commit d06baeba2dc859f860c8fd43c292275837b3e0e6
Author: guowei2 <gu...@asiainfo.com>
Date:   2014-11-05T03:01:33Z

    add constant objectinspector support for udafs, such as last_value(col, false)

commit 173016c08770fd2aa6ee15c3f194c2282bd46e68
Author: guowei2 <gu...@asiainfo.com>
Date:   2014-11-26T06:58:47Z

    fix window function to support multi-different window partitions

commit ab21933e64b3ee7afdcbb622bec935a34fe0785c
Author: guowei2 <gu...@asiainfo.com>
Date:   2014-11-27T08:40:26Z

    fix DecimalType bug after rebase

commit 66ef7a6d449f6ec1e644d2e73118d8be1cb56cde
Author: guowei2 <gu...@asiainfo.com>
Date:   2014-11-28T09:51:16Z

    fix bug about attribute reference

commit dc87d8d08c33644e61f6355ed07baf720b0e9ef9
Author: Cheng Lian <li...@databricks.com>
Date:   2014-12-04T06:34:32Z

    WIP: refactoring window functions support

commit 2da61753590fe00ecf46219f387d70d48c6dd32a
Author: Cheng Lian <li...@databricks.com>
Date:   2014-12-15T06:42:57Z

    Removed trailing spaces from query string in HiveWindowFunctionSuite

commit 922a8b9bfe0278577378c3cd9fc13cb9998b6e0f
Author: Cheng Lian <li...@databricks.com>
Date:   2014-12-15T17:18:42Z

    Fixes COUNT with window spec

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/3703#issuecomment-67418117
  
    Should we close this issue in favor of #3703 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

Posted by mallman <gi...@git.apache.org>.
Github user mallman commented on the pull request:

    https://github.com/apache/spark/pull/3703#issuecomment-83691288
  
    I'm confused. Why was this PR abruptly closed? Was there another active PR for window functions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3703#issuecomment-83722409
  
    I think that might have been a mistake and #2953 was supposed to be closed. Since we are unable to close PRs, there is (long story) a process that eventually closes PRs that have a comment like "mind closing this pr". That's why it got auto-closed then.
    
    That said I don't otherwise know whether this was going to proceed anyway. I don't see other PRs for this JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

Posted by nchammas <gi...@git.apache.org>.
Github user nchammas commented on the pull request:

    https://github.com/apache/spark/pull/3703#issuecomment-69768767
  
    > Should we close this issue in favor of #3703 ?
    
    @marmbrus This is #3703. Did you mean another PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3703#issuecomment-67047485
  
      [Test build #24462 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24462/consoleFull) for   PR 3703 at commit [`922a8b9`](https://github.com/apache/spark/commit/922a8b9bfe0278577378c3cd9fc13cb9998b6e0f).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class WindowSpec(windowPartition: WindowPartition, windowFrame: Option[WindowFrame])`
      * `case class WindowPartition(partitionBy: Seq[Expression], sortBy: Seq[SortOrder])`
      * `case class WindowFrame(frameType: FrameType, preceding: Int, following: Int)`
      * `abstract class AggregateExpression extends Expression with Serializable `
      * `case class WindowFunction(`
      * `case class WindowFunction(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3703#issuecomment-67047490
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24462/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3703


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/3703#issuecomment-67041844
  
    Comments from the [review on Reviewable.io](https://reviewable.io:443/reviews/apache/spark/3703)
    
    
    
    
    
    
    
    ---
    
    <sup>**[sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala, line 30 \[r1\]](https://reviewable.io:443/reviews/apache/spark/3703#-JdE5SkiLKM30za1keNC)** ([raw file](https://github.com/apache/spark/blob/922a8b9bfe0278577378c3cd9fc13cb9998b6e0f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala#L30)):</sup>
    This can be problematic. Ideally every aggregation function that can be used with window should have a `windowSpec: Option[WindowSpec]` field which defaults to `None`, and a `withWindowSpec` method that returns a new instance of the aggregation function object itself with a window spec.
    
    ---
    
    <sup>**[sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala, line 874 \[r1\]](https://reviewable.io:443/reviews/apache/spark/3703#-JdE660e137A3Dk7Uqp2)** ([raw file](https://github.com/apache/spark/blob/922a8b9bfe0278577378c3cd9fc13cb9998b6e0f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala#L874)):</sup>
    The thread-local `windowDefs` map is used to store window definitions (`w1`, `w2` and `w3`) in queries like this:
    
    ```sql
    SELECT
        p_mfgr, p_name, p_size,
        SUM(p_size) OVER w1 AS s1,
        SUM(p_size) OVER w2 AS s2,
        SUM(p_size) OVER (w3 ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)  AS s3
    FROM
        part
    WINDOW
        w1 AS (DISTRIBUTE BY p_mfgr SORT BY p_size RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING),
        w2 AS w3,
        w3 AS (DISTRIBUTE BY p_mfgr SORT BY p_size RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    ```
    
    This map is cleaned and refilled in `collectWindowDefs` below, so it doesn't grow indefinitely.
    
    ---
    
    <sup>**[sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala, line 1060 \[r1\]](https://reviewable.io:443/reviews/apache/spark/3703#-JdEA7B6QXd6N0aDAEMh)** ([raw file](https://github.com/apache/spark/blob/922a8b9bfe0278577378c3cd9fc13cb9998b6e0f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala#L1060)):</sup>
    All builtin aggregation functions need a similar `case` clause to handle their windowed version. Otherwise they all fallback to Hive UDAF implementations.
    
    `COUNT` is picked here because its Hive version `GenericUDAFCount` implements `GenericUDAFResolver2` rather than `AbstractGenericUDAFResolver`, and is not handled by `HiveFunctionRegistry.lookupFunction`.
    
    ---
    
    
    <!-- Sent from Reviewable.io -->



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3703#issuecomment-67035848
  
      [Test build #24462 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24462/consoleFull) for   PR 3703 at commit [`922a8b9`](https://github.com/apache/spark/commit/922a8b9bfe0278577378c3cd9fc13cb9998b6e0f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/3703#issuecomment-67035927
  
    Comments from the [review on Reviewable.io](https://reviewable.io:443/reviews/apache/spark/3703)
    
    
    
    ---
    
    Note that instead of whitelisting window function test cases in `HiveCompatibilitySuite`, a new `HiveWindowFunctionSuite` was added. This is because the current Spark SQL HiveQl parser doesn't handle comments, and window function test input files come with Hive contains comment lines.
    
    ---
    
    
    
    <!-- Sent from Reviewable.io -->



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org