You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Li Jin <ic...@gmail.com> on 2018/04/03 20:26:32 UTC

Clarify window behavior in Spark SQL

Hi Devs,

I am seeing some behavior with window functions that is a bit unintuitive
and would like to get some clarification.

When using aggregation function with window, the frame boundary seems to
change depending on the order of the window.

Example:
(1)

df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

w1 = Window.partitionBy('id')

df.withColumn('v2', mean(df.v).over(w1)).show()

+---+---+---+

| id|  v| v2|

+---+---+---+

|  0|  1|2.0|

|  0|  2|2.0|

|  0|  3|2.0|

+---+---+---+

(2)
df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

w2 = Window.partitionBy('id').orderBy('v')

df.withColumn('v2', mean(df.v).over(w2)).show()

+---+---+---+

| id|  v| v2|

+---+---+---+

|  0|  1|1.0|

|  0|  2|1.5|

|  0|  3|2.0|

+---+---+---+

Seems like orderBy('v') in the example (2) also changes the frame
boundaries from (

unboundedPreceding, unboundedFollowing) to (unboundedPreceding, currentRow).


I found this behavior a bit unintuitive. I wonder if this behavior is by
design and if so, what's the specific rule that orderBy() interacts with
frame boundaries?


Thanks,

Li

Re: Clarify window behavior in Spark SQL

Posted by Li Jin <ic...@gmail.com>.

Here is the original code and comments:
https://github.com/apache/spark/commit/b6b50efc854f298d5b3e11c05dca995a85bec962#diff-4a8f00ca33a80744965463dcc6662c75L277

Seems this is intentional. Although I am not really sure why - maybe to
match other SQL systems behavior?

On Tue, Apr 3, 2018 at 5:09 PM, Reynold Xin <rx...@databricks.com> wrote:

> Seems like a bug.
>
>
>
> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ic...@gmail.com> wrote:
>
>> Hi Devs,
>>
>> I am seeing some behavior with window functions that is a bit unintuitive
>> and would like to get some clarification.
>>
>> When using aggregation function with window, the frame boundary seems to
>> change depending on the order of the window.
>>
>> Example:
>> (1)
>>
>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>
>> w1 = Window.partitionBy('id')
>>
>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>
>> +---+---+---+
>>
>> | id|  v| v2|
>>
>> +---+---+---+
>>
>> |  0|  1|2.0|
>>
>> |  0|  2|2.0|
>>
>> |  0|  3|2.0|
>>
>> +---+---+---+
>>
>> (2)
>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>
>> w2 = Window.partitionBy('id').orderBy('v')
>>
>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>
>> +---+---+---+
>>
>> | id|  v| v2|
>>
>> +---+---+---+
>>
>> |  0|  1|1.0|
>>
>> |  0|  2|1.5|
>>
>> |  0|  3|2.0|
>>
>> +---+---+---+
>>
>> Seems like orderBy('v') in the example (2) also changes the frame
>> boundaries from (
>>
>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>> currentRow).
>>
>>
>> I found this behavior a bit unintuitive. I wonder if this behavior is by
>> design and if so, what's the specific rule that orderBy() interacts with
>> frame boundaries?
>>
>>
>> Thanks,
>>
>> Li
>>
>>
>

Re: Clarify window behavior in Spark SQL

Posted by Sandor Murakozi <sm...@gmail.com>.

Hi Li,
You might find my pending PR useful:
https://github.com/apache/spark/pull/20045/files

It contains a big bunch of test cases covering the windowing functionality,
showing and checking the behavior of a number of special cases.

On Wed, Apr 4, 2018 at 4:26 AM, Reynold Xin <rx...@databricks.com> wrote:

> Thanks Li!
>
> On Tue, Apr 3, 2018 at 7:23 PM Li Jin <ic...@gmail.com> wrote:
>
>> Thanks all for the explanation. I am happy to update the API doc.
>>
>> https://issues.apache.org/jira/browse/SPARK-23861
>>
>> On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Ah ok. Thanks for commenting. Everyday I learn something new about SQL.
>>>
>>> For others to follow, SQL Server has a good explanation of the behavior:
>>> https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-
>>> transact-sql
>>>
>>>
>>> Can somebody (Li?) update the API documentation to specify the gotchas,
>>> in case users are not familiar with SQL window function semantics?
>>>
>>>
>>>
>>> General Remarks
>>> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks>
>>>
>>> More than one window function can be used in a single query with a
>>> single FROM clause. The OVER clause for each function can differ in
>>> partitioning and ordering.
>>>
>>> If PARTITION BY is not specified, the function treats all rows of the
>>> query result set as a single group.
>>> Important!
>>> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important>
>>>
>>> If ROWS/RANGE is specified and <window frame preceding> is used for
>>> <window frame extent> (short syntax) then this specification is used for
>>> the window frame boundary starting point and CURRENT ROW is used for the
>>> boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS
>>> BETWEEN 5 PRECEDING AND CURRENT ROW”.
>>>
>>> Note+
>>>
>>> If ORDER BY is not specified entire partition is used for a window
>>> frame. This applies only to functions that do not require ORDER BY clause.
>>> If ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
>>> PRECEDING AND CURRENT ROW is used as default for window frame. This applies
>>> only to functions that have can accept optional ROWS/RANGE specification.
>>> For example, ranking functions cannot accept ROWS/RANGE, therefore this
>>> window frame is not applied even though ORDER BY is present and ROWS/RANGE
>>> is not.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <ji...@gmail.com>
>>> wrote:
>>>
>>>> This is actually by design, without a `ORDER BY` clause, all rows are
>>>> considered as the peer row of the current row, which means that the frame
>>>> is effectively the entire partition. This behavior follows the window
>>>> syntax of PGSQL.
>>>> You can refer to the comment by yhuai: https://github.com/
>>>> apache/spark/pull/5604#discussion_r157931911
>>>> :)
>>>>
>>>> 2018-04-04 6:27 GMT+08:00 Reynold Xin <rx...@databricks.com>:
>>>>
>>>>> Do other (non-Hive) SQL systems do the same thing?
>>>>>
>>>>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
>>>>> herman@databricks.com> wrote:
>>>>>
>>>>>> This is something we inherited from Hive: https://cwiki.apache.
>>>>>> org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>>>>>
>>>>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>>>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
>>>>>>> CURRENT ROW.
>>>>>>
>>>>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>>>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>>>>>>> UNBOUNDED FOLLOWING.
>>>>>>
>>>>>>
>>>>>> It sort of makes sense if you think about it. If there is no ordering
>>>>>> there is no way to have a bound frame. If there is ordering we default to
>>>>>> the most commonly used deterministic frame.
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Seems like a bug.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ic...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Devs,
>>>>>>>>
>>>>>>>> I am seeing some behavior with window functions that is a bit
>>>>>>>> unintuitive and would like to get some clarification.
>>>>>>>>
>>>>>>>> When using aggregation function with window, the frame boundary
>>>>>>>> seems to change depending on the order of the window.
>>>>>>>>
>>>>>>>> Example:
>>>>>>>> (1)
>>>>>>>>
>>>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>>>
>>>>>>>> w1 = Window.partitionBy('id')
>>>>>>>>
>>>>>>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> | id|  v| v2|
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> |  0|  1|2.0|
>>>>>>>>
>>>>>>>> |  0|  2|2.0|
>>>>>>>>
>>>>>>>> |  0|  3|2.0|
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> (2)
>>>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>>>
>>>>>>>> w2 = Window.partitionBy('id').orderBy('v')
>>>>>>>>
>>>>>>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> | id|  v| v2|
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> |  0|  1|1.0|
>>>>>>>>
>>>>>>>> |  0|  2|1.5|
>>>>>>>>
>>>>>>>> |  0|  3|2.0|
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> Seems like orderBy('v') in the example (2) also changes the frame
>>>>>>>> boundaries from (
>>>>>>>>
>>>>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>>>>>>> currentRow).
>>>>>>>>
>>>>>>>>
>>>>>>>> I found this behavior a bit unintuitive. I wonder if this behavior
>>>>>>>> is by design and if so, what's the specific rule that orderBy() interacts
>>>>>>>> with frame boundaries?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Li
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Clarify window behavior in Spark SQL

Posted by Reynold Xin <rx...@databricks.com>.

Thanks Li!

On Tue, Apr 3, 2018 at 7:23 PM Li Jin <ic...@gmail.com> wrote:

> Thanks all for the explanation. I am happy to update the API doc.
>
> https://issues.apache.org/jira/browse/SPARK-23861
>
> On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Ah ok. Thanks for commenting. Everyday I learn something new about SQL.
>>
>> For others to follow, SQL Server has a good explanation of the behavior:
>> https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql
>>
>>
>> Can somebody (Li?) update the API documentation to specify the gotchas,
>> in case users are not familiar with SQL window function semantics?
>>
>>
>>
>> General Remarks
>> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks>
>>
>> More than one window function can be used in a single query with a single
>> FROM clause. The OVER clause for each function can differ in partitioning
>> and ordering.
>>
>> If PARTITION BY is not specified, the function treats all rows of the
>> query result set as a single group.
>> Important!
>> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important>
>>
>> If ROWS/RANGE is specified and <window frame preceding> is used for
>> <window frame extent> (short syntax) then this specification is used for
>> the window frame boundary starting point and CURRENT ROW is used for the
>> boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS
>> BETWEEN 5 PRECEDING AND CURRENT ROW”.
>>
>> Note+
>>
>> If ORDER BY is not specified entire partition is used for a window frame.
>> This applies only to functions that do not require ORDER BY clause. If
>> ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
>> PRECEDING AND CURRENT ROW is used as default for window frame. This applies
>> only to functions that have can accept optional ROWS/RANGE specification.
>> For example, ranking functions cannot accept ROWS/RANGE, therefore this
>> window frame is not applied even though ORDER BY is present and ROWS/RANGE
>> is not.
>>
>>
>>
>>
>>
>> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <ji...@gmail.com>
>> wrote:
>>
>>> This is actually by design, without a `ORDER BY` clause, all rows are
>>> considered as the peer row of the current row, which means that the frame
>>> is effectively the entire partition. This behavior follows the window
>>> syntax of PGSQL.
>>> You can refer to the comment by yhuai:
>>> https://github.com/apache/spark/pull/5604#discussion_r157931911
>>> :)
>>>
>>> 2018-04-04 6:27 GMT+08:00 Reynold Xin <rx...@databricks.com>:
>>>
>>>> Do other (non-Hive) SQL systems do the same thing?
>>>>
>>>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
>>>> herman@databricks.com> wrote:
>>>>
>>>>> This is something we inherited from Hive:
>>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>>>>
>>>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
>>>>>> CURRENT ROW.
>>>>>
>>>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>>>>>> UNBOUNDED FOLLOWING.
>>>>>
>>>>>
>>>>> It sort of makes sense if you think about it. If there is no ordering
>>>>> there is no way to have a bound frame. If there is ordering we default to
>>>>> the most commonly used deterministic frame.
>>>>>
>>>>>
>>>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Seems like a bug.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ic...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Devs,
>>>>>>>
>>>>>>> I am seeing some behavior with window functions that is a bit
>>>>>>> unintuitive and would like to get some clarification.
>>>>>>>
>>>>>>> When using aggregation function with window, the frame boundary
>>>>>>> seems to change depending on the order of the window.
>>>>>>>
>>>>>>> Example:
>>>>>>> (1)
>>>>>>>
>>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>>
>>>>>>> w1 = Window.partitionBy('id')
>>>>>>>
>>>>>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> | id|  v| v2|
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> |  0|  1|2.0|
>>>>>>>
>>>>>>> |  0|  2|2.0|
>>>>>>>
>>>>>>> |  0|  3|2.0|
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> (2)
>>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>>
>>>>>>> w2 = Window.partitionBy('id').orderBy('v')
>>>>>>>
>>>>>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> | id|  v| v2|
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> |  0|  1|1.0|
>>>>>>>
>>>>>>> |  0|  2|1.5|
>>>>>>>
>>>>>>> |  0|  3|2.0|
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> Seems like orderBy('v') in the example (2) also changes the frame
>>>>>>> boundaries from (
>>>>>>>
>>>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>>>>>> currentRow).
>>>>>>>
>>>>>>>
>>>>>>> I found this behavior a bit unintuitive. I wonder if this behavior
>>>>>>> is by design and if so, what's the specific rule that orderBy() interacts
>>>>>>> with frame boundaries?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Li
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Clarify window behavior in Spark SQL

Posted by Li Jin <ic...@gmail.com>.

Thanks all for the explanation. I am happy to update the API doc.

https://issues.apache.org/jira/browse/SPARK-23861

On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin <rx...@databricks.com> wrote:

> Ah ok. Thanks for commenting. Everyday I learn something new about SQL.
>
> For others to follow, SQL Server has a good explanation of the behavior:
> https://docs.microsoft.com/en-us/sql/t-sql/queries
> /select-over-clause-transact-sql
>
>
> Can somebody (Li?) update the API documentation to specify the gotchas, in
> case users are not familiar with SQL window function semantics?
>
>
>
> General Remarks
> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks>
>
> More than one window function can be used in a single query with a single
> FROM clause. The OVER clause for each function can differ in partitioning
> and ordering.
>
> If PARTITION BY is not specified, the function treats all rows of the
> query result set as a single group.
> Important!
> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important>
>
> If ROWS/RANGE is specified and <window frame preceding> is used for
> <window frame extent> (short syntax) then this specification is used for
> the window frame boundary starting point and CURRENT ROW is used for the
> boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS
> BETWEEN 5 PRECEDING AND CURRENT ROW”.
>
> Note+
>
> If ORDER BY is not specified entire partition is used for a window frame.
> This applies only to functions that do not require ORDER BY clause. If
> ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
> PRECEDING AND CURRENT ROW is used as default for window frame. This applies
> only to functions that have can accept optional ROWS/RANGE specification.
> For example, ranking functions cannot accept ROWS/RANGE, therefore this
> window frame is not applied even though ORDER BY is present and ROWS/RANGE
> is not.
>
>
>
>
>
> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <ji...@gmail.com>
> wrote:
>
>> This is actually by design, without a `ORDER BY` clause, all rows are
>> considered as the peer row of the current row, which means that the frame
>> is effectively the entire partition. This behavior follows the window
>> syntax of PGSQL.
>> You can refer to the comment by yhuai: https://github.com/apac
>> he/spark/pull/5604#discussion_r157931911
>> :)
>>
>> 2018-04-04 6:27 GMT+08:00 Reynold Xin <rx...@databricks.com>:
>>
>>> Do other (non-Hive) SQL systems do the same thing?
>>>
>>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
>>> herman@databricks.com> wrote:
>>>
>>>> This is something we inherited from Hive: https://cwiki.apache.org
>>>> /confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>>>
>>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
>>>>> CURRENT ROW.
>>>>
>>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>>>>> UNBOUNDED FOLLOWING.
>>>>
>>>>
>>>> It sort of makes sense if you think about it. If there is no ordering
>>>> there is no way to have a bound frame. If there is ordering we default to
>>>> the most commonly used deterministic frame.
>>>>
>>>>
>>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> Seems like a bug.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ic...@gmail.com> wrote:
>>>>>
>>>>>> Hi Devs,
>>>>>>
>>>>>> I am seeing some behavior with window functions that is a bit
>>>>>> unintuitive and would like to get some clarification.
>>>>>>
>>>>>> When using aggregation function with window, the frame boundary seems
>>>>>> to change depending on the order of the window.
>>>>>>
>>>>>> Example:
>>>>>> (1)
>>>>>>
>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>
>>>>>> w1 = Window.partitionBy('id')
>>>>>>
>>>>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> | id|  v| v2|
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> |  0|  1|2.0|
>>>>>>
>>>>>> |  0|  2|2.0|
>>>>>>
>>>>>> |  0|  3|2.0|
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> (2)
>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>
>>>>>> w2 = Window.partitionBy('id').orderBy('v')
>>>>>>
>>>>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> | id|  v| v2|
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> |  0|  1|1.0|
>>>>>>
>>>>>> |  0|  2|1.5|
>>>>>>
>>>>>> |  0|  3|2.0|
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> Seems like orderBy('v') in the example (2) also changes the frame
>>>>>> boundaries from (
>>>>>>
>>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>>>>> currentRow).
>>>>>>
>>>>>>
>>>>>> I found this behavior a bit unintuitive. I wonder if this behavior is
>>>>>> by design and if so, what's the specific rule that orderBy() interacts with
>>>>>> frame boundaries?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Li
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Clarify window behavior in Spark SQL

Posted by Reynold Xin <rx...@databricks.com>.

Ah ok. Thanks for commenting. Everyday I learn something new about SQL.

For others to follow, SQL Server has a good explanation of the behavior:
https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-
transact-sql

Can somebody (Li?) update the API documentation to specify the gotchas, in
case users are not familiar with SQL window function semantics?

General Remarks
<https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks>

More than one window function can be used in a single query with a single
FROM clause. The OVER clause for each function can differ in partitioning
and ordering.

If PARTITION BY is not specified, the function treats all rows of the query
result set as a single group.
Important!
<https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important>

If ROWS/RANGE is specified and <window frame preceding> is used for <window
frame extent> (short syntax) then this specification is used for the window
frame boundary starting point and CURRENT ROW is used for the boundary
ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS BETWEEN 5
PRECEDING AND CURRENT ROW”.

Note+

If ORDER BY is not specified entire partition is used for a window frame.
This applies only to functions that do not require ORDER BY clause. If
ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
PRECEDING AND CURRENT ROW is used as default for window frame. This applies
only to functions that have can accept optional ROWS/RANGE specification.
For example, ranking functions cannot accept ROWS/RANGE, therefore this
window frame is not applied even though ORDER BY is present and ROWS/RANGE
is not.

On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <ji...@gmail.com> wrote:

> This is actually by design, without a `ORDER BY` clause, all rows are
> considered as the peer row of the current row, which means that the frame
> is effectively the entire partition. This behavior follows the window
> syntax of PGSQL.
> You can refer to the comment by yhuai: https://github.com/apac
> he/spark/pull/5604#discussion_r157931911
> :)
>
> 2018-04-04 6:27 GMT+08:00 Reynold Xin <rx...@databricks.com>:
>
>> Do other (non-Hive) SQL systems do the same thing?
>>
>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
>> herman@databricks.com> wrote:
>>
>>> This is something we inherited from Hive: https://cwiki.apache.org
>>> /confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>>
>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
>>>> CURRENT ROW.
>>>
>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>>>> UNBOUNDED FOLLOWING.
>>>
>>>
>>> It sort of makes sense if you think about it. If there is no ordering
>>> there is no way to have a bound frame. If there is ordering we default to
>>> the most commonly used deterministic frame.
>>>
>>>
>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> Seems like a bug.
>>>>
>>>>
>>>>
>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ic...@gmail.com> wrote:
>>>>
>>>>> Hi Devs,
>>>>>
>>>>> I am seeing some behavior with window functions that is a bit
>>>>> unintuitive and would like to get some clarification.
>>>>>
>>>>> When using aggregation function with window, the frame boundary seems
>>>>> to change depending on the order of the window.
>>>>>
>>>>> Example:
>>>>> (1)
>>>>>
>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>
>>>>> w1 = Window.partitionBy('id')
>>>>>
>>>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> | id|  v| v2|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> |  0|  1|2.0|
>>>>>
>>>>> |  0|  2|2.0|
>>>>>
>>>>> |  0|  3|2.0|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> (2)
>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>
>>>>> w2 = Window.partitionBy('id').orderBy('v')
>>>>>
>>>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> | id|  v| v2|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> |  0|  1|1.0|
>>>>>
>>>>> |  0|  2|1.5|
>>>>>
>>>>> |  0|  3|2.0|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> Seems like orderBy('v') in the example (2) also changes the frame
>>>>> boundaries from (
>>>>>
>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>>>> currentRow).
>>>>>
>>>>>
>>>>> I found this behavior a bit unintuitive. I wonder if this behavior is
>>>>> by design and if so, what's the specific rule that orderBy() interacts with
>>>>> frame boundaries?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Li
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Clarify window behavior in Spark SQL

Posted by Xingbo Jiang <ji...@gmail.com>.

This is actually by design, without a `ORDER BY` clause, all rows are
considered as the peer row of the current row, which means that the frame
is effectively the entire partition. This behavior follows the window
syntax of PGSQL.
You can refer to the comment by yhuai:
https://github.com/apache/spark/pull/5604#discussion_r157931911
:)

2018-04-04 6:27 GMT+08:00 Reynold Xin <rx...@databricks.com>:

> Do other (non-Hive) SQL systems do the same thing?
>
> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
> herman@databricks.com> wrote:
>
>> This is something we inherited from Hive: https://cwiki.apache.org
>> /confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>
>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
>>> ROW.
>>
>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
>>> FOLLOWING.
>>
>>
>> It sort of makes sense if you think about it. If there is no ordering
>> there is no way to have a bound frame. If there is ordering we default to
>> the most commonly used deterministic frame.
>>
>>
>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Seems like a bug.
>>>
>>>
>>>
>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ic...@gmail.com> wrote:
>>>
>>>> Hi Devs,
>>>>
>>>> I am seeing some behavior with window functions that is a bit
>>>> unintuitive and would like to get some clarification.
>>>>
>>>> When using aggregation function with window, the frame boundary seems
>>>> to change depending on the order of the window.
>>>>
>>>> Example:
>>>> (1)
>>>>
>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>
>>>> w1 = Window.partitionBy('id')
>>>>
>>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>>
>>>> +---+---+---+
>>>>
>>>> | id|  v| v2|
>>>>
>>>> +---+---+---+
>>>>
>>>> |  0|  1|2.0|
>>>>
>>>> |  0|  2|2.0|
>>>>
>>>> |  0|  3|2.0|
>>>>
>>>> +---+---+---+
>>>>
>>>> (2)
>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>
>>>> w2 = Window.partitionBy('id').orderBy('v')
>>>>
>>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>>
>>>> +---+---+---+
>>>>
>>>> | id|  v| v2|
>>>>
>>>> +---+---+---+
>>>>
>>>> |  0|  1|1.0|
>>>>
>>>> |  0|  2|1.5|
>>>>
>>>> |  0|  3|2.0|
>>>>
>>>> +---+---+---+
>>>>
>>>> Seems like orderBy('v') in the example (2) also changes the frame
>>>> boundaries from (
>>>>
>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>>> currentRow).
>>>>
>>>>
>>>> I found this behavior a bit unintuitive. I wonder if this behavior is
>>>> by design and if so, what's the specific rule that orderBy() interacts with
>>>> frame boundaries?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Li
>>>>
>>>>
>>>
>>
>

Re: Clarify window behavior in Spark SQL

Posted by Reynold Xin <rx...@databricks.com>.

Do other (non-Hive) SQL systems do the same thing?

On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
herman@databricks.com> wrote:

> This is something we inherited from Hive: https://cwiki.apache.
> org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>
> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
>> ROW.
>
> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
>> FOLLOWING.
>
>
> It sort of makes sense if you think about it. If there is no ordering
> there is no way to have a bound frame. If there is ordering we default to
> the most commonly used deterministic frame.
>
>
> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Seems like a bug.
>>
>>
>>
>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ic...@gmail.com> wrote:
>>
>>> Hi Devs,
>>>
>>> I am seeing some behavior with window functions that is a bit
>>> unintuitive and would like to get some clarification.
>>>
>>> When using aggregation function with window, the frame boundary seems to
>>> change depending on the order of the window.
>>>
>>> Example:
>>> (1)
>>>
>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>
>>> w1 = Window.partitionBy('id')
>>>
>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>
>>> +---+---+---+
>>>
>>> | id|  v| v2|
>>>
>>> +---+---+---+
>>>
>>> |  0|  1|2.0|
>>>
>>> |  0|  2|2.0|
>>>
>>> |  0|  3|2.0|
>>>
>>> +---+---+---+
>>>
>>> (2)
>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>
>>> w2 = Window.partitionBy('id').orderBy('v')
>>>
>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>
>>> +---+---+---+
>>>
>>> | id|  v| v2|
>>>
>>> +---+---+---+
>>>
>>> |  0|  1|1.0|
>>>
>>> |  0|  2|1.5|
>>>
>>> |  0|  3|2.0|
>>>
>>> +---+---+---+
>>>
>>> Seems like orderBy('v') in the example (2) also changes the frame
>>> boundaries from (
>>>
>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>> currentRow).
>>>
>>>
>>> I found this behavior a bit unintuitive. I wonder if this behavior is by
>>> design and if so, what's the specific rule that orderBy() interacts with
>>> frame boundaries?
>>>
>>>
>>> Thanks,
>>>
>>> Li
>>>
>>>
>>
>

Re: Clarify window behavior in Spark SQL

Posted by Reynold Xin <rx...@databricks.com>.

Seems like a bug.



On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ic...@gmail.com> wrote:

> Hi Devs,
>
> I am seeing some behavior with window functions that is a bit unintuitive
> and would like to get some clarification.
>
> When using aggregation function with window, the frame boundary seems to
> change depending on the order of the window.
>
> Example:
> (1)
>
> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>
> w1 = Window.partitionBy('id')
>
> df.withColumn('v2', mean(df.v).over(w1)).show()
>
> +---+---+---+
>
> | id|  v| v2|
>
> +---+---+---+
>
> |  0|  1|2.0|
>
> |  0|  2|2.0|
>
> |  0|  3|2.0|
>
> +---+---+---+
>
> (2)
> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>
> w2 = Window.partitionBy('id').orderBy('v')
>
> df.withColumn('v2', mean(df.v).over(w2)).show()
>
> +---+---+---+
>
> | id|  v| v2|
>
> +---+---+---+
>
> |  0|  1|1.0|
>
> |  0|  2|1.5|
>
> |  0|  3|2.0|
>
> +---+---+---+
>
> Seems like orderBy('v') in the example (2) also changes the frame
> boundaries from (
>
> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
> currentRow).
>
>
> I found this behavior a bit unintuitive. I wonder if this behavior is by
> design and if so, what's the specific rule that orderBy() interacts with
> frame boundaries?
>
>
> Thanks,
>
> Li
>
>