You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Maciej Szymkiewicz <ms...@gmail.com> on 2017/01/18 08:18:59 UTC

[SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

Hi,

Can I ask for some clarifications regarding intended behavior of window
/ TimeWindow?

PySpark documentation states that "Windows in the order of months are
not supported". This is further confirmed by the checks in
TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).

Surprisingly enough we can pass interval much larger than a month by
expressing interval in days or another unit of a higher precision. So
this fails:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))

while following is accepted:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))

with results which look sensible at first glance.

Is it a matter of a faulty validation logic (months will be assigned
only if there is a match against years or months https://git.io/vMPdi)
or expected behavior and I simply misunderstood the intentions?

-- 
Best,
Maciej


Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

Posted by Michael Armbrust <mi...@databricks.com>.
+1, we should just fix the error to explain why months aren't allowed and
suggest that you manually specify some number of days.

On Wed, Jan 18, 2017 at 9:52 AM, Maciej Szymkiewicz <ms...@gmail.com>
wrote:

> Thanks for the response Burak,
>
> As any sane person I try to steer away from the objects which have both
> calendar and unsafe in their fully qualified names but if there is no
> bigger picture I missed here I would go with 1 as well. And of course fix
> the error message. I understand this has been introduced with structured
> streaming in mind, but it is an useful feature in general, not only in high
> precision scale. To be honest I would love to see some generalized version
> which could be used (I mean without hacking) with arbitrary numeric
> sequence. It could address at least some scenarios in which people try to
> use window functions without PARTITION BY clause and fail miserably.
>
> Regarding ambiguity... Sticking with days doesn't really resolve the
> problem, does it? If one were to nitpick it doesn't look like this
> implementation even touches all the subtleties of DST or leap second.
>
>
>
>
> On 01/18/2017 05:52 PM, Burak Yavuz wrote:
>
> Hi Maciej,
>
> I believe it would be useful to either fix the documentation or fix the
> implementation. I'll leave it to the community to comment on. The code
> right now disallows intervals provided in months and years, because they
> are not a "consistently" fixed amount of time. A month can be 28, 29, 30,
> or 31 days. A year is 12 months for sure, but is it 360 days (sometimes
> used in finance), 365 days or 366 days?
>
> Therefore we could either:
>   1) Allow windowing when intervals are given in days and less, even
> though it could be 365 days, and fix the documentation.
>   2) Explicitly disallow it as there may be a lot of data for a given
> window, but partial aggregations should help with that.
>
> My thoughts are to go with 1. What do you think?
>
> Best,
> Burak
>
> On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz <
> mszymkiewicz@gmail.com> wrote:
>
>> Hi,
>>
>> Can I ask for some clarifications regarding intended behavior of window /
>> TimeWindow?
>>
>> PySpark documentation states that "Windows in the order of months are not
>> supported". This is further confirmed by the checks in
>> TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).
>>
>> Surprisingly enough we can pass interval much larger than a month by
>> expressing interval in days or another unit of a higher precision. So this
>> fails:
>>
>> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>>
>> while following is accepted:
>>
>> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>>
>> with results which look sensible at first glance.
>>
>> Is it a matter of a faulty validation logic (months will be assigned only
>> if there is a match against years or months https://git.io/vMPdi) or
>> expected behavior and I simply misunderstood the intentions?
>>
>> --
>> Best,
>> Maciej
>>
>>
>
>

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

Posted by Maciej Szymkiewicz <ms...@gmail.com>.
Thanks for the response Burak,

As any sane person I try to steer away from the objects which have both
calendar and unsafe in their fully qualified names but if there is no
bigger picture I missed here I would go with 1 as well. And of course
fix the error message. I understand this has been introduced with
structured streaming in mind, but it is an useful feature in general,
not only in high precision scale. To be honest I would love to see some
generalized version which could be used (I mean without hacking) with
arbitrary numeric sequence. It could address at least some scenarios in
which people try to use window functions without PARTITION BY clause and
fail miserably.

Regarding ambiguity... Sticking with days doesn't really resolve the
problem, does it? If one were to nitpick it doesn't look like this
implementation even touches all the subtleties of DST or leap second.



On 01/18/2017 05:52 PM, Burak Yavuz wrote:
> Hi Maciej,
>
> I believe it would be useful to either fix the documentation or fix
> the implementation. I'll leave it to the community to comment on. The
> code right now disallows intervals provided in months and years,
> because they are not a "consistently" fixed amount of time. A month
> can be 28, 29, 30, or 31 days. A year is 12 months for sure, but is it
> 360 days (sometimes used in finance), 365 days or 366 days? 
>
> Therefore we could either:
>   1) Allow windowing when intervals are given in days and less, even
> though it could be 365 days, and fix the documentation.
>   2) Explicitly disallow it as there may be a lot of data for a given
> window, but partial aggregations should help with that.
>
> My thoughts are to go with 1. What do you think?
>
> Best,
> Burak
>
> On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz
> <mszymkiewicz@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     Can I ask for some clarifications regarding intended behavior of
>     window / TimeWindow?
>
>     PySpark documentation states that "Windows in the order of months
>     are not supported". This is further confirmed by the checks in
>     TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).
>
>     Surprisingly enough we can pass interval much larger than a month
>     by expressing interval in days or another unit of a higher
>     precision. So this fails:
>
>     Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>
>     while following is accepted:
>
>     Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>
>     with results which look sensible at first glance.
>
>     Is it a matter of a faulty validation logic (months will be
>     assigned only if there is a match against years or months
>     https://git.io/vMPdi) or expected behavior and I simply
>     misunderstood the intentions?
>
>     -- 
>     Best,
>     Maciej
>
>


Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

Posted by Burak Yavuz <br...@gmail.com>.
Hi Maciej,

I believe it would be useful to either fix the documentation or fix the
implementation. I'll leave it to the community to comment on. The code
right now disallows intervals provided in months and years, because they
are not a "consistently" fixed amount of time. A month can be 28, 29, 30,
or 31 days. A year is 12 months for sure, but is it 360 days (sometimes
used in finance), 365 days or 366 days?

Therefore we could either:
  1) Allow windowing when intervals are given in days and less, even though
it could be 365 days, and fix the documentation.
  2) Explicitly disallow it as there may be a lot of data for a given
window, but partial aggregations should help with that.

My thoughts are to go with 1. What do you think?

Best,
Burak

On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz <mszymkiewicz@gmail.com
> wrote:

> Hi,
>
> Can I ask for some clarifications regarding intended behavior of window /
> TimeWindow?
>
> PySpark documentation states that "Windows in the order of months are not
> supported". This is further confirmed by the checks in TimeWindow.getIntervalInMicroseconds
> (https://git.io/vMP5l).
>
> Surprisingly enough we can pass interval much larger than a month by
> expressing interval in days or another unit of a higher precision. So this
> fails:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>
> while following is accepted:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>
> with results which look sensible at first glance.
>
> Is it a matter of a faulty validation logic (months will be assigned only
> if there is a match against years or months https://git.io/vMPdi) or
> expected behavior and I simply misunderstood the intentions?
>
> --
> Best,
> Maciej
>
>