You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Maciej Szymkiewicz <ms...@gmail.com> on 2016/11/30 16:27:08 UTC

[SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Hi,

I've been looking at the SPARK-17845 and I am curious if there is any
reason to make it a breaking change. In Spark 2.0 and below we could use:

    Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
sys.maxsize))

In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
-1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
Window.unboundedPreceding equal -sys.maxsize to ensure backward
compatibility?

-- 

Maciej Szymkiewicz


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Posted by Maciej Szymkiewicz <ms...@gmail.com>.

It could be something like this
https://github.com/zero323/spark/commit/b1f4d8218629b56b0982ee58f5b93a40305985e0 
but I am not fully satisfied.

On 11/30/2016 07:34 PM, Reynold Xin wrote:
> Yes I'd define unboundedPreceding to -sys.maxsize, but also any value
> less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered
> unboundedPreceding too. We need to be careful with long overflow when
> transferring data over to Java.
>
>
> On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz
> <mszymkiewicz@gmail.com <ma...@gmail.com>> wrote:
>
>     It is platform specific so theoretically can be larger, but 2**63
>     - 1 is a standard on 64 bit platform and 2**31 - 1 on 32bit
>     platform. I can submit a patch but I am not sure how to proceed.
>     Personally I would set
>
>     unboundedPreceding = -sys.maxsize
>
>     unboundedFollowing = sys.maxsize
>
>     to keep backwards compatibility.
>
>     On 11/30/2016 06:52 PM, Reynold Xin wrote:
>>     Ah ok for some reason when I did the pull request sys.maxsize was
>>     much larger than 2^63. Do you want to submit a patch to fix this?
>>
>>
>>     On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz
>>     <mszymkiewicz@gmail.com <ma...@gmail.com>> wrote:
>>
>>         The problem is that -(1 << 63) is -(sys.maxsize + 1) so the
>>         code which used to work before is off by one.
>>
>>         On 11/30/2016 06:43 PM, Reynold Xin wrote:
>>>         Can you give a repro? Anything less than -(1 << 63) is
>>>         considered negative infinity (i.e. unbounded preceding).
>>>
>>>         On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz
>>>         <mszymkiewicz@gmail.com <ma...@gmail.com>> wrote:
>>>
>>>             Hi,
>>>
>>>             I've been looking at the SPARK-17845 and I am curious if
>>>             there is any
>>>             reason to make it a breaking change. In Spark 2.0 and
>>>             below we could use:
>>>
>>>                
>>>             Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>>>             sys.maxsize))
>>>
>>>             In 2.1.0 this code will silently produce incorrect
>>>             results (ROWS BETWEEN
>>>             -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>>>             Window.unboundedPreceding equal -sys.maxsize to ensure
>>>             backward
>>>             compatibility?
>>>
>>>             --
>>>
>>>             Maciej Szymkiewicz
>>>
>>>
>>>             ---------------------------------------------------------------------
>>>             To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>             <ma...@spark.apache.org>
>>>
>>>
>>
>>         -- 
>>         Maciej Szymkiewicz
>>
>>
>
>     -- 
>     Maciej Szymkiewicz
>
>

-- 
Maciej Szymkiewicz

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Posted by Maciej Szymkiewicz <ms...@gmail.com>.

Sure, here you are: https://issues.apache.org/jira/browse/SPARK-18690

To be fair I am not fully convinced it is worth it.


On 12/02/2016 12:51 AM, Reynold Xin wrote:
> Can you submit a pull request with test cases based on that change?
>
>
> On Dec 1, 2016, 9:39 AM -0800, Maciej Szymkiewicz
> <ms...@gmail.com>, wrote:
>>
>> This doesn't affect that. The only concern is what we consider to
>> UNBOUNDED on Python side.
>>
>>
>> On 12/01/2016 07:56 AM, assaf.mendelson wrote:
>>>
>>> I may be mistaken but if I remember correctly spark behaves
>>> differently when it is bounded in the past and when it is not.
>>> Specifically I seem to recall a fix which made sure that when there
>>> is no lower bound then the aggregation is done one by one instead of
>>> doing the whole range for each window. So I believe it should be
>>> configured exactly the same as in scala/java so the optimization
>>> would take place.
>>>
>>> Assaf.
>>>
>>>  
>>>
>>> *From:* rxin [via Apache Spark Developers List]
>>> [mailto:ml-node+[hidden email]
>>> </user/SendEmail.jtp?type=node&node=20074&i=0>]
>>> *Sent:* Wednesday, November 30, 2016 8:35 PM
>>> *To:* Mendelson, Assaf
>>> *Subject:* Re: [SPARK-17845] [SQL][PYTHON] More self-evident window
>>> function frame boundary API
>>>
>>>  
>>>
>>> Yes I'd define unboundedPreceding to -sys.maxsize, but also any
>>> value less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered
>>> unboundedPreceding too. We need to be careful with long overflow
>>> when transferring data over to Java.
>>>
>>>  
>>>
>>>  
>>>
>>> On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <[hidden email]
>>> </user/SendEmail.jtp?type=node&node=20069&i=0>> wrote:
>>>
>>> It is platform specific so theoretically can be larger, but 2**63 -
>>> 1 is a standard on 64 bit platform and 2**31 - 1 on 32bit platform.
>>> I can submit a patch but I am not sure how to proceed. Personally I
>>> would set
>>>
>>> unboundedPreceding = -sys.maxsize
>>> unboundedFollowing = sys.maxsize
>>>
>>> to keep backwards compatibility.
>>>
>>> On 11/30/2016 06:52 PM, Reynold Xin wrote:
>>>
>>>     Ah ok for some reason when I did the pull request sys.maxsize
>>>     was much larger than 2^63. Do you want to submit a patch to fix
>>>     this?
>>>
>>>      
>>>
>>>      
>>>
>>>     On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <[hidden
>>>     email] </user/SendEmail.jtp?type=node&node=20069&i=1>> wrote:
>>>
>>>     The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code
>>>     which used to work before is off by one.
>>>
>>>     On 11/30/2016 06:43 PM, Reynold Xin wrote:
>>>
>>>         Can you give a repro? Anything less than -(1 << 63) is
>>>         considered negative infinity (i.e. unbounded preceding).
>>>
>>>          
>>>
>>>         On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <[hidden
>>>         email] </user/SendEmail.jtp?type=node&node=20069&i=2>> wrote:
>>>
>>>         Hi,
>>>
>>>         I've been looking at the SPARK-17845 and I am curious if
>>>         there is any
>>>         reason to make it a breaking change. In Spark 2.0 and below
>>>         we could use:
>>>
>>>            
>>>         Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>>>         sys.maxsize))
>>>
>>>         In 2.1.0 this code will silently produce incorrect results
>>>         (ROWS BETWEEN
>>>         -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>>>         Window.unboundedPreceding equal -sys.maxsize to ensure backward
>>>         compatibility?
>>>
>>>         --
>>>
>>>         Maciej Szymkiewicz
>>>
>>>
>>>         ---------------------------------------------------------------------
>>>         To unsubscribe e-mail: [hidden email]
>>>         </user/SendEmail.jtp?type=node&node=20069&i=3>
>>>
>>>          
>>>
>>>      
>>>
>>>     -- 
>>>
>>>     Maciej Szymkiewicz
>>>
>>>      
>>>
>>>  
>>>
>>> -- 
>>> Maciej Szymkiewicz
>>>
>>>  
>>>
>>>  
>>>
>>> ------------------------------------------------------------------------
>>>
>>> *If you reply to this email, your message will be added to the
>>> discussion below:*
>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20069.html
>>>
>>> To start a new topic under Apache Spark Developers List, email
>>> [hidden email] </user/SendEmail.jtp?type=node&node=20074&i=1>
>>> To unsubscribe from Apache Spark Developers List, click here.
>>> NAML
>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>>
>>> ------------------------------------------------------------------------
>>> View this message in context: RE: [SPARK-17845] [SQL][PYTHON] More
>>> self-evident window function frame boundary API
>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20074.html>
>>> Sent from the Apache Spark Developers List mailing list archive
>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
>>> Nabble.com.
>>
>> --  
>> Maciej Szymkiewicz

-- 
Maciej Szymkiewicz

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Posted by Reynold Xin <rx...@databricks.com>.

Can you submit a pull request with test cases based on that change?


On Dec 1, 2016, 9:39 AM -0800, Maciej Szymkiewicz <ms...@gmail.com>, wrote:
> This doesn't affect that. The only concern is what we consider to UNBOUNDED on Python side.
>
> On 12/01/2016 07:56 AM, assaf.mendelson wrote:
> > I may be mistaken but if I remember correctly spark behaves differently when it is bounded in the past and when it is not. Specifically I seem to recall a fix which made sure that when there is no lower bound then the aggregation is done one by one instead of doing the whole range for each window. So I believe it should be configured exactly the same as in scala/java so the optimization would take place.
> > Assaf.
> >
> > From: rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden email]]
> > Sent: Wednesday, November 30, 2016 8:35 PM
> > To: Mendelson, Assaf
> > Subject: Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API
> >
> > Yes I'd define unboundedPreceding to -sys.maxsize, but also any value less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered unboundedPreceding too. We need to be careful with long overflow when transferring data over to Java.
> >
> >
> > On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <[hidden email]> wrote:
> > It is platform specific so theoretically can be larger, but 2**63 - 1 is a standard on 64 bit platform and 2**31 - 1 on 32bit platform. I can submit a patch but I am not sure how to proceed. Personally I would set
> >
> > unboundedPreceding = -sys.maxsize
> >
> > unboundedFollowing = sys.maxsize
> > to keep backwards compatibility.
> > On 11/30/2016 06:52 PM, Reynold Xin wrote:
> > > Ah ok for some reason when I did the pull request sys.maxsize was much larger than 2^63. Do you want to submit a patch to fix this?
> > >
> > >
> > > On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <[hidden email]> wrote:
> > > The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code which used to work before is off by one.
> > > On 11/30/2016 06:43 PM, Reynold Xin wrote:
> > > > Can you give a repro? Anything less than -(1 << 63) is considered negative infinity (i.e. unbounded preceding).
> > > >
> > > > On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <[hidden email]> wrote:
> > > > Hi,
> > > >
> > > > I've been looking at the SPARK-17845 and I am curious if there is any
> > > > reason to make it a breaking change. In Spark 2.0 and below we could use:
> > > >
> > > >     Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
> > > > sys.maxsize))
> > > >
> > > > In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
> > > > -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
> > > > Window.unboundedPreceding equal -sys.maxsize to ensure backward
> > > > compatibility?
> > > >
> > > > --
> > > >
> > > > Maciej Szymkiewicz
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe e-mail: [hidden email]
> > > >
> > >
> > >
> > > --
> > >
> > > Maciej Szymkiewicz
> > >
> >
> >
> > --
> >
> > Maciej Szymkiewicz
> >
> >
> > If you reply to this email, your message will be added to the discussion below:
> > http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20069.html
> > To start a new topic under Apache Spark Developers List, email [hidden email]
> > To unsubscribe from Apache Spark Developers List, click here.
> > NAML
> >
> > View this message in context: RE: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API
> > Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
>
> --
> Maciej Szymkiewicz

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Posted by Maciej Szymkiewicz <ms...@gmail.com>.

This doesn't affect that. The only concern is what we consider to
UNBOUNDED on Python side.


On 12/01/2016 07:56 AM, assaf.mendelson wrote:
>
> I may be mistaken but if I remember correctly spark behaves
> differently when it is bounded in the past and when it is not.
> Specifically I seem to recall a fix which made sure that when there is
> no lower bound then the aggregation is done one by one instead of
> doing the whole range for each window. So I believe it should be
> configured exactly the same as in scala/java so the optimization would
> take place.
>
> Assaf.
>
>  
>
> *From:*rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden
> email] </user/SendEmail.jtp?type=node&node=20074&i=0>]
> *Sent:* Wednesday, November 30, 2016 8:35 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: [SPARK-17845] [SQL][PYTHON] More self-evident window
> function frame boundary API
>
>  
>
> Yes I'd define unboundedPreceding to -sys.maxsize, but also any value
> less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered
> unboundedPreceding too. We need to be careful with long overflow when
> transferring data over to Java.
>
>  
>
>  
>
> On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <[hidden email]
> </user/SendEmail.jtp?type=node&node=20069&i=0>> wrote:
>
> It is platform specific so theoretically can be larger, but 2**63 - 1
> is a standard on 64 bit platform and 2**31 - 1 on 32bit platform. I
> can submit a patch but I am not sure how to proceed. Personally I
> would set
>
> unboundedPreceding = -sys.maxsize
> unboundedFollowing = sys.maxsize
>
> to keep backwards compatibility.
>
> On 11/30/2016 06:52 PM, Reynold Xin wrote:
>
>     Ah ok for some reason when I did the pull request sys.maxsize was
>     much larger than 2^63. Do you want to submit a patch to fix this?
>
>      
>
>      
>
>     On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <[hidden
>     email] </user/SendEmail.jtp?type=node&node=20069&i=1>> wrote:
>
>     The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code
>     which used to work before is off by one.
>
>     On 11/30/2016 06:43 PM, Reynold Xin wrote:
>
>         Can you give a repro? Anything less than -(1 << 63) is
>         considered negative infinity (i.e. unbounded preceding).
>
>          
>
>         On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <[hidden
>         email] </user/SendEmail.jtp?type=node&node=20069&i=2>> wrote:
>
>         Hi,
>
>         I've been looking at the SPARK-17845 and I am curious if there
>         is any
>         reason to make it a breaking change. In Spark 2.0 and below we
>         could use:
>
>            
>         Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>         sys.maxsize))
>
>         In 2.1.0 this code will silently produce incorrect results
>         (ROWS BETWEEN
>         -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>         Window.unboundedPreceding equal -sys.maxsize to ensure backward
>         compatibility?
>
>         --
>
>         Maciej Szymkiewicz
>
>
>         ---------------------------------------------------------------------
>         To unsubscribe e-mail: [hidden email]
>         </user/SendEmail.jtp?type=node&node=20069&i=3>
>
>          
>
>      
>
>     -- 
>
>     Maciej Szymkiewicz
>
>      
>
>  
>
> -- 
> Maciej Szymkiewicz
>
>  
>
>  
>
> ------------------------------------------------------------------------
>
> *If you reply to this email, your message will be added to the
> discussion below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20069.html
>
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] </user/SendEmail.jtp?type=node&node=20074&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
> ------------------------------------------------------------------------
> View this message in context: RE: [SPARK-17845] [SQL][PYTHON] More
> self-evident window function frame boundary API
> <http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20074.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.

-- 
Maciej Szymkiewicz

RE: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Posted by "assaf.mendelson" <as...@rsa.com>.

I may be mistaken but if I remember correctly spark behaves differently when it is bounded in the past and when it is not. Specifically I seem to recall a fix which made sure that when there is no lower bound then the aggregation is done one by one instead of doing the whole range for each window. So I believe it should be configured exactly the same as in scala/java so the optimization would take place.
Assaf.

From: rxin [via Apache Spark Developers List] [mailto:ml-node+s1001551n20069h28@n3.nabble.com]
Sent: Wednesday, November 30, 2016 8:35 PM
To: Mendelson, Assaf
Subject: Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Yes I'd define unboundedPreceding to -sys.maxsize, but also any value less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered unboundedPreceding too. We need to be careful with long overflow when transferring data over to Java.

On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <[hidden email]</user/SendEmail.jtp?type=node&node=20069&i=0>> wrote:

It is platform specific so theoretically can be larger, but 2**63 - 1 is a standard on 64 bit platform and 2**31 - 1 on 32bit platform. I can submit a patch but I am not sure how to proceed. Personally I would set

unboundedPreceding = -sys.maxsize

unboundedFollowing = sys.maxsize

to keep backwards compatibility.
On 11/30/2016 06:52 PM, Reynold Xin wrote:
Ah ok for some reason when I did the pull request sys.maxsize was much larger than 2^63. Do you want to submit a patch to fix this?

On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <[hidden email]</user/SendEmail.jtp?type=node&node=20069&i=1>> wrote:

The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code which used to work before is off by one.
On 11/30/2016 06:43 PM, Reynold Xin wrote:
Can you give a repro? Anything less than -(1 << 63) is considered negative infinity (i.e. unbounded preceding).

On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <[hidden email]</user/SendEmail.jtp?type=node&node=20069&i=2>> wrote:
Hi,

I've been looking at the SPARK-17845 and I am curious if there is any
reason to make it a breaking change. In Spark 2.0 and below we could use:

    Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
sys.maxsize))

In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
-1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
Window.unboundedPreceding equal -sys.maxsize to ensure backward
compatibility?

--

Maciej Szymkiewicz

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]</user/SendEmail.jtp?type=node&node=20069&i=3>

--

Maciej Szymkiewicz

--

Maciej Szymkiewicz

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20069.html
To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1h20@n3.nabble.com<ma...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20074.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Posted by Reynold Xin <rx...@databricks.com>.

Yes I'd define unboundedPreceding to -sys.maxsize, but also any value less
than min(-sys.maxsize, _JAVA_MIN_LONG) are considered unboundedPreceding
too. We need to be careful with long overflow when transferring data over
to Java.


On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <mszymkiewicz@gmail.com
> wrote:

> It is platform specific so theoretically can be larger, but 2**63 - 1 is a
> standard on 64 bit platform and 2**31 - 1 on 32bit platform. I can submit a
> patch but I am not sure how to proceed. Personally I would set
>
> unboundedPreceding = -sys.maxsize
>
> unboundedFollowing = sys.maxsize
>
> to keep backwards compatibility.
> On 11/30/2016 06:52 PM, Reynold Xin wrote:
>
> Ah ok for some reason when I did the pull request sys.maxsize was much
> larger than 2^63. Do you want to submit a patch to fix this?
>
>
> On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <
> mszymkiewicz@gmail.com> wrote:
>
>> The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code which
>> used to work before is off by one.
>> On 11/30/2016 06:43 PM, Reynold Xin wrote:
>>
>> Can you give a repro? Anything less than -(1 << 63) is considered
>> negative infinity (i.e. unbounded preceding).
>>
>> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <
>> mszymkiewicz@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I've been looking at the SPARK-17845 and I am curious if there is any
>>> reason to make it a breaking change. In Spark 2.0 and below we could use:
>>>
>>>     Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>>> sys.maxsize))
>>>
>>> In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
>>> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>>> Window.unboundedPreceding equal -sys.maxsize to ensure backward
>>> compatibility?
>>>
>>> --
>>>
>>> Maciej Szymkiewicz
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>> --
>> Maciej Szymkiewicz
>>
>>
>
> --
> Maciej Szymkiewicz
>
>

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Posted by Maciej Szymkiewicz <ms...@gmail.com>.

It is platform specific so theoretically can be larger, but 2**63 - 1 is
a standard on 64 bit platform and 2**31 - 1 on 32bit platform. I can
submit a patch but I am not sure how to proceed. Personally I would set

unboundedPreceding = -sys.maxsize

unboundedFollowing = sys.maxsize

to keep backwards compatibility.

On 11/30/2016 06:52 PM, Reynold Xin wrote:
> Ah ok for some reason when I did the pull request sys.maxsize was much
> larger than 2^63. Do you want to submit a patch to fix this?
>
>
> On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz
> <mszymkiewicz@gmail.com <ma...@gmail.com>> wrote:
>
>     The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code
>     which used to work before is off by one.
>
>     On 11/30/2016 06:43 PM, Reynold Xin wrote:
>>     Can you give a repro? Anything less than -(1 << 63) is considered
>>     negative infinity (i.e. unbounded preceding).
>>
>>     On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz
>>     <mszymkiewicz@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hi,
>>
>>         I've been looking at the SPARK-17845 and I am curious if
>>         there is any
>>         reason to make it a breaking change. In Spark 2.0 and below
>>         we could use:
>>
>>            
>>         Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>>         sys.maxsize))
>>
>>         In 2.1.0 this code will silently produce incorrect results
>>         (ROWS BETWEEN
>>         -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>>         Window.unboundedPreceding equal -sys.maxsize to ensure backward
>>         compatibility?
>>
>>         --
>>
>>         Maciej Szymkiewicz
>>
>>
>>         ---------------------------------------------------------------------
>>         To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>         <ma...@spark.apache.org>
>>
>>
>
>     -- 
>     Maciej Szymkiewicz
>
>

-- 
Maciej Szymkiewicz

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Posted by Reynold Xin <rx...@databricks.com>.

Ah ok for some reason when I did the pull request sys.maxsize was much
larger than 2^63. Do you want to submit a patch to fix this?


On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <ms...@gmail.com>
wrote:

> The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code which
> used to work before is off by one.
> On 11/30/2016 06:43 PM, Reynold Xin wrote:
>
> Can you give a repro? Anything less than -(1 << 63) is considered negative
> infinity (i.e. unbounded preceding).
>
> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <
> mszymkiewicz@gmail.com> wrote:
>
>> Hi,
>>
>> I've been looking at the SPARK-17845 and I am curious if there is any
>> reason to make it a breaking change. In Spark 2.0 and below we could use:
>>
>>     Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>> sys.maxsize))
>>
>> In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
>> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>> Window.unboundedPreceding equal -sys.maxsize to ensure backward
>> compatibility?
>>
>> --
>>
>> Maciej Szymkiewicz
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
> --
> Maciej Szymkiewicz
>
>

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Posted by Maciej Szymkiewicz <ms...@gmail.com>.

The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code which
used to work before is off by one.

On 11/30/2016 06:43 PM, Reynold Xin wrote:
> Can you give a repro? Anything less than -(1 << 63) is considered
> negative infinity (i.e. unbounded preceding).
>
> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz
> <mszymkiewicz@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I've been looking at the SPARK-17845 and I am curious if there is any
>     reason to make it a breaking change. In Spark 2.0 and below we
>     could use:
>
>        
>     Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>     sys.maxsize))
>
>     In 2.1.0 this code will silently produce incorrect results (ROWS
>     BETWEEN
>     -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>     Window.unboundedPreceding equal -sys.maxsize to ensure backward
>     compatibility?
>
>     --
>
>     Maciej Szymkiewicz
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>
>

-- 
Maciej Szymkiewicz

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Posted by Reynold Xin <rx...@databricks.com>.

Can you give a repro? Anything less than -(1 << 63) is considered negative
infinity (i.e. unbounded preceding).

On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <ms...@gmail.com>
wrote:

> Hi,
>
> I've been looking at the SPARK-17845 and I am curious if there is any
> reason to make it a breaking change. In Spark 2.0 and below we could use:
>
>     Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
> sys.maxsize))
>
> In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
> Window.unboundedPreceding equal -sys.maxsize to ensure backward
> compatibility?
>
> --
>
> Maciej Szymkiewicz
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>