You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "assaf.mendelson" <as...@rsa.com> on 2016/11/17 08:19:31 UTC

structured streaming and window functions

Hi,
I have been trying to figure out how structured streaming handles window functions efficiently.
The portion I understand is that whenever new data arrived, it is grouped by the time and the aggregated data is added to the state.
However, unlike operations like sum etc. window functions need the original data and can change when data arrives late.
So if I understand correctly, this would mean that we would have to save the original data and rerun on it to calculate the window function every time new data arrives.
Is this correct? Are there ways to go around this issue?

Assaf.




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: structured streaming and window functions

Posted by "HENSLEE, AUSTIN L" <ah...@att.com>.

Forgive the slight tangent…

For anyone following this thread who may be wondering about a quick, simple solution they can apply (and a walk-through on how) for more straight-forward sessionization needs:

There’s a nice section on sessionization in “Advanced Analytics with Spark”, by Ryza, Laserson, Owen, and Wills (starts on p.167).

It was excellent for my job that needed to take events, get them in time order, and calculate the time between them (that particular job’s def of a “session”).

I used their groupByKeyAndSortValues() function.

As the authors state, “Work is progressing on Spark JIRA SPARK-3655 to add a transformation like this to core Spark.”

From: Ofir Manor <of...@equalum.io>
Date: Thursday, November 17, 2016 at 9:57 AM
To: "assaf.mendelson" <as...@rsa.com>
Cc: dev <de...@spark.apache.org>
Subject: Re: structured streaming and window functions

I agree with you, I think that once we will have sessionization, we could aim for richer processing capabilities per session. As far as I image it, a session is an ordered sequence of data, that we could apply computation on it (like CEP).



Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286<tel:%2B972-54-7801286> | Email: ofir.manor@equalum.io<ma...@equalum.io>

On Thu, Nov 17, 2016 at 5:16 PM, assaf.mendelson <as...@rsa.com>> wrote:
It is true that this is sessionizing but I brought it as an example for finding an ordered pattern in the data.
In general, using simple window (e.g. 24 hours) in structured streaming is explain in the grouping by time and is very clear.
What I was trying to figure out is how to do streaming of cases where you actually have to have some sorting to find patterns, especially when some of the data may come in late.
I was trying to figure out if there is plan to support this and if so, what would be the performance implications.
Assaf.

From: Ofir Manor [via Apache Spark Developers List] [mailto:ml-node+<mailto:ml-node%2B>[hidden email]<http:///user/SendEmail.jtp?type=node&node=19936&i=0>]
Sent: Thursday, November 17, 2016 5:13 PM
To: Mendelson, Assaf
Subject: Re: structured streaming and window functions

Assaf, I think what you are describing is actually sessionizing, by user, where a session is ended by a successful login event.
On each session, you want to count number of failed login events.
If so, this is tracked by https://issues.apache.org/jira/browse/SPARK-10816 (didn't start yet)


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: <a href="<a href="tel:%2B972-54-7801286">tel:%2B972-54-7801286" value="+972507470820<tel:%2B972507470820>" target="_blank">+972-54-7801286<tel:%2B972-54-7801286> | Email: [hidden email]<http:///user/SendEmail.jtp?type=node&node=19935&i=0>

On Thu, Nov 17, 2016 at 2:52 PM, assaf.mendelson <[hidden email]<http:///user/SendEmail.jtp?type=node&node=19935&i=1>> wrote:
Is there a plan to support sql window functions?
I will give an example of use: Let’s say we have login logs. What we want to do is for each user we would want to add the number of failed logins for each successful login. How would you do it with structured streaming?
As this is currently not supported, is there a plan on how to support it in the future?
Assaf.

From: Herman van Hövell tot Westerflier-2 [via Apache Spark Developers List] [mailto:[hidden email]<http:///user/SendEmail.jtp?type=node&node=19935&i=2>[hidden email]<http://user/SendEmail.jtp?type=node&node=19934&i=0>]
Sent: Thursday, November 17, 2016 1:27 PM
To: Mendelson, Assaf
Subject: Re: structured streaming and window functions

What kind of window functions are we talking about? Structured streaming only supports time window aggregates, not the more general sql window function (sum(x) over (partition by ... order by ...)) aggregates.

The basic idea is that you use incremental aggregation and store the aggregation buffer (not the end result) in a state store after each increment. When an new batch comes in, you perform aggregation on that batch, merge the result of that aggregation with the buffer in the state store, update the state store and return the new result.

This is much harder than it sounds, because you need to maintain state in a fault tolerant way and you need to have some eviction policy (watermarks for instance) for aggregation buffers to prevent the state store from reaching an infinite size.

On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden email]<http://user/SendEmail.jtp?type=node&node=19933&i=0>> wrote:
Hi,
I have been trying to figure out how structured streaming handles window functions efficiently.
The portion I understand is that whenever new data arrived, it is grouped by the time and the aggregated data is added to the state.
However, unlike operations like sum etc. window functions need the original data and can change when data arrives late.
So if I understand correctly, this would mean that we would have to save the original data and rerun on it to calculate the window function every time new data arrives.
Is this correct? Are there ways to go around this issue?

Assaf.

________________________________
View this message in context: structured streaming and window functions<http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930.html>
Sent from the Apache Spark Developers List mailing list archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com.


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19933.html
To start a new topic under Apache Spark Developers List, email [hidden email]<http://user/SendEmail.jtp?type=node&node=19934&i=1>
To unsubscribe from Apache Spark Developers List, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

________________________________
View this message in context: RE: structured streaming and window functions<http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19934.html>

Sent from the Apache Spark Developers List mailing list archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com.


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19935.html
To start a new topic under Apache Spark Developers List, email [hidden email]<http:///user/SendEmail.jtp?type=node&node=19936&i=1>
To unsubscribe from Apache Spark Developers List, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

________________________________
View this message in context: RE: structured streaming and window functions<http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19936.html>
Sent from the Apache Spark Developers List mailing list archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com.

Re: structured streaming and window functions

Posted by Ofir Manor <of...@equalum.io>.

I agree with you, I think that once we will have sessionization, we could
aim for richer processing capabilities per session. As far as I image it, a
session is an ordered sequence of data, that we could apply computation on
it (like CEP).


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io

On Thu, Nov 17, 2016 at 5:16 PM, assaf.mendelson <as...@rsa.com>
wrote:

> It is true that this is sessionizing but I brought it as an example for
> finding an ordered pattern in the data.
>
> In general, using simple window (e.g. 24 hours) in structured streaming is
> explain in the grouping by time and is very clear.
>
> What I was trying to figure out is how to do streaming of cases where you
> actually have to have some sorting to find patterns, especially when some
> of the data may come in late.
>
> I was trying to figure out if there is plan to support this and if so,
> what would be the performance implications.
>
> Assaf.
>
>
>
> *From:* Ofir Manor [via Apache Spark Developers List] [mailto:ml-node+[hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19936&i=0>]
> *Sent:* Thursday, November 17, 2016 5:13 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: structured streaming and window functions
>
>
>
> Assaf, I think what you are describing is actually sessionizing, by user,
> where a session is ended by a successful login event.
>
> On each session, you want to count number of failed login events.
>
> If so, this is tracked by https://issues.apache.org/
> jira/browse/SPARK-10816 (didn't start yet)
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: <a href="<a href="tel:%2B972-54-7801286">tel:%2B972-54-7801286"
> value="+972507470820" target="_blank">+972-54-7801286 | Email: [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19935&i=0>
>
>
>
> On Thu, Nov 17, 2016 at 2:52 PM, assaf.mendelson <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19935&i=1>> wrote:
>
> Is there a plan to support sql window functions?
>
> I will give an example of use: Let’s say we have login logs. What we want
> to do is for each user we would want to add the number of failed logins for
> each successful login. How would you do it with structured streaming?
>
> As this is currently not supported, is there a plan on how to support it
> in the future?
>
> Assaf.
>
>
>
> *From:* Herman van Hövell tot Westerflier-2 [via Apache Spark Developers
> List] [mailto:[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19935&i=2>[hidden email]
> <http://user/SendEmail.jtp?type=node&node=19934&i=0>]
> *Sent:* Thursday, November 17, 2016 1:27 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: structured streaming and window functions
>
>
>
> What kind of window functions are we talking about? Structured streaming
> only supports time window aggregates, not the more general sql window
> function (sum(x) over (partition by ... order by ...)) aggregates.
>
>
>
> The basic idea is that you use incremental aggregation and store the
> aggregation buffer (not the end result) in a state store after each
> increment. When an new batch comes in, you perform aggregation on that
> batch, merge the result of that aggregation with the buffer in the state
> store, update the state store and return the new result.
>
>
>
> This is much harder than it sounds, because you need to maintain state in
> a fault tolerant way and you need to have some eviction policy (watermarks
> for instance) for aggregation buffers to prevent the state store from
> reaching an infinite size.
>
>
>
> On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=19933&i=0>> wrote:
>
> Hi,
>
> I have been trying to figure out how structured streaming handles window
> functions efficiently.
>
> The portion I understand is that whenever new data arrived, it is grouped
> by the time and the aggregated data is added to the state.
>
> However, unlike operations like sum etc. window functions need the
> original data and can change when data arrives late.
>
> So if I understand correctly, this would mean that we would have to save
> the original data and rerun on it to calculate the window function every
> time new data arrives.
>
> Is this correct? Are there ways to go around this issue?
>
>
>
> Assaf.
>
>
> ------------------------------
>
> View this message in context: structured streaming and window functions
> <http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>
>
>
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/structured-
> streaming-and-window-functions-tp19930p19933.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http://user/SendEmail.jtp?type=node&node=19934&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ------------------------------
>
> View this message in context: RE: structured streaming and window
> functions
> <http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19934.html>
>
>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>
>
>
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/structured-
> streaming-and-window-functions-tp19930p19935.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19936&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> ------------------------------
> View this message in context: RE: structured streaming and window
> functions
> <http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19936.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>

RE: structured streaming and window functions

Posted by "assaf.mendelson" <as...@rsa.com>.

It is true that this is sessionizing but I brought it as an example for finding an ordered pattern in the data.
In general, using simple window (e.g. 24 hours) in structured streaming is explain in the grouping by time and is very clear.
What I was trying to figure out is how to do streaming of cases where you actually have to have some sorting to find patterns, especially when some of the data may come in late.
I was trying to figure out if there is plan to support this and if so, what would be the performance implications.
Assaf.

From: Ofir Manor [via Apache Spark Developers List] [mailto:ml-node+s1001551n19935h74@n3.nabble.com]
Sent: Thursday, November 17, 2016 5:13 PM
To: Mendelson, Assaf
Subject: Re: structured streaming and window functions

Assaf, I think what you are describing is actually sessionizing, by user, where a session is ended by a successful login event.
On each session, you want to count number of failed login events.
If so, this is tracked by https://issues.apache.org/jira/browse/SPARK-10816 (didn't start yet)

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: <a href="tel:%2B972-54-7801286" value="+972507470820" target="_blank">+972-54-7801286 | Email: [hidden email]</user/SendEmail.jtp?type=node&node=19935&i=0>

On Thu, Nov 17, 2016 at 2:52 PM, assaf.mendelson <[hidden email]</user/SendEmail.jtp?type=node&node=19935&i=1>> wrote:
Is there a plan to support sql window functions?
I will give an example of use: Let’s say we have login logs. What we want to do is for each user we would want to add the number of failed logins for each successful login. How would you do it with structured streaming?
As this is currently not supported, is there a plan on how to support it in the future?
Assaf.

From: Herman van Hövell tot Westerflier-2 [via Apache Spark Developers List] [mailto:[hidden email]</user/SendEmail.jtp?type=node&node=19935&i=2>[hidden email]<http://user/SendEmail.jtp?type=node&node=19934&i=0>]
Sent: Thursday, November 17, 2016 1:27 PM
To: Mendelson, Assaf
Subject: Re: structured streaming and window functions

What kind of window functions are we talking about? Structured streaming only supports time window aggregates, not the more general sql window function (sum(x) over (partition by ... order by ...)) aggregates.

The basic idea is that you use incremental aggregation and store the aggregation buffer (not the end result) in a state store after each increment. When an new batch comes in, you perform aggregation on that batch, merge the result of that aggregation with the buffer in the state store, update the state store and return the new result.

This is much harder than it sounds, because you need to maintain state in a fault tolerant way and you need to have some eviction policy (watermarks for instance) for aggregation buffers to prevent the state store from reaching an infinite size.

On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden email]<http://user/SendEmail.jtp?type=node&node=19933&i=0>> wrote:
Hi,
I have been trying to figure out how structured streaming handles window functions efficiently.
The portion I understand is that whenever new data arrived, it is grouped by the time and the aggregated data is added to the state.
However, unlike operations like sum etc. window functions need the original data and can change when data arrives late.
So if I understand correctly, this would mean that we would have to save the original data and rerun on it to calculate the window function every time new data arrives.
Is this correct? Are there ways to go around this issue?

Assaf.

________________________________
View this message in context: structured streaming and window functions<http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930.html>
Sent from the Apache Spark Developers List mailing list archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com.

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19933.html
To start a new topic under Apache Spark Developers List, email [hidden email]<http://user/SendEmail.jtp?type=node&node=19934&i=1>
To unsubscribe from Apache Spark Developers List, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

________________________________
View this message in context: RE: structured streaming and window functions<http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19934.html>

Sent from the Apache Spark Developers List mailing list archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com.

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19935.html
To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1h20@n3.nabble.com<ma...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19936.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: structured streaming and window functions

Posted by Ofir Manor <of...@equalum.io>.

Assaf, I think what you are describing is actually sessionizing, by user,
where a session is ended by a successful login event.
On each session, you want to count number of failed login events.
If so, this is tracked by https://issues.apache.org/jira/browse/SPARK-10816
(didn't start yet)

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io

On Thu, Nov 17, 2016 at 2:52 PM, assaf.mendelson <as...@rsa.com>
wrote:

> Is there a plan to support sql window functions?
>
> I will give an example of use: Let’s say we have login logs. What we want
> to do is for each user we would want to add the number of failed logins for
> each successful login. How would you do it with structured streaming?
>
> As this is currently not supported, is there a plan on how to support it
> in the future?
>
> Assaf.
>
>
>
> *From:* Herman van Hövell tot Westerflier-2 [via Apache Spark Developers
> List] [mailto:ml-node+[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19934&i=0>]
> *Sent:* Thursday, November 17, 2016 1:27 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: structured streaming and window functions
>
>
>
> What kind of window functions are we talking about? Structured streaming
> only supports time window aggregates, not the more general sql window
> function (sum(x) over (partition by ... order by ...)) aggregates.
>
>
>
> The basic idea is that you use incremental aggregation and store the
> aggregation buffer (not the end result) in a state store after each
> increment. When an new batch comes in, you perform aggregation on that
> batch, merge the result of that aggregation with the buffer in the state
> store, update the state store and return the new result.
>
>
>
> This is much harder than it sounds, because you need to maintain state in
> a fault tolerant way and you need to have some eviction policy (watermarks
> for instance) for aggregation buffers to prevent the state store from
> reaching an infinite size.
>
>
>
> On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19933&i=0>> wrote:
>
> Hi,
>
> I have been trying to figure out how structured streaming handles window
> functions efficiently.
>
> The portion I understand is that whenever new data arrived, it is grouped
> by the time and the aggregated data is added to the state.
>
> However, unlike operations like sum etc. window functions need the
> original data and can change when data arrives late.
>
> So if I understand correctly, this would mean that we would have to save
> the original data and rerun on it to calculate the window function every
> time new data arrives.
>
> Is this correct? Are there ways to go around this issue?
>
>
>
> Assaf.
>
>
> ------------------------------
>
> View this message in context: structured streaming and window functions
> <http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>
>
>
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/structured-
> streaming-and-window-functions-tp19930p19933.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19934&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> ------------------------------
> View this message in context: RE: structured streaming and window
> functions
> <http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19934.html>
>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>

RE: structured streaming and window functions

Posted by "assaf.mendelson" <as...@rsa.com>.

Is there a plan to support sql window functions?
I will give an example of use: Let’s say we have login logs. What we want to do is for each user we would want to add the number of failed logins for each successful login. How would you do it with structured streaming?
As this is currently not supported, is there a plan on how to support it in the future?
Assaf.

From: Herman van Hövell tot Westerflier-2 [via Apache Spark Developers List] [mailto:ml-node+s1001551n19933h12@n3.nabble.com]
Sent: Thursday, November 17, 2016 1:27 PM
To: Mendelson, Assaf
Subject: Re: structured streaming and window functions

What kind of window functions are we talking about? Structured streaming only supports time window aggregates, not the more general sql window function (sum(x) over (partition by ... order by ...)) aggregates.

The basic idea is that you use incremental aggregation and store the aggregation buffer (not the end result) in a state store after each increment. When an new batch comes in, you perform aggregation on that batch, merge the result of that aggregation with the buffer in the state store, update the state store and return the new result.

This is much harder than it sounds, because you need to maintain state in a fault tolerant way and you need to have some eviction policy (watermarks for instance) for aggregation buffers to prevent the state store from reaching an infinite size.

On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden email]</user/SendEmail.jtp?type=node&node=19933&i=0>> wrote:
Hi,
I have been trying to figure out how structured streaming handles window functions efficiently.
The portion I understand is that whenever new data arrived, it is grouped by the time and the aggregated data is added to the state.
However, unlike operations like sum etc. window functions need the original data and can change when data arrives late.
So if I understand correctly, this would mean that we would have to save the original data and rerun on it to calculate the window function every time new data arrives.
Is this correct? Are there ways to go around this issue?

Assaf.

________________________________
View this message in context: structured streaming and window functions<http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930.html>
Sent from the Apache Spark Developers List mailing list archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com.

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19933.html
To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1h20@n3.nabble.com<ma...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19934.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: structured streaming and window functions

Posted by Herman van Hövell tot Westerflier <hv...@databricks.com>.

What kind of window functions are we talking about? Structured streaming
only supports time window aggregates, not the more general sql window
function (sum(x) over (partition by ... order by ...)) aggregates.

The basic idea is that you use incremental aggregation and store the
aggregation buffer (not the end result) in a state store after each
increment. When an new batch comes in, you perform aggregation on that
batch, merge the result of that aggregation with the buffer in the state
store, update the state store and return the new result.

This is much harder than it sounds, because you need to maintain state in a
fault tolerant way and you need to have some eviction policy (watermarks
for instance) for aggregation buffers to prevent the state store from
reaching an infinite size.

On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <as...@rsa.com>
wrote:

> Hi,
>
> I have been trying to figure out how structured streaming handles window
> functions efficiently.
>
> The portion I understand is that whenever new data arrived, it is grouped
> by the time and the aggregated data is added to the state.
>
> However, unlike operations like sum etc. window functions need the
> original data and can change when data arrives late.
>
> So if I understand correctly, this would mean that we would have to save
> the original data and rerun on it to calculate the window function every
> time new data arrives.
>
> Is this correct? Are there ways to go around this issue?
>
>
>
> Assaf.
>
> ------------------------------
> View this message in context: structured streaming and window functions
> <http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>