You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by alessandro finamore <al...@polito.it> on 2014/07/04 15:46:26 UTC

window analysis with Spark and Spark streaming

Hi,

I have a large dataset of text logs files on which I need to implement
"window analysis"
Say, extract per-minute data and do aggregated stats on the last X minutes

I've to implement the windowing analysis with spark.
This is the workflow I'm currently using
- read a file and I create a new RDD with per-minute info
- loop on each new minute and integrate its data with another data structure
containing the last X minutes of data
- apply the analysis on the updated window of data 

This works but it suffer from limited parallelisms
Do you have any recommendations/suggestion about a better implementation?
Also, are there any recommended data collections for managing the window
(I'm simply using Arrays for managing data)

While working in this I started to investigate spark streaming.
The problem is that I don't know if is really possible to use it on already
collected data.
This post seems to indicate that it should, but it is not clear to me how
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html

Thanks





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: window analysis with Spark and Spark streaming

Posted by Laeeq Ahmed <la...@yahoo.com>.

Hi,

For QueueRDD, have a look here.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala

Regards,
Laeeq,
PhD candidatte,
KTH, Stockholm.

On Sunday, July 6, 2014 10:20 AM, alessandro finamore <al...@polito.it> wrote:

On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List] 
<[hidden email]> wrote: 
> Key idea is to simulate your app time as you enter data . So you can connect 
> spark streaming to a queue and insert data in it spaced by time. Easier said 
> than done :). 

I see. 
I'll try to implement also this solution so that I can compare it with 
my current spark implementation. 
I'm interested in seeing if this is faster...as I assume it should be :) 

> What are the parallelism issues you are hitting with your 
> static approach. 

In my current spark implementation, whenever I need to get the 
aggregated stats over the window, I'm re-mapping all the current bins 
to have the same key so that they can be reduced altogether. 
This means that data need to shipped to a single reducer. 
As results, adding nodes/cores to the application does not really 
affect the total time :( 

> 
> 
> On Friday, July 4, 2014, alessandro finamore <[hidden email]> wrote: 
>> 
>> Thanks for the replies 
>> 
>> What is not completely clear to me is how time is managed. 
>> I can create a DStream from file. 
>> But if I set the window property that will be bounded to the application 
>> time, right? 
>> 
>> If I got it right, with a receiver I can control the way DStream are 
>> created. 
>> But, how can apply then the windowing already shipped with the framework 
>> if 
>> this is bounded to the "application time"? 
>> I would like to do define a window of N files but the window() function 
>> requires a duration as input... 
>> 
>> 
>> 
>> 
>> -- 
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
>> 
>> Sent from the Apache Spark User List mailing list archive at Nabble.com. 
> 
> 
> 
> -- 
> Sent from Gmail Mobile 
> 
> 
> ________________________________ 
> If you reply to this email, your message will be added to the discussion 
> below: 
> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8860.html
> To unsubscribe from window analysis with Spark and Spark streaming, click 
> here. 
> NAML 

-- 
-------------------------------------------------- 
Alessandro Finamore, PhD 
Politecnico di Torino 
-- 
Office:    +39 0115644127 
Mobile:   +39 3280251485 
SkypeId: alessandro.finamore 
--------------------------------------------------- 

________________________________
 View this message in context: Re: window analysis with Spark and Spark streaming

Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: window analysis with Spark and Spark streaming

Posted by alessandro finamore <al...@polito.it>.

On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List]
<ml...@n3.nabble.com> wrote:
> Key idea is to simulate your app time as you enter data . So you can connect
> spark streaming to a queue and insert data in it spaced by time. Easier said
> than done :).

I see.
I'll try to implement also this solution so that I can compare it with
my current spark implementation.
I'm interested in seeing if this is faster...as I assume it should be :)

> What are the parallelism issues you are hitting with your
> static approach.

In my current spark implementation, whenever I need to get the
aggregated stats over the window, I'm re-mapping all the current bins
to have the same key so that they can be reduced altogether.
This means that data need to shipped to a single reducer.
As results, adding nodes/cores to the application does not really
affect the total time :(

>
>
> On Friday, July 4, 2014, alessandro finamore <[hidden email]> wrote:
>>
>> Thanks for the replies
>>
>> What is not completely clear to me is how time is managed.
>> I can create a DStream from file.
>> But if I set the window property that will be bounded to the application
>> time, right?
>>
>> If I got it right, with a receiver I can control the way DStream are
>> created.
>> But, how can apply then the windowing already shipped with the framework
>> if
>> this is bounded to the "application time"?
>> I would like to do define a window of N files but the window() function
>> requires a duration as input...
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
>>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
>
> --
> Sent from Gmail Mobile
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8860.html
> To unsubscribe from window analysis with Spark and Spark streaming, click
> here.
> NAML



-- 
--------------------------------------------------
Alessandro Finamore, PhD
Politecnico di Torino
--
Office:    +39 0115644127
Mobile:   +39 3280251485
SkypeId: alessandro.finamore
---------------------------------------------------




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8867.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: window analysis with Spark and Spark streaming

Posted by Mayur Rustagi <ma...@gmail.com>.

Key idea is to simulate your app time as you enter data . So you can
connect spark streaming to a queue and insert data in it spaced by time.
Easier said than done :). What are the parallelism issues you are hitting
with your static approach.

On Friday, July 4, 2014, alessandro finamore <al...@polito.it>
wrote:

> Thanks for the replies
>
> What is not completely clear to me is how time is managed.
> I can create a DStream from file.
> But if I set the window property that will be bounded to the application
> time, right?
>
> If I got it right, with a receiver I can control the way DStream are
> created.
> But, how can apply then the windowing already shipped with the framework if
> this is bounded to the "application time"?
> I would like to do define a window of N files but the window() function
> requires a duration as input...
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


-- 
Sent from Gmail Mobile

Re: window analysis with Spark and Spark streaming

Posted by M Singh <ma...@yahoo.com>.

The windowing capabilities of spark streaming determine the events in the RDD created for that time window.  If the duration is 1s then all the events received in a particular 1s window will be a part of the RDD created for that window for that stream.



On Friday, July 4, 2014 1:28 PM, alessandro finamore <al...@polito.it> wrote:
 


Thanks for the replies

What is not completely clear to me is how time is managed.
I can create a DStream from file.
But if I set the window property that will be bounded to the application
time, right?

If I got it right, with a receiver I can control the way DStream are
created.
But, how can apply then the windowing already shipped with the framework if
this is bounded to the "application time"?
I would like to do define a window of N files but the window() function
requires a duration as input...




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: window analysis with Spark and Spark streaming

Posted by alessandro finamore <al...@polito.it>.

Thanks for the replies

What is not completely clear to me is how time is managed.
I can create a DStream from file.
But if I set the window property that will be bounded to the application
time, right?

If I got it right, with a receiver I can control the way DStream are
created.
But, how can apply then the windowing already shipped with the framework if
this is bounded to the "application time"?
I would like to do define a window of N files but the window() function
requires a duration as input...




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: window analysis with Spark and Spark streaming

Posted by M Singh <ma...@yahoo.com>.

Another alternative could be use SparkStreaming's textFileStream with windowing capabilities.



On Friday, July 4, 2014 9:52 AM, Gianluca Privitera <gi...@studio.unibo.it> wrote:
 


You should think about a custom receiver, in order to solve the problem of the “already collected” data.  
http://spark.apache.org/docs/latest/streaming-custom-receivers.html

Gianluca



On 04 Jul 2014, at 15:46, alessandro finamore <al...@polito.it> wrote:

Hi,
>
>I have a large dataset of text logs files on which I need to implement
>"window analysis"
>Say, extract per-minute data and do aggregated stats on the last X minutes
>
>I've to implement the windowing analysis with spark.
>This is the workflow I'm currently using
>- read a file and I create a new RDD with per-minute info
>- loop on each new minute and integrate its data with another data structure
>containing the last X minutes of data
>- apply the analysis on the updated window of data 
>
>This works but it suffer from limited parallelisms
>Do you have any recommendations/suggestion about a better implementation?
>Also, are there any recommended data collections for managing the window
>(I'm simply using Arrays for managing data)
>
>While working in this I started to investigate spark streaming.
>The problem is that I don't know if is really possible to use it on already
>collected data.
>This post seems to indicate that it should, but it is not clear to me how
>http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html
>
>Thanks
>
>
>
>
>
>--
>View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: window analysis with Spark and Spark streaming

Posted by Gianluca Privitera <gi...@studio.unibo.it>.

You should think about a custom receiver, in order to solve the problem of the “already collected” data.
http://spark.apache.org/docs/latest/streaming-custom-receivers.html

Gianluca

On 04 Jul 2014, at 15:46, alessandro finamore <al...@polito.it>> wrote:

Hi,

I have a large dataset of text logs files on which I need to implement
"window analysis"
Say, extract per-minute data and do aggregated stats on the last X minutes

I've to implement the windowing analysis with spark.
This is the workflow I'm currently using
- read a file and I create a new RDD with per-minute info
- loop on each new minute and integrate its data with another data structure
containing the last X minutes of data
- apply the analysis on the updated window of data

This works but it suffer from limited parallelisms
Do you have any recommendations/suggestion about a better implementation?
Also, are there any recommended data collections for managing the window
(I'm simply using Arrays for managing data)

While working in this I started to investigate spark streaming.
The problem is that I don't know if is really possible to use it on already
collected data.
This post seems to indicate that it should, but it is not clear to me how
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html

Thanks





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.