You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by alessandro finamore <al...@polito.it> on 2014/07/04 15:46:26 UTC
window analysis with Spark and Spark streaming
Hi,
I have a large dataset of text logs files on which I need to implement
"window analysis"
Say, extract per-minute data and do aggregated stats on the last X minutes
I've to implement the windowing analysis with spark.
This is the workflow I'm currently using
- read a file and I create a new RDD with per-minute info
- loop on each new minute and integrate its data with another data structure
containing the last X minutes of data
- apply the analysis on the updated window of data
This works but it suffer from limited parallelisms
Do you have any recommendations/suggestion about a better implementation?
Also, are there any recommended data collections for managing the window
(I'm simply using Arrays for managing data)
While working in this I started to investigate spark streaming.
The problem is that I don't know if is really possible to use it on already
collected data.
This post seems to indicate that it should, but it is not clear to me how
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html
Thanks
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: window analysis with Spark and Spark streaming
Posted by Laeeq Ahmed <la...@yahoo.com>.
Hi,
For QueueRDD, have a look here.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala
Regards,
Laeeq,
PhD candidatte,
KTH, Stockholm.
On Sunday, July 6, 2014 10:20 AM, alessandro finamore <al...@polito.it> wrote:
On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List]
<[hidden email]> wrote:
> Key idea is to simulate your app time as you enter data . So you can connect
> spark streaming to a queue and insert data in it spaced by time. Easier said
> than done :).
I see.
I'll try to implement also this solution so that I can compare it with
my current spark implementation.
I'm interested in seeing if this is faster...as I assume it should be :)
> What are the parallelism issues you are hitting with your
> static approach.
In my current spark implementation, whenever I need to get the
aggregated stats over the window, I'm re-mapping all the current bins
to have the same key so that they can be reduced altogether.
This means that data need to shipped to a single reducer.
As results, adding nodes/cores to the application does not really
affect the total time :(
>
>
> On Friday, July 4, 2014, alessandro finamore <[hidden email]> wrote:
>>
>> Thanks for the replies
>>
>> What is not completely clear to me is how time is managed.
>> I can create a DStream from file.
>> But if I set the window property that will be bounded to the application
>> time, right?
>>
>> If I got it right, with a receiver I can control the way DStream are
>> created.
>> But, how can apply then the windowing already shipped with the framework
>> if
>> this is bounded to the "application time"?
>> I would like to do define a window of N files but the window() function
>> requires a duration as input...
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
>>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
>
> --
> Sent from Gmail Mobile
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8860.html
> To unsubscribe from window analysis with Spark and Spark streaming, click
> here.
> NAML
--
--------------------------------------------------
Alessandro Finamore, PhD
Politecnico di Torino
--
Office: +39 0115644127
Mobile: +39 3280251485
SkypeId: alessandro.finamore
---------------------------------------------------
________________________________
View this message in context: Re: window analysis with Spark and Spark streaming
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: window analysis with Spark and Spark streaming
Posted by alessandro finamore <al...@polito.it>.
On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List]
<ml...@n3.nabble.com> wrote:
> Key idea is to simulate your app time as you enter data . So you can connect
> spark streaming to a queue and insert data in it spaced by time. Easier said
> than done :).
I see.
I'll try to implement also this solution so that I can compare it with
my current spark implementation.
I'm interested in seeing if this is faster...as I assume it should be :)
> What are the parallelism issues you are hitting with your
> static approach.
In my current spark implementation, whenever I need to get the
aggregated stats over the window, I'm re-mapping all the current bins
to have the same key so that they can be reduced altogether.
This means that data need to shipped to a single reducer.
As results, adding nodes/cores to the application does not really
affect the total time :(
>
>
> On Friday, July 4, 2014, alessandro finamore <[hidden email]> wrote:
>>
>> Thanks for the replies
>>
>> What is not completely clear to me is how time is managed.
>> I can create a DStream from file.
>> But if I set the window property that will be bounded to the application
>> time, right?
>>
>> If I got it right, with a receiver I can control the way DStream are
>> created.
>> But, how can apply then the windowing already shipped with the framework
>> if
>> this is bounded to the "application time"?
>> I would like to do define a window of N files but the window() function
>> requires a duration as input...
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
>>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
>
> --
> Sent from Gmail Mobile
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8860.html
> To unsubscribe from window analysis with Spark and Spark streaming, click
> here.
> NAML
--
--------------------------------------------------
Alessandro Finamore, PhD
Politecnico di Torino
--
Office: +39 0115644127
Mobile: +39 3280251485
SkypeId: alessandro.finamore
---------------------------------------------------
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8867.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: window analysis with Spark and Spark streaming
Posted by Mayur Rustagi <ma...@gmail.com>.
Key idea is to simulate your app time as you enter data . So you can
connect spark streaming to a queue and insert data in it spaced by time.
Easier said than done :). What are the parallelism issues you are hitting
with your static approach.
On Friday, July 4, 2014, alessandro finamore <al...@polito.it>
wrote:
> Thanks for the replies
>
> What is not completely clear to me is how time is managed.
> I can create a DStream from file.
> But if I set the window property that will be bounded to the application
> time, right?
>
> If I got it right, with a receiver I can control the way DStream are
> created.
> But, how can apply then the windowing already shipped with the framework if
> this is bounded to the "application time"?
> I would like to do define a window of N files but the window() function
> requires a duration as input...
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
--
Sent from Gmail Mobile
Re: window analysis with Spark and Spark streaming
Posted by M Singh <ma...@yahoo.com>.
The windowing capabilities of spark streaming determine the events in the RDD created for that time window. If the duration is 1s then all the events received in a particular 1s window will be a part of the RDD created for that window for that stream.
On Friday, July 4, 2014 1:28 PM, alessandro finamore <al...@polito.it> wrote:
Thanks for the replies
What is not completely clear to me is how time is managed.
I can create a DStream from file.
But if I set the window property that will be bounded to the application
time, right?
If I got it right, with a receiver I can control the way DStream are
created.
But, how can apply then the windowing already shipped with the framework if
this is bounded to the "application time"?
I would like to do define a window of N files but the window() function
requires a duration as input...
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: window analysis with Spark and Spark streaming
Posted by alessandro finamore <al...@polito.it>.
Thanks for the replies
What is not completely clear to me is how time is managed.
I can create a DStream from file.
But if I set the window property that will be bounded to the application
time, right?
If I got it right, with a receiver I can control the way DStream are
created.
But, how can apply then the windowing already shipped with the framework if
this is bounded to the "application time"?
I would like to do define a window of N files but the window() function
requires a duration as input...
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: window analysis with Spark and Spark streaming
Posted by M Singh <ma...@yahoo.com>.
Another alternative could be use SparkStreaming's textFileStream with windowing capabilities.
On Friday, July 4, 2014 9:52 AM, Gianluca Privitera <gi...@studio.unibo.it> wrote:
You should think about a custom receiver, in order to solve the problem of the “already collected” data.
http://spark.apache.org/docs/latest/streaming-custom-receivers.html
Gianluca
On 04 Jul 2014, at 15:46, alessandro finamore <al...@polito.it> wrote:
Hi,
>
>I have a large dataset of text logs files on which I need to implement
>"window analysis"
>Say, extract per-minute data and do aggregated stats on the last X minutes
>
>I've to implement the windowing analysis with spark.
>This is the workflow I'm currently using
>- read a file and I create a new RDD with per-minute info
>- loop on each new minute and integrate its data with another data structure
>containing the last X minutes of data
>- apply the analysis on the updated window of data
>
>This works but it suffer from limited parallelisms
>Do you have any recommendations/suggestion about a better implementation?
>Also, are there any recommended data collections for managing the window
>(I'm simply using Arrays for managing data)
>
>While working in this I started to investigate spark streaming.
>The problem is that I don't know if is really possible to use it on already
>collected data.
>This post seems to indicate that it should, but it is not clear to me how
>http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html
>
>Thanks
>
>
>
>
>
>--
>View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
Re: window analysis with Spark and Spark streaming
Posted by Gianluca Privitera <gi...@studio.unibo.it>.
You should think about a custom receiver, in order to solve the problem of the “already collected” data.
http://spark.apache.org/docs/latest/streaming-custom-receivers.html
Gianluca
On 04 Jul 2014, at 15:46, alessandro finamore <al...@polito.it>> wrote:
Hi,
I have a large dataset of text logs files on which I need to implement
"window analysis"
Say, extract per-minute data and do aggregated stats on the last X minutes
I've to implement the windowing analysis with spark.
This is the workflow I'm currently using
- read a file and I create a new RDD with per-minute info
- loop on each new minute and integrate its data with another data structure
containing the last X minutes of data
- apply the analysis on the updated window of data
This works but it suffer from limited parallelisms
Do you have any recommendations/suggestion about a better implementation?
Also, are there any recommended data collections for managing the window
(I'm simply using Arrays for managing data)
While working in this I started to investigate spark streaming.
The problem is that I don't know if is really possible to use it on already
collected data.
This post seems to indicate that it should, but it is not clear to me how
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html
Thanks
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.