You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by pankaj <pa...@gmail.com> on 2014/12/01 16:31:40 UTC

Time based aggregation in Real time Spark Streaming

Hi,

My incoming message has time stamp as one field and i have to perform
aggregation over 3 minute of time slice.

Message sample

"Item ID" "Item Type" "timeStamp"
1                  X               1-12-2014:12:01
1                  X               1-12-2014:12:02
1                  X               1-12-2014:12:03
1                  y               1-12-2014:12:04
1                  y               1-12-2014:12:05
1                  y               1-12-2014:12:06

Aggregation Result
ItemId        ItemType      count   aggregationStartTime    aggrEndTime
1                  X                     3          1-12-2014:12:01
1-12-2014:12:03
1                  y                      3       1-12-2014:12:04
 1-12-2014:12:06

What is the best way to perform time based aggregation in spark.
Kindly suggest.

Thanks




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Time-based-aggregation-in-Real-time-Spark-Streaming-tp20102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Time based aggregation in Real time Spark Streaming

Posted by pankaj <pa...@gmail.com>.
Hi ,

suppose i keep batch size of 3 minute. in 1 batch there can be incoming
records with any time stamp.
so it is difficult to keep track of when the 3 minute interval was start and
end. i am doing output operation on worker nodes in forEachPartition not in
drivers(forEachRdd) so i cannot use any shared variable to store start/end
time bcoz shared variable like accumulator are write only in task.     

is there any solution on this.

Thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Time-based-aggregation-in-Real-time-Spark-Streaming-tp20102p20111.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Time based aggregation in Real time Spark Streaming

Posted by Bahubali Jain <ba...@gmail.com>.
Hi,
You can associate all the messages of a 3min interval with a unique key and
then group by and finally add up.

Thanks
On Dec 1, 2014 9:02 PM, "pankaj" <pa...@gmail.com> wrote:

> Hi,
>
> My incoming message has time stamp as one field and i have to perform
> aggregation over 3 minute of time slice.
>
> Message sample
>
> "Item ID" "Item Type" "timeStamp"
> 1                  X               1-12-2014:12:01
> 1                  X               1-12-2014:12:02
> 1                  X               1-12-2014:12:03
> 1                  y               1-12-2014:12:04
> 1                  y               1-12-2014:12:05
> 1                  y               1-12-2014:12:06
>
> Aggregation Result
> ItemId        ItemType      count   aggregationStartTime    aggrEndTime
> 1                  X                     3          1-12-2014:12:01
>   1-12-2014:12:03
> 1                  y                      3       1-12-2014:12:04
>  1-12-2014:12:06
>
> What is the best way to perform time based aggregation in spark.
> Kindly suggest.
>
> Thanks
>
> ------------------------------
> View this message in context: Time based aggregation in Real time Spark
> Streaming
> <http://apache-spark-user-list.1001560.n3.nabble.com/Time-based-aggregation-in-Real-time-Spark-Streaming-tp20102.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>