You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by SK <sk...@gmail.com> on 2014/11/14 02:28:11 UTC

Streaming: getting total count over all windows

Hi,

I am using the following code to generate the (score, count) for each
window:

val score_count_by_window  = topic.map(r =>  r._2)   // r._2 is the integer
score
                                                     .countByValue()
                           
score_count_by_window.print()   

E.g. output for a window is as follows, which means that within the Dstream
for that window, there are 2 rdds with score 0; 3 with score 1, and 1 with
score -1.
(0, 2)
(1, 3)
(-1, 1)

I would like to get the aggregate count for each score over all windows
until program terminates. I tried countByValueAndWindow() but the result is
same as countByValue() (i.e. it is producing only per window counts). 
reduceByWindow also does not produce the result I am expecting. What is the
correct way to sum up the counts over multiple windows?

thanks










--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-getting-total-count-over-all-windows-tp18888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Streaming: getting total count over all windows

Posted by Mayur Rustagi <ma...@gmail.com>.
So if you want to do from beginning to end of time the interface is
updateStatebykey, if only over a particular set of windows you can
construct broader windows from smaller windows/batches.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Fri, Nov 14, 2014 at 9:17 AM, jay vyas <ja...@gmail.com>
wrote:

> I would think this should be done at the application level.
> After all, the core functionality of SparkStreaming is to capture RDDs in
> some real time interval and process them -
> not to aggregate their results.
>
> But maybe there is a better way.......
>
> On Thu, Nov 13, 2014 at 8:28 PM, SK <sk...@gmail.com> wrote:
>
>> Hi,
>>
>> I am using the following code to generate the (score, count) for each
>> window:
>>
>> val score_count_by_window  = topic.map(r =>  r._2)   // r._2 is the
>> integer
>> score
>>                                                      .countByValue()
>>
>> score_count_by_window.print()
>>
>> E.g. output for a window is as follows, which means that within the
>> Dstream
>> for that window, there are 2 rdds with score 0; 3 with score 1, and 1 with
>> score -1.
>> (0, 2)
>> (1, 3)
>> (-1, 1)
>>
>> I would like to get the aggregate count for each score over all windows
>> until program terminates. I tried countByValueAndWindow() but the result
>> is
>> same as countByValue() (i.e. it is producing only per window counts).
>> reduceByWindow also does not produce the result I am expecting. What is
>> the
>> correct way to sum up the counts over multiple windows?
>>
>> thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-getting-total-count-over-all-windows-tp18888.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
> --
> jay vyas
>

Re: Streaming: getting total count over all windows

Posted by jay vyas <ja...@gmail.com>.
I would think this should be done at the application level.
After all, the core functionality of SparkStreaming is to capture RDDs in
some real time interval and process them -
not to aggregate their results.

But maybe there is a better way.......

On Thu, Nov 13, 2014 at 8:28 PM, SK <sk...@gmail.com> wrote:

> Hi,
>
> I am using the following code to generate the (score, count) for each
> window:
>
> val score_count_by_window  = topic.map(r =>  r._2)   // r._2 is the integer
> score
>                                                      .countByValue()
>
> score_count_by_window.print()
>
> E.g. output for a window is as follows, which means that within the Dstream
> for that window, there are 2 rdds with score 0; 3 with score 1, and 1 with
> score -1.
> (0, 2)
> (1, 3)
> (-1, 1)
>
> I would like to get the aggregate count for each score over all windows
> until program terminates. I tried countByValueAndWindow() but the result is
> same as countByValue() (i.e. it is producing only per window counts).
> reduceByWindow also does not produce the result I am expecting. What is the
> correct way to sum up the counts over multiple windows?
>
> thanks
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-getting-total-count-over-all-windows-tp18888.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
jay vyas