You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hoai-Thu Vuong <th...@gmail.com> on 2014/12/29 09:00:50 UTC

word count aggregation

dear user of spark

I've got a program, streaming a folder, when a new file is created in this
folder, I count a word, which appears in this document and update it (I
used StatefulNetworkWordCount to do it). And it work like charm. However, I
would like to know the different of top 10 word at now and at time (one
hour before). How could I do it? I try to use windowDuration, but it seem
not work.

Re: word count aggregation

Posted by Tathagata Das <ta...@gmail.com>.

For windows that large (1 hour), you will probably also have to
increase the batch interval for efficiency.

TD

On Mon, Dec 29, 2014 at 12:16 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> You can use reduceByKeyAndWindow for that. Here's a pretty clean example
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala
>
> Thanks
> Best Regards
>
> On Mon, Dec 29, 2014 at 1:30 PM, Hoai-Thu Vuong <th...@gmail.com> wrote:
>>
>> dear user of spark
>>
>> I've got a program, streaming a folder, when a new file is created in this
>> folder, I count a word, which appears in this document and update it (I used
>> StatefulNetworkWordCount to do it). And it work like charm. However, I would
>> like to know the different of top 10 word at now and at time (one hour
>> before). How could I do it? I try to use windowDuration, but it seem not
>> work.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: word count aggregation

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

You can use reduceByKeyAndWindow for that. Here's a pretty clean example
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala

Thanks
Best Regards

On Mon, Dec 29, 2014 at 1:30 PM, Hoai-Thu Vuong <th...@gmail.com> wrote:

> dear user of spark
>
> I've got a program, streaming a folder, when a new file is created in this
> folder, I count a word, which appears in this document and update it (I
> used StatefulNetworkWordCount to do it). And it work like charm. However, I
> would like to know the different of top 10 word at now and at time (one
> hour before). How could I do it? I try to use windowDuration, but it seem
> not work.
>