You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chris Fregly <ch...@fregly.com> on 2014/06/30 03:38:15 UTC

Re: Selecting first ten values in a RDD/partition

as brian g alluded to earlier, you can use DStream.mapPartitions() to
return  the partition-local top 10 for each partition.  once you collect
the results from all the partitions, you can do a global top 10 merge sort
across all partitions.

this leads to a much much-smaller dataset to be shuffled back to the driver
to calculate the global top 10.


On Fri, May 30, 2014 at 5:05 AM, nilmish <ni...@gmail.com> wrote:

> My primary goal : To get top 10 hashtag for every 5 mins interval.
>
> I want to do this efficiently. I have already done this by using
> reducebykeyandwindow() and then sorting all hashtag in 5 mins interval
> taking only top 10 elements. But this is very slow.
>
> So I now I am thinking of retaining only top 10 hashtags in each RDD
> because
> these only could come in the final answer.
>
> I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM
> ? Basically I need to transform my DTREAM in which each RDD contains only
> top 10 hashtags so that number of hashtags in 5 mins interval is low.
>
> If there is some more efficient way of doing this then please let me know
> that also.
>
> Thanx,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>