You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by nilmish <ni...@gmail.com> on 2014/05/29 14:04:58 UTC

Selecting first ten values in a RDD/partition

I have a DSTREAM which consists of RDD partitioned every 2 sec. I have sorted
each RDD and want to retain only top 10 values and discard further value.
How can I retain only top 10 values ?

I am trying to get top 10 hashtags.  Instead of sorting the entire of
5-minute-counts (thereby, incurring the cost of a data shuffle), I am trying
to get the top 10 hashtags in each partition. I am struck at how to retain
top 10 hashtags in each partition.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Selecting first ten values in a RDD/partition

Posted by Gerard Maas <ge...@gmail.com>.

DStream has a help method to print the first 10 elements of each RDD. You
could take some inspiration from it, as the usecase is practically the same
and the code will be probably very similar:  rdd.take(10)...

https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala#L591

-kr, Gerard.

On Thu, May 29, 2014 at 10:08 PM, Brian Gawalt <bg...@gmail.com> wrote:

> Try looking at the .mapPartitions( ) method implemented for RDD[T] objects.
> It will give you direct access to an iterator containing the member objects
> of each partition for doing the kind of within-partition hashtag counts
> you're describing.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6534.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Selecting first ten values in a RDD/partition

Posted by Brian Gawalt <bg...@gmail.com>.

Try looking at the .mapPartitions( ) method implemented for RDD[T] objects.
It will give you direct access to an iterator containing the member objects
of each partition for doing the kind of within-partition hashtag counts
you're describing.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6534.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Selecting first ten values in a RDD/partition

Posted by Chris Fregly <ch...@fregly.com>.

as brian g alluded to earlier, you can use DStream.mapPartitions() to
return  the partition-local top 10 for each partition.  once you collect
the results from all the partitions, you can do a global top 10 merge sort
across all partitions.

this leads to a much much-smaller dataset to be shuffled back to the driver
to calculate the global top 10.


On Fri, May 30, 2014 at 5:05 AM, nilmish <ni...@gmail.com> wrote:

> My primary goal : To get top 10 hashtag for every 5 mins interval.
>
> I want to do this efficiently. I have already done this by using
> reducebykeyandwindow() and then sorting all hashtag in 5 mins interval
> taking only top 10 elements. But this is very slow.
>
> So I now I am thinking of retaining only top 10 hashtags in each RDD
> because
> these only could come in the final answer.
>
> I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM
> ? Basically I need to transform my DTREAM in which each RDD contains only
> top 10 hashtags so that number of hashtags in 5 mins interval is low.
>
> If there is some more efficient way of doing this then please let me know
> that also.
>
> Thanx,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Selecting first ten values in a RDD/partition

Posted by nilmish <ni...@gmail.com>.

My primary goal : To get top 10 hashtag for every 5 mins interval.

I want to do this efficiently. I have already done this by using
reducebykeyandwindow() and then sorting all hashtag in 5 mins interval
taking only top 10 elements. But this is very slow. 

So I now I am thinking of retaining only top 10 hashtags in each RDD because
these only could come in the final answer. 

I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM
? Basically I need to transform my DTREAM in which each RDD contains only
top 10 hashtags so that number of hashtags in 5 mins interval is low.

If there is some more efficient way of doing this then please let me know
that also.

Thanx,
Nilesh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Selecting first ten values in a RDD/partition

Posted by Anwar Rizal <an...@gmail.com>.

Can you clarify what you're trying to achieve here ?

If you want to take only top 10 of each RDD, why don't sort followed by
take(10) of every RDD ?

Or, you want to take top 10 of five minutes ?

Cheers,



On Thu, May 29, 2014 at 2:04 PM, nilmish <ni...@gmail.com> wrote:

> I have a DSTREAM which consists of RDD partitioned every 2 sec. I have
> sorted
> each RDD and want to retain only top 10 values and discard further value.
> How can I retain only top 10 values ?
>
> I am trying to get top 10 hashtags.  Instead of sorting the entire of
> 5-minute-counts (thereby, incurring the cost of a data shuffle), I am
> trying
> to get the top 10 hashtags in each partition. I am struck at how to retain
> top 10 hashtags in each partition.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>