You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Soumya Simanta <so...@gmail.com> on 2014/02/19 23:23:26 UTC

How to achieve this in Spark

I've a RDD that contains ids (Long).

subsetids

res22: org.apache.spark.rdd.RDD[Long]


I've another RDD that has an Object (MyObject) where one of the field is an
id (Long).

allobjects

res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]

Now I want to run filter on allobjects so that I can get a subset that
matches with the ids that are present in my first RDD (i.e., subsetids)

Say something like -

val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) )

However, there is no method "contains" so I'm looking for the most
efficient way to achieving this in Spark.

Thanks.

Re: How to achieve this in Spark

Posted by Mayur Rustagi <ma...@gmail.com>.

You can convert it into a map and ship it along the filter closure. This is
a bad implementation if your Map is huge,
You can convert map into a broadcast variable and send it along with the
filter.
http://www.youtube.com/watch?v=w0Tisli7zn4#t=248


Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Wed, Feb 19, 2014 at 2:23 PM, Soumya Simanta <so...@gmail.com>wrote:

> I've a RDD that contains ids (Long).
>
> subsetids
>
> res22: org.apache.spark.rdd.RDD[Long]
>
>
> I've another RDD that has an Object (MyObject) where one of the field is
> an id (Long).
>
> allobjects
>
> res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]
>
> Now I want to run filter on allobjects so that I can get a subset that
> matches with the ids that are present in my first RDD (i.e., subsetids)
>
> Say something like -
>
> val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) )
>
> However, there is no method "contains" so I'm looking for the most
> efficient way to achieving this in Spark.
>
> Thanks.
>
>
>
>

Re: How to achieve this in Spark

Posted by Nathan Kronenfeld <nk...@oculusinfo.com>.

It depends on how big subsetids is.

If subsetids is small, then you can just collect to the client and
broadcast it as a set to use exactly as stated.

If subsetids is too big for that, you need to join it with the second set:

subsetids.map(i => (i, i)) // transform ids to a key/value form for use
with join
.join(subsetObjs)  // To get just the objs with included ids; note
subsetObjs should be in the (id, object) form
.map(_._2) // to get back to the original subsetObjs form of (id, object)

I hope this helps.
                      -Nathan



On Wed, Feb 19, 2014 at 5:23 PM, Soumya Simanta <so...@gmail.com>wrote:

> I've a RDD that contains ids (Long).
>
> subsetids
>
> res22: org.apache.spark.rdd.RDD[Long]
>
>
> I've another RDD that has an Object (MyObject) where one of the field is
> an id (Long).
>
> allobjects
>
> res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]
>
> Now I want to run filter on allobjects so that I can get a subset that
> matches with the ids that are present in my first RDD (i.e., subsetids)
>
> Say something like -
>
> val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) )
>
> However, there is no method "contains" so I'm looking for the most
> efficient way to achieving this in Spark.
>
> Thanks.
>
>
>
>


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenfeld@oculusinfo.com

Re: How to achieve this in Spark

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Your problem is more basic than that.  You can't reference one RDD
(subsetids) from within an operation on another RDD (allobjects.filter).


On Wed, Feb 19, 2014 at 2:23 PM, Soumya Simanta <so...@gmail.com>wrote:

> I've a RDD that contains ids (Long).
>
> subsetids
>
> res22: org.apache.spark.rdd.RDD[Long]
>
>
> I've another RDD that has an Object (MyObject) where one of the field is
> an id (Long).
>
> allobjects
>
> res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]
>
> Now I want to run filter on allobjects so that I can get a subset that
> matches with the ids that are present in my first RDD (i.e., subsetids)
>
> Say something like -
>
> val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) )
>
> However, there is no method "contains" so I'm looking for the most
> efficient way to achieving this in Spark.
>
> Thanks.
>
>
>
>