You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Soumya Simanta <so...@gmail.com> on 2014/02/19 23:23:26 UTC
How to achieve this in Spark
I've a RDD that contains ids (Long).
subsetids
res22: org.apache.spark.rdd.RDD[Long]
I've another RDD that has an Object (MyObject) where one of the field is an
id (Long).
allobjects
res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]
Now I want to run filter on allobjects so that I can get a subset that
matches with the ids that are present in my first RDD (i.e., subsetids)
Say something like -
val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) )
However, there is no method "contains" so I'm looking for the most
efficient way to achieving this in Spark.
Thanks.
Re: How to achieve this in Spark
Posted by Mayur Rustagi <ma...@gmail.com>.
You can convert it into a map and ship it along the filter closure. This is
a bad implementation if your Map is huge,
You can convert map into a broadcast variable and send it along with the
filter.
http://www.youtube.com/watch?v=w0Tisli7zn4#t=248
Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi
On Wed, Feb 19, 2014 at 2:23 PM, Soumya Simanta <so...@gmail.com>wrote:
> I've a RDD that contains ids (Long).
>
> subsetids
>
> res22: org.apache.spark.rdd.RDD[Long]
>
>
> I've another RDD that has an Object (MyObject) where one of the field is
> an id (Long).
>
> allobjects
>
> res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]
>
> Now I want to run filter on allobjects so that I can get a subset that
> matches with the ids that are present in my first RDD (i.e., subsetids)
>
> Say something like -
>
> val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) )
>
> However, there is no method "contains" so I'm looking for the most
> efficient way to achieving this in Spark.
>
> Thanks.
>
>
>
>
Re: How to achieve this in Spark
Posted by Nathan Kronenfeld <nk...@oculusinfo.com>.
It depends on how big subsetids is.
If subsetids is small, then you can just collect to the client and
broadcast it as a set to use exactly as stated.
If subsetids is too big for that, you need to join it with the second set:
subsetids.map(i => (i, i)) // transform ids to a key/value form for use
with join
.join(subsetObjs) // To get just the objs with included ids; note
subsetObjs should be in the (id, object) form
.map(_._2) // to get back to the original subsetObjs form of (id, object)
I hope this helps.
-Nathan
On Wed, Feb 19, 2014 at 5:23 PM, Soumya Simanta <so...@gmail.com>wrote:
> I've a RDD that contains ids (Long).
>
> subsetids
>
> res22: org.apache.spark.rdd.RDD[Long]
>
>
> I've another RDD that has an Object (MyObject) where one of the field is
> an id (Long).
>
> allobjects
>
> res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]
>
> Now I want to run filter on allobjects so that I can get a subset that
> matches with the ids that are present in my first RDD (i.e., subsetids)
>
> Say something like -
>
> val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) )
>
> However, there is no method "contains" so I'm looking for the most
> efficient way to achieving this in Spark.
>
> Thanks.
>
>
>
>
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone: +1-416-203-3003 x 238
Email: nkronenfeld@oculusinfo.com
Re: How to achieve this in Spark
Posted by Mark Hamstra <ma...@clearstorydata.com>.
Your problem is more basic than that. You can't reference one RDD
(subsetids) from within an operation on another RDD (allobjects.filter).
On Wed, Feb 19, 2014 at 2:23 PM, Soumya Simanta <so...@gmail.com>wrote:
> I've a RDD that contains ids (Long).
>
> subsetids
>
> res22: org.apache.spark.rdd.RDD[Long]
>
>
> I've another RDD that has an Object (MyObject) where one of the field is
> an id (Long).
>
> allobjects
>
> res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]
>
> Now I want to run filter on allobjects so that I can get a subset that
> matches with the ids that are present in my first RDD (i.e., subsetids)
>
> Say something like -
>
> val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) )
>
> However, there is no method "contains" so I'm looking for the most
> efficient way to achieving this in Spark.
>
> Thanks.
>
>
>
>