You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jared Rodriguez <jr...@kitedesk.com> on 2014/04/23 16:48:26 UTC

Comparing RDD Items

Hi there,

I am new to Spark and new to scala, although have lots of experience on the
Java side.  I am experimenting with Spark for a new project where it seems
like it could be a good fit.  As I go through the examples, there is one
case scenario that I am trying to figure out, comparing the contents of an
RDD to itself to result in a new RDD.

In an overly simply example, I have:

JavaSparkContext sc = new JavaSparkContext ...
JavaRDD<String> data = sc.parallelize(buildData());

I then want to compare each entry in data to other entries and end up with:

JavaPairRDD<String, List<String>> mapped = data.???

Is this something easily handled by Spark?  My apologies if this is a
stupid question, I have spent less than 10 hours tinkering with Spark and
am trying to come up to speed.


-- 
Jared Rodriguez

Re: Comparing RDD Items

Posted by Daniel Darabos <da...@lynxanalytics.com>.
Hi! There is RDD.cartesian(), which creates the Cartiesian product of two
RDDs. You could do data.cartesian(data) to get an RDD of all pairs of
lines. It will be of length data.count * data.count of course.



On Wed, Apr 23, 2014 at 4:48 PM, Jared Rodriguez <jr...@kitedesk.com>wrote:

> Hi there,
>
> I am new to Spark and new to scala, although have lots of experience on
> the Java side.  I am experimenting with Spark for a new project where it
> seems like it could be a good fit.  As I go through the examples, there is
> one case scenario that I am trying to figure out, comparing the contents of
> an RDD to itself to result in a new RDD.
>
> In an overly simply example, I have:
>
> JavaSparkContext sc = new JavaSparkContext ...
> JavaRDD<String> data = sc.parallelize(buildData());
>
> I then want to compare each entry in data to other entries and end up with:
>
> JavaPairRDD<String, List<String>> mapped = data.???
>
> Is this something easily handled by Spark?  My apologies if this is a
> stupid question, I have spent less than 10 hours tinkering with Spark and
> am trying to come up to speed.
>
>
> --
> Jared Rodriguez
>
>