You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Blind Faith <pe...@gmail.com> on 2014/11/17 18:51:11 UTC

How can I apply such an inner join in Spark Scala/Python

So let us say I have RDDs A and B with the following values.

A = [ (1, 2), (2, 4), (3, 6) ]

B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ]

I want to apply an inner join, such that I get the following as a result.

C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ]

That is, those keys which are not present in A should disappear after the
left inner join.

How can I achieve that? I can see outerJoin functions but no innerJoin
functions in the Spark RDD class.

Re: How can I apply such an inner join in Spark Scala/Python

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Simple join would do it.

    val a: List[(Int, Int)] = List((1,2),(2,4),(3,6))
    val b: List[(Int, Int)] = List((1,3),(2,5),(3,6), (4,5),(5,6))

    val A = sparkContext.parallelize(a)
    val B = sparkContext.parallelize(b)

    val ac = new PairRDDFunctions[Int, Int](A)

    *val C = ac.join(B)*

    C.foreach(println)


Thanks
Best Regards

On Mon, Nov 17, 2014 at 11:54 PM, Sean Owen <so...@cloudera.com> wrote:

> Just RDD.join() should be an inner join.
>
> On Mon, Nov 17, 2014 at 5:51 PM, Blind Faith <pe...@gmail.com>
> wrote:
> > So let us say I have RDDs A and B with the following values.
> >
> > A = [ (1, 2), (2, 4), (3, 6) ]
> >
> > B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ]
> >
> > I want to apply an inner join, such that I get the following as a result.
> >
> > C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ]
> >
> > That is, those keys which are not present in A should disappear after the
> > left inner join.
> >
> > How can I achieve that? I can see outerJoin functions but no innerJoin
> > functions in the Spark RDD class.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: How can I apply such an inner join in Spark Scala/Python

Posted by Sean Owen <so...@cloudera.com>.

Just RDD.join() should be an inner join.

On Mon, Nov 17, 2014 at 5:51 PM, Blind Faith <pe...@gmail.com> wrote:
> So let us say I have RDDs A and B with the following values.
>
> A = [ (1, 2), (2, 4), (3, 6) ]
>
> B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ]
>
> I want to apply an inner join, such that I get the following as a result.
>
> C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ]
>
> That is, those keys which are not present in A should disappear after the
> left inner join.
>
> How can I achieve that? I can see outerJoin functions but no innerJoin
> functions in the Spark RDD class.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org