You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shashikant Kulkarni <sh...@gmail.com> on 2017/10/12 17:16:05 UTC

Apache Spark-Subtract two datasets

Hello,

I have 2 datasets, Dataset<Class1> and other is Dataset<Class2>. I want the list of records which are in Dataset<Class1> but not in Dataset<Class2>. How can I do this in Apache Spark using Java Connector? I am using Apache Spark 2.2.0

Thank you
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Apache Spark-Subtract two datasets

Posted by Imran Rajjad <ra...@gmail.com>.
if the datasets hold objects of different classes, then you will have to
convert both of them to rdd and then rename the columns befrore you call
rdd1.subtract(rdd2)

On Thu, Oct 12, 2017 at 10:16 PM, Shashikant Kulkarni <
shashikant.kulkarni@gmail.com> wrote:

> Hello,
>
> I have 2 datasets, Dataset<Class1> and other is Dataset<Class2>. I want
> the list of records which are in Dataset<Class1> but not in
> Dataset<Class2>. How can I do this in Apache Spark using Java Connector? I
> am using Apache Spark 2.2.0
>
> Thank you
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
I.R

Re: Apache Spark-Subtract two datasets

Posted by Nathan Kronenfeld <nk...@uncharted.software>.
I think you want a join of type "left_anti"... See below log

scala> import spark.implicits._
import spark.implicits._

scala> case class Foo (a: String, b: Int)
defined class Foo

scala> case class Bar (a: String, d: Double)
defined class Bar

scala> var fooDs = Seq(Foo("a", 1), Foo("b", 2), Foo("c", 3)).toDS
fooDs: org.apache.spark.sql.Dataset[Foo] = [a: string, b: int]

scala> var barDs = Seq(Bar("b", 2.1), Bar("c", 3.2), Bar("d", 4.3)).toDS
barDs: org.apache.spark.sql.Dataset[Bar] = [a: string, d: double]

scala> fooDs.join(barDs, Seq("a"), "left_anti").collect.foreach(println)
[a,1]


On Thu, Oct 12, 2017 at 1:16 PM, Shashikant Kulkarni <
shashikant.kulkarni@gmail.com> wrote:

> Hello,
>
> I have 2 datasets, Dataset<Class1> and other is Dataset<Class2>. I want
> the list of records which are in Dataset<Class1> but not in
> Dataset<Class2>. How can I do this in Apache Spark using Java Connector? I
> am using Apache Spark 2.2.0
>
> Thank you
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>