You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ankur Srivastava <an...@gmail.com> on 2019/02/14 23:50:19 UTC

Cross Join in Spark

Hello,

We have a use case where we need to do a Cartesian join and for some reason
we are not able to get it work with Dataset API's. We have similar use case
implemented and working with RDD.

We have two dataset:
- one data set with 2 string columns say c1, c2. It is a small data set
with ~1 million records. The two columns are both strings of 32 characters
so should be less than 500 mb. We broadcast this dataset

- the other data set is little bigger with ~10 million records

Below is the code we are using:
val ds1 = spark.read.format("csv").option("header",
"true").load(<s3-location>).select("c1", "c2")
ds1.count

val ds2 = spark.read.format("csv").load(<s3-location>).
  toDF("c11", "c12", "c13", "c14", "c15", "ts")
ds2.count

ds2.crossJoin(broadcast(ds1)).filter($"c1" <= $"c11" && $"c11" <=
$"c2").count

We even tried regular join,
ds2.join(broadcast(ds1), $"c1" <= $"c11" && $"c11" <= $"c2")

I can see the broadcast is successful
2019-02-14 23:11:55 INFO  CodeGenerator:54 - Code generated in 10.469136 ms
2019-02-14 23:11:55 INFO  TorrentBroadcast:54 - Started reading broadcast
variable 29
2019-02-14 23:11:55 INFO  TorrentBroadcast:54 - Reading broadcast variable
29 took 6 ms
2019-02-14 23:11:56 INFO  CodeGenerator:54 - Code generated in 11.280087 ms

But then the stage do not progress.

What am I doing wrong? Any pointers will be helpful.

Thanks
Ankur