You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ankur Srivastava <an...@gmail.com> on 2019/02/14 23:50:19 UTC
Cross Join in Spark
Hello,
We have a use case where we need to do a Cartesian join and for some reason
we are not able to get it work with Dataset API's. We have similar use case
implemented and working with RDD.
We have two dataset:
- one data set with 2 string columns say c1, c2. It is a small data set
with ~1 million records. The two columns are both strings of 32 characters
so should be less than 500 mb. We broadcast this dataset
- the other data set is little bigger with ~10 million records
Below is the code we are using:
val ds1 = spark.read.format("csv").option("header",
"true").load(<s3-location>).select("c1", "c2")
ds1.count
val ds2 = spark.read.format("csv").load(<s3-location>).
toDF("c11", "c12", "c13", "c14", "c15", "ts")
ds2.count
ds2.crossJoin(broadcast(ds1)).filter($"c1" <= $"c11" && $"c11" <=
$"c2").count
We even tried regular join,
ds2.join(broadcast(ds1), $"c1" <= $"c11" && $"c11" <= $"c2")
I can see the broadcast is successful
2019-02-14 23:11:55 INFO CodeGenerator:54 - Code generated in 10.469136 ms
2019-02-14 23:11:55 INFO TorrentBroadcast:54 - Started reading broadcast
variable 29
2019-02-14 23:11:55 INFO TorrentBroadcast:54 - Reading broadcast variable
29 took 6 ms
2019-02-14 23:11:56 INFO CodeGenerator:54 - Code generated in 11.280087 ms
But then the stage do not progress.
What am I doing wrong? Any pointers will be helpful.
Thanks
Ankur