You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nirav Patel <np...@xactlycorp.com> on 2018/11/08 00:12:42 UTC

spark 2.2.x - Broadcasthashjoin is not happening even after checkpointing

I am joining two datasets: one with few hundred million record and another
is just 72 records. Without doing anything it tries to do SortMergeJoin
(shuffle exchange) and blows with OOM. I expect it to do mapjoin (broadcast
join)
I have auto boradcast on and I am not repartitioning my dataset.

It works now if I save small dataset and read it back. It doesn't work if I
checkpoint!

Attaching two screen shot. 1st one is where I am checkpointing small
dataset.

[image: Screen Shot 2018-11-07 at 4.04.04 PM.png]

Above is reading ExistingRDD from checkpoint. It has only 72 records and
still decided to do shuffle join!

Here when I save it :

[image: Screen Shot 2018-11-07 at 4.03.53 PM.png]

now it does broadcast join.

So my workaround is to save and read back small dataset.

Why checkpointing didn't work?

Why without checkpointing or saving it doesn't work? (I don't have this
lineage here as it's too big and complicated) checkpointing does help to
truncate previous lineage by executing it but what happened after that was
not expected.

-- 


 <http://www.xactlycorp.com/email-click/>

 
<https://www.instagram.com/xactlycorp/>   
<https://www.linkedin.com/company/xactly-corporation>   
<https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
<http://www.youtube.com/xactlycorporation>