You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Arun Luthra <ar...@gmail.com> on 2019/06/18 19:18:16 UTC

GC problem doing fuzzy join

I'm trying to do a brute force fuzzy join where I compare N records against
N other records, for N^2 total comparisons.

The table is medium size and fits in memory, so I collect it and put it
into a broadcast variable.

The other copy of the table is in an RDD. I am basically calling the RDD
map operation, and each record in the RDD takes the broadcasted table and
FILTERS it. There appears to be large GC happening, so I suspect that huge
repeated data deletion of copies of the broadcast table is causing GC.

Is there a way to fix this pattern?

Thanks,
Arun