You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Arun Luthra <ar...@gmail.com> on 2019/06/18 19:18:16 UTC
GC problem doing fuzzy join
I'm trying to do a brute force fuzzy join where I compare N records against
N other records, for N^2 total comparisons.
The table is medium size and fits in memory, so I collect it and put it
into a broadcast variable.
The other copy of the table is in an RDD. I am basically calling the RDD
map operation, and each record in the RDD takes the broadcasted table and
FILTERS it. There appears to be large GC happening, so I suspect that huge
repeated data deletion of copies of the broadcast table is causing GC.
Is there a way to fix this pattern?
Thanks,
Arun