You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Srikanth <sr...@gmail.com> on 2015/07/23 04:14:42 UTC

ShuffledHashJoin instead of CartesianProduct

Hello,

I'm trying to link records from two large data sources. Both datasets have
almost same number of rows.
Goal is to match records based on multiple columns.

    val matchId =
> SFAccountDF.as("SF").join(ELAccountDF.as("EL")).where($"SF.Email" ===
> $"EL.EmailAddress" || $"SF.Phone" === "EL.Phone")


Joining with a OR(||) will result in a CartesianProduct. I'm trying to
avoid that.
One way to do this is to join on each column and UNION the results.


>     val phoneMatch = SFAccountDF.as("SF").filter("Phone !=
> ''").join(ELAccountDF.as("EL").filter("BusinessPhone !=
> ''")).where($"SF.Phone" === $"EL.BusinessPhone")
>     val emailMatch = SFAccountDF.as("SF").filter("Email !=
> ''").join(ELAccountDF.as("EL").filter("EmailAddress !=
> ''")).where($"SF.Email" === $"EL.EmailAddress")
>
>     val matchId =
> phoneMatch.unionAll(emailMatch.unionAll(faxMatch.unionAll(mobileMatch)))
>     matchId.cache().registerTempTable("matchId")


Is there a more elegant way to do this?

On a related note, has anyone worked on record linkage using Bloom Filters,
Levenshtein distance, etc in Spark?

Srikanth