You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bryan <br...@gmail.com> on 2015/10/27 00:13:46 UTC

Joining large data sets

Hello.

What is the suggested practice for joining two large data streams? I am currently simply mapping out the key tuple on both streams then executing a join.

I have seen several suggestions for broadcast joins that seem to be targeted at a joining a larger data set to a small set (broadcasting the smaller set).

 For joining two large datasets, it would seem to be better to repartition both sets in the same way then join each partition. It there a suggested practice for this problem?

Thank you,

Bryan Jeffrey