You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Debasish Das <de...@gmail.com> on 2014/03/16 04:58:27 UTC

Spark join for skewed dataset

Hi,

If the join keys are skewed is there are specific optimized join available
in Spark for such usecases ?

I saw in both scalding and Hive similar feature is supported and I am
testing skewjoinWithSmaller on one of the skewed dataset...


http://twitter.github.io/scalding/com/twitter/scalding/JoinAlgorithms.html:
skewjoinWithSmaller



https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization:
I am not sure if it is a proposal or it has been added


I guess using hashpartition we can generate new join keys for the cases
where the join key is skewed...I was wondering if there is something
available in the API already.


Thanks.

Deb