You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "David Sabater (JIRA)" <ji...@apache.org> on 2015/06/30 18:08:06 UTC

[jira] [Comment Edited] (SPARK-4644) Implement skewed join

    [ https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608578#comment-14608578 ] 

David Sabater edited comment on SPARK-4644 at 6/30/15 4:07 PM:
---------------------------------------------------------------

I think the point here is to minimise the shuffling produced by the skewed key, i.e. one executor receives all tuples based on the skewed key.
As the groupBy is not using information about partitioning and it's known to be not optimal anyway, it makes sense to focus on improving this.

The point is how other MPP engines are handling skewed data? As an example Greenplum just partitions based on your distribution key and leverages storage rather than memory to proces the queries. Kind of brute force approach, broadcasting the other tables assumed smaller and doing the sort-merge join locally. I believe this is the way GPDB is doing it but worth checking this to see if helps so solve this challenge.


was (Author: dsdinter):
I think the point here is to minimise the shuffling produced by the skewed key, i.e. one executor receives all tuples based on the skewed key.
As the groupBy is not using information about partitioning and ii's known to be not optimal anyway, it makes sense to focus on improving this.

The point is how other MPP engines are handling skewed data? As an example Greenplum just partitions based on your distribution key and leverages storage rather than memory to proces the queries. Kind of brute force approach, broadcasting the other tables assumed smaller and doing the sort-merge join locally. I believe this is the way GPDB is doing it but worth checking this to see if helps so solve this challenge.

> Implement skewed join
> ---------------------
>
>                 Key: SPARK-4644
>                 URL: https://issues.apache.org/jira/browse/SPARK-4644
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Shixiong Zhu
>         Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have several books which are liked by most of the users. Running ALS on such skewed data will raise a OutOfMemory error, if some book has too many users which cannot be fit into memory. To solve it, we propose a skewed join implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org