You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Patrick Wendell (JIRA)" <ji...@apache.org> on 2014/12/01 21:38:14 UTC
[jira] [Comment Edited] (SPARK-4644) Implement skewed join

    [ https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230390#comment-14230390 ] 

Patrick Wendell edited comment on SPARK-4644 at 12/1/14 8:37 PM:
-----------------------------------------------------------------

I would push back a bit on what you said about groupByKey [~sdkfjslakdfj]. I think solving groupByKey is pretty important, it's probably the most common user frustration with Spark. There are cases where the user is streaming through the data (e.g. they are doing groupByKey and then writing results out to HDFS or DISK_ONLY persistence level). Or cases where it's hard for them to significantly reduce the amount of data. Or even cases where they really should be doing reduceByKey but they just don't. So I wouldn't rule out solving this in a nice way across all of our operators that, in the current architecture, suffer from this issue.


was (Author: pwendell):
I would push back a bit on what you said about groupByKey [~sdkfjslakdfj]. I think solving groupByKey is pretty important, it's probably the most common user frustration with Spark. There are cases where the user is streaming through the data (e.g. they are doing groupByKey and then writing results out to HDFS or DISK_ONLY persistence level). Or cases where it's hard for them to significantly reduce the amount of data. So I wouldn't rule out solving this in a nice way across all of our operators that, in the current architecture, suffer from this issue.

> Implement skewed join
> ---------------------
>
>                 Key: SPARK-4644
>                 URL: https://issues.apache.org/jira/browse/SPARK-4644
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Shixiong Zhu
>         Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have several books which are liked by most of the users. Running ALS on such skewed data will raise a OutOfMemory error, if some book has too many users which cannot be fit into memory. To solve it, we propose a skewed join implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org