You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:25:25 UTC

[jira] [Updated] (SPARK-11387) minimize shuffles during joins by using existing partitions and bundling messages

     [ https://issues.apache.org/jira/browse/SPARK-11387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-11387:
---------------------------------
    Labels: bulk-closed  (was: )

> minimize shuffles during joins by using existing partitions and bundling messages
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-11387
>                 URL: https://issues.apache.org/jira/browse/SPARK-11387
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Glenn Strycker
>            Priority: Major
>              Labels: bulk-closed
>
> Currently an RDD join in Spark requires repartitioning by the join key (for large RDDs that cannot use broadcast).
> This is very bad for highly skewed data, as every row containing a particular key will end up on one node.
> Additionally, repartitioning is expensive, and the existing partitioning scheme may have been optimized to minimize message passing.  For example, perhaps an RDD is an edge list for a graph, but a user has already partitioned this data by a community structure or connected components, ensuring that similar edges are on the same partition.  Using a join operation to perform message passing will require repartitioning the edge list by the first or second vertex in the edge as a key.
> Instead of repartitioning and shuffling, could messages across partitions be "bundled" together and passed once, almost like a broadcast operation?
> Essentially the request here is to treat ALL RDDs of any size as broadcast-capable, and each partition would be broadcast one and at a time and the results aggregated.  It would be up to the user to optimize the partitioning to minimize the between-partition message passing volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org