You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Glenn Strycker (JIRA)" <ji...@apache.org> on 2015/10/28 20:19:27 UTC

[jira] [Created] (SPARK-11387) minimize shuffles during joins by using existing partitions and bundling messages

Glenn Strycker created SPARK-11387:
--------------------------------------

             Summary: minimize shuffles during joins by using existing partitions and bundling messages
                 Key: SPARK-11387
                 URL: https://issues.apache.org/jira/browse/SPARK-11387
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Glenn Strycker


Currently an RDD join in Spark requires repartitioning by the join key (for large RDDs that cannot use broadcast).

This is very bad for highly skewed data, as every row containing a particular key will end up on one node.

Additionally, repartitioning is expensive, and the existing partitioning scheme may have been optimized to minimize message passing.  For example, perhaps an RDD is an edge list for a graph, but a user has already partitioned this data by a community structure or connected components, ensuring that similar edges are on the same partition.  Using a join operation to perform message passing will require repartitioning the edge list by the first or second vertex in the edge as a key.

Instead of repartitioning and shuffling, could messages across partitions be "bundled" together and passed once, almost like a broadcast operation?

Essentially the request here is to treat ALL RDDs of any size as broadcast-capable, and each partition would be broadcast one and at a time and the results aggregated.  It would be up to the user to optimize the partitioning to minimize the between-partition message passing volume.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org