You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:25:25 UTC
[jira] [Updated] (SPARK-11387) minimize shuffles during joins by
using existing partitions and bundling messages
[ https://issues.apache.org/jira/browse/SPARK-11387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-11387:
---------------------------------
Labels: bulk-closed (was: )
> minimize shuffles during joins by using existing partitions and bundling messages
> ---------------------------------------------------------------------------------
>
> Key: SPARK-11387
> URL: https://issues.apache.org/jira/browse/SPARK-11387
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Glenn Strycker
> Priority: Major
> Labels: bulk-closed
>
> Currently an RDD join in Spark requires repartitioning by the join key (for large RDDs that cannot use broadcast).
> This is very bad for highly skewed data, as every row containing a particular key will end up on one node.
> Additionally, repartitioning is expensive, and the existing partitioning scheme may have been optimized to minimize message passing. For example, perhaps an RDD is an edge list for a graph, but a user has already partitioned this data by a community structure or connected components, ensuring that similar edges are on the same partition. Using a join operation to perform message passing will require repartitioning the edge list by the first or second vertex in the edge as a key.
> Instead of repartitioning and shuffling, could messages across partitions be "bundled" together and passed once, almost like a broadcast operation?
> Essentially the request here is to treat ALL RDDs of any size as broadcast-capable, and each partition would be broadcast one and at a time and the results aggregated. It would be up to the user to optimize the partitioning to minimize the between-partition message passing volume.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org