You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/05/04 21:28:06 UTC
[jira] [Commented] (SPARK-7342) Partitioner implementation that
uses Int keys directly
[ https://issues.apache.org/jira/browse/SPARK-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527116#comment-14527116 ]
Sean Owen commented on SPARK-7342:
----------------------------------
Since {{Integer.hashCode()}} simply returns its own integer value, I don't think there is value in implementing this separately since as you say this is exactly what {{HashPartitioner}} will already give you. The simple accessor will be inlined straightaway, so the overhead is virtually zero.
> Partitioner implementation that uses Int keys directly
> ------------------------------------------------------
>
> Key: SPARK-7342
> URL: https://issues.apache.org/jira/browse/SPARK-7342
> Project: Spark
> Issue Type: Question
> Components: Spark Core
> Reporter: Renat Bekbolatov
> Priority: Trivial
>
> I wanted to find out if we could find it useful to have a partitioner implementation that directly uses integer keys.
> E.g. for an element (i, t) in RDD[(Int, T)], partition id would be (i % numPartitions).
> This can be useful when we want to have a better control over partitions, simply by using key portion of a pair-RDD to communicate partition id.
> Currently, HashPartitioner can be used for this, but having such "direct" partitioner would allow us to skip key object hash computation and also prevent partition collisions (HashPartitioner uses: key.hashCode % numPartitions), if that is desirable to the user.
> One use-case is in RDD.treeAggregate where we already compute partition id and putting it into a key, before reduceByKey operation.
> Another possibility is that explicitly having such a "direct" partitioner, might encourage developers to introduce more sophisticated communication patterns between executors.
> Here is a pull request that has a sketch of that: https://github.com/apache/spark/pull/5884
> This is an insignificant change. If we want to keep our core Spark Partitioner implementations lean, we can just skip this, just throwing an idea for discussion.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org