You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/05/04 21:28:06 UTC

[jira] [Commented] (SPARK-7342) Partitioner implementation that uses Int keys directly

    [ https://issues.apache.org/jira/browse/SPARK-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527116#comment-14527116 ] 

Sean Owen commented on SPARK-7342:
----------------------------------

Since {{Integer.hashCode()}} simply returns its own integer value, I don't think there is value in implementing this separately since as you say this is exactly what {{HashPartitioner}} will already give you. The simple accessor will be inlined straightaway, so the overhead is virtually zero.

> Partitioner implementation that uses Int keys directly
> ------------------------------------------------------
>
>                 Key: SPARK-7342
>                 URL: https://issues.apache.org/jira/browse/SPARK-7342
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>            Reporter: Renat Bekbolatov
>            Priority: Trivial
>
> I wanted to find out if we could find it useful to have a partitioner implementation that directly uses integer keys.
> E.g. for an element (i, t) in RDD[(Int, T)], partition id would be (i % numPartitions).
> This can be useful when we want to have a better control over partitions, simply by using key portion of a pair-RDD to communicate partition id.
> Currently, HashPartitioner can be used for this, but having such "direct" partitioner would allow us to skip key object hash computation and also prevent partition collisions (HashPartitioner uses: key.hashCode % numPartitions), if that is desirable to the user.
> One use-case is in RDD.treeAggregate where we already compute partition id and putting it into a key, before reduceByKey operation.
> Another possibility is that explicitly having such a "direct" partitioner, might encourage developers to introduce more sophisticated communication patterns between executors.
> Here is a pull request that has a sketch of that: https://github.com/apache/spark/pull/5884
> This is an insignificant change. If we want to keep our core Spark Partitioner implementations lean, we can just skip this, just throwing an idea for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org