You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "zhengruifeng (JIRA)" <ji...@apache.org> on 2019/06/11 09:48:00 UTC

[jira] [Commented] (SPARK-25360) Parallelized RDDs of Ranges could have known partitioner

    [ https://issues.apache.org/jira/browse/SPARK-25360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860756#comment-16860756 ] 

zhengruifeng commented on SPARK-25360:
--------------------------------------

[~holdenk] i am afraid it is not doable to add a partitioner to \{RDD[Long]} generated by \{sc.range}, refering to the defination of partitioner.
{code:java}
/**
 * An object that defines how the elements in a key-value pair RDD are partitioned by key.
 * Maps each key to a partition ID, from 0 to `numPartitions - 1`.
 *
 * Note that, partitioner must be deterministic, i.e. it must return the same partition id given
 * the same partition key.
 */{code}
Since the returned RDD[Long] is not a \{PairRDD}, so that following ops (like join, sort) which can utilize upstreaming partitioner.

 

An alternative is to add some method like `sc.tabulate[T](start, end, step, numSlices)(f: Long => T)`, so that the partitioner can be used in future ops.

> Parallelized RDDs of Ranges could have known partitioner
> --------------------------------------------------------
>
>                 Key: SPARK-25360
>                 URL: https://issues.apache.org/jira/browse/SPARK-25360
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: holdenk
>            Priority: Trivial
>
> We already have the logic to split up the generator, we could expose the same logic as a partitioner. This would be useful when joining a small parallelized collection with a larger collection and other cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org