You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by sujeet jog <su...@gmail.com> on 2016/09/12 16:44:39 UTC
Partition n keys into exacly n partitions
Hi,
Is there a way to partition set of data with n keys into exactly n
partitions.
For ex : -
tuple of 1008 rows with key as x
tuple of 1008 rows with key as y and so on total 10 keys ( x, y etc )
Total records = 10080
NumOfKeys = 10
i want to partition the 10080 elements into exactly 10 partitions with each
partition having elements with unique key
Is there a way to make this happen ?.. any ideas on implementing custom
partitioner.
The current partitioner i'm using is HashPartitioner from which there are
cases where key.hascode() % numPartitions for keys of x & y become same.
hence many elements with different keys fall into single partition at
times.
Thanks,
Sujeet
Re: Partition n keys into exacly n partitions
Posted by Christophe Préaud <ch...@kelkoo.com>.
Hi,
A custom partitioner is indeed the solution.
Here is a sample code:
import org.apache.spark.Partitioner
class KeyPartitioner(keyList: Seq[Any]) extends Partitioner {
def numPartitions: Int = keyList.size + 1
def getPartition(key: Any): Int = keyList.indexOf(key) + 1
override def equals(other: Any): Boolean = other match {
case h: KeyPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
override def hashCode: Int = numPartitions
}
It allows to repartition a RDD[(K, V)] so that all lines with the same
key value (and only those lines) will be on the same partition.
You need to pass as parameter to the constructor a Seq[K] keyList containing all the possible values for the keys in the RDD[(K, V)], e.g.:
val rdd = sc.parallelize(
Seq((1,'a),(2,'a),(3,'a),(1,'b),(2,'b),(1,'c),(3,'c),(4,'d))
)
rdd.partitionBy(new KeyPartitioner(Seq(1,2,3,4)))
will put:
- (1,'a) (1,'b) and (1,'c) in partition 1
- (2,'a) and (2,'b) in partition 2
- (3,'a) and (3,'c) in partition 3
- (4,'d) in partition 4
and nothing in partition 0
If a key is not defined in the keyList, it will be put in partition 0:
rdd.partitionBy(new KeyPartitioner(Seq(1,2,3)))
will put:
- (1,'a) (1,'b) and (1,'c) in partition 1
- (2,'a) and (2,'b) in partition 2
- (3,'a) and (3,'c) in partition 3
- (4,'d) in partition 0
Please let me know if it fits your needs.
Regards,
Christophe.
On 12/09/16 19:03, Denis Bolshakov wrote:
Just provide own partitioner.
One I wrote a partitioner which keeps similar keys together in one partitioner.
Best regards,
Denis
On 12 September 2016 at 19:44, sujeet jog <su...@gmail.com>> wrote:
Hi,
Is there a way to partition set of data with n keys into exactly n partitions.
For ex : -
tuple of 1008 rows with key as x
tuple of 1008 rows with key as y and so on total 10 keys ( x, y etc )
Total records = 10080
NumOfKeys = 10
i want to partition the 10080 elements into exactly 10 partitions with each partition having elements with unique key
Is there a way to make this happen ?.. any ideas on implementing custom partitioner.
The current partitioner i'm using is HashPartitioner from which there are cases where key.hascode() % numPartitions for keys of x & y become same.
hence many elements with different keys fall into single partition at times.
Thanks,
Sujeet
--
//with Best Regards
--Denis Bolshakov
e-mail: bolshakov.denis@gmail.com<ma...@gmail.com>
________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris
Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Partition n keys into exacly n partitions
Posted by Denis Bolshakov <bo...@gmail.com>.
Just provide own partitioner.
One I wrote a partitioner which keeps similar keys together in one
partitioner.
Best regards,
Denis
On 12 September 2016 at 19:44, sujeet jog <su...@gmail.com> wrote:
> Hi,
>
> Is there a way to partition set of data with n keys into exactly n
> partitions.
>
> For ex : -
>
> tuple of 1008 rows with key as x
> tuple of 1008 rows with key as y and so on total 10 keys ( x, y etc )
>
> Total records = 10080
> NumOfKeys = 10
>
> i want to partition the 10080 elements into exactly 10 partitions with
> each partition having elements with unique key
>
> Is there a way to make this happen ?.. any ideas on implementing custom
> partitioner.
>
>
> The current partitioner i'm using is HashPartitioner from which there are
> cases where key.hascode() % numPartitions for keys of x & y become same.
>
> hence many elements with different keys fall into single partition at
> times.
>
>
>
> Thanks,
> Sujeet
>
--
//with Best Regards
--Denis Bolshakov
e-mail: bolshakov.denis@gmail.com