You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Adrian Mocanu <am...@verticalscope.com> on 2015/03/16 19:50:53 UTC

partitionBy not working w HashPartitioner

Here's my use case:

I read an array into an RDD and I use a hash partitioner to partition the RDD.
This is the array type: Array[(String, Iterable[(Long, Int)])]

topK:Array[(String, Iterable[(Long, Int)])] = ...
import org.apache.spark.HashPartitioner
val hashPartitioner=new HashPartitioner(10)
val resultRdd=sc.parallelize(topK).partitionBy(hashPartitioner).sortByKey().saveAsTextFile(fileName)

I also tried
val resultRdd=sc.parallelize(topK, 10).sortByKey().saveAsTextFile(fileName)

The results:
I do get 10 partitions. However, the first partition always contains data for the first 2 keys in the RDD, then each following partition contains data for 1 key in the RDD (as expected), then the last file is empty since the first file contained 2 keys.

The Question:
How to make Spark write 1 file per key? Is this behaviour I'm currently seeing a bug?


-Adrian