You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Adrian Mocanu <am...@verticalscope.com> on 2015/03/16 19:50:53 UTC
partitionBy not working w HashPartitioner
Here's my use case:
I read an array into an RDD and I use a hash partitioner to partition the RDD.
This is the array type: Array[(String, Iterable[(Long, Int)])]
topK:Array[(String, Iterable[(Long, Int)])] = ...
import org.apache.spark.HashPartitioner
val hashPartitioner=new HashPartitioner(10)
val resultRdd=sc.parallelize(topK).partitionBy(hashPartitioner).sortByKey().saveAsTextFile(fileName)
I also tried
val resultRdd=sc.parallelize(topK, 10).sortByKey().saveAsTextFile(fileName)
The results:
I do get 10 partitions. However, the first partition always contains data for the first 2 keys in the RDD, then each following partition contains data for 1 key in the RDD (as expected), then the last file is empty since the first file contained 2 keys.
The Question:
How to make Spark write 1 file per key? Is this behaviour I'm currently seeing a bug?
-Adrian