You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Zhang, Liyun" <li...@intel.com> on 2016/09/06 08:13:04 UTC

How to make the result of sortByKey distributed evenly?

Hi all:
  I have a question about RDD.sortByKey

val n=20000
val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey()
 sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest")

sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like [(0,2),(0,3),.....,(0,19999),(1,20000)], the key is skewed.

The result of sortByKey is expected to distributed evenly. But when I view the result and found that part-00000 is large and part-00001 is small.

 hadoop fs -ls /SkewedGroupByTest/
16/09/06 03:24:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r-- 1 root supergroup 0 2016-09-06 03:21 /SkewedGroupByTest /_SUCCESS
-rw-r--r-- 1 root supergroup 188878 2016-09-06 03:21 /SkewedGroupByTest/part-00000
-rw-r--r-- 1 root supergroup 10 2016-09-06 03:21 /SkewedGroupByTest/part-00001

How can I get the result distributed evenly?  I don't need that the key in the part-xxxxx are same and only need to guarantee the data in part-xxxx0 ~ part-xxxxx is sorted.


Thanks for any help!


Kelly Zhang/Zhang,Liyun
Best Regards


Re: How to make the result of sortByKey distributed evenly?

Posted by Fridtjof Sander <fr...@googlemail.com>.
Your data has only two keys, and basically all values are assigned to 
only one of them. There is no better way to distribute the keys, than 
the one Spark executes.

What you have to do is to use different keys to sort and range-partition 
on. Try to invoke sortBy() on a non-pair-RDD. This will take both parts 
of your data as key so sort on. You can also set your tuple as key 
manually, and set a constant int or something as value.

Am 06.09.16 um 10:13 schrieb Zhang, Liyun:
>
> Hi all:
>
>   I have a question about RDD.sortByKey
>
> val n=20000
> val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey()
>  sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest")
>
> sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like 
> [(0,2),(0,3),..,(0,19999),(1,20000)], the key is skewed.
>
> The result of sortByKey is expected to distributed evenly. But when I 
> view the result and found that part-00000 is large and part-00001 is 
> small.
>
>  hadoop fs -ls /SkewedGroupByTest/
> 16/09/06 03:24:55 WARN util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes 
> where applicable
> Found 3 items
> -rw-r--r-- 1 root supergroup 0 2016-09-06 03:21 /SkewedGroupByTest 
> /_SUCCESS
> -rw-r--r-- 1 root supergroup 188878 2016-09-06 03:21 
> /SkewedGroupByTest/part-00000
> -rw-r--r-- 1 root supergroup 10 2016-09-06 03:21 
> /SkewedGroupByTest/part-00001
>
> How can I get the result distributed evenly?  I dont need that the 
> key in the part-xxxxx are same and only need to guarantee the data in 
> part-xxxx0 ~ part-xxxxx is sorted.
>
> Thanks for any help!
>
> Kelly Zhang/Zhang,Liyun
>
> Best Regards
>