You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Zhang, Liyun" <li...@intel.com> on 2016/09/06 08:13:04 UTC
How to make the result of sortByKey distributed evenly?
Hi all:
I have a question about RDD.sortByKey
val n=20000
val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey()
sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest")
sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like [(0,2),(0,3),.....,(0,19999),(1,20000)], the key is skewed.
The result of sortByKey is expected to distributed evenly. But when I view the result and found that part-00000 is large and part-00001 is small.
hadoop fs -ls /SkewedGroupByTest/
16/09/06 03:24:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r-- 1 root supergroup 0 2016-09-06 03:21 /SkewedGroupByTest /_SUCCESS
-rw-r--r-- 1 root supergroup 188878 2016-09-06 03:21 /SkewedGroupByTest/part-00000
-rw-r--r-- 1 root supergroup 10 2016-09-06 03:21 /SkewedGroupByTest/part-00001
How can I get the result distributed evenly? I don't need that the key in the part-xxxxx are same and only need to guarantee the data in part-xxxx0 ~ part-xxxxx is sorted.
Thanks for any help!
Kelly Zhang/Zhang,Liyun
Best Regards
Re: How to make the result of sortByKey distributed evenly?
Posted by Fridtjof Sander <fr...@googlemail.com>.
Your data has only two keys, and basically all values are assigned to
only one of them. There is no better way to distribute the keys, than
the one Spark executes.
What you have to do is to use different keys to sort and range-partition
on. Try to invoke sortBy() on a non-pair-RDD. This will take both parts
of your data as key so sort on. You can also set your tuple as key
manually, and set a constant int or something as value.
Am 06.09.16 um 10:13 schrieb Zhang, Liyun:
>
> Hi all:
>
> I have a question about RDD.sortByKey
>
> val n=20000
> val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey()
> sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest")
>
> sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like
> [(0,2),(0,3),..,(0,19999),(1,20000)], the key is skewed.
>
> The result of sortByKey is expected to distributed evenly. But when I
> view the result and found that part-00000 is large and part-00001 is
> small.
>
> hadoop fs -ls /SkewedGroupByTest/
> 16/09/06 03:24:55 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> Found 3 items
> -rw-r--r-- 1 root supergroup 0 2016-09-06 03:21 /SkewedGroupByTest
> /_SUCCESS
> -rw-r--r-- 1 root supergroup 188878 2016-09-06 03:21
> /SkewedGroupByTest/part-00000
> -rw-r--r-- 1 root supergroup 10 2016-09-06 03:21
> /SkewedGroupByTest/part-00001
>
> How can I get the result distributed evenly? I dont need that the
> key in the part-xxxxx are same and only need to guarantee the data in
> part-xxxx0 ~ part-xxxxx is sorted.
>
> Thanks for any help!
>
> Kelly Zhang/Zhang,Liyun
>
> Best Regards
>