You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2015/04/10 22:53:14 UTC

[jira] [Resolved] (SPARK-5969) The pyspark.rdd.sortByKey always fills only two partitions when ascending=False.

     [ https://issues.apache.org/jira/browse/SPARK-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Rosen resolved SPARK-5969.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.4.0

Issue resolved by pull request 4761
[https://github.com/apache/spark/pull/4761]

> The pyspark.rdd.sortByKey always fills only two partitions when ascending=False.
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-5969
>                 URL: https://issues.apache.org/jira/browse/SPARK-5969
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.1
>         Environment: Linux, 64bit
>            Reporter: Milan Straka
>             Fix For: 1.4.0
>
>
> The pyspark.rdd.sortByKey always fills only two partitions when ascending=False -- the first one and the last one.
> Simple example sorting numbers 0..999 into 10 partitions in descending order:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=False, numPartitions=10).glom().map(len).collect()
> {code}
> returns the following partition sizes:
> {code}
> [469, 0, 0, 0, 0, 0, 0, 0, 0, 531]
> {code}
> When ascending=True, all works as expected:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=True, numPartitions=10).glom().map(len).collect()
> Out: [116, 96, 100, 87, 132, 101, 101, 95, 87, 85]
> {code}
> The problem is caused by the following line 565 in rdd.py:
> {code}
>         samples = sorted(samples, reverse=(not ascending), key=keyfunc)
> {code}
> That sorts the samples descending if ascending=False. Nevertheless samples should always be in ascending order, because it is (after subsampling to variable bounds) used in a bisect_left call:
> {code}
>         def rangePartitioner(k):
>             p = bisect.bisect_left(bounds, keyfunc(k))
>             if ascending:
>                 return p
>             else:
>                 return numPartitions - 1 - p
> {code}
> As you can see, rangePartitioner already handles the ascending=False by itself, so the fix for the whole problem is really trivial: just change line rdd.py:565 to
> {{samples = sorted(samples, -reverse=(not ascending),- key=keyfunc)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org