You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2015/04/10 22:53:14 UTC
[jira] [Resolved] (SPARK-5969) The pyspark.rdd.sortByKey always
fills only two partitions when ascending=False.
[ https://issues.apache.org/jira/browse/SPARK-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen resolved SPARK-5969.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.4.0
Issue resolved by pull request 4761
[https://github.com/apache/spark/pull/4761]
> The pyspark.rdd.sortByKey always fills only two partitions when ascending=False.
> --------------------------------------------------------------------------------
>
> Key: SPARK-5969
> URL: https://issues.apache.org/jira/browse/SPARK-5969
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.2.1
> Environment: Linux, 64bit
> Reporter: Milan Straka
> Fix For: 1.4.0
>
>
> The pyspark.rdd.sortByKey always fills only two partitions when ascending=False -- the first one and the last one.
> Simple example sorting numbers 0..999 into 10 partitions in descending order:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=False, numPartitions=10).glom().map(len).collect()
> {code}
> returns the following partition sizes:
> {code}
> [469, 0, 0, 0, 0, 0, 0, 0, 0, 531]
> {code}
> When ascending=True, all works as expected:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=True, numPartitions=10).glom().map(len).collect()
> Out: [116, 96, 100, 87, 132, 101, 101, 95, 87, 85]
> {code}
> The problem is caused by the following line 565 in rdd.py:
> {code}
> samples = sorted(samples, reverse=(not ascending), key=keyfunc)
> {code}
> That sorts the samples descending if ascending=False. Nevertheless samples should always be in ascending order, because it is (after subsampling to variable bounds) used in a bisect_left call:
> {code}
> def rangePartitioner(k):
> p = bisect.bisect_left(bounds, keyfunc(k))
> if ascending:
> return p
> else:
> return numPartitions - 1 - p
> {code}
> As you can see, rangePartitioner already handles the ascending=False by itself, so the fix for the whole problem is really trivial: just change line rdd.py:565 to
> {{samples = sorted(samples, -reverse=(not ascending),- key=keyfunc)}}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org