You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Davies Liu (JIRA)" <ji...@apache.org> on 2014/08/20 01:43:19 UTC
[jira] [Created] (SPARK-3141) sortByKey() break take()
Davies Liu created SPARK-3141:
---------------------------------
Summary: sortByKey() break take()
Key: SPARK-3141
URL: https://issues.apache.org/jira/browse/SPARK-3141
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.1.0
Reporter: Davies Liu
Priority: Blocker
https://github.com/apache/spark/pull/1898/files#r16449470
I think there might be two unintended side effects of this change. This code used to work in pyspark:
sc.parallelize([5,3,4,2,1]).map(lambda x: (x,x)).sortByKey().take(1)
Now it failswith the error:
File "<...>/spark/python/pyspark/rdd.py", line 1023, in takeUpToNumLeft
yield next(iterator)
TypeError: list object is not an iterator
Changing mapFunc and sort back to generators rather than regular functions fixes that problem.
After making that change, there is a second side effect due to the removal of flatMap where the above code returns the following unexpected result due to the default partitioning scheme:
[[(1, 1), (2, 2)]]
Removing sortByKey, e.g.:
sc.parallelize([5,3,4,2,1]).map(lambda x: (x,x)).take(1)
returns the expected result [(5, 5)]. Restoring the call to flatMap resolves this as well.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org