You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Davies Liu (JIRA)" <ji...@apache.org> on 2014/08/20 01:43:19 UTC

[jira] [Created] (SPARK-3141) sortByKey() break take()

Davies Liu created SPARK-3141:
---------------------------------

             Summary: sortByKey() break take()
                 Key: SPARK-3141
                 URL: https://issues.apache.org/jira/browse/SPARK-3141
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.1.0
            Reporter: Davies Liu
            Priority: Blocker


https://github.com/apache/spark/pull/1898/files#r16449470

I think there might be two unintended side effects of this change. This code used to work in pyspark:

sc.parallelize([5,3,4,2,1]).map(lambda x: (x,x)).sortByKey().take(1)
Now it failswith the error:

File "<...>/spark/python/pyspark/rdd.py", line 1023, in takeUpToNumLeft
    yield next(iterator)
TypeError: list object is not an iterator
Changing mapFunc and sort back to generators rather than regular functions fixes that problem.

After making that change, there is a second side effect due to the removal of flatMap where the above code returns the following unexpected result due to the default partitioning scheme:

[[(1, 1), (2, 2)]]
Removing sortByKey, e.g.:

sc.parallelize([5,3,4,2,1]).map(lambda x: (x,x)).take(1)
returns the expected result [(5, 5)]. Restoring the call to flatMap resolves this as well.





--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org