You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Brad Willard (JIRA)" <ji...@apache.org> on 2014/12/06 18:17:12 UTC

[jira] [Created] (SPARK-4778) PySpark Json and groupByKey broken

Brad Willard created SPARK-4778:
-----------------------------------

             Summary: PySpark Json and groupByKey broken
                 Key: SPARK-4778
                 URL: https://issues.apache.org/jira/browse/SPARK-4778
             Project: Spark
          Issue Type: Bug
          Components: EC2, PySpark
    Affects Versions: 1.1.1
         Environment: ec2 cluster launched from ec2 script
pyspark
c3.2xlarge 6 nodes
hadoop major version 1
            Reporter: Brad Willard


When I run a groupByKey it seems to create a single tasks after the groupByKey that never stops executing. I'm loading a smallish json dataset that is 4 million records. This is the code I'm running. 

rdd = sql_context.jsonFile(hdfs_uri) 
rdd = rdd.cache() 

grouped = rdd.map(lambda row: (row.id, row)).groupByKey(160) 

grouped.take(1) 

The groupByKey stage takes a few minutes which I'd expect. However the take operation never completes. It it hands indefinitely.

This is what it looks like in UI
http://cl.ly/image/2k1t3I253T0x

The only work around I have at the moment is to run a map operation after I loaded from json to convert all the Row objects to python dictionary objects and then things work although the map operation is expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org