You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Brad Willard (JIRA)" <ji...@apache.org> on 2015/01/06 20:00:34 UTC

[jira] [Updated] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general

     [ https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brad Willard updated SPARK-5075:
--------------------------------
    Summary: Memory Leak when repartitioning SchemaRDD or running queries in general  (was: Memory Leak when repartitioning SchemaRDD from JSON)

> Memory Leak when repartitioning SchemaRDD or running queries in general
> -----------------------------------------------------------------------
>
>                 Key: SPARK-5075
>                 URL: https://issues.apache.org/jira/browse/SPARK-5075
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 1.2.0
>         Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge
>            Reporter: Brad Willard
>              Labels: ec2, json, parquet, pyspark, repartition, s3
>
> I'm trying to repartition a json dataset for better cpu optimization and save in parquet format for better performance. The Json dataset is about 200gb
> from pyspark.sql import SQLContext
> sql_context = SQLContext(sc)
> rdd = sql_context.jsonFile('s3c://some_path')
> rdd = rdd.repartition(256)
> rdd.saveAsParquetFile('hdfs://some_path')
> In ganglia when the dataset first loads it's about 200G in memory which is expected. However once it attempts the repartition, it balloons over 2.5x in memory which is never returned making any subsequent operations fail from memory errors.
> https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org