You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Justin Pihony (JIRA)" <ji...@apache.org> on 2016/03/08 16:14:41 UTC

[jira] [Created] (SPARK-13744) Dataframe RDD caching increases the input size for subsequent stages

Justin Pihony created SPARK-13744:
-------------------------------------

             Summary: Dataframe RDD caching increases the input size for subsequent stages
                 Key: SPARK-13744
                 URL: https://issues.apache.org/jira/browse/SPARK-13744
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.0
         Environment: OSX
            Reporter: Justin Pihony
            Priority: Minor


Given the below code, you will see that the first run of count shows up as ~90KB, and even the next run with cache being set will result in the same input size. However, every subsequent run thereafter will result in an input size that is MUCH larger (500MB is listed as 38% for a default run). This size discrepancy seems to be a bug in the caching of a dataframe's RDD as far as I can see. 

{code}
import sqlContext.implicits._

case class Person(name:String ="Test", number:Double = 1000.2)

val people = sc.parallelize(1 to 10000000,50).map { p => Person()}.toDF

people.write.parquet("people.parquet")

val parquetFile = sqlContext.read.parquet("people.parquet")

parquetFile.rdd.count()
parquetFile.rdd.cache()
parquetFile.rdd.count()
parquetFile.rdd.count()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org