You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/03/08 16:27:40 UTC

[jira] [Commented] (SPARK-13744) Dataframe RDD caching increases the input size for subsequent stages

    [ https://issues.apache.org/jira/browse/SPARK-13744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185069#comment-15185069 ] 

Sean Owen commented on SPARK-13744:
-----------------------------------

90KB can't be right. You have an DF of 10m objects. What are you referring to that says 90K?
I also see it take hundreds of MB in memory as expected.

> Dataframe RDD caching increases the input size for subsequent stages
> --------------------------------------------------------------------
>
>                 Key: SPARK-13744
>                 URL: https://issues.apache.org/jira/browse/SPARK-13744
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: OSX
>            Reporter: Justin Pihony
>            Priority: Minor
>
> Given the below code, you will see that the first run of count shows up as ~90KB, and even the next run with cache being set will result in the same input size. However, every subsequent run thereafter will result in an input size that is MUCH larger (500MB is listed as 38% for a default run). This size discrepancy seems to be a bug in the caching of a dataframe's RDD as far as I can see. 
> {code}
> import sqlContext.implicits._
> case class Person(name:String ="Test", number:Double = 1000.2)
> val people = sc.parallelize(1 to 10000000,50).map { p => Person()}.toDF
> people.write.parquet("people.parquet")
> val parquetFile = sqlContext.read.parquet("people.parquet")
> parquetFile.rdd.count()
> parquetFile.rdd.cache()
> parquetFile.rdd.count()
> parquetFile.rdd.count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org