You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Justin Pihony (JIRA)" <ji...@apache.org> on 2016/03/08 16:14:41 UTC
[jira] [Created] (SPARK-13744) Dataframe RDD caching increases the
input size for subsequent stages
Justin Pihony created SPARK-13744:
-------------------------------------
Summary: Dataframe RDD caching increases the input size for subsequent stages
Key: SPARK-13744
URL: https://issues.apache.org/jira/browse/SPARK-13744
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.6.0
Environment: OSX
Reporter: Justin Pihony
Priority: Minor
Given the below code, you will see that the first run of count shows up as ~90KB, and even the next run with cache being set will result in the same input size. However, every subsequent run thereafter will result in an input size that is MUCH larger (500MB is listed as 38% for a default run). This size discrepancy seems to be a bug in the caching of a dataframe's RDD as far as I can see.
{code}
import sqlContext.implicits._
case class Person(name:String ="Test", number:Double = 1000.2)
val people = sc.parallelize(1 to 10000000,50).map { p => Person()}.toDF
people.write.parquet("people.parquet")
val parquetFile = sqlContext.read.parquet("people.parquet")
parquetFile.rdd.count()
parquetFile.rdd.cache()
parquetFile.rdd.count()
parquetFile.rdd.count()
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org