You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Todd <bi...@163.com> on 2015/08/16 03:35:08 UTC

Can't understand the size of raw RDD and its DataFrame

Hi,
With following code snippet, I cached the raw RDD(which is already in memory, but just for illustration) and its DataFrame.
I thought that the df cache would take less space than the rdd cache,which is wrong because from the UI that I see the rdd cache takes 168B,while the df cache takes 272B.
What data is cached when df.cache is called and actually cache the data?  It looks that the df only cached the avg(age) which should be much smaller in size,

val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
    val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22)))
    rdd.cache
    rdd.toDF().registerTempTable("TBL_STUDENT")
    val df = sqlContext.sql("select avg(age) from TBL_STUDENT")
    df.cache()
    df.show

Re: Re: Can't understand the size of raw RDD and its DataFrame

Posted by Rishi Yadav <ri...@infoobjects.com>.

Dataframes in simple terms are RDDs combined with Schema. In reality they
are much more than that and provide a very fine level of optimization,
Check out project Tungsten.

In your case it was one column as you chose. By default, it keeps same
columns as in RDD (same as field of a case class if you created RDD using
case class)


Author: Spark Cook Book <http://amzn.com/1783987065> (Packt)


On Sat, Aug 15, 2015 at 10:01 PM, Todd <bi...@163.com> wrote:

> I thought that the df only contains one column, and actually contains only
> one resulting row(select avg(age) from theTable).
> So,I would think that it would take less space,looks my understanding is
> run??
>
>
>
>
>
> At 2015-08-16 12:34:31, "Rishi Yadav" <ri...@infoobjects.com> wrote:
>
> why are you expecting footprint of dataframe to be lower when it contains
> more information ( RDD + Schema)
>
> On Sat, Aug 15, 2015 at 6:35 PM, Todd <bi...@163.com> wrote:
>
>> Hi,
>> With following code snippet, I cached the raw RDD(which is already in
>> memory, but just for illustration) and its DataFrame.
>> I thought that the df cache would take less space than the rdd
>> cache,which is wrong because from the UI that I see the rdd cache takes
>> 168B,while the df cache takes 272B.
>> What data is cached when df.cache is called and actually cache the data?
>> It looks that the df only cached the avg(age) which should be much smaller
>> in size,
>>
>> val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache")
>>     val sc = new SparkContext(conf)
>>     val sqlContext = new SQLContext(sc)
>>     import sqlContext.implicits._
>>     val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22)))
>>     rdd.cache
>>     rdd.toDF().registerTempTable("TBL_STUDENT")
>>     val df = sqlContext.sql("select avg(age) from TBL_STUDENT")
>>     df.cache()
>>     df.show
>>
>>
>

Re:Re: Can't understand the size of raw RDD and its DataFrame

Posted by Todd <bi...@163.com>.

I thought that the df only contains one column, and actually contains only one resulting row(select avg(age) from theTable).
So,I would think that it would take less space,looks my understanding is run??







At 2015-08-16 12:34:31, "Rishi Yadav" <ri...@infoobjects.com> wrote:

why are you expecting footprint of dataframe to be lower when it contains more information ( RDD + Schema)


On Sat, Aug 15, 2015 at 6:35 PM, Todd <bi...@163.com> wrote:

Hi,
With following code snippet, I cached the raw RDD(which is already in memory, but just for illustration) and its DataFrame.
I thought that the df cache would take less space than the rdd cache,which is wrong because from the UI that I see the rdd cache takes 168B,while the df cache takes 272B.
What data is cached when df.cache is called and actually cache the data?  It looks that the df only cached the avg(age) which should be much smaller in size,

val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
    val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22)))
    rdd.cache
    rdd.toDF().registerTempTable("TBL_STUDENT")
    val df = sqlContext.sql("select avg(age) from TBL_STUDENT")
    df.cache()
    df.show

Re: Can't understand the size of raw RDD and its DataFrame

Posted by Rishi Yadav <ri...@infoobjects.com>.

why are you expecting footprint of dataframe to be lower when it contains
more information ( RDD + Schema)

On Sat, Aug 15, 2015 at 6:35 PM, Todd <bi...@163.com> wrote:

> Hi,
> With following code snippet, I cached the raw RDD(which is already in
> memory, but just for illustration) and its DataFrame.
> I thought that the df cache would take less space than the rdd cache,which
> is wrong because from the UI that I see the rdd cache takes 168B,while the
> df cache takes 272B.
> What data is cached when df.cache is called and actually cache the data?
> It looks that the df only cached the avg(age) which should be much smaller
> in size,
>
> val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache")
>     val sc = new SparkContext(conf)
>     val sqlContext = new SQLContext(sc)
>     import sqlContext.implicits._
>     val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22)))
>     rdd.cache
>     rdd.toDF().registerTempTable("TBL_STUDENT")
>     val df = sqlContext.sql("select avg(age) from TBL_STUDENT")
>     df.cache()
>     df.show
>
>