You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rabin Banerjee <de...@gmail.com> on 2016/11/18 15:36:14 UTC

Will spark cache table once even if I call read/cache on the same table multiple times

Hi All ,

  I am working in a project where code is divided into multiple reusable
module . I am not able to understand spark persist/cache on that context.

My Question is Will spark cache table once even if I call read/cache on the
same table multiple times ??

 Sample Code ::

  TableReader::

   def getTableDF(tablename:String,persist:Boolean = false) : DataFrame = {
         val tabdf = sqlContext.table(tablename)
         if(persist) {
             tabdf.cache()
            }
      return tableDF
}

 Now
Module1::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Module2::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

....

ModuleN::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Will spark cache emp table once , or it will cache every time I am calling
?? Shall I maintain a global hashmap to handle that ? something like
Map[String,DataFrame] ??

 Regards,
Rabin Banerjee

Re: Will spark cache table once even if I call read/cache on the same table multiple times

Posted by Yong Zhang <ja...@hotmail.com>.

If you have 2 different RDD (as 2 different references and RDD ID shown in your example), then YES, Spark will cache 2 exactly same thing in the memory.


There is no way that spark will compare and know that they are the same content. You define them as 2 RDD, then they are different RDDs, and will be cached individually.


Yong


________________________________
From: Taotao.Li <ch...@gmail.com>
Sent: Sunday, November 20, 2016 6:18 AM
To: Rabin Banerjee
Cc: Yong Zhang; user; Mich Talebzadeh; Tathagata Das
Subject: Re: Will spark cache table once even if I call read/cache on the same table multiple times

hi, you can check my stackoverflow question : http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812

On Sat, Nov 19, 2016 at 3:16 AM, Rabin Banerjee <de...@gmail.com>> wrote:
Hi Yong,

  But every time  val tabdf = sqlContext.table(tablename) is called tabdf.rdd is having a new id which can be checked by calling tabdf.rdd.id<http://tabdf.rdd.id> .
And,
https://github.com/apache/spark/blob/b6de0c98c70960a97b07615b0b08fbd8f900fbe7/core/src/main/scala/org/apache/spark/SparkContext.scala#L268

Spark is maintaining the Map if [RDD_ID,RDD] , as RDD id is changing , will spark cache same data again and again ??

For example ,

val tabdf = sqlContext.table("employee")
tabdf.cache()
tabdf.someTransformation.someAction
println(tabledf.rdd.id<http://tabledf.rdd.id>)
val tabdf1 = sqlContext.table("employee")
tabdf1.cache() <= Will spark again go to disk read and load data into memory or look into cache ?
tabdf1.someTransformation.someAction
println(tabledf1.rdd.id<http://tabledf1.rdd.id>)

Regards,
R Banerjee




On Fri, Nov 18, 2016 at 9:14 PM, Yong Zhang <ja...@hotmail.com>> wrote:

That's correct, as long as you don't change the StorageLevel.


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L166



Yong

________________________________
From: Rabin Banerjee <de...@gmail.com>>
Sent: Friday, November 18, 2016 10:36 AM
To: user; Mich Talebzadeh; Tathagata Das
Subject: Will spark cache table once even if I call read/cache on the same table multiple times

Hi All ,

  I am working in a project where code is divided into multiple reusable module . I am not able to understand spark persist/cache on that context.

My Question is Will spark cache table once even if I call read/cache on the same table multiple times ??

 Sample Code ::

  TableReader::

   def getTableDF(tablename:String,persist:Boolean = false) : DataFrame = {
         val tabdf = sqlContext.table(tablename)
         if(persist) {
             tabdf.cache()
            }
      return tableDF
}

 Now
Module1::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Module2::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

....

ModuleN::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Will spark cache emp table once , or it will cache every time I am calling ?? Shall I maintain a global hashmap to handle that ? something like Map[String,DataFrame] ??

 Regards,
Rabin Banerjee







--
___________________
Quant | Engineer | Boy
___________________
blog:    http://litaotao.github.io<http://litaotao.github.io?utm_source=spark_mail>
github: www.github.com/litaotao<http://www.github.com/litaotao>

Re: Will spark cache table once even if I call read/cache on the same table multiple times

Posted by "Taotao.Li" <ch...@gmail.com>.

hi, you can check my stackoverflow question :
http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812

On Sat, Nov 19, 2016 at 3:16 AM, Rabin Banerjee <
dev.rabin.banerjee@gmail.com> wrote:

> Hi Yong,
>
>   But every time  val tabdf = sqlContext.table(tablename) is called tabdf.rdd
> is having a new id which can be checked by calling tabdf.rdd.id .
> And,
> https://github.com/apache/spark/blob/b6de0c98c70960a97b07615b0b08fb
> d8f900fbe7/core/src/main/scala/org/apache/spark/SparkContext.scala#L268
>
> Spark is maintaining the Map if [RDD_ID,RDD] , as RDD id is changing ,
> will spark cache same data again and again ??
>
> For example ,
>
> val tabdf = sqlContext.table("employee")
> tabdf.cache()
> tabdf.someTransformation.someAction
> println(tabledf.rdd.id)
> val tabdf1 = sqlContext.table("employee")
> tabdf1.cache() <= *Will spark again go to disk read and load data into
> memory or look into cache ?*
> tabdf1.someTransformation.someAction
> println(tabledf1.rdd.id)
>
> Regards,
> R Banerjee
>
>
>
>
> On Fri, Nov 18, 2016 at 9:14 PM, Yong Zhang <ja...@hotmail.com> wrote:
>
>> That's correct, as long as you don't change the StorageLevel.
>>
>>
>> https://github.com/apache/spark/blob/master/core/src/main/
>> scala/org/apache/spark/rdd/RDD.scala#L166
>>
>>
>>
>> Yong
>>
>> ------------------------------
>> *From:* Rabin Banerjee <de...@gmail.com>
>> *Sent:* Friday, November 18, 2016 10:36 AM
>> *To:* user; Mich Talebzadeh; Tathagata Das
>> *Subject:* Will spark cache table once even if I call read/cache on the
>> same table multiple times
>>
>> Hi All ,
>>
>>   I am working in a project where code is divided into multiple reusable
>> module . I am not able to understand spark persist/cache on that context.
>>
>> My Question is Will spark cache table once even if I call read/cache on
>> the same table multiple times ??
>>
>>  Sample Code ::
>>
>>   TableReader::
>>
>>    def getTableDF(tablename:String,persist:Boolean = false) : DataFrame
>> = {
>>          val tabdf = sqlContext.table(tablename)
>>          if(persist) {
>>              tabdf.cache()
>>             }
>>       return tableDF
>> }
>>
>>  Now
>> Module1::
>>  val emp = TableReader.getTable("employee")
>>  emp.someTransformation.someAction
>>
>> Module2::
>>  val emp = TableReader.getTable("employee")
>>  emp.someTransformation.someAction
>>
>> ....
>>
>> ModuleN::
>>  val emp = TableReader.getTable("employee")
>>  emp.someTransformation.someAction
>>
>> Will spark cache emp table once , or it will cache every time I am
>> calling ?? Shall I maintain a global hashmap to handle that ? something
>> like Map[String,DataFrame] ??
>>
>>  Regards,
>> Rabin Banerjee
>>
>>
>>
>>
>


-- 
*___________________*
Quant | Engineer | Boy
*___________________*
*blog*:    http://litaotao.github.io
<http://litaotao.github.io?utm_source=spark_mail>
*github*: www.github.com/litaotao

Re: Will spark cache table once even if I call read/cache on the same table multiple times

Posted by Rabin Banerjee <de...@gmail.com>.

Hi Yong,

  But every time  val tabdf = sqlContext.table(tablename) is called tabdf.rdd
is having a new id which can be checked by calling tabdf.rdd.id .
And,
https://github.com/apache/spark/blob/b6de0c98c70960a97b07615b0b08fbd8f900fbe7/core/src/main/scala/org/apache/spark/SparkContext.scala#L268

Spark is maintaining the Map if [RDD_ID,RDD] , as RDD id is changing , will
spark cache same data again and again ??

For example ,

val tabdf = sqlContext.table("employee")
tabdf.cache()
tabdf.someTransformation.someAction
println(tabledf.rdd.id)
val tabdf1 = sqlContext.table("employee")
tabdf1.cache() <= *Will spark again go to disk read and load data into
memory or look into cache ?*
tabdf1.someTransformation.someAction
println(tabledf1.rdd.id)

Regards,
R Banerjee




On Fri, Nov 18, 2016 at 9:14 PM, Yong Zhang <ja...@hotmail.com> wrote:

> That's correct, as long as you don't change the StorageLevel.
>
>
> https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L166
>
>
>
> Yong
>
> ------------------------------
> *From:* Rabin Banerjee <de...@gmail.com>
> *Sent:* Friday, November 18, 2016 10:36 AM
> *To:* user; Mich Talebzadeh; Tathagata Das
> *Subject:* Will spark cache table once even if I call read/cache on the
> same table multiple times
>
> Hi All ,
>
>   I am working in a project where code is divided into multiple reusable
> module . I am not able to understand spark persist/cache on that context.
>
> My Question is Will spark cache table once even if I call read/cache on
> the same table multiple times ??
>
>  Sample Code ::
>
>   TableReader::
>
>    def getTableDF(tablename:String,persist:Boolean = false) : DataFrame =
> {
>          val tabdf = sqlContext.table(tablename)
>          if(persist) {
>              tabdf.cache()
>             }
>       return tableDF
> }
>
>  Now
> Module1::
>  val emp = TableReader.getTable("employee")
>  emp.someTransformation.someAction
>
> Module2::
>  val emp = TableReader.getTable("employee")
>  emp.someTransformation.someAction
>
> ....
>
> ModuleN::
>  val emp = TableReader.getTable("employee")
>  emp.someTransformation.someAction
>
> Will spark cache emp table once , or it will cache every time I am calling
> ?? Shall I maintain a global hashmap to handle that ? something like
> Map[String,DataFrame] ??
>
>  Regards,
> Rabin Banerjee
>
>
>
>

Re: Will spark cache table once even if I call read/cache on the same table multiple times

Posted by Yong Zhang <ja...@hotmail.com>.

That's correct, as long as you don't change the StorageLevel.


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L166



Yong

________________________________
From: Rabin Banerjee <de...@gmail.com>
Sent: Friday, November 18, 2016 10:36 AM
To: user; Mich Talebzadeh; Tathagata Das
Subject: Will spark cache table once even if I call read/cache on the same table multiple times

Hi All ,

  I am working in a project where code is divided into multiple reusable module . I am not able to understand spark persist/cache on that context.

My Question is Will spark cache table once even if I call read/cache on the same table multiple times ??

 Sample Code ::

  TableReader::

   def getTableDF(tablename:String,persist:Boolean = false) : DataFrame = {
         val tabdf = sqlContext.table(tablename)
         if(persist) {
             tabdf.cache()
            }
      return tableDF
}

 Now
Module1::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Module2::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

....

ModuleN::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Will spark cache emp table once , or it will cache every time I am calling ?? Shall I maintain a global hashmap to handle that ? something like Map[String,DataFrame] ??

 Regards,
Rabin Banerjee