You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by apu <ap...@gmail.com> on 2016/09/02 20:05:31 UTC

Is cache() still necessary for Spark DataFrames?

When I first learnt Spark, I was told that *cache()* is desirable anytime
one performs more than one Action on an RDD or DataFrame. For example,
consider the PySpark toy example below; it shows two approaches to doing
the same thing.

# Approach 1 (bad?)
df2 = someTransformation(df1)
a = df2.count()
b = df2.first() # This step could take long, because df2 has to be created
all over again

# Approach 2 (good?)
df2 = someTransformation(df1)
df2.cache()
a = df2.count()
b = df2.first() # Because df2 is already cached, this action is quick
df2.unpersist()

The second approach shown above is somewhat clunky, because it requires one
to cache any dataframe that will be Acted on more than once, followed by
the need to call *unpersist()* later to free up memory.

*So my question is: is the second approach still necessary/desirable when
operating on DataFrames in newer versions of Spark (>=1.6)?*

Thanks!!

Apu

Re: Is cache() still necessary for Spark DataFrames?

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

As I understand Spark memory allocation is used for execution ,memory and
storage memory. The sum is deterministic (memory allocated in simplest
form). So by using storage cache you impact the sum.

Now

   1. cache() is an alias to persist(memory_only)
   2. caching is only done once.
   3. Both dataframes and rdds can be cached.

If you cache rdd or df it will persist in memory until it is evicted as
Spark uses an LRU (Least Recently Used) chain. So if your rdd is moderately
small and it is accessed iteratively, then caching it would be advantages
for faster access. Otherwise, leave it as it is.  Spark doc
<http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence>explains
this.

You can perform some tests by running both approaches and check Spark UI
(default port 4040) under Storage tab to see the amount of data cached.

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 2 September 2016 at 21:21, Davies Liu <da...@databricks.com> wrote:

> Caching a RDD/DataFrame always has some cost, in this case, I'd suggest
> that
> do not cache the DataFrame, the first() is usually fast enough (only
> compute the
> partitions as needed).
>
> On Fri, Sep 2, 2016 at 1:05 PM, apu <ap...@gmail.com> wrote:
> > When I first learnt Spark, I was told that cache() is desirable anytime
> one
> > performs more than one Action on an RDD or DataFrame. For example,
> consider
> > the PySpark toy example below; it shows two approaches to doing the same
> > thing.
> >
> > # Approach 1 (bad?)
> > df2 = someTransformation(df1)
> > a = df2.count()
> > b = df2.first() # This step could take long, because df2 has to be
> created
> > all over again
> >
> > # Approach 2 (good?)
> > df2 = someTransformation(df1)
> > df2.cache()
> > a = df2.count()
> > b = df2.first() # Because df2 is already cached, this action is quick
> > df2.unpersist()
> >
> > The second approach shown above is somewhat clunky, because it requires
> one
> > to cache any dataframe that will be Acted on more than once, followed by
> the
> > need to call unpersist() later to free up memory.
> >
> > So my question is: is the second approach still necessary/desirable when
> > operating on DataFrames in newer versions of Spark (>=1.6)?
> >
> > Thanks!!
> >
> > Apu
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Is cache() still necessary for Spark DataFrames?

Posted by Davies Liu <da...@databricks.com>.

Caching a RDD/DataFrame always has some cost, in this case, I'd suggest that
do not cache the DataFrame, the first() is usually fast enough (only compute the
partitions as needed).

On Fri, Sep 2, 2016 at 1:05 PM, apu <ap...@gmail.com> wrote:
> When I first learnt Spark, I was told that cache() is desirable anytime one
> performs more than one Action on an RDD or DataFrame. For example, consider
> the PySpark toy example below; it shows two approaches to doing the same
> thing.
>
> # Approach 1 (bad?)
> df2 = someTransformation(df1)
> a = df2.count()
> b = df2.first() # This step could take long, because df2 has to be created
> all over again
>
> # Approach 2 (good?)
> df2 = someTransformation(df1)
> df2.cache()
> a = df2.count()
> b = df2.first() # Because df2 is already cached, this action is quick
> df2.unpersist()
>
> The second approach shown above is somewhat clunky, because it requires one
> to cache any dataframe that will be Acted on more than once, followed by the
> need to call unpersist() later to free up memory.
>
> So my question is: is the second approach still necessary/desirable when
> operating on DataFrames in newer versions of Spark (>=1.6)?
>
> Thanks!!
>
> Apu

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org