You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Kristina Rogale Plazonic <kp...@gmail.com> on 2015/07/28 20:50:25 UTC

DataFrame DAG recomputed even though DataFrame is cached?

Hi,

I'm puzzling over the following problem: when I cache a small sample of a
big dataframe, the small dataframe is recomputed when selecting a column
(but not if show() or count() is invoked).

Why is that so and how can I avoid recomputation of the small sample
dataframe?

More details:

- I have a big dataframe "df" of ~190million rows and ~10 columns, obtained
via 3 different joins; I cache it and invoke count() to make sure it really
is in memory and confirm in web UI

- val sdf = df.sample(false, 1e-6); sdf.cache(); sdf.count()  // 170 lines;
cached is also confirmed in webUI, size in memory is 150kB

*- sdf.select("colname").show()   // this triggers a complete recomputation
of sdf with 3 joins!*

- show(), count() or take() do not trigger the recomputation of the 3
joins, but select(), collect() or withColumn() do.

I have --executor-memory 30G --driver-memory 10g, so memory is not a
problem. I'm using Spark 1.4.0. Could anybody shed some light on this or
where I can find more info?

Many thanks,
Kristina

Re: DataFrame DAG recomputed even though DataFrame is cached?

Posted by Michael Armbrust <mi...@databricks.com>.
We will try to address this before Spark 1.5 is released:
https://issues.apache.org/jira/browse/SPARK-9141

On Tue, Jul 28, 2015 at 11:50 AM, Kristina Rogale Plazonic <kplazo@gmail.com
> wrote:

> Hi,
>
> I'm puzzling over the following problem: when I cache a small sample of a
> big dataframe, the small dataframe is recomputed when selecting a column
> (but not if show() or count() is invoked).
>
> Why is that so and how can I avoid recomputation of the small sample
> dataframe?
>
> More details:
>
> - I have a big dataframe "df" of ~190million rows and ~10 columns,
> obtained via 3 different joins; I cache it and invoke count() to make sure
> it really is in memory and confirm in web UI
>
> - val sdf = df.sample(false, 1e-6); sdf.cache(); sdf.count()  // 170
> lines; cached is also confirmed in webUI, size in memory is 150kB
>
> *- sdf.select("colname").show()   // this triggers a complete
> recomputation of sdf with 3 joins!*
>
> - show(), count() or take() do not trigger the recomputation of the 3
> joins, but select(), collect() or withColumn() do.
>
> I have --executor-memory 30G --driver-memory 10g, so memory is not a
> problem. I'm using Spark 1.4.0. Could anybody shed some light on this or
> where I can find more info?
>
> Many thanks,
> Kristina
>