You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jacek Laskowski <ja...@japila.pl> on 2016/06/08 14:49:37 UTC

Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

Hi,

I just noticed it today while toying with Spark 2.0.0 (today's build)
that doing Seq(...).toDF does **not** submit a Spark job while
sc.parallelize(Seq(...)).toDF does. I was nicely surprised and been
thinking about the reason for the behaviour.

My explanation was that Datasets are just a "view" layer atop data and
when this data is local/in memory already there's no need to submit a
job to...well...compute the data.

I'd appreciate more in-depth answer, perhaps with links to the code. Thanks!

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

Posted by Jacek Laskowski <ja...@japila.pl>.

Makes sense. Thanks Michael (and welcome back from #SparkSummit!) On to
exploring the space...

Jacek
On 9 Jun 2016 6:10 p.m., "Michael Armbrust" <mi...@databricks.com> wrote:

> Look at the explain().  For a Seq we know its just local data so avoid
> spark jobs for simple operations.  In contrast, an RDD is opaque to
> catalyst so we can't perform that optimization.
>
> On Wed, Jun 8, 2016 at 7:49 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> I just noticed it today while toying with Spark 2.0.0 (today's build)
>> that doing Seq(...).toDF does **not** submit a Spark job while
>> sc.parallelize(Seq(...)).toDF does. I was nicely surprised and been
>> thinking about the reason for the behaviour.
>>
>> My explanation was that Datasets are just a "view" layer atop data and
>> when this data is local/in memory already there's no need to submit a
>> job to...well...compute the data.
>>
>> I'd appreciate more in-depth answer, perhaps with links to the code.
>> Thanks!
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

Posted by Michael Armbrust <mi...@databricks.com>.

Look at the explain().  For a Seq we know its just local data so avoid
spark jobs for simple operations.  In contrast, an RDD is opaque to
catalyst so we can't perform that optimization.

On Wed, Jun 8, 2016 at 7:49 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> I just noticed it today while toying with Spark 2.0.0 (today's build)
> that doing Seq(...).toDF does **not** submit a Spark job while
> sc.parallelize(Seq(...)).toDF does. I was nicely surprised and been
> thinking about the reason for the behaviour.
>
> My explanation was that Datasets are just a "view" layer atop data and
> when this data is local/in memory already there's no need to submit a
> job to...well...compute the data.
>
> I'd appreciate more in-depth answer, perhaps with links to the code.
> Thanks!
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>