You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by jggg777 <jo...@gmail.com> on 2016/11/22 14:50:11 UTC

Is there a processing speed difference between DataFrames and Datasets?

I've seen a number of visuals showing the processing time benefits of using
Datasets+DataFrames over RDDs, but I'd assume that there are performance
benefits to using a defined case class instead a generic Dataset[Row].  The
tale of three Spark APIs post mentions "If you want higher degree of
type-safety at compile time, want typed JVM objects, *take advantage of
Catalyst optimization, and benefit from Tungsten’s efficient code
generation, use Dataset.*"

Are there any comparisons showing the performance differences between
Datasets and DataFrames?  Or more information about how Catalyst/Tungsten
handle them differently?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-processing-speed-difference-between-DataFrames-and-Datasets-tp28117.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Is there a processing speed difference between DataFrames and Datasets?

Posted by Sean Owen <so...@cloudera.com>.
DataFrames are a narrower, more specific type of abstraction, for tabular
data. Where your data is tabular, it makes more sense to use, especially
because this knowledge means a lot more can be optimized under the hood for
you, whereas the framework can do nothing with an RDD of arbitrary objects.
DataFrames are not somehow a "better RDD".

Datasets are more like the new RDDs, supporting more general objects and
programmatic access. Still a different thing for a different purpose from
DataFrames. But has an API more similar to DataFrames and some of the same
types of benefits for simple types via Encoders.

On Tue, Nov 22, 2016 at 2:50 PM jggg777 <jo...@gmail.com> wrote:

> I've seen a number of visuals showing the processing time benefits of using
> Datasets+DataFrames over RDDs, but I'd assume that there are performance
> benefits to using a defined case class instead a generic Dataset[Row].  The
> tale of three Spark APIs post mentions "If you want higher degree of
> type-safety at compile time, want typed JVM objects, *take advantage of
> Catalyst optimization, and benefit from Tungsten’s efficient code
> generation, use Dataset.*"
>
> Are there any comparisons showing the performance differences between
> Datasets and DataFrames?  Or more information about how Catalyst/Tungsten
> handle them differently?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-processing-speed-difference-between-DataFrames-and-Datasets-tp28117.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>