You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Darin McBeath <dd...@yahoo.com.INVALID> on 2016/07/28 16:52:47 UTC

RDD vs Dataset performance

I started playing round with Datasets on Spark 2.0 this morning and I'm surprised by the significant performance difference I'm seeing between an RDD and a Dataset for a very basic example.


I've defined a simple case class called AnnotationText that has a handful of fields.


I create a Dataset[AnnotationText] with my data and repartition(4) this on one of the columns and cache the resulting dataset as ds (force the cache by executing a count action).  Everything looks good and I have more than 10M records in my dataset ds.

When I execute the following:

ds.filter(textAnnotation => textAnnotation.text == "mutational".toLowerCase).count 

It consistently finishes in just under 3 seconds.  One of the things I notice is that it has 3 stages.  The first stage is skipped (as this had to do with creation ds and it was already cached).  The second stage appears to do the filtering (requires 4 tasks) but interestingly it shuffles output.  The third stage (requires only 1 task) appears to count the results of the shuffle.  

When I look at the cached dataset (on 4 partitions) it is 82.6MB.

I then decided to convert the ds dataset to an RDD as follows, repartition(4) and cache.

val aRDD = ds.rdd.repartition(4).cache
aRDD.count
So, I now have an RDD[AnnotationText]

When I execute the following:

aRDD.filter(textAnnotation => textAnnotation.text == "mutational".toLowerCase).count

It consistently finishes in just under half a second.  One of the things I notice is that it only has 2 stages.  The first stage is skipped (as this had to do with creation of aRDD and it was already cached).  The second stage appears to do the filtering and count(requires 4 tasks).  Interestingly, there is no shuffle (or subsequently 3rd stage).   

When I look at the cached RDD (on 4 partitions) it is 2.9GB.


I was surprised how significant the cached storage difference was between the Dataset (82.6MB) and the RDD (2.9GB) version of the same content.  Is this kind of difference to be expected?

While I like the smaller size for the Dataset version, I was confused as to why the performance for the Dataset version was so much slower (2.5s vs .5s).  I suspect it might be attributed to the shuffle and third stage required by the Dataset version but I'm not sure. I was under the impression that Datasets should (would) be faster in many use cases (such as the one I'm using above).  Am I doing something wrong or is this to be expected?

Thanks.

Darin.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: RDD vs Dataset performance

Posted by Reynold Xin <rx...@databricks.com>.
The performance difference is coming from the need to serialize and
deserialize data to AnnotationText. The extra stage is probably very quick
and shouldn't impact much.

If you try cache the RDD using serialized mode, it would slow down a lot
too.


On Thu, Jul 28, 2016 at 9:52 AM, Darin McBeath <dd...@yahoo.com.invalid>
wrote:

> I started playing round with Datasets on Spark 2.0 this morning and I'm
> surprised by the significant performance difference I'm seeing between an
> RDD and a Dataset for a very basic example.
>
>
> I've defined a simple case class called AnnotationText that has a handful
> of fields.
>
>
> I create a Dataset[AnnotationText] with my data and repartition(4) this on
> one of the columns and cache the resulting dataset as ds (force the cache
> by executing a count action).  Everything looks good and I have more than
> 10M records in my dataset ds.
>
> When I execute the following:
>
> ds.filter(textAnnotation => textAnnotation.text ==
> "mutational".toLowerCase).count
>
> It consistently finishes in just under 3 seconds.  One of the things I
> notice is that it has 3 stages.  The first stage is skipped (as this had to
> do with creation ds and it was already cached).  The second stage appears
> to do the filtering (requires 4 tasks) but interestingly it shuffles
> output.  The third stage (requires only 1 task) appears to count the
> results of the shuffle.
>
> When I look at the cached dataset (on 4 partitions) it is 82.6MB.
>
> I then decided to convert the ds dataset to an RDD as follows,
> repartition(4) and cache.
>
> val aRDD = ds.rdd.repartition(4).cache
> aRDD.count
> So, I now have an RDD[AnnotationText]
>
> When I execute the following:
>
> aRDD.filter(textAnnotation => textAnnotation.text ==
> "mutational".toLowerCase).count
>
> It consistently finishes in just under half a second.  One of the things I
> notice is that it only has 2 stages.  The first stage is skipped (as this
> had to do with creation of aRDD and it was already cached).  The second
> stage appears to do the filtering and count(requires 4 tasks).
> Interestingly, there is no shuffle (or subsequently 3rd stage).
>
> When I look at the cached RDD (on 4 partitions) it is 2.9GB.
>
>
> I was surprised how significant the cached storage difference was between
> the Dataset (82.6MB) and the RDD (2.9GB) version of the same content.  Is
> this kind of difference to be expected?
>
> While I like the smaller size for the Dataset version, I was confused as
> to why the performance for the Dataset version was so much slower (2.5s vs
> .5s).  I suspect it might be attributed to the shuffle and third stage
> required by the Dataset version but I'm not sure. I was under the
> impression that Datasets should (would) be faster in many use cases (such
> as the one I'm using above).  Am I doing something wrong or is this to be
> expected?
>
> Thanks.
>
> Darin.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>