You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Xiao JIANG <ji...@outlook.com> on 2015/08/13 21:55:07 UTC

RDD.join vs spark SQL join

Hi,May I know the performance difference the rdd.join function and spark SQL join operation. If I want to join several big Rdds, how should I decide which one I should use? What are the factors to consider here?  Thanks!

RE: RDD.join vs spark SQL join

Posted by Xiao JIANG <ji...@outlook.com>.

Thank you Akhil!

Date: Fri, 14 Aug 2015 14:51:56 +0530
Subject: Re: RDD.join vs spark SQL join
From: akhil@sigmoidanalytics.com
To: jiangxiao01@outlook.com
CC: user@spark.apache.org

Both works the same way, but with SparkSQL you will get the optimization etc done by the catalyst. One important thing to consider is the # partitions and the key distribution (when you are doing RDD.join), If the keys are not evenly distributed across machines then you can see the process chocking on a single task (more like it takes hell lot of time for one task to execute compared to others in that stage).ThanksBest Regards

On Fri, Aug 14, 2015 at 1:25 AM, Xiao JIANG <ji...@outlook.com> wrote:

Hi,May I know the performance difference the rdd.join function and spark SQL join operation. If I want to join several big Rdds, how should I decide which one I should use? What are the factors to consider here?  Thanks!

Re: RDD.join vs spark SQL join

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Both works the same way, but with SparkSQL you will get the optimization
etc done by the catalyst. One important thing to consider is the #
partitions and the key distribution (when you are doing RDD.join), If the
keys are not evenly distributed across machines then you can see the
process chocking on a single task (more like it takes hell lot of time for
one task to execute compared to others in that stage).

Thanks
Best Regards

On Fri, Aug 14, 2015 at 1:25 AM, Xiao JIANG <ji...@outlook.com> wrote:

> Hi,
>
> May I know the performance difference the rdd.join function and spark SQL
> join operation. If I want to join several big Rdds, how should I decide
> which one I should use? What are the factors to consider here?
>
> Thanks!
>