You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by shahab <sh...@gmail.com> on 2015/03/12 16:04:36 UTC

Which is more efficient : first join three RDDs and then do filtering or vice versa?

Hi,

Probably this question is already answered sometime in the mailing list,
but i couldn't find it. Sorry for posting this again.

I need to to join and apply filtering on three different RDDs, I just
wonder which of the following alternatives are more efficient:
1- first joint all three RDDs and then do  filtering on resulting joint RDD
  or
2- Apply filtering on each individual RDD and then join the resulting RDDs


Or probably there is no difference due to lazy evaluation and under beneath
Spark optimisation?

best,
/Shahab

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

Posted by shahab <sh...@gmail.com>.
Thanks, it makes sense.

On Thursday, March 12, 2015, Daniel Siegmann <da...@teamaol.com>
wrote:

> Join causes a shuffle (sending data across the network). I expect it will
> be better to filter before you join, so you reduce the amount of data which
> is sent across the network.
>
> Note this would be true for *any* transformation which causes a shuffle.
> It would not be true if you're combining RDDs with union, since that
> doesn't cause a shuffle.
>
> On Thu, Mar 12, 2015 at 11:04 AM, shahab <shahab.mokari@gmail.com
> <javascript:_e(%7B%7D,'cvml','shahab.mokari@gmail.com');>> wrote:
>
>> Hi,
>>
>> Probably this question is already answered sometime in the mailing list,
>> but i couldn't find it. Sorry for posting this again.
>>
>> I need to to join and apply filtering on three different RDDs, I just
>> wonder which of the following alternatives are more efficient:
>> 1- first joint all three RDDs and then do  filtering on resulting joint
>> RDD   or
>> 2- Apply filtering on each individual RDD and then join the resulting RDDs
>>
>>
>> Or probably there is no difference due to lazy evaluation and under
>> beneath Spark optimisation?
>>
>> best,
>> /Shahab
>>
>
>

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

Posted by Daniel Siegmann <da...@teamaol.com>.
Join causes a shuffle (sending data across the network). I expect it will
be better to filter before you join, so you reduce the amount of data which
is sent across the network.

Note this would be true for *any* transformation which causes a shuffle. It
would not be true if you're combining RDDs with union, since that doesn't
cause a shuffle.

On Thu, Mar 12, 2015 at 11:04 AM, shahab <sh...@gmail.com> wrote:

> Hi,
>
> Probably this question is already answered sometime in the mailing list,
> but i couldn't find it. Sorry for posting this again.
>
> I need to to join and apply filtering on three different RDDs, I just
> wonder which of the following alternatives are more efficient:
> 1- first joint all three RDDs and then do  filtering on resulting joint
> RDD   or
> 2- Apply filtering on each individual RDD and then join the resulting RDDs
>
>
> Or probably there is no difference due to lazy evaluation and under
> beneath Spark optimisation?
>
> best,
> /Shahab
>