You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bin Wang <wb...@gmail.com> on 2015/07/16 10:02:59 UTC

Will multiple filters on the same RDD optimized to one filter?

If I write code like this:

val rdd = input.map(_.value)
val f1 = rdd.filter(_ == 1)
val f2 = rdd.filter(_ == 2)
...

Then the DAG of the execution may be this:

         -> Filter -> ...
Map
         -> Filter -> ...

But the two filters is operated on the same RDD, which means it could be
done by just scan the RDD once. Does spark have this kind optimization for
now?

Re: Will multiple filters on the same RDD optimized to one filter?

Posted by Raghavendra Pandey <ra...@gmail.com>.

Depending on what you do with them, they will get computed separately.
Bcoz u may have long dag in each branch. So spark tries to run all the
transformation function together rather than trying to optimize things
across branches.
On Jul 16, 2015 1:40 PM, "Bin Wang" <wb...@gmail.com> wrote:

> What if I would use both rdd1 and rdd2 later?
>
> Raghavendra Pandey <ra...@gmail.com>于2015年7月16日周四 下午4:08写道：
>
>> If you cache rdd it will save some operations. But anyway filter is a
>> lazy operation. And it runs based on what you will do later on with rdd1
>> and rdd2...
>>
>> Raghavendra
>> On Jul 16, 2015 1:33 PM, "Bin Wang" <wb...@gmail.com> wrote:
>>
>>> If I write code like this:
>>>
>>> val rdd = input.map(_.value)
>>> val f1 = rdd.filter(_ == 1)
>>> val f2 = rdd.filter(_ == 2)
>>> ...
>>>
>>> Then the DAG of the execution may be this:
>>>
>>>          -> Filter -> ...
>>> Map
>>>          -> Filter -> ...
>>>
>>> But the two filters is operated on the same RDD, which means it could be
>>> done by just scan the RDD once. Does spark have this kind optimization for
>>> now?
>>>
>>

Re: Will multiple filters on the same RDD optimized to one filter?

Posted by Bin Wang <wb...@gmail.com>.

What if I would use both rdd1 and rdd2 later?

Raghavendra Pandey <ra...@gmail.com>于2015年7月16日周四 下午4:08写道：

> If you cache rdd it will save some operations. But anyway filter is a lazy
> operation. And it runs based on what you will do later on with rdd1 and
> rdd2...
>
> Raghavendra
> On Jul 16, 2015 1:33 PM, "Bin Wang" <wb...@gmail.com> wrote:
>
>> If I write code like this:
>>
>> val rdd = input.map(_.value)
>> val f1 = rdd.filter(_ == 1)
>> val f2 = rdd.filter(_ == 2)
>> ...
>>
>> Then the DAG of the execution may be this:
>>
>>          -> Filter -> ...
>> Map
>>          -> Filter -> ...
>>
>> But the two filters is operated on the same RDD, which means it could be
>> done by just scan the RDD once. Does spark have this kind optimization for
>> now?
>>
>

Re: Will multiple filters on the same RDD optimized to one filter?

Posted by Raghavendra Pandey <ra...@gmail.com>.

If you cache rdd it will save some operations. But anyway filter is a lazy
operation. And it runs based on what you will do later on with rdd1 and
rdd2...

Raghavendra
On Jul 16, 2015 1:33 PM, "Bin Wang" <wb...@gmail.com> wrote:

> If I write code like this:
>
> val rdd = input.map(_.value)
> val f1 = rdd.filter(_ == 1)
> val f2 = rdd.filter(_ == 2)
> ...
>
> Then the DAG of the execution may be this:
>
>          -> Filter -> ...
> Map
>          -> Filter -> ...
>
> But the two filters is operated on the same RDD, which means it could be
> done by just scan the RDD once. Does spark have this kind optimization for
> now?
>