You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by YaoPau <jo...@gmail.com> on 2014/11/14 19:20:47 UTC

Given multiple .filter()'s, is there a way to set the order?

I have an RDD "x" of millions of STRINGs, each of which I want to pass
through a set of filters.  My filtering code looks like this:

x.filter(filter#1, which will filter out 40% of data).
   filter(filter#2, which will filter out 20% of data).
   filter(filter#3, which will filter out 2% of data).
   filter(filter#4, which will filter out 1% of data)

There is no ordering requirement (filter #2 does not depend on filter #1,
etc), but the filters are drastically different in the % of rows they should
eliminate.  What I'd like is an ordering similar to a "||" statement, where
if it fails on filter#1 the row automatically gets filtered out before the
other three filters run.

But when I play around with the ordering of the filters, the runtime doesn't
seem to change.  Is Spark somehow intelligently guessing how effective each
filter will be and ordering it correctly regardless of how I order them?  If
not, is there I way I can set the filter order?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Given-multiple-filter-s-is-there-a-way-to-set-the-order-tp18957.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Given multiple .filter()'s, is there a way to set the order?

Posted by Aaron Davidson <il...@gmail.com>.

In the situation you show, Spark will pipeline each filter together, and
will apply each filter one at a time to each row, effectively constructing
an "&&" statement. You would only see a performance difference if the
filter code itself is somewhat expensive, then you would want to only
execute it on a smaller set of rows. Otherwise, the runtime difference
between "a == b && b == c && c ==d" is minimal when compared to "a == b & b
== c & c == d", the latter being sort of the worst-case scenario as it
would always run all filters (though as I said, Spark acts like the former).

Spark does not reorder the filters automatically. It uses the explicit
ordering you provide.

On Fri, Nov 14, 2014 at 10:20 AM, YaoPau <jo...@gmail.com> wrote:

> I have an RDD "x" of millions of STRINGs, each of which I want to pass
> through a set of filters.  My filtering code looks like this:
>
> x.filter(filter#1, which will filter out 40% of data).
>    filter(filter#2, which will filter out 20% of data).
>    filter(filter#3, which will filter out 2% of data).
>    filter(filter#4, which will filter out 1% of data)
>
> There is no ordering requirement (filter #2 does not depend on filter #1,
> etc), but the filters are drastically different in the % of rows they
> should
> eliminate.  What I'd like is an ordering similar to a "||" statement, where
> if it fails on filter#1 the row automatically gets filtered out before the
> other three filters run.
>
> But when I play around with the ordering of the filters, the runtime
> doesn't
> seem to change.  Is Spark somehow intelligently guessing how effective each
> filter will be and ordering it correctly regardless of how I order them?
> If
> not, is there I way I can set the filter order?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Given-multiple-filter-s-is-there-a-way-to-set-the-order-tp18957.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>