You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by mrm <ma...@skimlinks.com> on 2014/11/28 17:21:04 UTC

optimize multiple filter operations

Hi, 

My question is:

I have multiple filter operations where I split my initial rdd into two
different groups. The two groups cover the whole initial set. In code, it's
something like:

set1 = initial.filter(lambda x: x == something)
set2 = initial.filter(lambda x: x != something)

By doing this, I am doing two passes over the data. Is there any way to
optimise this to do it in a single pass?

Note: I was trying to look in the mailing list to see if this question has
been asked already, but could not find it.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: optimize multiple filter operations

Posted by Imran Rashid <im...@therashids.com>.

Rishi's approach will work, but its worth mentioning that because all of
the data goes into only two groups, you will only process the resulting
data with two tasks and so you're losing almost all parallelism.
Presumably you're processing a lot of data, since you only want to do one
pass, so I doubt that would actually be helpful.

Unfortunately I don't think there is a better approach than doing two
passes currently.  Given some more info about the downstream processes,
there may be alternatives, but in general I think you are stuck.

Eg., here's a slight variation on Rishi's proposal, that may or may not
work:

initial.groupBy{x => (if (x == something) "key1" else "key2"),
util.Random.nextInt(500))}

which splits the data by a compound key -- first just a label of whether or
not it matches, and then subdivides into another 500 groups.  This will
result in nicely balanced tasks within each group, but also results in a
shuffle of all the data, which can be pretty expensive.  You might be
better off just doing two passes over the raw data.

Imran

On Fri, Nov 28, 2014 at 7:08 PM, Rishi Yadav <ri...@infoobjects.com> wrote:

> you can try (scala version => you convert to python)
>
> val set = initial.groupBy( x => if (x == something) "key1" else "key2")
>
> This would do one pass over original data.
>
> On Fri, Nov 28, 2014 at 8:21 AM, mrm <ma...@skimlinks.com> wrote:
>
>> Hi,
>>
>> My question is:
>>
>> I have multiple filter operations where I split my initial rdd into two
>> different groups. The two groups cover the whole initial set. In code,
>> it's
>> something like:
>>
>> set1 = initial.filter(lambda x: x == something)
>> set2 = initial.filter(lambda x: x != something)
>>
>> By doing this, I am doing two passes over the data. Is there any way to
>> optimise this to do it in a single pass?
>>
>> Note: I was trying to look in the mailing list to see if this question has
>> been asked already, but could not find it.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: optimize multiple filter operations

Posted by Rishi Yadav <ri...@infoobjects.com>.

you can try (scala version => you convert to python)

val set = initial.groupBy( x => if (x == something) "key1" else "key2")

This would do one pass over original data.

On Fri, Nov 28, 2014 at 8:21 AM, mrm <ma...@skimlinks.com> wrote:

> Hi,
>
> My question is:
>
> I have multiple filter operations where I split my initial rdd into two
> different groups. The two groups cover the whole initial set. In code, it's
> something like:
>
> set1 = initial.filter(lambda x: x == something)
> set2 = initial.filter(lambda x: x != something)
>
> By doing this, I am doing two passes over the data. Is there any way to
> optimise this to do it in a single pass?
>
> Note: I was trying to look in the mailing list to see if this question has
> been asked already, but could not find it.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>