You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Xiang Huo <hu...@gmail.com> on 2013/11/04 07:42:50 UTC

How to filter a sorted RDD

Hi all,

I am trying to filter a smaller RDD data set from a large RDD data set. And
the large one is sorted. So my question is that is there any way to make
the filter method does't check every element in RDD but filter out all the
other elements when one element doesn't meet the condition of filter.
Because the large data set is sorted, when there is one element doesn't
meet the requirement, all the following elements are impossible to meet.
But checking them one by one will take a relative long time.
So is there any way to save time for this part?

Thanks,

Xiang

-- 
Xiang Huo
Department of Computer Science
University of Illinois at Chicago(UIC)
Chicago, Illinois
US
Email: huoxiang5659@gmail.com
           or xhuo4@uic.edu

Re: How to filter a sorted RDD

Posted by Zongheng Yang <zo...@gmail.com>.

Element-wise: that sounds like a sequential control flow whereas RDDs
are inherently parallel collections.  I'm also interested to know if
it's possible.

Partition-wise: PartitionPruningRDD [1] may be of help.

[1] http://spark.incubator.apache.org/docs/0.8.0/api/core/org/apache/spark/rdd/PartitionPruningRDD.html

On Sun, Nov 3, 2013 at 10:42 PM, Xiang Huo <hu...@gmail.com> wrote:
> Hi all,
>
> I am trying to filter a smaller RDD data set from a large RDD data set. And
> the large one is sorted. So my question is that is there any way to make the
> filter method does't check every element in RDD but filter out all the other
> elements when one element doesn't meet the condition of filter. Because the
> large data set is sorted, when there is one element doesn't meet the
> requirement, all the following elements are impossible to meet. But checking
> them one by one will take a relative long time.
> So is there any way to save time for this part?
>
> Thanks,
>
> Xiang
>
> --
> Xiang Huo
> Department of Computer Science
> University of Illinois at Chicago(UIC)
> Chicago, Illinois
> US
> Email: huoxiang5659@gmail.com
>            or xhuo4@uic.edu

Re: How to filter a sorted RDD

Posted by Xiang Huo <hu...@gmail.com>.

Thanks! It is very helpful!


2013/11/4 Mark Hamstra <ma...@clearstorydata.com>

> scala> val rdd = sc.parallelize(1 to 1000000, 4)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
> parallelize at <console>:12
>
> scala> rdd.mapPartitions { itr => itr.takeWhile(_ < 10) }.count
> .
> .
> .
> res1: Long = 9
>
> In the first partition, 9 elements are iterated through before the
> evaluation of the closure over that partition is completed; in the other
> three partitions, only one element is examined.
>
>
>
> On Sun, Nov 3, 2013 at 11:32 PM, Xiang Huo <hu...@gmail.com> wrote:
>
>> Hi Mark,
>>
>> Could you tell me more detail information about how to short-circuit the
>> filtering ?
>>
>> Thanks.
>>
>> Xiang
>>
>>
>> 2013/11/4 Mark Hamstra <ma...@clearstorydata.com>
>>
>>> You could short-circuit the filtering within the interator function
>>> supplied to mapPartitions.
>>>
>>>
>>> On Sunday, November 3, 2013, Xiang Huo wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am trying to filter a smaller RDD data set from a large RDD data set.
>>>> And the large one is sorted. So my question is that is there any way to
>>>> make the filter method does't check every element in RDD but filter out all
>>>> the other elements when one element doesn't meet the condition of filter.
>>>> Because the large data set is sorted, when there is one element doesn't
>>>> meet the requirement, all the following elements are impossible to meet.
>>>> But checking them one by one will take a relative long time.
>>>> So is there any way to save time for this part?
>>>>
>>>> Thanks,
>>>>
>>>> Xiang
>>>>
>>>> --
>>>> Xiang Huo
>>>> Department of Computer Science
>>>> University of Illinois at Chicago(UIC)
>>>> Chicago, Illinois
>>>> US
>>>> Email: huoxiang5659@gmail.com
>>>>            or xhuo4@uic.edu
>>>>
>>>
>>
>>
>> --
>> Xiang Huo
>> Department of Computer Science
>> University of Illinois at Chicago(UIC)
>> Chicago, Illinois
>> US
>> Email: huoxiang5659@gmail.com
>>            or xhuo4@uic.edu
>>
>
>


-- 
Xiang Huo
Department of Computer Science
University of Illinois at Chicago(UIC)
Chicago, Illinois
US
Email: huoxiang5659@gmail.com
           or xhuo4@uic.edu

Re: How to filter a sorted RDD

Posted by Mark Hamstra <ma...@clearstorydata.com>.

scala> val rdd = sc.parallelize(1 to 1000000, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize at <console>:12

scala> rdd.mapPartitions { itr => itr.takeWhile(_ < 10) }.count
.
.
.
res1: Long = 9

In the first partition, 9 elements are iterated through before the
evaluation of the closure over that partition is completed; in the other
three partitions, only one element is examined.



On Sun, Nov 3, 2013 at 11:32 PM, Xiang Huo <hu...@gmail.com> wrote:

> Hi Mark,
>
> Could you tell me more detail information about how to short-circuit the
> filtering ?
>
> Thanks.
>
> Xiang
>
>
> 2013/11/4 Mark Hamstra <ma...@clearstorydata.com>
>
>> You could short-circuit the filtering within the interator function
>> supplied to mapPartitions.
>>
>>
>> On Sunday, November 3, 2013, Xiang Huo wrote:
>>
>>> Hi all,
>>>
>>> I am trying to filter a smaller RDD data set from a large RDD data set.
>>> And the large one is sorted. So my question is that is there any way to
>>> make the filter method does't check every element in RDD but filter out all
>>> the other elements when one element doesn't meet the condition of filter.
>>> Because the large data set is sorted, when there is one element doesn't
>>> meet the requirement, all the following elements are impossible to meet.
>>> But checking them one by one will take a relative long time.
>>> So is there any way to save time for this part?
>>>
>>> Thanks,
>>>
>>> Xiang
>>>
>>> --
>>> Xiang Huo
>>> Department of Computer Science
>>> University of Illinois at Chicago(UIC)
>>> Chicago, Illinois
>>> US
>>> Email: huoxiang5659@gmail.com
>>>            or xhuo4@uic.edu
>>>
>>
>
>
> --
> Xiang Huo
> Department of Computer Science
> University of Illinois at Chicago(UIC)
> Chicago, Illinois
> US
> Email: huoxiang5659@gmail.com
>            or xhuo4@uic.edu
>

Re: How to filter a sorted RDD

Posted by Xiang Huo <hu...@gmail.com>.

Hi Mark,

Could you tell me more detail information about how to short-circuit the
filtering ?

Thanks.

Xiang


2013/11/4 Mark Hamstra <ma...@clearstorydata.com>

> You could short-circuit the filtering within the interator function
> supplied to mapPartitions.
>
>
> On Sunday, November 3, 2013, Xiang Huo wrote:
>
>> Hi all,
>>
>> I am trying to filter a smaller RDD data set from a large RDD data set.
>> And the large one is sorted. So my question is that is there any way to
>> make the filter method does't check every element in RDD but filter out all
>> the other elements when one element doesn't meet the condition of filter.
>> Because the large data set is sorted, when there is one element doesn't
>> meet the requirement, all the following elements are impossible to meet.
>> But checking them one by one will take a relative long time.
>> So is there any way to save time for this part?
>>
>> Thanks,
>>
>> Xiang
>>
>> --
>> Xiang Huo
>> Department of Computer Science
>> University of Illinois at Chicago(UIC)
>> Chicago, Illinois
>> US
>> Email: huoxiang5659@gmail.com
>>            or xhuo4@uic.edu
>>
>


-- 
Xiang Huo
Department of Computer Science
University of Illinois at Chicago(UIC)
Chicago, Illinois
US
Email: huoxiang5659@gmail.com
           or xhuo4@uic.edu

Re: How to filter a sorted RDD

Posted by Mark Hamstra <ma...@clearstorydata.com>.

You could short-circuit the filtering within the interator function
supplied to mapPartitions.


On Sunday, November 3, 2013, Xiang Huo wrote:

> Hi all,
>
> I am trying to filter a smaller RDD data set from a large RDD data set.
> And the large one is sorted. So my question is that is there any way to
> make the filter method does't check every element in RDD but filter out all
> the other elements when one element doesn't meet the condition of filter.
> Because the large data set is sorted, when there is one element doesn't
> meet the requirement, all the following elements are impossible to meet.
> But checking them one by one will take a relative long time.
> So is there any way to save time for this part?
>
> Thanks,
>
> Xiang
>
> --
> Xiang Huo
> Department of Computer Science
> University of Illinois at Chicago(UIC)
> Chicago, Illinois
> US
> Email: huoxiang5659@gmail.com <javascript:_e({}, 'cvml',
> 'huoxiang5659@gmail.com');>
>            or xhuo4@uic.edu <javascript:_e({}, 'cvml', 'xhuo4@uic.edu');>
>