You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nirav Patel <np...@xactlycorp.com> on 2016/01/23 14:48:41 UTC

How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

Problem is I have RDD of about 10M rows and it keeps growing. Everytime
when we want to perform query and compute on subset of data we have to use
filter and then some aggregation. Here I know filter goes through each
partitions and every rows of RDD which may not be efficient at all.

Spark having Ordered RDD functions I dont see why it's so difficult to
implement such function. Cassandra/Hbase has it for years where they can
fetch data only from certain partitions based on your rowkey. Scala TreeMap
has Range function to do the same.

I think people have been looking for this for while. I see several post
asking this.

http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-td20170.html#a26048

By the way, I assume there
Thanks
Nirav

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

Posted by Nirav Patel <np...@xactlycorp.com>.

@IIya Ganellin, not sure how zipWithIndex() will do less then O(n) scan.
Spark doc doesnt mention anything about it.

I found solution with spark 1.5.2 OrderedRDDFunctions. It has filterByRange
api.

Thanks

On Sun, Jan 24, 2016 at 10:27 PM, Sonal Goyal <so...@gmail.com> wrote:

> One thing you can also look at is to save your data in a way that can be
> accessed through file patterns. Eg by hour, zone etc so that you only load
> what you need.
> On Jan 24, 2016 10:00 PM, "Ilya Ganelin" <il...@gmail.com> wrote:
>
>> The solution I normally use is to zipWithIndex() and then use the filter
>> operation. Filter is an O(m) operation where m is the size of your
>> partition, not an O(N) operation.
>>
>> -Ilya Ganelin
>>
>> On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel <np...@xactlycorp.com>
>> wrote:
>>
>>> Problem is I have RDD of about 10M rows and it keeps growing. Everytime
>>> when we want to perform query and compute on subset of data we have to use
>>> filter and then some aggregation. Here I know filter goes through each
>>> partitions and every rows of RDD which may not be efficient at all.
>>>
>>> Spark having Ordered RDD functions I dont see why it's so difficult to
>>> implement such function. Cassandra/Hbase has it for years where they can
>>> fetch data only from certain partitions based on your rowkey. Scala TreeMap
>>> has Range function to do the same.
>>>
>>> I think people have been looking for this for while. I see several post
>>> asking this.
>>>
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-td20170.html#a26048
>>>
>>> By the way, I assume there
>>> Thanks
>>> Nirav
>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>
>>
>>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

Posted by Sonal Goyal <so...@gmail.com>.

One thing you can also look at is to save your data in a way that can be
accessed through file patterns. Eg by hour, zone etc so that you only load
what you need.
On Jan 24, 2016 10:00 PM, "Ilya Ganelin" <il...@gmail.com> wrote:

> The solution I normally use is to zipWithIndex() and then use the filter
> operation. Filter is an O(m) operation where m is the size of your
> partition, not an O(N) operation.
>
> -Ilya Ganelin
>
> On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel <np...@xactlycorp.com>
> wrote:
>
>> Problem is I have RDD of about 10M rows and it keeps growing. Everytime
>> when we want to perform query and compute on subset of data we have to use
>> filter and then some aggregation. Here I know filter goes through each
>> partitions and every rows of RDD which may not be efficient at all.
>>
>> Spark having Ordered RDD functions I dont see why it's so difficult to
>> implement such function. Cassandra/Hbase has it for years where they can
>> fetch data only from certain partitions based on your rowkey. Scala TreeMap
>> has Range function to do the same.
>>
>> I think people have been looking for this for while. I see several post
>> asking this.
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-td20170.html#a26048
>>
>> By the way, I assume there
>> Thanks
>> Nirav
>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>
>
>

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

Posted by Ilya Ganelin <il...@gmail.com>.

The solution I normally use is to zipWithIndex() and then use the filter
operation. Filter is an O(m) operation where m is the size of your
partition, not an O(N) operation.

-Ilya Ganelin

On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel <np...@xactlycorp.com> wrote:

> Problem is I have RDD of about 10M rows and it keeps growing. Everytime
> when we want to perform query and compute on subset of data we have to use
> filter and then some aggregation. Here I know filter goes through each
> partitions and every rows of RDD which may not be efficient at all.
>
> Spark having Ordered RDD functions I dont see why it's so difficult to
> implement such function. Cassandra/Hbase has it for years where they can
> fetch data only from certain partitions based on your rowkey. Scala TreeMap
> has Range function to do the same.
>
> I think people have been looking for this for while. I see several post
> asking this.
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-td20170.html#a26048
>
> By the way, I assume there
> Thanks
> Nirav
>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>