You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Brian Candler <B....@pobox.com> on 2009/05/04 23:00:46 UTC

Efficient range queries

Just a quick check to see if someone has a view strategy I can borrow.

Suppose I have a large collection of documents each of which contains a
range like this:

  lower: <value1>,
  upper: <value2>

I want to be able to search these documents, such that when given a value v,
I quickly find all documents where v lies between the lower and upper
values. Any suggestions?

Thanks,

Brian.

Re: Efficient range queries

Posted by Damien Katz <da...@apache.org>.

On May 4, 2009, at 5:15 PM, Chris Anderson wrote:

> On Mon, May 4, 2009 at 2:00 PM, Brian Candler <B....@pobox.com>  
> wrote:
>> Just a quick check to see if someone has a view strategy I can  
>> borrow.
>>
>> Suppose I have a large collection of documents each of which  
>> contains a
>> range like this:
>>
>>  lower: <value1>,
>>  upper: <value2>
>>
>> I want to be able to search these documents, such that when given a  
>> value v,
>> I quickly find all documents where v lies between the lower and upper
>> values. Any suggestions?
>>
>
> The simplest approach is to define a granularity for your queries and
> then emit each value within the range at that granularity from each
> document.

This is probably the best option given the way CouchDB views work.

On the map phase, round down the start key and end key down to the  
level of granularity, and emit each a key for each interval of  
granularity until you reach the end.

On lookup, the client then rounds the input start key and end key down  
and does a range query. It then filters the results client side,  
removing results that fall within the rounded ranges but don't  
actually fall within the input key ranges. I think this could be done  
by the _list functionality.

This method allows for fully continuous range queries, but the  
tradeoff is more key/value emits versus the extra work of filtering  
out the out of range results at query time.

-Damien

Re: Efficient range queries

Posted by Brian Candler <B....@pobox.com>.

On Mon, May 04, 2009 at 05:18:13PM -0400, Paul Davis wrote:
> There's a data structure called a nested containment list that is
> specifically meant for this type of query. It's on my todo list, but
> no where near the top. If someone wants to add an indexer that'd be
> pretty awesome.

This one? http://bioinfo.mbi.ucla.edu/pygr/docs/draft5.pdf/view

Looks interesting. However I note:

"The price for efficiency of the NCList approach is that it is most
efficient in static environments where the database is updated in bulk only
rarely, since any update would require rebuild of the entire database"

So unfortunately we don't get incremental updates. But the paper does list
some other algorithms which might allow this.

In the mean time: looks like I need to divide my ranges into bins. With
care I think this can be done with varying bin widths.

e.g. document contains range 37 to 345. I could emit keys:

  - "37", "38", "39", "4x", "5x", "6x", "7x", "8x", "9x", "1xx", "2xx",
    "30x", "31x", "32x", "33x", "340", "341", "342", "343", "344", "345"

Then in order to query for value 123:

   - multi-key fetch "123", "12x", "1xx"

Not pretty but should scale reasonably well. Maybe it would work better in
binary:

e.g. document contains range 37 to 345. Emit keys:

  - "100101", "10011x", "101xxx", "11xxxx", "1xxxxxx", "1xxxxxxx",
    "100xxxxxx", "1010xxxxx", "101010xxx", "10101100x"

Query for value 123:

  - multi-key fetch "1111011", "111101x", "11110xx", "1111xxx",
    "111xxxx", "11xxxxx", "1xxxxxx"

That gives fewer keys to emit but more to fetch. A key like "1xxxxxx" could
be emitted as "64-127" which easier to understand, and slightly shorter.

Anyway, thanks for clarifying that I hadn't missed anything obvious.

Regards,

Brian.

Re: Efficient range queries

Posted by Paul Davis <pa...@gmail.com>.

On Mon, May 4, 2009 at 5:15 PM, Chris Anderson <jc...@apache.org> wrote:
> On Mon, May 4, 2009 at 2:00 PM, Brian Candler <B....@pobox.com> wrote:
>> Just a quick check to see if someone has a view strategy I can borrow.
>>
>> Suppose I have a large collection of documents each of which contains a
>> range like this:
>>
>>  lower: <value1>,
>>  upper: <value2>
>>
>> I want to be able to search these documents, such that when given a value v,
>> I quickly find all documents where v lies between the lower and upper
>> values. Any suggestions?
>>
>
> The simplest approach is to define a granularity for your queries and
> then emit each value within the range at that granularity from each
> document.
>
> Otherwise you end up intersecting potentially very large data sets on
> the client. Going down this second route can be optimized a bit if
> there is a maximum size each range can have. If the ranges are
> unbounded, then so too will be the result sets you must intersect.
>
> Or maybe someone can come up with a more clever method.
>
> --
> Chris Anderson
> http://jchrisa.net
> http://couch.io
>

There's a data structure called a nested containment list that is
specifically meant for this type of query. It's on my todo list, but
no where near the top. If someone wants to add an indexer that'd be
pretty awesome.

Paul Davis

Re: Efficient range queries

Posted by Chris Anderson <jc...@apache.org>.

On Mon, May 4, 2009 at 2:00 PM, Brian Candler <B....@pobox.com> wrote:
> Just a quick check to see if someone has a view strategy I can borrow.
>
> Suppose I have a large collection of documents each of which contains a
> range like this:
>
>  lower: <value1>,
>  upper: <value2>
>
> I want to be able to search these documents, such that when given a value v,
> I quickly find all documents where v lies between the lower and upper
> values. Any suggestions?
>

The simplest approach is to define a granularity for your queries and
then emit each value within the range at that granularity from each
document.

Otherwise you end up intersecting potentially very large data sets on
the client. Going down this second route can be optimized a bit if
there is a maximum size each range can have. If the ranges are
unbounded, then so too will be the result sets you must intersect.

Or maybe someone can come up with a more clever method.

-- 
Chris Anderson
http://jchrisa.net
http://couch.io