You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Dan Woolley <da...@gmail.com> on 2008/12/14 18:06:25 UTC

Multiple search criteria with ranges

I'm researching Couchdb for a project dealing with real estate listing  
data.  I'm very interested in Couchdb because the schema less nature,  
RESTful interface, and potential off-line usage with syncing fit my  
problem very well.  I've been able to do some prototyping and search  
on ranges for a single field very successfully.  I'm having trouble  
wrapping my mind around views for a popular use case in real estate,  
which is a query like:

Price = 350000-400000
Beds = 4-5
Baths = 2-3

Any single range above is trivial, but what is the best model for  
handling this AND scenario with views?  The only thing I've been able  
to come up with is three views returning doc id's - which should be  
very fast - with an array intersection calculation on the client  
side.  Although I haven't tried it yet, that client side calculation  
worries me with a potential document with 1M records - the client  
would potentially be dealing with calculating the intersection of  
multiple 100K element arrays.  Is that a realistic calculation?

Please tell me there is a better model for dealing with this type of  
scenario - or that this use case is not well suited for Couchdb at  
this time and I should move along.


Dan Woolley
profile:  http://www.linkedin.com/in/danwoolley
company:  http://woolleyrobertson.com
product:  http://dwellicious.com
blog:  http://tzetzefly.com




Re: Multiple search criteria with ranges

Posted by Dan Woolley <da...@gmail.com>.
Thanks for all the good ideas - Couchdb definitely has an active  
community willing to help.  I'm going to try to run some tests over  
the holidays, with this and the array intersection methods on the  
server side, and see if it's worth pursuing further.


Dan Woolley
profile:  http://www.linkedin.com/in/danwoolley
company:  http://woolleyrobertson.com
product:  http://dwellicious.com
blog:  http://tzetzefly.com



On Dec 15, 2008, at 6:25 PM, Paul Davis wrote:

> I thought of a possible way to possible overcome this that may or may
> not work depending on the underlying data.
>
> General scheme is to basically make a view for each filter that you
> want to be able to combine. Then with a group reduce you could figure
> out the number of records in each range. Then fetch the records for
> the view with the fewest results and then multi get against all other
> filters removing keys if they come back with an error.
>
> More concretely:
>
> Each view would be:
>
> map:
> function(doc) {emit(doc.filter_field, 1);}
>
> reduce:
> function(keys, values) {return sum(values);}
>
> To get the count for a specific range you'd do:
>
> GET http://127.0.0.1:5984/db_name/_view/filters/by_field_x?group=true&startkey=min&endkey=max
>
> And in the client code you would merge each of the results. For things
> with a more continuous range like price, you may need to bucket
> appropriately. This query should be run once for each field.
>
> Then say we want to filter on fields X, Y, Z (assuming num_in(X) <
> num_in(Y) < num_in(Z))
>
> DocIds  = GET http://127.0.0.1:5984/db_name/_view/filters/by_field_X?startkey=minX&endkey=maxX
> Then intersect with a call to Y and Z views:
> POST http://127.0.0.1:5984/db_name/_view/filters/by_field_Y?startkey=minY&endkey=maxY
> BODY: {"keys": [DocIds]}
>
> That make sense?
>
> Paul
>
>
>
>
>
>
> On Sun, Dec 14, 2008 at 12:06 PM, Dan Woolley <da...@gmail.com>  
> wrote:
>> I'm researching Couchdb for a project dealing with real estate  
>> listing data.
>> I'm very interested in Couchdb because the schema less nature,  
>> RESTful
>> interface, and potential off-line usage with syncing fit my problem  
>> very
>> well.  I've been able to do some prototyping and search on ranges  
>> for a
>> single field very successfully.  I'm having trouble wrapping my  
>> mind around
>> views for a popular use case in real estate, which is a query like:
>>
>> Price = 350000-400000
>> Beds = 4-5
>> Baths = 2-3
>>
>> Any single range above is trivial, but what is the best model for  
>> handling
>> this AND scenario with views?  The only thing I've been able to  
>> come up with
>> is three views returning doc id's - which should be very fast -  
>> with an
>> array intersection calculation on the client side.  Although I  
>> haven't tried
>> it yet, that client side calculation worries me with a potential  
>> document
>> with 1M records - the client would potentially be dealing with  
>> calculating
>> the intersection of multiple 100K element arrays.  Is that a  
>> realistic
>> calculation?
>>
>> Please tell me there is a better model for dealing with this type of
>> scenario - or that this use case is not well suited for Couchdb at  
>> this time
>> and I should move along.
>>
>>
>> Dan Woolley
>> profile:  http://www.linkedin.com/in/danwoolley
>> company:  http://woolleyrobertson.com
>> product:  http://dwellicious.com
>> blog:  http://tzetzefly.com
>>
>>
>>
>>


Re: Multiple search criteria with ranges

Posted by Paul Davis <pa...@gmail.com>.
I thought of a possible way to possible overcome this that may or may
not work depending on the underlying data.

General scheme is to basically make a view for each filter that you
want to be able to combine. Then with a group reduce you could figure
out the number of records in each range. Then fetch the records for
the view with the fewest results and then multi get against all other
filters removing keys if they come back with an error.

More concretely:

Each view would be:

map:
function(doc) {emit(doc.filter_field, 1);}

reduce:
function(keys, values) {return sum(values);}

To get the count for a specific range you'd do:

GET http://127.0.0.1:5984/db_name/_view/filters/by_field_x?group=true&startkey=min&endkey=max

And in the client code you would merge each of the results. For things
with a more continuous range like price, you may need to bucket
appropriately. This query should be run once for each field.

Then say we want to filter on fields X, Y, Z (assuming num_in(X) <
num_in(Y) < num_in(Z))

DocIds  = GET http://127.0.0.1:5984/db_name/_view/filters/by_field_X?startkey=minX&endkey=maxX
Then intersect with a call to Y and Z views:
POST http://127.0.0.1:5984/db_name/_view/filters/by_field_Y?startkey=minY&endkey=maxY
BODY: {"keys": [DocIds]}

That make sense?

Paul






On Sun, Dec 14, 2008 at 12:06 PM, Dan Woolley <da...@gmail.com> wrote:
> I'm researching Couchdb for a project dealing with real estate listing data.
>  I'm very interested in Couchdb because the schema less nature, RESTful
> interface, and potential off-line usage with syncing fit my problem very
> well.  I've been able to do some prototyping and search on ranges for a
> single field very successfully.  I'm having trouble wrapping my mind around
> views for a popular use case in real estate, which is a query like:
>
> Price = 350000-400000
> Beds = 4-5
> Baths = 2-3
>
> Any single range above is trivial, but what is the best model for handling
> this AND scenario with views?  The only thing I've been able to come up with
> is three views returning doc id's - which should be very fast - with an
> array intersection calculation on the client side.  Although I haven't tried
> it yet, that client side calculation worries me with a potential document
> with 1M records - the client would potentially be dealing with calculating
> the intersection of multiple 100K element arrays.  Is that a realistic
> calculation?
>
> Please tell me there is a better model for dealing with this type of
> scenario - or that this use case is not well suited for Couchdb at this time
> and I should move along.
>
>
> Dan Woolley
> profile:  http://www.linkedin.com/in/danwoolley
> company:  http://woolleyrobertson.com
> product:  http://dwellicious.com
> blog:  http://tzetzefly.com
>
>
>
>

Re: Multiple search criteria with ranges

Posted by "ara.t.howard" <ar...@gmail.com>.
On Dec 15, 2008, at 10:46 AM, Dean Landolt wrote:

> I don't think it's the actual intersection that kills, but the  
> downloading
> and json parsing of what would be megabytes of data (just the docids  
> alone
> for 200k docs would be quite a few mb).

yeah that is a good point.  it's about 1.7 mb per 100k of json ids.   
if the 'client' were actually like a rails instance running on the  
same box or subnet as the db i guess that could work out, but  
otherwise it's a pretty tough sell.  i wonder about doing something  
creative with indexing both an attribute and an id and them limiting  
by start/endkey and docid....  sure seems like this should be solvable.

cheers.

a @ http://codeforpeople.com/
--
we can deny everything, except that we have the possibility of being  
better. simply reflect on that.
h.h. the 14th dalai lama




Re: Multiple search criteria with ranges

Posted by Dean Landolt <de...@deanlandolt.com>.
On Mon, Dec 15, 2008 at 12:08 PM, ara.t.howard <ar...@gmail.com>wrote:

>
> On Dec 14, 2008, at 10:06 AM, Dan Woolley wrote:
>
>  I'm researching Couchdb for a project dealing with real estate listing
>> data.  I'm very interested in Couchdb because the schema less nature,
>> RESTful interface, and potential off-line usage with syncing fit my problem
>> very well.  I've been able to do some prototyping and search on ranges for a
>> single field very successfully.  I'm having trouble wrapping my mind around
>> views for a popular use case in real estate, which is a query like:
>>
>> Price = 350000-400000
>> Beds = 4-5
>> Baths = 2-3
>>
>> Any single range above is trivial, but what is the best model for handling
>> this AND scenario with views?  The only thing I've been able to come up with
>> is three views returning doc id's - which should be very fast - with an
>> array intersection calculation on the client side.  Although I haven't tried
>> it yet, that client side calculation worries me with a potential document
>> with 1M records - the client would potentially be dealing with calculating
>> the intersection of multiple 100K element arrays.  Is that a realistic
>> calculation?
>>
>> Please tell me there is a better model for dealing with this type of
>> scenario - or that this use case is not well suited for Couchdb at this time
>> and I should move along.
>>
>
> using ruby or js i can compute the intersection of two 100k arrays in about
> 10/th a sec, for example with this code
>
>
>
> a=Array.new(100_000).map{ rand }
> b=Array.new(100_000).map{ rand }
>
> start_time=Time.now.to_f
>
> intersection = b | a
>
> end_time=Time.now.to_f
>
> puts(end_time - start_time)  #=> 0.197230815887451
>
>
> and that's on my laptop which isn't too quick using ruby which also isn't
> too quick.
>
>
> i guess to me it's seems like keeping an index of each attribute to search
> by and doing refinements is going to plenty fast, offload cpu cycles to the
> client, and keep the code orthogonal and easy to understand - you have one
> index per field, period.
>
> in addition it seems like are always going to have a natural first criteria
> and that you might be able to use startkey_docid/endkey_docid to limit the
> result set of the second and third queries to smaller and smaller ranges of
> ids (in the normal case).
>
> cheers.


I don't think it's the actual intersection that kills, but the downloading
and json parsing of what would be megabytes of data (just the docids alone
for 200k docs would be quite a few mb).

Re: Multiple search criteria with ranges

Posted by "ara.t.howard" <ar...@gmail.com>.
On Dec 14, 2008, at 10:06 AM, Dan Woolley wrote:

> I'm researching Couchdb for a project dealing with real estate  
> listing data.  I'm very interested in Couchdb because the schema  
> less nature, RESTful interface, and potential off-line usage with  
> syncing fit my problem very well.  I've been able to do some  
> prototyping and search on ranges for a single field very  
> successfully.  I'm having trouble wrapping my mind around views for  
> a popular use case in real estate, which is a query like:
>
> Price = 350000-400000
> Beds = 4-5
> Baths = 2-3
>
> Any single range above is trivial, but what is the best model for  
> handling this AND scenario with views?  The only thing I've been  
> able to come up with is three views returning doc id's - which  
> should be very fast - with an array intersection calculation on the  
> client side.  Although I haven't tried it yet, that client side  
> calculation worries me with a potential document with 1M records -  
> the client would potentially be dealing with calculating the  
> intersection of multiple 100K element arrays.  Is that a realistic  
> calculation?
>
> Please tell me there is a better model for dealing with this type of  
> scenario - or that this use case is not well suited for Couchdb at  
> this time and I should move along.

using ruby or js i can compute the intersection of two 100k arrays in  
about 10/th a sec, for example with this code



a=Array.new(100_000).map{ rand }
b=Array.new(100_000).map{ rand }

start_time=Time.now.to_f

intersection = b | a

end_time=Time.now.to_f

puts(end_time - start_time)  #=> 0.197230815887451


and that's on my laptop which isn't too quick using ruby which also  
isn't too quick.


i guess to me it's seems like keeping an index of each attribute to  
search by and doing refinements is going to plenty fast, offload cpu  
cycles to the client, and keep the code orthogonal and easy to  
understand - you have one index per field, period.

in addition it seems like are always going to have a natural first  
criteria and that you might be able to use startkey_docid/endkey_docid  
to limit the result set of the second and third queries to smaller and  
smaller ranges of ids (in the normal case).

cheers.


a @ http://codeforpeople.com/
--
we can deny everything, except that we have the possibility of being  
better. simply reflect on that.
h.h. the 14th dalai lama




Re: Multiple search criteria with ranges

Posted by Adam Kocoloski <ad...@gmail.com>.
Hi Dan, it's not a general-purpose solution, but in the specific  
example you gave where there's only one continuous variable you might  
be able to do something like the following.  Use a map function that  
emits your filterable quantities as a list:

emit([doc.beds, doc.baths, doc.price], null)

and then query that view multiple times with:

startkey=[4,2,350000]&endkey=[4,2,400000]
startkey=[4,3,350000]&endkey=[4,3,400000]
startkey=[5,2,350000]&endkey=[5,2,400000]
startkey=[5,2,350000]&endkey=[4,3,400000]

The advantage is that you get only the data you need; the disadvantage  
is that the number of queries scales non-linearly with the number of  
fields used in the filter, and there's no easy way to skip fields  
(you'd need to query with every possible value of that field, or else  
write an additional view that doesn't emit it).

A possible middle ground would be to query this same view with a  
single request:

startkey=[4,2,350000]&endkey=[5,3,400000]

You'll still need to filter the results on the client side, since e.g.  
a 4 bed, 2 bath, $600k listing would get included, but at least the  
data volume would be smaller than doing the whole intersection  
yourself.  If you go this route, the discrete vs. continuous variable  
thing doesn't really matter; just arrange the keys so that the one  
with the greatest discriminating power comes first.

Best, Adam


On Dec 14, 2008, at 12:06 PM, Dan Woolley wrote:

> I'm researching Couchdb for a project dealing with real estate  
> listing data.  I'm very interested in Couchdb because the schema  
> less nature, RESTful interface, and potential off-line usage with  
> syncing fit my problem very well.  I've been able to do some  
> prototyping and search on ranges for a single field very  
> successfully.  I'm having trouble wrapping my mind around views for  
> a popular use case in real estate, which is a query like:
>
> Price = 350000-400000
> Beds = 4-5
> Baths = 2-3
>
> Any single range above is trivial, but what is the best model for  
> handling this AND scenario with views?  The only thing I've been  
> able to come up with is three views returning doc id's - which  
> should be very fast - with an array intersection calculation on the  
> client side.  Although I haven't tried it yet, that client side  
> calculation worries me with a potential document with 1M records -  
> the client would potentially be dealing with calculating the  
> intersection of multiple 100K element arrays.  Is that a realistic  
> calculation?
>
> Please tell me there is a better model for dealing with this type of  
> scenario - or that this use case is not well suited for Couchdb at  
> this time and I should move along.
>
>
> Dan Woolley
> profile:  http://www.linkedin.com/in/danwoolley
> company:  http://woolleyrobertson.com
> product:  http://dwellicious.com
> blog:  http://tzetzefly.com
>
>
>


Re: Multiple search criteria with ranges

Posted by Thibaut Barrère <th...@gmail.com>.
Hi,

> Faceted search like this isn't best supported directly in CouchDB
> itself. Its a feature that's been discussed for implementation but as
> of yet there aren't any concrete plans on what that implementation
> would look like.
>
> That being said, there's nothing keeping you from using an external
> indexer such as Solr that supports faceted searching like you're
> describing.

I'm also trying to implement this faceted search (mailed the list
today about that but the mail didn't make it , I think I didn't post
it to the current address).

I'd be very interested to hear some feedback from people who actually
implemented something like that - either with Solr or any other
solution.

cheers

-- Thibaut

Re: Multiple search criteria with ranges

Posted by Paul Davis <pa...@gmail.com>.
On Mon, Dec 15, 2008 at 4:28 AM, Adam Groves <ad...@gmail.com> wrote:
> Hi Paul,
>
> I noticed you were working on various search solutions for CouchDB.
> Your EFTI project on github specifically caught my attention. Is it on
> ice at the moment?
>

More or less. I haven't pursued it much beyond discussing ideas for it
on IRC. Generally I'm waiting for a better picture on how the Erlang
plugin interface will pan out. The biggest questions for me are on how
to match the code with other aspects of CouchDB. Replication in
particular is a bit of a stumbling block in terms of how to pursue an
actual implementation.

Paul

> Kind regards
>
> Adam Groves
>
> 2008/12/14 Paul Davis <pa...@gmail.com>:
>> Faceted search like this isn't best supported directly in CouchDB
>> itself. Its a feature that's been discussed for implementation but as
>> of yet there aren't any concrete plans on what that implementation
>> would look like.
>>
>> That being said, there's nothing keeping you from using an external
>> indexer such as Solr that supports faceted searching like you're
>> describing.
>>
>> Also, patches are welcome :D
>>
>> Paul Davis
>>
>> On Sun, Dec 14, 2008 at 12:06 PM, Dan Woolley <da...@gmail.com> wrote:
>>> I'm researching Couchdb for a project dealing with real estate listing data.
>>>  I'm very interested in Couchdb because the schema less nature, RESTful
>>> interface, and potential off-line usage with syncing fit my problem very
>>> well.  I've been able to do some prototyping and search on ranges for a
>>> single field very successfully.  I'm having trouble wrapping my mind around
>>> views for a popular use case in real estate, which is a query like:
>>>
>>> Price = 350000-400000
>>> Beds = 4-5
>>> Baths = 2-3
>>>
>>> Any single range above is trivial, but what is the best model for handling
>>> this AND scenario with views?  The only thing I've been able to come up with
>>> is three views returning doc id's - which should be very fast - with an
>>> array intersection calculation on the client side.  Although I haven't tried
>>> it yet, that client side calculation worries me with a potential document
>>> with 1M records - the client would potentially be dealing with calculating
>>> the intersection of multiple 100K element arrays.  Is that a realistic
>>> calculation?
>>>
>>> Please tell me there is a better model for dealing with this type of
>>> scenario - or that this use case is not well suited for Couchdb at this time
>>> and I should move along.
>>>
>>>
>>> Dan Woolley
>>> profile:  http://www.linkedin.com/in/danwoolley
>>> company:  http://woolleyrobertson.com
>>> product:  http://dwellicious.com
>>> blog:  http://tzetzefly.com
>>>
>>>
>>>
>>>
>>
>

Re: Multiple search criteria with ranges

Posted by Adam Groves <ad...@gmail.com>.
Hi Paul,

I noticed you were working on various search solutions for CouchDB.
Your EFTI project on github specifically caught my attention. Is it on
ice at the moment?

Kind regards

Adam Groves

2008/12/14 Paul Davis <pa...@gmail.com>:
> Faceted search like this isn't best supported directly in CouchDB
> itself. Its a feature that's been discussed for implementation but as
> of yet there aren't any concrete plans on what that implementation
> would look like.
>
> That being said, there's nothing keeping you from using an external
> indexer such as Solr that supports faceted searching like you're
> describing.
>
> Also, patches are welcome :D
>
> Paul Davis
>
> On Sun, Dec 14, 2008 at 12:06 PM, Dan Woolley <da...@gmail.com> wrote:
>> I'm researching Couchdb for a project dealing with real estate listing data.
>>  I'm very interested in Couchdb because the schema less nature, RESTful
>> interface, and potential off-line usage with syncing fit my problem very
>> well.  I've been able to do some prototyping and search on ranges for a
>> single field very successfully.  I'm having trouble wrapping my mind around
>> views for a popular use case in real estate, which is a query like:
>>
>> Price = 350000-400000
>> Beds = 4-5
>> Baths = 2-3
>>
>> Any single range above is trivial, but what is the best model for handling
>> this AND scenario with views?  The only thing I've been able to come up with
>> is three views returning doc id's - which should be very fast - with an
>> array intersection calculation on the client side.  Although I haven't tried
>> it yet, that client side calculation worries me with a potential document
>> with 1M records - the client would potentially be dealing with calculating
>> the intersection of multiple 100K element arrays.  Is that a realistic
>> calculation?
>>
>> Please tell me there is a better model for dealing with this type of
>> scenario - or that this use case is not well suited for Couchdb at this time
>> and I should move along.
>>
>>
>> Dan Woolley
>> profile:  http://www.linkedin.com/in/danwoolley
>> company:  http://woolleyrobertson.com
>> product:  http://dwellicious.com
>> blog:  http://tzetzefly.com
>>
>>
>>
>>
>

Re: Multiple search criteria with ranges

Posted by Paul Davis <pa...@gmail.com>.
Faceted search like this isn't best supported directly in CouchDB
itself. Its a feature that's been discussed for implementation but as
of yet there aren't any concrete plans on what that implementation
would look like.

That being said, there's nothing keeping you from using an external
indexer such as Solr that supports faceted searching like you're
describing.

Also, patches are welcome :D

Paul Davis

On Sun, Dec 14, 2008 at 12:06 PM, Dan Woolley <da...@gmail.com> wrote:
> I'm researching Couchdb for a project dealing with real estate listing data.
>  I'm very interested in Couchdb because the schema less nature, RESTful
> interface, and potential off-line usage with syncing fit my problem very
> well.  I've been able to do some prototyping and search on ranges for a
> single field very successfully.  I'm having trouble wrapping my mind around
> views for a popular use case in real estate, which is a query like:
>
> Price = 350000-400000
> Beds = 4-5
> Baths = 2-3
>
> Any single range above is trivial, but what is the best model for handling
> this AND scenario with views?  The only thing I've been able to come up with
> is three views returning doc id's - which should be very fast - with an
> array intersection calculation on the client side.  Although I haven't tried
> it yet, that client side calculation worries me with a potential document
> with 1M records - the client would potentially be dealing with calculating
> the intersection of multiple 100K element arrays.  Is that a realistic
> calculation?
>
> Please tell me there is a better model for dealing with this type of
> scenario - or that this use case is not well suited for Couchdb at this time
> and I should move along.
>
>
> Dan Woolley
> profile:  http://www.linkedin.com/in/danwoolley
> company:  http://woolleyrobertson.com
> product:  http://dwellicious.com
> blog:  http://tzetzefly.com
>
>
>
>