You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Dan Woolley <da...@gmail.com> on 2008/12/14 18:06:25 UTC
Multiple search criteria with ranges
I'm researching Couchdb for a project dealing with real estate listing
data. I'm very interested in Couchdb because the schema less nature,
RESTful interface, and potential off-line usage with syncing fit my
problem very well. I've been able to do some prototyping and search
on ranges for a single field very successfully. I'm having trouble
wrapping my mind around views for a popular use case in real estate,
which is a query like:
Price = 350000-400000
Beds = 4-5
Baths = 2-3
Any single range above is trivial, but what is the best model for
handling this AND scenario with views? The only thing I've been able
to come up with is three views returning doc id's - which should be
very fast - with an array intersection calculation on the client
side. Although I haven't tried it yet, that client side calculation
worries me with a potential document with 1M records - the client
would potentially be dealing with calculating the intersection of
multiple 100K element arrays. Is that a realistic calculation?
Please tell me there is a better model for dealing with this type of
scenario - or that this use case is not well suited for Couchdb at
this time and I should move along.
Dan Woolley
profile: http://www.linkedin.com/in/danwoolley
company: http://woolleyrobertson.com
product: http://dwellicious.com
blog: http://tzetzefly.com
Re: Multiple search criteria with ranges
Posted by Dan Woolley <da...@gmail.com>.
Thanks for all the good ideas - Couchdb definitely has an active
community willing to help. I'm going to try to run some tests over
the holidays, with this and the array intersection methods on the
server side, and see if it's worth pursuing further.
Dan Woolley
profile: http://www.linkedin.com/in/danwoolley
company: http://woolleyrobertson.com
product: http://dwellicious.com
blog: http://tzetzefly.com
On Dec 15, 2008, at 6:25 PM, Paul Davis wrote:
> I thought of a possible way to possible overcome this that may or may
> not work depending on the underlying data.
>
> General scheme is to basically make a view for each filter that you
> want to be able to combine. Then with a group reduce you could figure
> out the number of records in each range. Then fetch the records for
> the view with the fewest results and then multi get against all other
> filters removing keys if they come back with an error.
>
> More concretely:
>
> Each view would be:
>
> map:
> function(doc) {emit(doc.filter_field, 1);}
>
> reduce:
> function(keys, values) {return sum(values);}
>
> To get the count for a specific range you'd do:
>
> GET http://127.0.0.1:5984/db_name/_view/filters/by_field_x?group=true&startkey=min&endkey=max
>
> And in the client code you would merge each of the results. For things
> with a more continuous range like price, you may need to bucket
> appropriately. This query should be run once for each field.
>
> Then say we want to filter on fields X, Y, Z (assuming num_in(X) <
> num_in(Y) < num_in(Z))
>
> DocIds = GET http://127.0.0.1:5984/db_name/_view/filters/by_field_X?startkey=minX&endkey=maxX
> Then intersect with a call to Y and Z views:
> POST http://127.0.0.1:5984/db_name/_view/filters/by_field_Y?startkey=minY&endkey=maxY
> BODY: {"keys": [DocIds]}
>
> That make sense?
>
> Paul
>
>
>
>
>
>
> On Sun, Dec 14, 2008 at 12:06 PM, Dan Woolley <da...@gmail.com>
> wrote:
>> I'm researching Couchdb for a project dealing with real estate
>> listing data.
>> I'm very interested in Couchdb because the schema less nature,
>> RESTful
>> interface, and potential off-line usage with syncing fit my problem
>> very
>> well. I've been able to do some prototyping and search on ranges
>> for a
>> single field very successfully. I'm having trouble wrapping my
>> mind around
>> views for a popular use case in real estate, which is a query like:
>>
>> Price = 350000-400000
>> Beds = 4-5
>> Baths = 2-3
>>
>> Any single range above is trivial, but what is the best model for
>> handling
>> this AND scenario with views? The only thing I've been able to
>> come up with
>> is three views returning doc id's - which should be very fast -
>> with an
>> array intersection calculation on the client side. Although I
>> haven't tried
>> it yet, that client side calculation worries me with a potential
>> document
>> with 1M records - the client would potentially be dealing with
>> calculating
>> the intersection of multiple 100K element arrays. Is that a
>> realistic
>> calculation?
>>
>> Please tell me there is a better model for dealing with this type of
>> scenario - or that this use case is not well suited for Couchdb at
>> this time
>> and I should move along.
>>
>>
>> Dan Woolley
>> profile: http://www.linkedin.com/in/danwoolley
>> company: http://woolleyrobertson.com
>> product: http://dwellicious.com
>> blog: http://tzetzefly.com
>>
>>
>>
>>
Re: Multiple search criteria with ranges
Posted by Paul Davis <pa...@gmail.com>.
I thought of a possible way to possible overcome this that may or may
not work depending on the underlying data.
General scheme is to basically make a view for each filter that you
want to be able to combine. Then with a group reduce you could figure
out the number of records in each range. Then fetch the records for
the view with the fewest results and then multi get against all other
filters removing keys if they come back with an error.
More concretely:
Each view would be:
map:
function(doc) {emit(doc.filter_field, 1);}
reduce:
function(keys, values) {return sum(values);}
To get the count for a specific range you'd do:
GET http://127.0.0.1:5984/db_name/_view/filters/by_field_x?group=true&startkey=min&endkey=max
And in the client code you would merge each of the results. For things
with a more continuous range like price, you may need to bucket
appropriately. This query should be run once for each field.
Then say we want to filter on fields X, Y, Z (assuming num_in(X) <
num_in(Y) < num_in(Z))
DocIds = GET http://127.0.0.1:5984/db_name/_view/filters/by_field_X?startkey=minX&endkey=maxX
Then intersect with a call to Y and Z views:
POST http://127.0.0.1:5984/db_name/_view/filters/by_field_Y?startkey=minY&endkey=maxY
BODY: {"keys": [DocIds]}
That make sense?
Paul
On Sun, Dec 14, 2008 at 12:06 PM, Dan Woolley <da...@gmail.com> wrote:
> I'm researching Couchdb for a project dealing with real estate listing data.
> I'm very interested in Couchdb because the schema less nature, RESTful
> interface, and potential off-line usage with syncing fit my problem very
> well. I've been able to do some prototyping and search on ranges for a
> single field very successfully. I'm having trouble wrapping my mind around
> views for a popular use case in real estate, which is a query like:
>
> Price = 350000-400000
> Beds = 4-5
> Baths = 2-3
>
> Any single range above is trivial, but what is the best model for handling
> this AND scenario with views? The only thing I've been able to come up with
> is three views returning doc id's - which should be very fast - with an
> array intersection calculation on the client side. Although I haven't tried
> it yet, that client side calculation worries me with a potential document
> with 1M records - the client would potentially be dealing with calculating
> the intersection of multiple 100K element arrays. Is that a realistic
> calculation?
>
> Please tell me there is a better model for dealing with this type of
> scenario - or that this use case is not well suited for Couchdb at this time
> and I should move along.
>
>
> Dan Woolley
> profile: http://www.linkedin.com/in/danwoolley
> company: http://woolleyrobertson.com
> product: http://dwellicious.com
> blog: http://tzetzefly.com
>
>
>
>
Re: Multiple search criteria with ranges
Posted by "ara.t.howard" <ar...@gmail.com>.
On Dec 15, 2008, at 10:46 AM, Dean Landolt wrote:
> I don't think it's the actual intersection that kills, but the
> downloading
> and json parsing of what would be megabytes of data (just the docids
> alone
> for 200k docs would be quite a few mb).
yeah that is a good point. it's about 1.7 mb per 100k of json ids.
if the 'client' were actually like a rails instance running on the
same box or subnet as the db i guess that could work out, but
otherwise it's a pretty tough sell. i wonder about doing something
creative with indexing both an attribute and an id and them limiting
by start/endkey and docid.... sure seems like this should be solvable.
cheers.
a @ http://codeforpeople.com/
--
we can deny everything, except that we have the possibility of being
better. simply reflect on that.
h.h. the 14th dalai lama
Re: Multiple search criteria with ranges
Posted by Dean Landolt <de...@deanlandolt.com>.
On Mon, Dec 15, 2008 at 12:08 PM, ara.t.howard <ar...@gmail.com>wrote:
>
> On Dec 14, 2008, at 10:06 AM, Dan Woolley wrote:
>
> I'm researching Couchdb for a project dealing with real estate listing
>> data. I'm very interested in Couchdb because the schema less nature,
>> RESTful interface, and potential off-line usage with syncing fit my problem
>> very well. I've been able to do some prototyping and search on ranges for a
>> single field very successfully. I'm having trouble wrapping my mind around
>> views for a popular use case in real estate, which is a query like:
>>
>> Price = 350000-400000
>> Beds = 4-5
>> Baths = 2-3
>>
>> Any single range above is trivial, but what is the best model for handling
>> this AND scenario with views? The only thing I've been able to come up with
>> is three views returning doc id's - which should be very fast - with an
>> array intersection calculation on the client side. Although I haven't tried
>> it yet, that client side calculation worries me with a potential document
>> with 1M records - the client would potentially be dealing with calculating
>> the intersection of multiple 100K element arrays. Is that a realistic
>> calculation?
>>
>> Please tell me there is a better model for dealing with this type of
>> scenario - or that this use case is not well suited for Couchdb at this time
>> and I should move along.
>>
>
> using ruby or js i can compute the intersection of two 100k arrays in about
> 10/th a sec, for example with this code
>
>
>
> a=Array.new(100_000).map{ rand }
> b=Array.new(100_000).map{ rand }
>
> start_time=Time.now.to_f
>
> intersection = b | a
>
> end_time=Time.now.to_f
>
> puts(end_time - start_time) #=> 0.197230815887451
>
>
> and that's on my laptop which isn't too quick using ruby which also isn't
> too quick.
>
>
> i guess to me it's seems like keeping an index of each attribute to search
> by and doing refinements is going to plenty fast, offload cpu cycles to the
> client, and keep the code orthogonal and easy to understand - you have one
> index per field, period.
>
> in addition it seems like are always going to have a natural first criteria
> and that you might be able to use startkey_docid/endkey_docid to limit the
> result set of the second and third queries to smaller and smaller ranges of
> ids (in the normal case).
>
> cheers.
I don't think it's the actual intersection that kills, but the downloading
and json parsing of what would be megabytes of data (just the docids alone
for 200k docs would be quite a few mb).
Re: Multiple search criteria with ranges
Posted by "ara.t.howard" <ar...@gmail.com>.
On Dec 14, 2008, at 10:06 AM, Dan Woolley wrote:
> I'm researching Couchdb for a project dealing with real estate
> listing data. I'm very interested in Couchdb because the schema
> less nature, RESTful interface, and potential off-line usage with
> syncing fit my problem very well. I've been able to do some
> prototyping and search on ranges for a single field very
> successfully. I'm having trouble wrapping my mind around views for
> a popular use case in real estate, which is a query like:
>
> Price = 350000-400000
> Beds = 4-5
> Baths = 2-3
>
> Any single range above is trivial, but what is the best model for
> handling this AND scenario with views? The only thing I've been
> able to come up with is three views returning doc id's - which
> should be very fast - with an array intersection calculation on the
> client side. Although I haven't tried it yet, that client side
> calculation worries me with a potential document with 1M records -
> the client would potentially be dealing with calculating the
> intersection of multiple 100K element arrays. Is that a realistic
> calculation?
>
> Please tell me there is a better model for dealing with this type of
> scenario - or that this use case is not well suited for Couchdb at
> this time and I should move along.
using ruby or js i can compute the intersection of two 100k arrays in
about 10/th a sec, for example with this code
a=Array.new(100_000).map{ rand }
b=Array.new(100_000).map{ rand }
start_time=Time.now.to_f
intersection = b | a
end_time=Time.now.to_f
puts(end_time - start_time) #=> 0.197230815887451
and that's on my laptop which isn't too quick using ruby which also
isn't too quick.
i guess to me it's seems like keeping an index of each attribute to
search by and doing refinements is going to plenty fast, offload cpu
cycles to the client, and keep the code orthogonal and easy to
understand - you have one index per field, period.
in addition it seems like are always going to have a natural first
criteria and that you might be able to use startkey_docid/endkey_docid
to limit the result set of the second and third queries to smaller and
smaller ranges of ids (in the normal case).
cheers.
a @ http://codeforpeople.com/
--
we can deny everything, except that we have the possibility of being
better. simply reflect on that.
h.h. the 14th dalai lama
Re: Multiple search criteria with ranges
Posted by Adam Kocoloski <ad...@gmail.com>.
Hi Dan, it's not a general-purpose solution, but in the specific
example you gave where there's only one continuous variable you might
be able to do something like the following. Use a map function that
emits your filterable quantities as a list:
emit([doc.beds, doc.baths, doc.price], null)
and then query that view multiple times with:
startkey=[4,2,350000]&endkey=[4,2,400000]
startkey=[4,3,350000]&endkey=[4,3,400000]
startkey=[5,2,350000]&endkey=[5,2,400000]
startkey=[5,2,350000]&endkey=[4,3,400000]
The advantage is that you get only the data you need; the disadvantage
is that the number of queries scales non-linearly with the number of
fields used in the filter, and there's no easy way to skip fields
(you'd need to query with every possible value of that field, or else
write an additional view that doesn't emit it).
A possible middle ground would be to query this same view with a
single request:
startkey=[4,2,350000]&endkey=[5,3,400000]
You'll still need to filter the results on the client side, since e.g.
a 4 bed, 2 bath, $600k listing would get included, but at least the
data volume would be smaller than doing the whole intersection
yourself. If you go this route, the discrete vs. continuous variable
thing doesn't really matter; just arrange the keys so that the one
with the greatest discriminating power comes first.
Best, Adam
On Dec 14, 2008, at 12:06 PM, Dan Woolley wrote:
> I'm researching Couchdb for a project dealing with real estate
> listing data. I'm very interested in Couchdb because the schema
> less nature, RESTful interface, and potential off-line usage with
> syncing fit my problem very well. I've been able to do some
> prototyping and search on ranges for a single field very
> successfully. I'm having trouble wrapping my mind around views for
> a popular use case in real estate, which is a query like:
>
> Price = 350000-400000
> Beds = 4-5
> Baths = 2-3
>
> Any single range above is trivial, but what is the best model for
> handling this AND scenario with views? The only thing I've been
> able to come up with is three views returning doc id's - which
> should be very fast - with an array intersection calculation on the
> client side. Although I haven't tried it yet, that client side
> calculation worries me with a potential document with 1M records -
> the client would potentially be dealing with calculating the
> intersection of multiple 100K element arrays. Is that a realistic
> calculation?
>
> Please tell me there is a better model for dealing with this type of
> scenario - or that this use case is not well suited for Couchdb at
> this time and I should move along.
>
>
> Dan Woolley
> profile: http://www.linkedin.com/in/danwoolley
> company: http://woolleyrobertson.com
> product: http://dwellicious.com
> blog: http://tzetzefly.com
>
>
>
Re: Multiple search criteria with ranges
Posted by Thibaut Barrère <th...@gmail.com>.
Hi,
> Faceted search like this isn't best supported directly in CouchDB
> itself. Its a feature that's been discussed for implementation but as
> of yet there aren't any concrete plans on what that implementation
> would look like.
>
> That being said, there's nothing keeping you from using an external
> indexer such as Solr that supports faceted searching like you're
> describing.
I'm also trying to implement this faceted search (mailed the list
today about that but the mail didn't make it , I think I didn't post
it to the current address).
I'd be very interested to hear some feedback from people who actually
implemented something like that - either with Solr or any other
solution.
cheers
-- Thibaut
Re: Multiple search criteria with ranges
Posted by Paul Davis <pa...@gmail.com>.
On Mon, Dec 15, 2008 at 4:28 AM, Adam Groves <ad...@gmail.com> wrote:
> Hi Paul,
>
> I noticed you were working on various search solutions for CouchDB.
> Your EFTI project on github specifically caught my attention. Is it on
> ice at the moment?
>
More or less. I haven't pursued it much beyond discussing ideas for it
on IRC. Generally I'm waiting for a better picture on how the Erlang
plugin interface will pan out. The biggest questions for me are on how
to match the code with other aspects of CouchDB. Replication in
particular is a bit of a stumbling block in terms of how to pursue an
actual implementation.
Paul
> Kind regards
>
> Adam Groves
>
> 2008/12/14 Paul Davis <pa...@gmail.com>:
>> Faceted search like this isn't best supported directly in CouchDB
>> itself. Its a feature that's been discussed for implementation but as
>> of yet there aren't any concrete plans on what that implementation
>> would look like.
>>
>> That being said, there's nothing keeping you from using an external
>> indexer such as Solr that supports faceted searching like you're
>> describing.
>>
>> Also, patches are welcome :D
>>
>> Paul Davis
>>
>> On Sun, Dec 14, 2008 at 12:06 PM, Dan Woolley <da...@gmail.com> wrote:
>>> I'm researching Couchdb for a project dealing with real estate listing data.
>>> I'm very interested in Couchdb because the schema less nature, RESTful
>>> interface, and potential off-line usage with syncing fit my problem very
>>> well. I've been able to do some prototyping and search on ranges for a
>>> single field very successfully. I'm having trouble wrapping my mind around
>>> views for a popular use case in real estate, which is a query like:
>>>
>>> Price = 350000-400000
>>> Beds = 4-5
>>> Baths = 2-3
>>>
>>> Any single range above is trivial, but what is the best model for handling
>>> this AND scenario with views? The only thing I've been able to come up with
>>> is three views returning doc id's - which should be very fast - with an
>>> array intersection calculation on the client side. Although I haven't tried
>>> it yet, that client side calculation worries me with a potential document
>>> with 1M records - the client would potentially be dealing with calculating
>>> the intersection of multiple 100K element arrays. Is that a realistic
>>> calculation?
>>>
>>> Please tell me there is a better model for dealing with this type of
>>> scenario - or that this use case is not well suited for Couchdb at this time
>>> and I should move along.
>>>
>>>
>>> Dan Woolley
>>> profile: http://www.linkedin.com/in/danwoolley
>>> company: http://woolleyrobertson.com
>>> product: http://dwellicious.com
>>> blog: http://tzetzefly.com
>>>
>>>
>>>
>>>
>>
>
Re: Multiple search criteria with ranges
Posted by Adam Groves <ad...@gmail.com>.
Hi Paul,
I noticed you were working on various search solutions for CouchDB.
Your EFTI project on github specifically caught my attention. Is it on
ice at the moment?
Kind regards
Adam Groves
2008/12/14 Paul Davis <pa...@gmail.com>:
> Faceted search like this isn't best supported directly in CouchDB
> itself. Its a feature that's been discussed for implementation but as
> of yet there aren't any concrete plans on what that implementation
> would look like.
>
> That being said, there's nothing keeping you from using an external
> indexer such as Solr that supports faceted searching like you're
> describing.
>
> Also, patches are welcome :D
>
> Paul Davis
>
> On Sun, Dec 14, 2008 at 12:06 PM, Dan Woolley <da...@gmail.com> wrote:
>> I'm researching Couchdb for a project dealing with real estate listing data.
>> I'm very interested in Couchdb because the schema less nature, RESTful
>> interface, and potential off-line usage with syncing fit my problem very
>> well. I've been able to do some prototyping and search on ranges for a
>> single field very successfully. I'm having trouble wrapping my mind around
>> views for a popular use case in real estate, which is a query like:
>>
>> Price = 350000-400000
>> Beds = 4-5
>> Baths = 2-3
>>
>> Any single range above is trivial, but what is the best model for handling
>> this AND scenario with views? The only thing I've been able to come up with
>> is three views returning doc id's - which should be very fast - with an
>> array intersection calculation on the client side. Although I haven't tried
>> it yet, that client side calculation worries me with a potential document
>> with 1M records - the client would potentially be dealing with calculating
>> the intersection of multiple 100K element arrays. Is that a realistic
>> calculation?
>>
>> Please tell me there is a better model for dealing with this type of
>> scenario - or that this use case is not well suited for Couchdb at this time
>> and I should move along.
>>
>>
>> Dan Woolley
>> profile: http://www.linkedin.com/in/danwoolley
>> company: http://woolleyrobertson.com
>> product: http://dwellicious.com
>> blog: http://tzetzefly.com
>>
>>
>>
>>
>
Re: Multiple search criteria with ranges
Posted by Paul Davis <pa...@gmail.com>.
Faceted search like this isn't best supported directly in CouchDB
itself. Its a feature that's been discussed for implementation but as
of yet there aren't any concrete plans on what that implementation
would look like.
That being said, there's nothing keeping you from using an external
indexer such as Solr that supports faceted searching like you're
describing.
Also, patches are welcome :D
Paul Davis
On Sun, Dec 14, 2008 at 12:06 PM, Dan Woolley <da...@gmail.com> wrote:
> I'm researching Couchdb for a project dealing with real estate listing data.
> I'm very interested in Couchdb because the schema less nature, RESTful
> interface, and potential off-line usage with syncing fit my problem very
> well. I've been able to do some prototyping and search on ranges for a
> single field very successfully. I'm having trouble wrapping my mind around
> views for a popular use case in real estate, which is a query like:
>
> Price = 350000-400000
> Beds = 4-5
> Baths = 2-3
>
> Any single range above is trivial, but what is the best model for handling
> this AND scenario with views? The only thing I've been able to come up with
> is three views returning doc id's - which should be very fast - with an
> array intersection calculation on the client side. Although I haven't tried
> it yet, that client side calculation worries me with a potential document
> with 1M records - the client would potentially be dealing with calculating
> the intersection of multiple 100K element arrays. Is that a realistic
> calculation?
>
> Please tell me there is a better model for dealing with this type of
> scenario - or that this use case is not well suited for Couchdb at this time
> and I should move along.
>
>
> Dan Woolley
> profile: http://www.linkedin.com/in/danwoolley
> company: http://woolleyrobertson.com
> product: http://dwellicious.com
> blog: http://tzetzefly.com
>
>
>
>