You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Brad Anderson <br...@sankatygroup.com> on 2008/08/19 05:31:00 UTC

flexible filtering needed, with speed.

Howdy,

I have 12K docs that look like this:

{
  "_id": "000111bf7a8515da822b05ebbb8cd257",
  "_rev": "94750440",
  "month": 17,
  "store": {
   "store_num": 123,
   "city": "Atlanta",
   "state": "GA",
   "zip": "30301",
   "exterior": true,
   "interior": true,
   "restroom": true,
   "breakfast": true,
   "sunday": true,
   "adi_name": "Atlanta, GA",
   "adi_num": 123,
   "ownership": "Company",
   "playground": "Indoor",
   "seats": 123,
   "parking_spaces": 123
  },
  "raw": {
   "Other Hourly Pay": 0.28,
   "Workers Comp - State Funds Exp": 401.65,
   "Rent Expense - Company": -8,
   "Archives Expense": 82.81,
   "Revised Hours allowed per": 860.22,
   "Merch Standard": 174.78,
   "Total Property Tax": 1190.91

   ...

  }
}

I truncated 'raw' but it's usually much longer, and avg. doc size is 5K.

  I'm trying to see how I will query them with views.  I want to be  
able to filter down by various store sub fields, i.e all the Breakfast  
= true stores in Georgia that are owned by Franchisees.  However, this  
will differ for just about every query.

The 'reduce' function would then be averaging each line in the 'raw'  
field.

I have played around with views that take the store filters, but just  
returning the 'raw' field as the value from the map function is  
brutally slow in Futon.  This is because the view is accessed right  
away, so it builds, takes about 3-4 mins (on a MBP with 4GB RAM,  
2.2GHz dual core, 7200RPM disk).  I understand the next time this  
specific store group is requested, it's fast...  but they will all be  
so dynamic that this seems prohibitively slow.

So, I thought, should I be doing this in two steps?  Set up the key to  
be store and whatever else I might want to query on (Month or whatever  
timeframe), and return the doc id's as the values on the original  
query?  I would then send in a complex key to do the filtering.  This  
would require waiting for the _bulk_get functionality, and I'd send  
that list of ID's into a 2nd query to get the raw data to send it to  
'map'.

This is slow now on 12K docs... It needs to be stupid-fast at that low  
number of docs, because the plan is for *way* more data.

The filtering part is tailor-made for a RDBMS, but the doc handling  
(all the 'raw' fields will be different store-by-store, industry by  
industry, change over time, and in general be free-form) is perfect  
for CouchDB.  Thoughts?  I want to use the right tool for the job, and  
that's looking like a RDBMS, sadly.  That is, unless I'm completely  
misusing Couch.  In which case, swift blows to the head are welcome.

Cheers,
BA

Re: flexible filtering needed, with speed.

Posted by Ralf Nieuwenhuijsen <ra...@gmail.com>.

Don't take Futon as a speed measure; since it might also be slowing
down in the rendering part if your documents are big. (there is a lot
of stuff going on client=side as well).

The truth is, all data that is being searched, people only care about
3-5 different types of search.

You can offcourse, go nuts with the indexing and just generate all
possible indexes you could possible need.

Here is one of my favorites; this creates an index for every unique field.

function(doc) {
 for(var k in doc){
   emit([k,1,doc[k]], rdoc);
 }
}

You can query it like:

use startkey=['someField',1,null] and endkey=['someField',2,null]
To get the index for 'someField'.

Offcourse, this baby is going to create a huge index if used with too
many or too big documents, but I would at least try something like
that.

I use the above view function to make sure I can get the data sorted
however I want.

2008/8/19 Brad Anderson <br...@sankatygroup.com>:
> Howdy,
>
> I have 12K docs that look like this:
>
> {
>  "_id": "000111bf7a8515da822b05ebbb8cd257",
>  "_rev": "94750440",
>  "month": 17,
>  "store": {
>  "store_num": 123,
>  "city": "Atlanta",
>  "state": "GA",
>  "zip": "30301",
>  "exterior": true,
>  "interior": true,
>  "restroom": true,
>  "breakfast": true,
>  "sunday": true,
>  "adi_name": "Atlanta, GA",
>  "adi_num": 123,
>  "ownership": "Company",
>  "playground": "Indoor",
>  "seats": 123,
>  "parking_spaces": 123
>  },
>  "raw": {
>  "Other Hourly Pay": 0.28,
>  "Workers Comp - State Funds Exp": 401.65,
>  "Rent Expense - Company": -8,
>  "Archives Expense": 82.81,
>  "Revised Hours allowed per": 860.22,
>  "Merch Standard": 174.78,
>  "Total Property Tax": 1190.91
>
>  ...
>
>  }
> }
>
> I truncated 'raw' but it's usually much longer, and avg. doc size is 5K.
>
>  I'm trying to see how I will query them with views.  I want to be able to
> filter down by various store sub fields, i.e all the Breakfast = true stores
> in Georgia that are owned by Franchisees.  However, this will differ for
> just about every query.
>
> The 'reduce' function would then be averaging each line in the 'raw' field.
>
> I have played around with views that take the store filters, but just
> returning the 'raw' field as the value from the map function is brutally
> slow in Futon.  This is because the view is accessed right away, so it
> builds, takes about 3-4 mins (on a MBP with 4GB RAM, 2.2GHz dual core,
> 7200RPM disk).  I understand the next time this specific store group is
> requested, it's fast...  but they will all be so dynamic that this seems
> prohibitively slow.
>
> So, I thought, should I be doing this in two steps?  Set up the key to be
> store and whatever else I might want to query on (Month or whatever
> timeframe), and return the doc id's as the values on the original query?  I
> would then send in a complex key to do the filtering.  This would require
> waiting for the _bulk_get functionality, and I'd send that list of ID's into
> a 2nd query to get the raw data to send it to 'map'.
>
> This is slow now on 12K docs... It needs to be stupid-fast at that low
> number of docs, because the plan is for *way* more data.
>
> The filtering part is tailor-made for a RDBMS, but the doc handling (all the
> 'raw' fields will be different store-by-store, industry by industry, change
> over time, and in general be free-form) is perfect for CouchDB.  Thoughts?
>  I want to use the right tool for the job, and that's looking like a RDBMS,
> sadly.  That is, unless I'm completely misusing Couch.  In which case, swift
> blows to the head are welcome.
>
> Cheers,
> BA
>
>
>

Re: flexible filtering needed, with speed.

Posted by Chris Anderson <jc...@grabb.it>.

On Mon, Aug 18, 2008 at 8:31 PM, Brad Anderson <br...@sankatygroup.com> wrote:
> So, I thought, should I be doing this in two steps?  Set up the key to be
> store and whatever else I might want to query on (Month or whatever
> timeframe), and return the doc id's as the values on the original query?  I
> would then send in a complex key to do the filtering.  This would require
> waiting for the _bulk_get functionality, and I'd send that list of ID's into
> a 2nd query to get the raw data to send it to 'map'.
>

You have dynamic queries, so unless you can use couchdb to build an
index that works for your queries... maybe emitting each key-value
pair from your documents, like:

var keys = [];
for (key in doc) keys.push(key);
for (key in doc) {
  emit([key, doc[key]], keys);
}

This will give you a sorted index of each key/value pair you deal
with. So for example, if the user is searching by "city" the document
would show up with a value of "Atlanta".

By listing all the document's keys in the value, you can do your
duck-typing from the view value alone without having to load the
document. Only request the documents which have all the necessary
keys.

This is a little complex - it's the sort of thing SQL tries to paper
over. So it depends on your application. It might be a better fit for
RDBMS, but there's a lot that can be expressed with CouchDB.

Hope this is helpful.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com