You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by David King <dk...@ketralnis.com> on 2008/06/28 08:03:04 UTC

General-understanding questions about views

I'm trying to gain a fundamental understanding of views and indexed  
data. If this is documented in a FAQ, please direct me there instead :)

In trying to map my understanding from SQL, it appears that the answer  
to quickly querying data is by pre-calculating query result-sets and  
storing them in tables, called views. A view is table populated by a  
function that runs against every object that is written or modified in  
the database.

1. How would you implement a query against a value that changes after  
the view is populated, like the current time? That is, if I wanted  
things younger than a week, a permanent view like this:

function(doc) {
	if(doc.date > now() - timeinterval('1 week')) {
		emit(null,doc);
	}
}

(date-syntax liberally made up) the results of that query, if  
populated when the data is changed, would quickly be invalid, because  
now() has changed. Is this accurate? How would you performantly run a  
query like this?

2. Same question for a permanent view containing the youngest 10 items  
(this one might be easier)?

3. The wiki doesn't mention parameterised views. So if I have a  
document with an 'author' field, and I want a view such that I can see  
everything that a given author wrote, do I need a view per author?  
Given thousands of authors, what is the performance cost for running a  
document through a few thousand author-functions?

4. I know that the distribution bits are still being fleshed out, but  
is it the intention that eventually views can be stored or calculated  
on a separate server from the data (since they are implemented as  
tables)?

Re: General-understanding questions about views

Posted by David King <dk...@ketralnis.com>.

> Now you're getting to the technical part. This quote from Damien is
> the best I can do for you:
> http://damienkatz.net/2008/02/incremental_map_1.html
> ... in this design, the reductions happen at index-update time, and
> the reductions are stored directly inside the inner nodes of the view
> b+tree index. Then at query time, the intermediate results are reduced
> to their final result. The number of reductions that happen at query
> time are logarithmic with respect to the number of matching
> key/values.

So for modifications and deletions, the map results are changed, the  
tree of intermediate values is partially dirty, and the reduction only  
has to be partially re-done.

Very cool.

Re: General-understanding questions about views

Posted by Chris Anderson <jc...@grabb.it>.

On Sat, Jun 28, 2008 at 9:17 PM, David King <dk...@ketralnis.com> wrote:
> But in the example I gave (that I got from the CouchDB wiki), there's no way
> for the reduction to be accurate in the face of deletions and modifications
> without re-calculating it for every single item in the database. That is,
> given a database of 5 million rows, every time one is modified or deleted
> (or possibly even added, depending on implementation), all 5 million rows
> have to pass through that function
>

Now you're getting to the technical part. This quote from Damien is
the best I can do for you:

http://damienkatz.net/2008/02/incremental_map_1.html

... in this design, the reductions happen at index-update time, and
the reductions are stored directly inside the inner nodes of the view
b+tree index. Then at query time, the intermediate results are reduced
to their final result. The number of reductions that happen at query
time are logarithmic with respect to the number of matching
key/values.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: General-understanding questions about views

Posted by David King <dk...@ketralnis.com>.

>> How would that total be updated if something were deleted or  
>> updated? Does
>> the sum() function have to be evaluated over doc.Amount of *every*  
>> doc on
>> every update?
> I don't grok the internals well enough to say exactly why, but I do
> know that one of the main features of CouchDB is the fact that only
> the changed documents (and a minimum of aggregation "re-reductions")
> need to be recomputed when an already mapped doc is updated or
> deleted.

But in the example I gave (that I got from the CouchDB wiki), there's  
no way for the reduction to be accurate in the face of deletions and  
modifications without re-calculating it for every single item in the  
database. That is, given a database of 5 million rows, every time one  
is modified or deleted (or possibly even added, depending on  
implementation), all 5 million rows have to pass through that function

Is there another way to handle modifications and deletions?

Re: General-understanding questions about views

Posted by Chris Anderson <jc...@grabb.it>.

On Sat, Jun 28, 2008 at 8:35 PM, David King <dk...@ketralnis.com> wrote:
>
> How would that total be updated if something were deleted or updated? Does
> the sum() function have to be evaluated over doc.Amount of *every* doc on
> every update?
>

I don't grok the internals well enough to say exactly why, but I do
know that one of the main features of CouchDB is the fact that only
the changed documents (and a minimum of aggregation "re-reductions")
need to be recomputed when an already mapped doc is updated or
deleted.

Also, be aware that the views aren't computed on update, but rather on
the next query after an update. If you have thousands of inserts,
updates, or deletions between view queries, the next query can take
some time to return.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com

Re: General-understanding questions about views

Posted by David King <dk...@ketralnis.com>.

> Feel free to send in more questions as they come :-)

Okay, I have another one :)

How are views with a reduce function affected by deletes and updates?  
Given the example (<http://wiki.apache.org/couchdb/HttpViewApi>) of:

     "total_purchases": {
       "map": "function(doc) { if (doc.Type == 'purchase')   
emit(doc.Customer, doc.Amount) }",
       "reduce": "function(keys, values) { return sum(values) }"
     }

How would that total be updated if something were deleted or updated?  
Does the sum() function have to be evaluated over doc.Amount of  
*every* doc on every update?

Re: General-understanding questions about views

Posted by Chris Anderson <jc...@grabb.it>.

> I'm an Erlang developer. Any tips on where to start reading if I plan to
> begin understand the code-base?

As somone new to erlang I've started my explorations in
couch_httpd.erl (because it maps transparently to the HTTP API that
I'm familiar with.)


-- 
Chris Anderson
http://jchris.mfdz.com

Re: General-understanding questions about views

Posted by David King <dk...@ketralnis.com>.

> Here we have to tackle the first issue: Do not try to map what you  
> know
> from SQL to CouchDB.

Maybe I just worded it poorly. How about "I'm trying to understand how  
couchdb would solve some problems that I'm currently solving with SQL  
and come up with some intermediately mappaable vocabulary so that I  
can gloss over the bits that I don't understand until I do so that I  
can tackle one concept at a time". But that's much longer :)

> Your map functions must return the same result for the same input  
> [...]
> So what you would do here instead, is:
> function(doc) {
>  emit(doc.date, null);
> }
> and query with /db/_view/date/name? 
> startkey=timestamp_from_interval('1 week')&endkey=now()

Ah, very cool. I was mis-understanding the Key, which answers almost  
all of my questions as once

> Same thing. I note that you explicitly mention permanent views. Do not
> use temporary views in production, only during development.

That answers my next question too.

> Feel free to send in more questions as they come :-)

Oh I will :)

I'm an Erlang developer. Any tips on where to start reading if I plan  
to begin understand the code-base?

Re: General-understanding questions about views

Posted by Jan Lehnardt <ja...@apache.org>.

On Jun 28, 2008, at 08:03, David King wrote:

> I'm trying to gain a fundamental understanding of views and indexed  
> data. If this is documented in a FAQ, please direct me there  
> instead :)
>
> In trying to map my understanding from SQL,

Here we have to tackle the first issue: Do not try to map what you know
from SQL to CouchDB. Try to independently understand, how CouchDB
works and then try to apply your problems to it. A translation will not
work and possibly leave you thinking CouchDB is crap because it is not
an RDBMS which is surely not the case. On the other hand, it might
be perfectly possible that CouchDB is not the right tool for your job,
but it is certainly cool that you are checking it out :)

> it appears that the answer to quickly querying data is by pre- 
> calculating query result-sets and storing them in tables, called  
> views. A view is table populated by a function that runs against  
> every object that is written or modified in the database.
>
> 1. How would you implement a query against a value that changes  
> after the view is populated, like the current time? That is, if I  
> wanted things younger than a week, a permanent view like this:
>
> function(doc) {
> 	if(doc.date > now() - timeinterval('1 week')) {
> 		emit(null,doc);
> 	}
> }
> (date-syntax liberally made up) the results of that query, if  
> populated when the data is changed, would quickly be invalid,  
> because now() has changed. Is this accurate? How would you  
> performantly run a query like this?

Your map functions must return the same result for the same input, so
things like now() can not be used. And you usually don't. The most
interesting feature of the result set (or table as you call it) of the  
map
function is that the 'first column', the 'key' can be used for fast  
lookups.
So what you would do here instead, is:

function(doc) {
   emit(doc.date, null);
}

and query with /db/_view/date/name?startkey=timestamp_from_interval('1  
week')&endkey=now()

Looking up this can be done in constant time.

> 2. Same question for a permanent view containing the youngest 10  
> items (this one might be easier)?

Same thing. I note that you explicitly mention permanent views. Do not
use temporary views in production, only during development.

> 3. The wiki doesn't mention parameterised views. So if I have a  
> document with an 'author' field, and I want a view such that I can  
> see everything that a given author wrote, do I need a view per  
> author? Given thousands of authors, what is the performance cost for  
> running a document through a few thousand author-functions?

Same as above:

function(doc) {
   emit(doc.author, null);
}

GET /db/_view/authors/name?key=authorname

One view, extremely fast lookups.

> 4. I know that the distribution bits are still being fleshed out,  
> but is it the intention that eventually views can be stored or  
> calculated on a separate server from the data (since they are  
> implemented as tables)?

Not sure what you mean with 'since they are implemented as tables', but
maybe that is just the SQL-lingua that is confusing me. We don't have
tables (things might look like them, though). But yes, eventually, you  
will
be able to distribute view creation. We haven't gotten around to to  
that yet.

Feel free to send in more questions as they come :-)

Cheers
Jan
--