You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Brian Candler <B....@pobox.com> on 2009/01/30 11:18:59 UTC

Incremental map/reduce

I'm trying to understand how couchdb does incremental updates to its
map/reduce views. By this, I understand that it only has to read those
documents which have changed since the view was last updated.

I have an idea how it might work, so let me pose it as an example. Here's a
database with 6 documents and a very simple summing map/reduce function.

-------------------------------------------------------------
#!/bin/sh
set -xe

HOST1=http://localhost:5984
LOCAL1=maptest
DB1="$HOST1/$LOCAL1"

curl -X DELETE "$DB1"; echo
curl -X PUT "$DB1"; echo
curl -sX PUT -d '{"values":[1,2]}' "$DB1/doc1"
curl -sX PUT -d '{"values":[]}' "$DB1/doc2"
curl -sX PUT -d '{"values":[3,4,5]}' "$DB1/doc3"
curl -sX PUT -d '{"values":[6,7]}' "$DB1/doc4"
curl -sX PUT -d '{"values":[8]}' "$DB1/doc5"
curl -sX PUT -d '{"values":[9]}' "$DB1/doc6"
curl -sX PUT -T - "$DB1/_design/test" <<EOS
{
  "views": {
    "sum": {
      "map": "function(doc) { if (doc.values) { for (var i in doc.values) { emit(null,doc.values[i]) } } }",
      "reduce": "function(key, values) { return sum(values) }"
    }
  }
}
EOS

echo -e "\n\nReading view..."
curl "$DB1/_view/test/sum"
-------------------------------------------------------------

I am guessing there must be an upper bound to the number of keys and values
passed at once to the reduce function, so let me assume for now that limit
is 3. Then the process is:

           doc1       doc3     doc4  doc5 doc6
MAP         / \     /  |  \     / \    |   |
           1   2   3   4   5   6   7   8   9
REDUCE      \  |  /     \  |  /     \  |  /
               6          15          24
REDUCE          `--------. | ,--------'
                          45

Now, suppose I delete doc5. In order to update this sum incrementally, I
think I would have to:
(1) Regenerate the map block(s) which included doc5
    - so this would now be [7,9] instead of [7,8,9]
(2) Reduce those block(s) - giving 16 instead of 24
(3) Down the reduction tree, re-reduce those blocks which depend on that
    input

Is that an accurate description of the process?

In order to do this, couchdb would have to remember all the intermediate
reduce values, and also know that the intermediate value 24 was derived from
doc4, doc5 and doc6. Furthermore, it would have to re-map doc4 and doc6 to
generate the new value, even though they had not changed.

Alternatively, it could keep all the map results materialised, each tagged
with the the docid where it came from, so it could simply remove the 8 from
the map when doc5 is deleted. Is that what it does? (If so, I think it could
be useful to expose that in the view, so that a single view could be used
as both map and map+reduce)

Now, the other thing which I don't understand is group_level.

I can understand that emit gives (key,value) pairs, so group=true gives me
the intermediate reduction values for rows with identical keys. For example,
if I change the above map function to

      "map": "function(doc) { if (doc.values) { for (var i in doc.values) { emit(doc._id,doc.values[i]) } } }",

this gives the result I expect, and I imagine that the reduction process
is working something like this:

           doc1       doc3     doc4  doc5 doc6
MAP         / \     /  |  \     / \    |   |
           1   2   3   4   5   6   7   8   9
REDUCE      \ /     \  |  /     \ /    |   |
             3        12        13     8   9   ==> group=true
REDUCE        `------. | ,-----'        \ /
                      28                17
REDUCE                  `------. ,-----'
                               45              ==> group=false

Is that right? But in that case, what does group_level=2 do? Is this
something which comes into play if emitted keys are structured as arrays?

Anyway, sorry for the long E-mail. I'm just trying to get all this clear in
my head :-)

Many thanks,

Brian.