You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Brian Candler <B....@pobox.com> on 2009/12/07 11:34:53 UTC

Reducible checksum?

I am thinking about storing some derived data which is associated with key
ranges of a view. (Example: an image which provides a graphical summary of a
key range).

I would like to determine when it's time to regenerate an image, that is,
when the underlying view has changed within that range.

One thought I had was if I could make a reduce function which was some sort
of checksum of the key/value pairs. Then I could just do a reduce query
across the key range, and see if the reduce value has changed. It would be
like an etag for the range.

Unfortunately, I can't just do something simple like an md5sum across the
range, because couchdb implements a tree of reduces and re-reduces, and may
decide to restructure this tree. I'd like a checksum which is invariant
across all possible reduce trees for the same data.

Something simple would be to XOR all the keys and values together, but
sometimes this would not detect changes which happen to XOR to the same
data.

Perhaps I should md5 each (key,value) pair, and then XOR all those together
in the reduce function.

Since my docs have updated timestamps, maybe I should just take the max() of
the updated timestamp for each doc, together with a count of the docs (so as
to be able to detect deletions)

I just wondered if anyone had already made an elegant solution for this? Or
some completely different way of determining whether a view has changed
between a given startkey and endkey?

Thanks,

Brian.

Re: Reducible checksum?

Posted by Chris Anderson <jc...@apache.org>.
On Wed, Dec 9, 2009 at 1:31 AM, Brian Candler <B....@pobox.com> wrote:
> On Mon, Dec 07, 2009 at 02:27:18PM -0800, Chris Anderson wrote:
>> If there's a generic way to do this, and it
>> is cheap enough, it could be generalized to handle view etags. Your
>> row count + max timestamp trick seems sensible to me, but obviously is
>> not generalizable.
>>
>> Presumably you could avoid hashing the keys and values by leaning on
>> the document._rev. However, that just pushes the problem back a step.
>
> Aha. These days the _rev includes a cryptographic checksum of the document
> contents, correct?
>
> In that case I think all we need is a simple sum, modulo 2^128, of the _rev
> hashes. This is commutative and associative, and very cheap. (An XOR has the
> problem that if the same document is emitted an even number of times, it
> vanishes).
>
> Now, we know that the set of (k,v) pair(s) emitted for any particular doc
> can't change unless you modify the design doc. So we could simply add this
> sum to a hash of the design doc as a final step, to get the etag. You'd also
> have to include the view params in the final hash (e.g. startkey, endkey,
> skip) because it's possible that the same set of docs could appear under
> different parts of the key space, emitting different keys and/or values, and
> you'd want different etags for those.
>
> This leaves one question: I know the view index already contains
> {{key,_id},value}, but does it include _rev? If not, would the extra storage
> overhead be acceptable?

There have been request for _rev in the view rows in the past but
we've always responded with the recommendation to just emit the rev as
part of your value.

If we were to use a solution like this for the view etags we'd want to
hard-code it, but for now you should be able to prototype your
approach with the current API.

>
> The alternative I can see is to take a hash of each {{key,_id},value} and to
> sum those hashes. The trouble is this is more expensive computation-wise,
> especially if the value is large (e.g. when you emit the whole document).
>
> Cheers,
>
> Brian.
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Reducible checksum?

Posted by Brian Candler <B....@pobox.com>.
On Mon, Dec 07, 2009 at 02:27:18PM -0800, Chris Anderson wrote:
> If there's a generic way to do this, and it
> is cheap enough, it could be generalized to handle view etags. Your
> row count + max timestamp trick seems sensible to me, but obviously is
> not generalizable.
> 
> Presumably you could avoid hashing the keys and values by leaning on
> the document._rev. However, that just pushes the problem back a step.

Aha. These days the _rev includes a cryptographic checksum of the document
contents, correct?

In that case I think all we need is a simple sum, modulo 2^128, of the _rev
hashes. This is commutative and associative, and very cheap. (An XOR has the
problem that if the same document is emitted an even number of times, it
vanishes).

Now, we know that the set of (k,v) pair(s) emitted for any particular doc
can't change unless you modify the design doc. So we could simply add this
sum to a hash of the design doc as a final step, to get the etag. You'd also
have to include the view params in the final hash (e.g. startkey, endkey,
skip) because it's possible that the same set of docs could appear under
different parts of the key space, emitting different keys and/or values, and
you'd want different etags for those.

This leaves one question: I know the view index already contains
{{key,_id},value}, but does it include _rev? If not, would the extra storage
overhead be acceptable?

The alternative I can see is to take a hash of each {{key,_id},value} and to
sum those hashes. The trouble is this is more expensive computation-wise,
especially if the value is large (e.g. when you emit the whole document).

Cheers,

Brian.

Re: Reducible checksum?

Posted by Chris Anderson <jc...@apache.org>.
On Mon, Dec 7, 2009 at 2:34 AM, Brian Candler <B....@pobox.com> wrote:
> I am thinking about storing some derived data which is associated with key
> ranges of a view. (Example: an image which provides a graphical summary of a
> key range).
>
> I would like to determine when it's time to regenerate an image, that is,
> when the underlying view has changed within that range.
>
> One thought I had was if I could make a reduce function which was some sort
> of checksum of the key/value pairs. Then I could just do a reduce query
> across the key range, and see if the reduce value has changed. It would be
> like an etag for the range.
>
> Unfortunately, I can't just do something simple like an md5sum across the
> range, because couchdb implements a tree of reduces and re-reduces, and may
> decide to restructure this tree. I'd like a checksum which is invariant
> across all possible reduce trees for the same data.
>
> Something simple would be to XOR all the keys and values together, but
> sometimes this would not detect changes which happen to XOR to the same
> data.
>
> Perhaps I should md5 each (key,value) pair, and then XOR all those together
> in the reduce function.
>
> Since my docs have updated timestamps, maybe I should just take the max() of
> the updated timestamp for each doc, together with a count of the docs (so as
> to be able to detect deletions)

This is a great question. If there's a generic way to do this, and it
is cheap enough, it could be generalized to handle view etags. Your
row count + max timestamp trick seems sensible to me, but obviously is
not generalizable.

Presumably you could avoid hashing the keys and values by leaning on
the document._rev. However, that just pushes the problem back a step.

What we need for a general solution is a commutative and associated
checksum function, which would be a funny beast indeed.

Chris

>
> I just wondered if anyone had already made an elegant solution for this? Or
> some completely different way of determining whether a view has changed
> between a given startkey and endkey?
>
> Thanks,
>
> Brian.
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io