You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by ara howard <ar...@gmail.com> on 2008/11/13 18:01:02 UTC

dirty reads - update strategies

what are people's strategies for dealing with the following scenario

doc_a = get 'id_a'

doc_b = get 'id_b'

obj_c = { 'sum' : doc_a.x + doc_b.y }

put obj_c


this kind of thing is tricky even in a traditional RDBMS, since the  
default transaction level may or may not allow the application to see  
an uncommitted write by another transaction.

the only way i can think of to get consistency from an op like the  
above would be to do

bulk_put [ obj_c, doc_a, doc_b ]

in other words, if you are ever going to compute values to from couch  
docs to produce another doc, it would seem that's it's required to put  
*all* read information back in order to ensure that the sources have  
not changed since the time that you read them.  the issue with this,  
of course, is that a result computed from many documents is going to  
cause exponential slowdown since the potential for overlapping writes  
will increase with the number of documents and also the size of  
updates themselves will increase similarly.

a solution i can image is something like

list = get 'some_view'

obj = computed_value_from list

obj[ '_depends_on' ] = list.map{|element| [element.id, element.rev]}

put obj


so basically a method to do a put with not only your rev, but that of  
'n' dependent docs where only the [id, rev] pair for the dependent  
docs need be posted.  am i making any sense here?

cheers.



a @ http://codeforpeople.com/
--
we can deny everything, except that we have the possibility of being  
better. simply reflect on that.
h.h. the 14th dalai lama

Re: dirty reads - update strategies

Posted by Paul Carey <pa...@gmail.com>.

Not addressing your concern about large writes, but avoiding
inconsistent state across docs, you could simply perform your write
> bulk_put [ obj_c, doc_a, doc_b ]
to a separate database e.g. db_cached

So you could query db_cached, knowing that obj_c is consistent with
doc_a and doc_b, but still being able to retrieve the up-to-date data
from db_live if you wish.

Paul

On Thu, Nov 13, 2008 at 5:01 PM, ara howard <ar...@gmail.com> wrote:
>
> what are people's strategies for dealing with the following scenario
>
> doc_a = get 'id_a'
>
> doc_b = get 'id_b'
>
> obj_c = { 'sum' : doc_a.x + doc_b.y }
>
> put obj_c
>
>
> this kind of thing is tricky even in a traditional RDBMS, since the default
> transaction level may or may not allow the application to see an uncommitted
> write by another transaction.
>
> the only way i can think of to get consistency from an op like the above
> would be to do
>
> bulk_put [ obj_c, doc_a, doc_b ]
>
> in other words, if you are ever going to compute values to from couch docs
> to produce another doc, it would seem that's it's required to put *all* read
> information back in order to ensure that the sources have not changed since
> the time that you read them.  the issue with this, of course, is that a
> result computed from many documents is going to cause exponential slowdown
> since the potential for overlapping writes will increase with the number of
> documents and also the size of updates themselves will increase similarly.
>
> a solution i can image is something like
>
> list = get 'some_view'
>
> obj = computed_value_from list
>
> obj[ '_depends_on' ] = list.map{|element| [element.id, element.rev]}
>
> put obj
>
>
> so basically a method to do a put with not only your rev, but that of 'n'
> dependent docs where only the [id, rev] pair for the dependent docs need be
> posted.  am i making any sense here?
>
> cheers.
>
>
>
> a @ http://codeforpeople.com/
> --
> we can deny everything, except that we have the possibility of being better.
> simply reflect on that.
> h.h. the 14th dalai lama
>
>
>
>

Re: dirty reads - update strategies

Posted by "ara.t.howard" <ar...@gmail.com>.

On Nov 13, 2008, at 10:44 AM, Ayende Rahien wrote:

> This is a hack, but you could use bulk_docs to do this, which would  
> fail if
> a or b were updated already.This would cause other items (that uses  
> a or b
> but not change them) to fail updating.

indeed.  but it's late, the reads may already be inconsistent since  
they are done serially, and it's impossibly complex attm since  
offending docs cannot be identified (i know a change request is in for  
this info in the error now).

cheers.

a @ http://codeforpeople.com/
--
we can deny everything, except that we have the possibility of being  
better. simply reflect on that.
h.h. the 14th dalai lama

Re: dirty reads - update strategies

Posted by Ayende Rahien <ay...@ayende.com>.

This is a hack, but you could use bulk_docs to do this, which would fail if
a or b were updated already.This would cause other items (that uses a or b
but not change them) to fail updating.

On Thu, Nov 13, 2008 at 7:01 PM, ara howard <ar...@gmail.com> wrote:

>
> what are people's strategies for dealing with the following scenario
>
> doc_a = get 'id_a'
>
> doc_b = get 'id_b'
>
> obj_c = { 'sum' : doc_a.x + doc_b.y }
>
> put obj_c
>
>
> this kind of thing is tricky even in a traditional RDBMS, since the default
> transaction level may or may not allow the application to see an uncommitted
> write by another transaction.
>
> the only way i can think of to get consistency from an op like the above
> would be to do
>
> bulk_put [ obj_c, doc_a, doc_b ]
>
> in other words, if you are ever going to compute values to from couch docs
> to produce another doc, it would seem that's it's required to put *all* read
> information back in order to ensure that the sources have not changed since
> the time that you read them.  the issue with this, of course, is that a
> result computed from many documents is going to cause exponential slowdown
> since the potential for overlapping writes will increase with the number of
> documents and also the size of updates themselves will increase similarly.
>
> a solution i can image is something like
>
> list = get 'some_view'
>
> obj = computed_value_from list
>
> obj[ '_depends_on' ] = list.map{|element| [element.id, element.rev]}
>
> put obj
>
>
> so basically a method to do a put with not only your rev, but that of 'n'
> dependent docs where only the [id, rev] pair for the dependent docs need be
> posted.  am i making any sense here?
>
> cheers.
>
>
>
> a @ http://codeforpeople.com/
> --
> we can deny everything, except that we have the possibility of being
> better. simply reflect on that.
> h.h. the 14th dalai lama
>
>
>
>

Re: dirty reads - update strategies

Posted by "ara.t.howard" <ar...@gmail.com>.

On Nov 13, 2008, at 11:28 AM, Damien Katz wrote:

>
> Yes, I mean values as computed values.  The main post shouldn't be  
> updated with a comment count or anything computed like that. It's  
> fine if comments have a reference to their parent, and its fine if  
> the comments are tagged as children of the post. This way, when the  
> main post is opened, the comment count can be computed from a view,  
> or when viewing a comment, the user is also shown the parent, and  
> maybe subcomments if its a threaded discussion.



okay that makes good sense - same for RDBMS of course too.  basically  
you're saying 'stay normalized.'  i wasn't clear about your meaning of  
'values' - which clearly excludes 'ids.'

cheers.

a @ http://codeforpeople.com/
--
we can deny everything, except that we have the possibility of being  
better. simply reflect on that.
h.h. the 14th dalai lama

Re: dirty reads - update strategies

Posted by Damien Katz <da...@apache.org>.

On Nov 13, 2008, at 1:10 PM, ara.t.howard wrote:

>
> On Nov 13, 2008, at 10:39 AM, Damien Katz wrote:
>
>> My answer is "Don't do that". Values in documents shouldn't depend  
>> on values in other documents, that's a better fit for a relational  
>> or OO DB. In your example though, CouchDB's views could be used to  
>> compute the sums.
>
> i don't think that's realistic.  consider something like the  
> following:
>
> let's say we write a publishing system, users can create documents  
> with content and tags.  at the end of the month the editor is going  
> to write a summary of the content from that month, obviously this  
> summary should be tagged with the union of the tags from all  
> summarized content - for later searching.  regardless of whether we  
> store the tags inside the document or outside of it we have quite a  
> task - we need to get a consistent read of all content for the  
> month, with all it's tags, in order to properly construct the  
> summary document with it's aggregate tags. this isn't strict  
> dependence - it's merely a read/write consistency issue which nearly  
> any application is going to face.  we can argue that it's not  
> important that the summary of tags exactly mirrors the tags of it's  
> constituent parts, but that kind of thinking results not in an  
> information store, but a collection of valueless data.

CouchDB views are a consistent snapshot of the database, your reports  
are generated from the views. The view APIs are the place to look for  
better reporting capabilties.

>
>
> anyhow, i think it's important to be able to agree upon best  
> practices for this kind of operation.  saying that values shouldn't  
> depend on values in other documents is quite a statement - it means  
> couch should no be used for any information store where the  
> information value needs to grow recursively.

What I mean is you should never depend on the accuracy of the computed  
values in documents that are based on other documents. Particularly  
with replication.

> in my case we're modeling financial information which gets processed  
> in increasingly sophisticated ways - where documents are inputs to  
> processes which produce other documents.  i can't think of an  
> application that does not do the same thing: a blog comment depends  
> on the blog post, a 'friends list' depends on the users, etc.

>
>
> are you referring to 'values' as different from 'ids' ?

Yes, I mean values as computed values.  The main post shouldn't be  
updated with a comment count or anything computed like that. It's fine  
if comments have a reference to their parent, and its fine if the  
comments are tagged as children of the post. This way, when the main  
post is opened, the comment count can be computed from a view, or when  
viewing a comment, the user is also shown the parent, and maybe  
subcomments if its a threaded discussion.

-Damien

>
>
> kind regards.
>
> a @ http://codeforpeople.com/
> --
> we can deny everything, except that we have the possibility of being  
> better. simply reflect on that.
> h.h. the 14th dalai lama
>
>
>

Re: dirty reads - update strategies

Posted by "ara.t.howard" <ar...@gmail.com>.

On Nov 13, 2008, at 10:39 AM, Damien Katz wrote:

> My answer is "Don't do that". Values in documents shouldn't depend  
> on values in other documents, that's a better fit for a relational  
> or OO DB. In your example though, CouchDB's views could be used to  
> compute the sums.

i don't think that's realistic.  consider something like the following:

let's say we write a publishing system, users can create documents  
with content and tags.  at the end of the month the editor is going to  
write a summary of the content from that month, obviously this summary  
should be tagged with the union of the tags from all summarized  
content - for later searching.  regardless of whether we store the  
tags inside the document or outside of it we have quite a task - we  
need to get a consistent read of all content for the month, with all  
it's tags, in order to properly construct the summary document with  
it's aggregate tags.  this isn't strict dependence - it's merely a  
read/write consistency issue which nearly any application is going to  
face.  we can argue that it's not important that the summary of tags  
exactly mirrors the tags of it's constituent parts, but that kind of  
thinking results not in an information store, but a collection of  
valueless data.

anyhow, i think it's important to be able to agree upon best practices  
for this kind of operation.  saying that values shouldn't depend on  
values in other documents is quite a statement - it means couch should  
no be used for any information store where the information value needs  
to grow recursively.  in my case we're modeling financial information  
which gets processed in increasingly sophisticated ways - where  
documents are inputs to processes which produce other documents.  i  
can't think of an application that does not do the same thing: a blog  
comment depends on the blog post, a 'friends list' depends on the  
users, etc.

are you referring to 'values' as different from 'ids' ?

kind regards.

a @ http://codeforpeople.com/
--
we can deny everything, except that we have the possibility of being  
better. simply reflect on that.
h.h. the 14th dalai lama

Re: dirty reads - update strategies

Posted by Nuno Job <nu...@gmail.com>.

jan is at codebits now :D

sorry for the off topic :P

On Thu, Nov 13, 2008 at 12:39 PM, Damien Katz <da...@apache.org> wrote:

> My answer is "Don't do that". Values in documents shouldn't depend on
> values in other documents, that's a better fit for a relational or OO DB. In
> your example though, CouchDB's views could be used to compute the sums.
>
> -Damien
>
>
> On Nov 13, 2008, at 12:01 PM, ara howard wrote:
>
>
>> what are people's strategies for dealing with the following scenario
>>
>> doc_a = get 'id_a'
>>
>> doc_b = get 'id_b'
>>
>> obj_c = { 'sum' : doc_a.x + doc_b.y }
>>
>> put obj_c
>>
>>
>> this kind of thing is tricky even in a traditional RDBMS, since the
>> default transaction level may or may not allow the application to see an
>> uncommitted write by another transaction.
>>
>> the only way i can think of to get consistency from an op like the above
>> would be to do
>>
>> bulk_put [ obj_c, doc_a, doc_b ]
>>
>> in other words, if you are ever going to compute values to from couch docs
>> to produce another doc, it would seem that's it's required to put *all* read
>> information back in order to ensure that the sources have not changed since
>> the time that you read them.  the issue with this, of course, is that a
>> result computed from many documents is going to cause exponential slowdown
>> since the potential for overlapping writes will increase with the number of
>> documents and also the size of updates themselves will increase similarly.
>>
>> a solution i can image is something like
>>
>> list = get 'some_view'
>>
>> obj = computed_value_from list
>>
>> obj[ '_depends_on' ] = list.map{|element| [element.id, element.rev]}
>>
>> put obj
>>
>>
>> so basically a method to do a put with not only your rev, but that of 'n'
>> dependent docs where only the [id, rev] pair for the dependent docs need be
>> posted.  am i making any sense here?
>>
>> cheers.
>>
>>
>>
>> a @ http://codeforpeople.com/
>> --
>> we can deny everything, except that we have the possibility of being
>> better. simply reflect on that.
>> h.h. the 14th dalai lama
>>
>>
>>
>>
>

Re: dirty reads - update strategies

Posted by Damien Katz <da...@apache.org>.

My answer is "Don't do that". Values in documents shouldn't depend on  
values in other documents, that's a better fit for a relational or OO  
DB. In your example though, CouchDB's views could be used to compute  
the sums.

-Damien

On Nov 13, 2008, at 12:01 PM, ara howard wrote:

>
> what are people's strategies for dealing with the following scenario
>
> doc_a = get 'id_a'
>
> doc_b = get 'id_b'
>
> obj_c = { 'sum' : doc_a.x + doc_b.y }
>
> put obj_c
>
>
> this kind of thing is tricky even in a traditional RDBMS, since the  
> default transaction level may or may not allow the application to  
> see an uncommitted write by another transaction.
>
> the only way i can think of to get consistency from an op like the  
> above would be to do
>
> bulk_put [ obj_c, doc_a, doc_b ]
>
> in other words, if you are ever going to compute values to from  
> couch docs to produce another doc, it would seem that's it's  
> required to put *all* read information back in order to ensure that  
> the sources have not changed since the time that you read them.  the  
> issue with this, of course, is that a result computed from many  
> documents is going to cause exponential slowdown since the potential  
> for overlapping writes will increase with the number of documents  
> and also the size of updates themselves will increase similarly.
>
> a solution i can image is something like
>
> list = get 'some_view'
>
> obj = computed_value_from list
>
> obj[ '_depends_on' ] = list.map{|element| [element.id, element.rev]}
>
> put obj
>
>
> so basically a method to do a put with not only your rev, but that  
> of 'n' dependent docs where only the [id, rev] pair for the  
> dependent docs need be posted.  am i making any sense here?
>
> cheers.
>
>
>
> a @ http://codeforpeople.com/
> --
> we can deny everything, except that we have the possibility of being  
> better. simply reflect on that.
> h.h. the 14th dalai lama
>
>
>