You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by "Poyau, John" <jo...@lmco.com> on 2011/05/03 16:38:52 UTC

Document Timestamp On Replication

Hi All,
Our project the learningregistry<https://github.com/LearningRegistry/LearningRegistry> is using couchdb and we are trying to figure out what is the best way to update document timestamps on replication.  Currently we have a resource_data database where resource_data document has a  update_timestamp field that is set when the document is pushed to database and any time that the document is updated thereafter.  We want the update_timestamp to be time of replication when the document gets replicated.   What is the couchdb way of update the documents update_timestamp on replication.

Re: Document Timestamp On Replication

Posted by Paul Davis <pa...@gmail.com>.

On Tue, May 3, 2011 at 10:38 AM, Poyau, John <jo...@lmco.com> wrote:
> Hi All,
> Our project the learningregistry<https://github.com/LearningRegistry/LearningRegistry> is using couchdb and we are trying to figure out what is the best way to update document timestamps on replication.  Currently we have a resource_data database where resource_data document has a  update_timestamp field that is set when the document is pushed to database and any time that the document is updated thereafter.  We want the update_timestamp to be time of replication when the document gets replicated.   What is the couchdb way of update the documents update_timestamp on replication.
>
>

The CouchDB way is Don't Do That. If you change a doc during
replication then your replications will never finish.

RE: EXTERNAL: Re: Document Timestamp On Replication

Posted by "Poyau, John" <jo...@lmco.com>.

Owen,

Thank you for all replies. I answered inline

-----Original Message-----
From: Owen Marshall [mailto:omarshall@facilityone.com] 
Sent: Wednesday, May 04, 2011 12:15 PM
To: user@couchdb.apache.org
Cc: Poyau, John
Subject: EXTERNAL: Re: Document Timestamp On Replication

On 05/04/2011 11:29 AM, Poyau, John wrote:
> -We want to keep track of the time that a document is added/updated in 
> a source database

Then you definitely want an updated field per-document.

Implementing this varies with your needs. You could use a single timestamp that gets clobbered each time, if you don't need a huge auditing trail. You could also do a list of timestamps if it would prove helpful.

One other technique that I'm especially fond of is to store changes as attachments to each document. This gives you great audit trails -- who made what change when. You could go so far as to store the full document state before the change.

But if you don't need that level of auditing, a timestamp field is the way to go.

> -We want to keep track of the time that a document get replicated to a target databases on replication.

Don't. Don't don't don't.

But because I hate it when the answer is "you're doing it wrong" and nothing else, some notes:

* You will definitely want to separate the replication time from the update time (as they clearly aren't the same thing.)

* Further, that *cannot* go in the document, clearly.
I know, I was planning on using a separate document to track update_timestamp.

* You'd need at a minimum filtered/named replication to send the documents you want, and an update handler to put the "replicated time"
in some other document.

Again though, you never answered the simple question of *why* you want to know this. Let me be clear: what you are trying to do adds a bunch of complexity to your documents, your replication, and your program. And I'm not sure why you want to do it so badly.

I am implementing a spec that requires a update_timestamp that tracks the time any change to the document including when it is replicated to a target db.  I think you make a good point that an update_timestamp is not exactly the same as the replication_timestamp.

What problem do you think you are solving by storing the replicated time?
Knowing when the document was added the target database
--
Owen Marshall
FacilityONE
omarshall@facilityone.com | (502) 805-2126

Re: Document Timestamp On Replication

Posted by Jim Klo <ji...@sri.com>.

Not sure that's quite exactly what we need to do.

Mike's concept is relatively spot on.. 

the since=<sequence number> is a constantly changing result if I understand it correctly, as updates/insertions occur, the sequence is incremented.

If I use an example timeline:

0-----10-----20-----30-----40----->

How would one request the range of results from the view from seq 0 to seq 30 repeatedly, and paginate as well? Since seems to imply to me if I use seq 20, I'm going to get results from 21 to current, or am I misunderstanding?


Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International




On May 4, 2011, at 12:17 PM, Chris Anderson wrote:

> This feature exists.
> 
> Query the view with ?update_seq=true, and you will get a line in the
> output that tells you what the DB seq the view is current as of.
> 
> Then you can subscribe to filtered changes with since= the seq, and
> have a continuously updated transactional view of the view.
> 
> Cheers,
> Chris
> 
> On Wed, May 4, 2011 at 10:35 AM, Mike Leddy <mi...@loop.com.br> wrote:
>> What you are describing could be resolved with a feature (that I believe
>> does not exist).
>> 
>> If you could supply a database sequence number when querying a view
>> ie. return the results from the view when the database sequence number
>> was "x" then the MVCC guarantees of couchdb would guarantee exactly
>> what you want.
>> 
>> Databases and views being append only pretty much resolves everything
>> (unless a compact completes while you are still holding on to
>> the sequence number).
>> 
>> Regards,
>> 
>> Mike
>> 
>> 
>> On Wed, 2011-05-04 at 09:36 -0700, Jim Klo wrote:
>>> I might be able to shed some light here, as I'm working w/ John on this project.
>>> 
>>> Essentially I believe ultimately we need to be able to do is build a view that is similar to the _changes but filtered.
>>> 
>>> Basically we need a way to maintain 'local' transactional integrity.  So assuming our Couch is busy receiving updates from other Couch nodes.  If we have a user that is querying a range against Couch that, lets has would have 100,000 results.  We need to be able to paginate through that range and be guaranteed that it's not going to be modified via some update happening in another thread.
>>> 
>>> If you are familiar with OAI-PMH, essential we need to build flow control into the application, but if I request a range of objects at 12:00pm... I need to be able to paginate through that range probably until 12:05pm, potentially, without any updates between 12:00 and 12:05 effecting the result set.
>>> 
>>> The idea about having the document with the local timestamp is that we would be able to create a view by timestamp to query in this manner.
>>> 
>>> If you guys have alternate ideas on how we might achieve this - I think we'd be open to discussion.
>>> 
>>> 
>>> Jim Klo
>>> Senior Software Engineer
>>> Center for Software Engineering
>>> SRI International
>>> 
>>> 
>>> 
>>> 
>>> On May 4, 2011, at 9:14 AM, Owen Marshall wrote:
>>> 
>>>> On 05/04/2011 11:29 AM, Poyau, John wrote:
>>>>> -We want to keep track of the time that a document is added/updated in a source database
>>>> 
>>>> Then you definitely want an updated field per-document.
>>>> 
>>>> Implementing this varies with your needs. You could use a single
>>>> timestamp that gets clobbered each time, if you don't need a huge
>>>> auditing trail. You could also do a list of timestamps if it would prove
>>>> helpful.
>>>> 
>>>> One other technique that I'm especially fond of is to store changes as
>>>> attachments to each document. This gives you great audit trails -- who
>>>> made what change when. You could go so far as to store the full document
>>>> state before the change.
>>>> 
>>>> But if you don't need that level of auditing, a timestamp field is the
>>>> way to go.
>>>> 
>>>>> -We want to keep track of the time that a document get replicated to a target databases on replication.
>>>> 
>>>> Don't. Don't don't don't.
>>>> 
>>>> But because I hate it when the answer is "you're doing it wrong" and
>>>> nothing else, some notes:
>>>> 
>>>> * You will definitely want to separate the replication time from the
>>>> update time (as they clearly aren't the same thing.)
>>>> 
>>>> * Further, that *cannot* go in the document, clearly.
>>>> 
>>>> * You'd need at a minimum filtered/named replication to send the
>>>> documents you want, and an update handler to put the "replicated time"
>>>> in some other document.
>>>> 
>>>> Again though, you never answered the simple question of *why* you want
>>>> to know this. Let me be clear: what you are trying to do adds a bunch of
>>>> complexity to your documents, your replication, and your program. And
>>>> I'm not sure why you want to do it so badly.
>>>> 
>>>> What problem do you think you are solving by storing the replicated time?
>>>> 
>>>> --
>>>> Owen Marshall
>>>> FacilityONE
>>>> omarshall@facilityone.com | (502) 805-2126
>>>> 
>>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> Chris Anderson
> http://jchrisa.net
> http://couchbase.com

Re: Document Timestamp On Replication

Posted by Chris Anderson <jc...@apache.org>.

This feature exists.

Query the view with ?update_seq=true, and you will get a line in the
output that tells you what the DB seq the view is current as of.

Then you can subscribe to filtered changes with since= the seq, and
have a continuously updated transactional view of the view.

Cheers,
Chris

On Wed, May 4, 2011 at 10:35 AM, Mike Leddy <mi...@loop.com.br> wrote:
> What you are describing could be resolved with a feature (that I believe
> does not exist).
>
> If you could supply a database sequence number when querying a view
> ie. return the results from the view when the database sequence number
> was "x" then the MVCC guarantees of couchdb would guarantee exactly
> what you want.
>
> Databases and views being append only pretty much resolves everything
> (unless a compact completes while you are still holding on to
> the sequence number).
>
> Regards,
>
> Mike
>
>
> On Wed, 2011-05-04 at 09:36 -0700, Jim Klo wrote:
>> I might be able to shed some light here, as I'm working w/ John on this project.
>>
>> Essentially I believe ultimately we need to be able to do is build a view that is similar to the _changes but filtered.
>>
>> Basically we need a way to maintain 'local' transactional integrity.  So assuming our Couch is busy receiving updates from other Couch nodes.  If we have a user that is querying a range against Couch that, lets has would have 100,000 results.  We need to be able to paginate through that range and be guaranteed that it's not going to be modified via some update happening in another thread.
>>
>> If you are familiar with OAI-PMH, essential we need to build flow control into the application, but if I request a range of objects at 12:00pm... I need to be able to paginate through that range probably until 12:05pm, potentially, without any updates between 12:00 and 12:05 effecting the result set.
>>
>> The idea about having the document with the local timestamp is that we would be able to create a view by timestamp to query in this manner.
>>
>> If you guys have alternate ideas on how we might achieve this - I think we'd be open to discussion.
>>
>>
>> Jim Klo
>> Senior Software Engineer
>> Center for Software Engineering
>> SRI International
>>
>>
>>
>>
>> On May 4, 2011, at 9:14 AM, Owen Marshall wrote:
>>
>> > On 05/04/2011 11:29 AM, Poyau, John wrote:
>> >> -We want to keep track of the time that a document is added/updated in a source database
>> >
>> > Then you definitely want an updated field per-document.
>> >
>> > Implementing this varies with your needs. You could use a single
>> > timestamp that gets clobbered each time, if you don't need a huge
>> > auditing trail. You could also do a list of timestamps if it would prove
>> > helpful.
>> >
>> > One other technique that I'm especially fond of is to store changes as
>> > attachments to each document. This gives you great audit trails -- who
>> > made what change when. You could go so far as to store the full document
>> > state before the change.
>> >
>> > But if you don't need that level of auditing, a timestamp field is the
>> > way to go.
>> >
>> >> -We want to keep track of the time that a document get replicated to a target databases on replication.
>> >
>> > Don't. Don't don't don't.
>> >
>> > But because I hate it when the answer is "you're doing it wrong" and
>> > nothing else, some notes:
>> >
>> > * You will definitely want to separate the replication time from the
>> > update time (as they clearly aren't the same thing.)
>> >
>> > * Further, that *cannot* go in the document, clearly.
>> >
>> > * You'd need at a minimum filtered/named replication to send the
>> > documents you want, and an update handler to put the "replicated time"
>> > in some other document.
>> >
>> > Again though, you never answered the simple question of *why* you want
>> > to know this. Let me be clear: what you are trying to do adds a bunch of
>> > complexity to your documents, your replication, and your program. And
>> > I'm not sure why you want to do it so badly.
>> >
>> > What problem do you think you are solving by storing the replicated time?
>> >
>> > --
>> > Owen Marshall
>> > FacilityONE
>> > omarshall@facilityone.com | (502) 805-2126
>> >
>>
>
>
>



-- 
Chris Anderson
http://jchrisa.net
http://couchbase.com

Re: Document Timestamp On Replication

Posted by Mike Leddy <mi...@loop.com.br>.

What you are describing could be resolved with a feature (that I believe
does not exist).

If you could supply a database sequence number when querying a view
ie. return the results from the view when the database sequence number
was "x" then the MVCC guarantees of couchdb would guarantee exactly 
what you want.

Databases and views being append only pretty much resolves everything
(unless a compact completes while you are still holding on to
the sequence number). 

Regards,

Mike


On Wed, 2011-05-04 at 09:36 -0700, Jim Klo wrote:
> I might be able to shed some light here, as I'm working w/ John on this project.
> 
> Essentially I believe ultimately we need to be able to do is build a view that is similar to the _changes but filtered.
> 
> Basically we need a way to maintain 'local' transactional integrity.  So assuming our Couch is busy receiving updates from other Couch nodes.  If we have a user that is querying a range against Couch that, lets has would have 100,000 results.  We need to be able to paginate through that range and be guaranteed that it's not going to be modified via some update happening in another thread.
> 
> If you are familiar with OAI-PMH, essential we need to build flow control into the application, but if I request a range of objects at 12:00pm... I need to be able to paginate through that range probably until 12:05pm, potentially, without any updates between 12:00 and 12:05 effecting the result set.
> 
> The idea about having the document with the local timestamp is that we would be able to create a view by timestamp to query in this manner.
> 
> If you guys have alternate ideas on how we might achieve this - I think we'd be open to discussion.
> 
> 
> Jim Klo
> Senior Software Engineer
> Center for Software Engineering
> SRI International
> 
> 
> 
> 
> On May 4, 2011, at 9:14 AM, Owen Marshall wrote:
> 
> > On 05/04/2011 11:29 AM, Poyau, John wrote:
> >> -We want to keep track of the time that a document is added/updated in a source database
> > 
> > Then you definitely want an updated field per-document.
> > 
> > Implementing this varies with your needs. You could use a single
> > timestamp that gets clobbered each time, if you don't need a huge
> > auditing trail. You could also do a list of timestamps if it would prove
> > helpful.
> > 
> > One other technique that I'm especially fond of is to store changes as
> > attachments to each document. This gives you great audit trails -- who
> > made what change when. You could go so far as to store the full document
> > state before the change.
> > 
> > But if you don't need that level of auditing, a timestamp field is the
> > way to go.
> > 
> >> -We want to keep track of the time that a document get replicated to a target databases on replication.
> > 
> > Don't. Don't don't don't.
> > 
> > But because I hate it when the answer is "you're doing it wrong" and
> > nothing else, some notes:
> > 
> > * You will definitely want to separate the replication time from the
> > update time (as they clearly aren't the same thing.)
> > 
> > * Further, that *cannot* go in the document, clearly.
> > 
> > * You'd need at a minimum filtered/named replication to send the
> > documents you want, and an update handler to put the "replicated time"
> > in some other document.
> > 
> > Again though, you never answered the simple question of *why* you want
> > to know this. Let me be clear: what you are trying to do adds a bunch of
> > complexity to your documents, your replication, and your program. And
> > I'm not sure why you want to do it so badly.
> > 
> > What problem do you think you are solving by storing the replicated time?
> > 
> > -- 
> > Owen Marshall
> > FacilityONE
> > omarshall@facilityone.com | (502) 805-2126
> > 
>

Re: Document Timestamp On Replication

Posted by Jim Klo <ji...@sri.com>.

Besides our use not being applicable inside the browser, I'm not sure that works for our use case, as we explicitly don't want any updates.  Our end solution ends up being a REST service to be used by others.

We literally want a snapshot of the view at a specific point in time and be able to operate on that snapshot without it changing.  

A good physical analogy is I want to take a bucketful of water (documents) from a lake (couchdb) being fed by a river (continous data inserts/updates), and empty the bucket by the spoonful (pagination). When I'm finished emptying the bucket, I'll go back to lake and refill the bucket (new request), the lake now has more water (documents) since I first started with the first bucketful since it's fed by the river (continuous inserts/updates of new documents).  In this scenario, the state of the bucket is only effected by operations done with the spoon, the lake has no effect upon the bucket. This is very close to what we want to be able to achieve using CouchDB.

To extend the analogy a bit to show what we do not want to occur, and are desire to prevent [ as I believe this is close to how CouchDB actually works]: if we had a hose that would continuously fill the bucket from the lake while we are emptying the bucket by spoonful.  We could easily get into a state where the bucket begins to overflow or we can never actually empty the bucket unless the process of emptying with the spoon doesn't exceed the rate at which the bucket fills.  I can't change the rate at which bucket emptying occurs, nor can a predict or change the rate the hose fills the bucket from the lake.  Ultimately in this extension - how do we get rid of the hose?  It seems update_seq=true get's me part of the way there. What seems to be missing and really what I need/want is a before=<seq id> instead of a since=<seq id>.

Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International

On May 5, 2011, at 2:07 PM, Chris Anderson wrote:

> Not sure I follow all the requirements, but here is what I've done in the past.
> 
> on page load: query the view with update_seq=true
> 
> render the screen with up to date data as of seq X
> 
> open a changes request with since=X&include_docs=true
> 
> each doc that comes down the pipe, run the map function again (in the
> browser) and take whatever is emitted and stick it in your
> datastructure that represents the view (or just directly update the
> dom). also if an old version of the doc emitted something different,
> remove whatever stuff in your in-page representation corresponds to
> the old version of the doc.
> 
> now you have a screen that is kept up to date with a consistent
> representation of what you'd get in a hard-reload, with a
> transactional guarantee that no updates will be skipped.
> 
> Chris
> 
> On Wed, May 4, 2011 at 1:21 PM, Owen Marshall <om...@facilityone.com> wrote:
>> On 05/04/2011 04:13 PM, Eli Stevens (Gmail) wrote:
>>> 11:59 - Document D inserted on Node 2.  Replication hasn't happened yet.
>>> 12:00 - First access of view page 1 on Node 1.  Only A, B, C are present.
>>> 12:01 - D is replicated to Node 1.
>> 
>> Mmm, yes, you're absolutely correct; depending on that view would carry
>> with it the risk of an update race. It would (likely) work if replicates
>> were consistently low-latency, but that's not a guarantee.
>> 
>> Correct me if I'm wrong, but that view would work if:
>> 
>> 1. you capture last_seq from _changes pre-view run
>> 2. run the view, capturing the output
>> 3. check _changes for any updates since=your captured last_seq
>> 4. filter those IDs out of your captured view.
>> 
>> Yuck.
>> 
>> --
>> Owen Marshall
>> FacilityONE
>> omarshall@facilityone.com | (502) 805-2126
>> 
>> 
> 
> 
> 
> -- 
> Chris Anderson
> http://jchrisa.net
> http://couchbase.com

Re: Document Timestamp On Replication

Posted by Chris Anderson <jc...@apache.org>.

Not sure I follow all the requirements, but here is what I've done in the past.

on page load: query the view with update_seq=true

render the screen with up to date data as of seq X

open a changes request with since=X&include_docs=true

each doc that comes down the pipe, run the map function again (in the
browser) and take whatever is emitted and stick it in your
datastructure that represents the view (or just directly update the
dom). also if an old version of the doc emitted something different,
remove whatever stuff in your in-page representation corresponds to
the old version of the doc.

now you have a screen that is kept up to date with a consistent
representation of what you'd get in a hard-reload, with a
transactional guarantee that no updates will be skipped.

Chris

On Wed, May 4, 2011 at 1:21 PM, Owen Marshall <om...@facilityone.com> wrote:
> On 05/04/2011 04:13 PM, Eli Stevens (Gmail) wrote:
>> 11:59 - Document D inserted on Node 2.  Replication hasn't happened yet.
>> 12:00 - First access of view page 1 on Node 1.  Only A, B, C are present.
>> 12:01 - D is replicated to Node 1.
>
> Mmm, yes, you're absolutely correct; depending on that view would carry
> with it the risk of an update race. It would (likely) work if replicates
> were consistently low-latency, but that's not a guarantee.
>
> Correct me if I'm wrong, but that view would work if:
>
> 1. you capture last_seq from _changes pre-view run
> 2. run the view, capturing the output
> 3. check _changes for any updates since=your captured last_seq
> 4. filter those IDs out of your captured view.
>
> Yuck.
>
> --
> Owen Marshall
> FacilityONE
> omarshall@facilityone.com | (502) 805-2126
>
>

-- 
Chris Anderson
http://jchrisa.net
http://couchbase.com

Re: Document Timestamp On Replication

Posted by Owen Marshall <om...@facilityone.com>.

On 05/04/2011 04:13 PM, Eli Stevens (Gmail) wrote:
> 11:59 - Document D inserted on Node 2.  Replication hasn't happened yet.
> 12:00 - First access of view page 1 on Node 1.  Only A, B, C are present.
> 12:01 - D is replicated to Node 1.

Mmm, yes, you're absolutely correct; depending on that view would carry
with it the risk of an update race. It would (likely) work if replicates
were consistently low-latency, but that's not a guarantee.

Correct me if I'm wrong, but that view would work if:

1. you capture last_seq from _changes pre-view run
2. run the view, capturing the output
3. check _changes for any updates since=your captured last_seq
4. filter those IDs out of your captured view.

Yuck.

-- 
Owen Marshall
FacilityONE
omarshall@facilityone.com | (502) 805-2126

Re: Document Timestamp On Replication

Posted by "Eli Stevens (Gmail)" <wi...@gmail.com>.

On Wed, May 4, 2011 at 10:41 AM, Owen Marshall
<om...@facilityone.com> wrote:
> So, if a user runs this view on N1 at 12:00, and changes/adds are
> replicated in from N2 at 12:01, **it doesn't matter!** Those documents
> will have an updatetime > 12:00, so they won't be seen.

This isn't the use case being described though.

00:00 - Documents A, B, C inserted on Node 1, replicated to Node 2.
11:59 - Document D inserted on Node 2.  Replication hasn't happened yet.
12:00 - First access of view page 1 on Node 1.  Only A, B, C are present.
12:01 - D is replicated to Node 1.
12:02 - Second access of view page 1 on Node 1, A, B, C, D should be present.
12:05 - First view page 2 is accessed, D should not be present.

I'm not quite sure why the OP doesn't want D shown at 12:05, though.

HTH,
Eli

Re: Document Timestamp On Replication

Posted by Owen Marshall <om...@facilityone.com>.

On 05/04/2011 12:36 PM, Jim Klo wrote:
> We need to be able to paginate through that range and be guaranteed that it's not going to be modified via some update happening in another thread.
>
>[...] I request a range of objects at 12:00pm... I need to be able to paginate through that range probably until 12:05pm, potentially, without any updates between 12:00 and 12:05 effecting the result set.

Ah, *now* I see what you are going for.

First, pretend we are working only with one node. This is important :)

Let's assume that we are only concerned about filtering out new writes
-- that is, we want to be able to run a query and page around from 12:00
- 12:05, and not see a document added at 12:01.

So, include the document update time inside the doc, make that part (or
all!) of your view's key, and use endkey to ensure that you only see
documents before 12:00.

Handling updates becomes a bit more complex, but it's all based on how
your application needs to work.

One idea would be to, on update, store the old document inline. So you'd
go from:

{id: A, updatetime: 12:00, foo:bar} -> {id: A, updatetime: 12:01, foo:
baz, history: {id: A, updatetime: 12:00, foo: bar}}

Then your view just emits all updatetimes for documents *AND* the
updatetimes for all history values. You can use the same endkey filter
as before.

(That's assuming this level of tracking is needed for your program. It
may not be -- but that's up to you. And you could also decouple history
from individual documents. There are plenty of ways to skin this
particular cat.)

OK, now... this works fine for one node -- why won't it work for 2 or more?

Remember that replication is _not special_; the act of replicating a
document from n1->n2 is equivalent to PUTting/POSTing that document on
n1, then on n2.

So, if a user runs this view on N1 at 12:00, and changes/adds are
replicated in from N2 at 12:01, **it doesn't matter!** Those documents
will have an updatetime > 12:00, so they won't be seen.

Now, there are some possible issues that can come up with conflicts, but
you've got to handle those anyway if you want to use replication -- and
keeping track of when a document was replicated in won't help you with
that. As a matter of fact, that will have more pain points, compared to
well-understood conflicts.

Replication is just another insert/update stream. If you write your view
to work properly on one node (in your case, show only updates before a
given time), it's going to work when you replicate with other nodes.

See:
http://wiki.apache.org/couchdb/HTTP_view_API
http://wiki.apache.org/couchdb/View_collation

for more.

--
Owen Marshall
FacilityONE
omarshall@facilityone.com | (502) 805-2126

Re: Document Timestamp On Replication

Posted by Jim Klo <ji...@sri.com>.

I might be able to shed some light here, as I'm working w/ John on this project.

Essentially I believe ultimately we need to be able to do is build a view that is similar to the _changes but filtered.

Basically we need a way to maintain 'local' transactional integrity.  So assuming our Couch is busy receiving updates from other Couch nodes.  If we have a user that is querying a range against Couch that, lets has would have 100,000 results.  We need to be able to paginate through that range and be guaranteed that it's not going to be modified via some update happening in another thread.

If you are familiar with OAI-PMH, essential we need to build flow control into the application, but if I request a range of objects at 12:00pm... I need to be able to paginate through that range probably until 12:05pm, potentially, without any updates between 12:00 and 12:05 effecting the result set.

The idea about having the document with the local timestamp is that we would be able to create a view by timestamp to query in this manner.

If you guys have alternate ideas on how we might achieve this - I think we'd be open to discussion.

Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International

On May 4, 2011, at 9:14 AM, Owen Marshall wrote:

> On 05/04/2011 11:29 AM, Poyau, John wrote:
>> -We want to keep track of the time that a document is added/updated in a source database
> 
> Then you definitely want an updated field per-document.
> 
> Implementing this varies with your needs. You could use a single
> timestamp that gets clobbered each time, if you don't need a huge
> auditing trail. You could also do a list of timestamps if it would prove
> helpful.
> 
> One other technique that I'm especially fond of is to store changes as
> attachments to each document. This gives you great audit trails -- who
> made what change when. You could go so far as to store the full document
> state before the change.
> 
> But if you don't need that level of auditing, a timestamp field is the
> way to go.
> 
>> -We want to keep track of the time that a document get replicated to a target databases on replication.
> 
> Don't. Don't don't don't.
> 
> But because I hate it when the answer is "you're doing it wrong" and
> nothing else, some notes:
> 
> * You will definitely want to separate the replication time from the
> update time (as they clearly aren't the same thing.)
> 
> * Further, that *cannot* go in the document, clearly.
> 
> * You'd need at a minimum filtered/named replication to send the
> documents you want, and an update handler to put the "replicated time"
> in some other document.
> 
> Again though, you never answered the simple question of *why* you want
> to know this. Let me be clear: what you are trying to do adds a bunch of
> complexity to your documents, your replication, and your program. And
> I'm not sure why you want to do it so badly.
> 
> What problem do you think you are solving by storing the replicated time?
> 
> -- 
> Owen Marshall
> FacilityONE
> omarshall@facilityone.com | (502) 805-2126
>

Re: Document Timestamp On Replication

Posted by Owen Marshall <om...@facilityone.com>.

On 05/04/2011 11:29 AM, Poyau, John wrote:
> -We want to keep track of the time that a document is added/updated in a source database

Then you definitely want an updated field per-document.

Implementing this varies with your needs. You could use a single
timestamp that gets clobbered each time, if you don't need a huge
auditing trail. You could also do a list of timestamps if it would prove
helpful.

One other technique that I'm especially fond of is to store changes as
attachments to each document. This gives you great audit trails -- who
made what change when. You could go so far as to store the full document
state before the change.

But if you don't need that level of auditing, a timestamp field is the
way to go.

> -We want to keep track of the time that a document get replicated to a target databases on replication.

Don't. Don't don't don't.

But because I hate it when the answer is "you're doing it wrong" and
nothing else, some notes:

* You will definitely want to separate the replication time from the
update time (as they clearly aren't the same thing.)

* Further, that *cannot* go in the document, clearly.

* You'd need at a minimum filtered/named replication to send the
documents you want, and an update handler to put the "replicated time"
in some other document.

Again though, you never answered the simple question of *why* you want
to know this. Let me be clear: what you are trying to do adds a bunch of
complexity to your documents, your replication, and your program. And
I'm not sure why you want to do it so badly.

What problem do you think you are solving by storing the replicated time?

--
Owen Marshall
FacilityONE
omarshall@facilityone.com | (502) 805-2126

Re: Document Timestamp On Replication

Posted by "Poyau, John" <jo...@lmco.com>.

-----Original Message-----
From: Owen Marshall [mailto:omarshall@facilityone.com] 
Sent: Wednesday, May 04, 2011 10:33 AM
To: Poyau, John
Cc: user@couchdb.apache.org
Subject: EXTERNAL: Re: Document Timestamp On Replication

On 05/03/2011 04:54 PM, Poyau, John wrote:
> Thank you for your reply.  I am aware of the issue that you mentioned and that is why I posted my question to list.
> 
> One approach that I thought of is to use a separate document to hold 
> the update_timestamp, these documents would not get replicated (filtered out on replication). Documents that tracks the update_timestamp would get created using a document update handler as you mentioned.
> 
> I posted my question looking to see if there were any another way to track document updates even through replication. 

What exactly are you trying to accomplish? Why do you want to update "update_timestamp" every time a document is replicated?

-We want to keep track of the time that a document is added/updated in a source database
-We want to keep track of the time that a document get replicated to a target databases on replication.

If you are just looking to track the time a document was updated, just put an updated timestamp field on the document. When the user wants to update a document, bump the timestamp at that time. Let CouchDB take care of replicating the document. It's very good at that :)

If you want to handle conflicts (and you definitely should!) GET your documents with conflicts=true. You can then look at the conflicting documents and decide what to do. Options include:

* Picking the document with the biggest timestamp and discarding the other one.
* POST a new update that merges the changes from both documents and has a third and newer timestamp.
* Ask the user what to do!

Conflict resolution is very much up to you. Pick the one that makes the most sense to your program. Either way, know that replication conflicts
*will* occur as a result of the multi-version concurrency model. But conflicts are not bad, nor are they hard to handle.

The question you originally asked was how to update the timestamp _on replication_. That's bad. Why do you want to do this? The only reason I could think of is that you are trying to force some consistency across nodes. Don't. CouchDB is very good at that. As a matter of fact, if you ever think about layering your own synchronization on top of CouchDB, you should immediately stop, because you won't be happy with the results.

Of course, if that's not what you want to accomplish, please correct me.

--
Owen Marshall
FacilityONE
omarshall@facilityone.com | (502) 805-2126

Re: Document Timestamp On Replication

Posted by Owen Marshall <om...@facilityone.com>.

On 05/03/2011 04:54 PM, Poyau, John wrote:
> Thank you for your reply.  I am aware of the issue that you mentioned and that is why I posted my question to list.
> 
> One approach that I thought of is to use a separate document to hold the update_timestamp, 
> these documents would not get replicated (filtered out on replication). Documents that tracks the update_timestamp would get created using a document update handler as you mentioned.
> 
> I posted my question looking to see if there were any another way to track document updates even through replication. 

What exactly are you trying to accomplish? Why do you want to update
"update_timestamp" every time a document is replicated?

If you are just looking to track the time a document was updated, just
put an updated timestamp field on the document. When the user wants to
update a document, bump the timestamp at that time. Let CouchDB take
care of replicating the document. It's very good at that :)

If you want to handle conflicts (and you definitely should!) GET your
documents with conflicts=true. You can then look at the conflicting
documents and decide what to do. Options include:

* Picking the document with the biggest timestamp and discarding the
other one.
* POST a new update that merges the changes from both documents and has
a third and newer timestamp.
* Ask the user what to do!

Conflict resolution is very much up to you. Pick the one that makes the
most sense to your program. Either way, know that replication conflicts
*will* occur as a result of the multi-version concurrency model. But
conflicts are not bad, nor are they hard to handle.

The question you originally asked was how to update the timestamp _on
replication_. That's bad. Why do you want to do this? The only reason I
could think of is that you are trying to force some consistency across
nodes. Don't. CouchDB is very good at that. As a matter of fact, if you
ever think about layering your own synchronization on top of CouchDB,
you should immediately stop, because you won't be happy with the results.

Of course, if that's not what you want to accomplish, please correct me.

-- 
Owen Marshall
FacilityONE
omarshall@facilityone.com | (502) 805-2126

RE: Document Timestamp On Replication

Posted by "Poyau, John" <jo...@lmco.com>.

Owen,

Thank you for your reply.  I am aware of the issue that you mentioned and that is why I posted my question to list.

One approach that I thought of is to use a separate document to hold the update_timestamp, 
these documents would not get replicated (filtered out on replication). Documents that tracks the update_timestamp would get created using a document update handler as you mentioned.

I posted my question looking to see if there were any another way to track document updates even through replication. 

Again thank you



-----Original Message-----
From: Owen Marshall [mailto:omarshall@facilityone.com] 
Sent: Tuesday, May 03, 2011 4:39 PM
To: user@couchdb.apache.org
Cc: Poyau, John
Subject: EXTERNAL: Re: Document Timestamp On Replication

On 05/03/2011 10:38 AM, Poyau, John wrote:
> What is the couchdb way of update the documents update_timestamp on replication.

Are you using continuous replication? If so, _don't ever do this_!

If you do, you will likely end up passing the same document back and forth between nodes. I.E: User pushes doc to node A, node A replicates to node B, node B bumps the update_timestamp, node A pulls the new document, node A bumps the update_timestamp... repeat ad nauseum.

Even if you aren't using continuous replication, you still _shouldn't do this_! Replicate your documents and handle conflicts when they happen.

If for some reason you have a truly good reason, you might be able to accomplish this with update handlers:
http://wiki.apache.org/couchdb/Document_Update_Handlers

But if you are not 100% completely positively sure, I'd urge you to read the excellent CouchDB guide on replication and on conflicts:

http://guide.couchdb.org/draft/replication.html
http://guide.couchdb.org/draft/conflicts.html

--
Owen Marshall
FacilityONE
omarshall@facilityone.com | (502) 805-2126

Re: Document Timestamp On Replication

Posted by Owen Marshall <om...@facilityone.com>.

On 05/03/2011 10:38 AM, Poyau, John wrote:
> What is the couchdb way of update the documents update_timestamp on replication.

Are you using continuous replication? If so, _don't ever do this_!

If you do, you will likely end up passing the same document back and
forth between nodes. I.E: User pushes doc to node A, node A replicates
to node B, node B bumps the update_timestamp, node A pulls the new
document, node A bumps the update_timestamp... repeat ad nauseum.

Even if you aren't using continuous replication, you still _shouldn't do
this_! Replicate your documents and handle conflicts when they happen.

If for some reason you have a truly good reason, you might be able to
accomplish this with update handlers:
http://wiki.apache.org/couchdb/Document_Update_Handlers

But if you are not 100% completely positively sure, I'd urge you to read
the excellent CouchDB guide on replication and on conflicts:

http://guide.couchdb.org/draft/replication.html
http://guide.couchdb.org/draft/conflicts.html

-- 
Owen Marshall
FacilityONE
omarshall@facilityone.com | (502) 805-2126