You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Zachary Zolton <za...@gmail.com> on 2012/03/05 17:23:35 UTC

Strategy for reliable _changes feed workers

Hi,

I've enjoyed using _changes for non-critical functionality such as keeping
a user interface up-to-date, but it also seems like a nice paradigm for
background processing as well.

For reliable background processing, my mind is inevitably drawn towards the
following questions:

  * What is the last sequence number processed?
  * Have we already attempted to process this update?
  * How many times have we failed this update failed?

Unfortunately, this all sounds like global state, which makes me feel
un-relaxed.

Now, you could help alleviate this by adding a work queue to your stack,
but it feels like CouchDB could handle this style of processing without it.
Also, someone program would have to then sit between the _changes feed and
my background processors, which seems like a bottleneck.

How are you using the _changes feed for reliable background processing?


Cheers,

Zach

Re: Strategy for reliable _changes feed workers

Posted by Gabriel de Oliveira Barbosa <ma...@gmail.com>.
I made a related question in
quora<http://www.quora.com/CouchDB/What-are-the-best-way-to-cache-last-seq-of-_changes>,
but there's no replies.

Em 5 de março de 2012 13:23, Zachary Zolton <za...@gmail.com>escreveu:

> Hi,
>
> I've enjoyed using _changes for non-critical functionality such as keeping
> a user interface up-to-date, but it also seems like a nice paradigm for
> background processing as well.
>
> For reliable background processing, my mind is inevitably drawn towards the
> following questions:
>
>  * What is the last sequence number processed?
>  * Have we already attempted to process this update?
>  * How many times have we failed this update failed?
>
> Unfortunately, this all sounds like global state, which makes me feel
> un-relaxed.
>
> Now, you could help alleviate this by adding a work queue to your stack,
> but it feels like CouchDB could handle this style of processing without it.
> Also, someone program would have to then sit between the _changes feed and
> my background processors, which seems like a bottleneck.
>
> How are you using the _changes feed for reliable background processing?
>
>
> Cheers,
>
> Zach
>

Re: Strategy for reliable _changes feed workers

Posted by Zachary Zolton <za...@gmail.com>.
>
> There's an issue with this in that you have an eventually consistent
> database and you require a consistent update (read my write) on a cluster.
> Yes you'd need to be a bit unlucky to have this happen but the race
> condition isn't that impossible to meet in a big enough system.
>
> Thats especially true if "workers" are replicating from the central
> database, writing that update to the local DB and then replicating back,
> resulting in conflicts, confusion and horror. If your workers do something
> that isn't idempotent (e.g. launch a rocket) you kind of need a handshake
> between the different clients, it's not enough to just take an unprocessed
> document, you have to make sure that no one else has taken it, too. In the
> case of replicating workers you have a consistent winner and a set of
> conflicts. If your change to the database is the winner you do the work, if
> it's a conflict you don't.
> Cheers
> Simon
>

You're right about non-idempotent operations, I guess the workers should
try put the document into an "in-process" state without receiving an update
conflict from couch?

And then, you need another background watcher to look for workers that have
had a doc in-process longer some timeout value. So, that's like another
program and view... Maybe a proper work queue is the Right Tool?

Re: Strategy for reliable _changes feed workers

Posted by Simon Metson <si...@googlemail.com>.
Hi, 


On Monday, 5 March 2012 at 21:32, Robert Newson wrote:

> I'd urge caution here. The _changes feed allows the replicator to
> avoid reprocessing updates that the target has already seen but,
> crucially, replication is not broken if the feed includes old updates.
> In BigCouch, and hence a future version of CouchDB, the changes feed
> can sometimes contain rows from before the since= value, in the case
> of failover to a different replica of a shard.
> 
> 

+1 we've seen this exact issue in one of our projects 
> Clearly, in BigCouch, you could not depend on the changes feed to
> ensure you process an item exactly once, so I suggest its a bad
> practice to assume the same of CouchDB. Instead, I would create a view
> that includes unprocessed items. Once processed (whatever that
> entails), update the document to indicate it has been processed. This
> will work everywhere.
> 
> 
> 

There's an issue with this in that you have an eventually consistent database and you require a consistent update (read my write) on a cluster. Yes you'd need to be a bit unlucky to have this happen but the race condition isn't that impossible to meet in a big enough system. 

Thats especially true if "workers" are replicating from the central database, writing that update to the local DB and then replicating back, resulting in conflicts, confusion and horror. If your workers do something that isn't idempotent (e.g. launch a rocket) you kind of need a handshake between the different clients, it's not enough to just take an unprocessed document, you have to make sure that no one else has taken it, too. In the case of replicating workers you have a consistent winner and a set of conflicts. If your change to the database is the winner you do the work, if it's a conflict you don't.
Cheers
Simon

Re: Strategy for reliable _changes feed workers

Posted by Mark Hahn <ma...@hahnca.com>.
I couldn't solve the consistency problem.  I ended up allowing multiple
workers on the same task, albeit rarely.  Then I detect the collision and
one worker knows to stop, if possible.  Luckily they are idempotent so
stopping is really just for efficiency's sake.

Re: Strategy for reliable _changes feed workers

Posted by Zachary Zolton <za...@gmail.com>.
>
> Clearly, in BigCouch, you could not depend on the changes feed to
> ensure you process an item exactly once, so I suggest its a bad
> practice to assume the same of CouchDB. Instead, I would create a view
> that includes unprocessed items. Once processed (whatever that
> entails), update the document to indicate it has been processed. This
> will work everywhere.
>

So the view of items to be processed becomes the "global state". That make
sense!

I suppose once the view of to-be-processed items becomes empty the workers
could wait for _changes.

Re: Strategy for reliable _changes feed workers

Posted by Robert Newson <rn...@apache.org>.
I'd urge caution here. The _changes feed allows the replicator to
avoid reprocessing updates that the target has already seen but,
crucially, replication is not broken if the feed includes old updates.
In BigCouch, and hence a future version of CouchDB, the changes feed
can sometimes contain rows from before the since= value, in the case
of failover to a different replica of a shard.

Clearly, in BigCouch, you could not depend on the changes feed to
ensure you process an item exactly once, so I suggest its a bad
practice to assume the same of CouchDB. Instead, I would create a view
that includes unprocessed items. Once processed (whatever that
entails), update the document to indicate it has been processed. This
will work everywhere.

B.

On 5 March 2012 21:13, Jens Alfke <je...@couchbase.com> wrote:
>
> On Mar 5, 2012, at 8:23 AM, Zachary Zolton wrote:
>
> How are you using the _changes feed for reliable background processing?
>
> Well, the _changes feed is a key part of the CouchDB replicator, which uses it exactly as you’ve described.
>
>  * What is the last sequence number processed?
>  * Have we already attempted to process this update?
>  * How many times have we failed this update failed?
>
> The replicator stores a “checkpoint” value which is the latest sequence ID that it’s completely processed. The logic of its full operation is pretty complex (though of course the source code is available.)
>
> —Jens

Re: Strategy for reliable _changes feed workers

Posted by Jens Alfke <je...@couchbase.com>.
On Mar 5, 2012, at 8:23 AM, Zachary Zolton wrote:

How are you using the _changes feed for reliable background processing?

Well, the _changes feed is a key part of the CouchDB replicator, which uses it exactly as you’ve described.

 * What is the last sequence number processed?
 * Have we already attempted to process this update?
 * How many times have we failed this update failed?

The replicator stores a “checkpoint” value which is the latest sequence ID that it’s completely processed. The logic of its full operation is pretty complex (though of course the source code is available.)

—Jens