You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Paul Davis <pa...@gmail.com> on 2009/01/19 00:12:03 UTC

Update notifications including update sequence

Hey,

I'm working on this Lucene indexing stuff and I'm trying to write it
in such a way that I don't have to pound couchdb once per update. I
know that others have either gone every N updates or after a timeout,
but I'm not sure that's behavior that people would want in terms of
full text indexing.

The general update_notification outline is:

1. Receive notification with type == "updated"
2. while _all_docs_by_seq returns more data:
        index updates

The kicker is that it's possible that while we're doing the while
loop, we're receiving more update notifications. Naively we could just
queue them up and process them all which leads to us hitting couchdb
at least once per write to the db (which is teh suck) or we could
discard them all except for one and just restart the indexer when it
thinks it's finished etc etc.

After thinking about this, I thought that a simple way to actually
know if you need to start indexing again is if the notification sent
to update_notifications included the update_seq of the db. Then your
indexer that is already storing the current update_seq can just
compare if there's something new that needs to be worked on without
having to make an http request.

Then it just becomes "index till no new docs, then discard all update
notifications with an update_seq we've already indexed past.

I attached a patch that is extremely trivial, but I'd like to hear if
anyone has feed back on the merits or if there's just a better way
that I'm not thinking of.

Thanks,
Paul Davis

Re: Update notifications including update sequence

Posted by Paul Davis <pa...@gmail.com>.
On Sun, Jan 18, 2009 at 7:17 PM, Chris Anderson <jc...@gmail.com> wrote:
> On Sun, Jan 18, 2009 at 3:12 PM, Paul Davis <pa...@gmail.com> wrote:
>> I attached a patch that is extremely trivial, but I'd like to hear if
>> anyone has feed back on the merits or if there's just a better way
>> that I'm not thinking of.
>>
>
> I think this is a good way to do it (and useful for other things). The
> patch looks solid. I'd have to look more closely at the code to see if
> the update_seq's interactions with deferred commit need to be
> accounted for here.
>

Good call. I didn't think to hard about anything other than just
taking what was in the db record when the notification was sent.

>
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

Re: Update notifications including update sequence

Posted by Chris Anderson <jc...@gmail.com>.
On Sun, Jan 18, 2009 at 3:12 PM, Paul Davis <pa...@gmail.com> wrote:
> I attached a patch that is extremely trivial, but I'd like to hear if
> anyone has feed back on the merits or if there's just a better way
> that I'm not thinking of.
>

I think this is a good way to do it (and useful for other things). The
patch looks solid. I'd have to look more closely at the code to see if
the update_seq's interactions with deferred commit need to be
accounted for here.



-- 
Chris Anderson
http://jchris.mfdz.com

Re: Update notifications including update sequence

Posted by Chris Anderson <jc...@gmail.com>.
On Sun, Jan 18, 2009 at 10:52 PM, Paul Davis
<pa...@gmail.com> wrote:
> On Mon, Jan 19, 2009 at 1:46 AM, Antony Blakey <an...@gmail.com> wrote:
>>
>> On 19/01/2009, at 3:51 PM, Paul Davis wrote:
>>
>>> There can be many _external processes for a single definition. So, not
>>> only are requests not serialized, they can be concurrent etc.
>>
>> Hmmm. I must be particularly thick today, because my reading of the code has
>> a single couch_external_manager creating and maintaining an instance of
>> couch_external_server *per* UrlName, with each couch_external_server
>> instance corresponding to a single invocation of the external process
>> backing that URL.
>>
>> Where am I going wrong?
>>
>
> Wow. I am the dumb one here. I was just checking it out again as well
> to pin down the spot you'd need. Turns out that everything I said
> about _external is dead wrong. Though, if it helps, the model I had in
> my head is definitely how view server processes work XD
>
> And now that I just got that into my head I'm scrapping the update
> notification side of my couchdb-lucene stuff and running it all from
> _external.
>
> Apologies for wasting everyone's time.
>

Not a waste of time. Perhaps on another thread, we should consider
enhancements to the db-update-notification process.



-- 
Chris Anderson
http://jchris.mfdz.com

Re: Update notifications including update sequence

Posted by Paul Davis <pa...@gmail.com>.
On Mon, Jan 19, 2009 at 1:46 AM, Antony Blakey <an...@gmail.com> wrote:
>
> On 19/01/2009, at 3:51 PM, Paul Davis wrote:
>
>> There can be many _external processes for a single definition. So, not
>> only are requests not serialized, they can be concurrent etc.
>
> Hmmm. I must be particularly thick today, because my reading of the code has
> a single couch_external_manager creating and maintaining an instance of
> couch_external_server *per* UrlName, with each couch_external_server
> instance corresponding to a single invocation of the external process
> backing that URL.
>
> Where am I going wrong?
>

Wow. I am the dumb one here. I was just checking it out again as well
to pin down the spot you'd need. Turns out that everything I said
about _external is dead wrong. Though, if it helps, the model I had in
my head is definitely how view server processes work XD

And now that I just got that into my head I'm scrapping the update
notification side of my couchdb-lucene stuff and running it all from
_external.

Apologies for wasting everyone's time.

Paul Davis

> Antony Blakey
> -------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> A Buddhist walks up to a hot-dog stand and says, "Make me one with
> everything". He then pays the vendor and asks for change. The vendor says,
> "Change comes from within".
>
>
>
>

Re: Update notifications including update sequence

Posted by Antony Blakey <an...@gmail.com>.
On 19/01/2009, at 3:51 PM, Paul Davis wrote:

> There can be many _external processes for a single definition. So, not
> only are requests not serialized, they can be concurrent etc.

Hmmm. I must be particularly thick today, because my reading of the  
code has a single couch_external_manager creating and maintaining an  
instance of couch_external_server *per* UrlName, with each  
couch_external_server instance corresponding to a single invocation of  
the external process backing that URL.

Where am I going wrong?

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

A Buddhist walks up to a hot-dog stand and says, "Make me one with  
everything". He then pays the vendor and asks for change. The vendor  
says, "Change comes from within".




Re: Update notifications including update sequence

Posted by Antony Blakey <an...@gmail.com>.
On 19/01/2009, at 5:05 PM, Antony Blakey wrote:

>> The ideas from the other thread about having a UUID per db and
>> compaction are interesting, are either of those included the fs  
>> layout
>> stuff you were working on?
>
> No. UUIDs are useful for the fs because you need a strictly  
> functional mapping from name -> file, and using a UUID is begging  
> the question.

s/UUIDs are useful/UUIDs are *not* useful/

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

In anything at all, perfection is finally attained not when there is  
no longer anything to add, but when there is no longer anything to  
take away.
   -- Antoine de Saint-Exupery



Re: Update notifications including update sequence

Posted by Antony Blakey <an...@gmail.com>.
On 19/01/2009, at 3:51 PM, Paul Davis wrote:

> There can be many _external processes for a single definition. So, not
> only are requests not serialized, they can be concurrent etc.

OK. I'll be patching my deployment to ensure a single external process  
per external definition.

IMO the _external system is considerable less useful in this form,  
especially for external indexing. Concurrency and consistency should  
be a matter for the external system to control, because it's the  
external system that understands/imposes/relaxes the concurrency and  
serialization requirements.

Maybe an external that acts more like a real server, even if the  
single command channel needs a request multiplexing protocol.


> A single _external process should only see monotonically increasing
> update_seq's. I think it's techincally possible to have a smaller
> update_seq processed later in time in a different os process though
> (later in time <= few ms).

Possible => broken.

> The ideas from the other thread about having a UUID per db and
> compaction are interesting, are either of those included the fs layout
> stuff you were working on?

No. UUIDs are useful for the fs because you need a strictly functional  
mapping from name -> file, and using a UUID is begging the question.

The compaction issue isn't real. My first thought is that the purge  
issue could be dealt with by a) having a notification of the purge and  
b) having the purge_seq be set to the update_seq of the snapshot seen  
by the purge. Maybe it works that way already.

I definitely prefer state transitions to be reified rather than  
notified, and IMO it's more consistent with the overall couch model.

Personally I think an _external system with a few richer protocol is  
required, rolling in notifications with the requests, so that an  
external system can maintain accurate state-correspondence with the  
canonical couch data, without exceptions e.g. without needing some  
sideband for database life-cycle events. Also being able to make  
queries to a given snapshot, using either the request channel or an  
additional parameter via HTTP. The request channel is by far a better  
idea because the snapshot can be implicitly scoped by the request.

I am adding the db UUID, view function UUID and revs in view results  
for my own purposes, but there wasn't much interest on the list, and I  
haven't the time to convince/shepherd/clean/publish etc that proposal  
or implementation.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

I contend that we are both atheists. I just believe in one fewer god  
than you do. When you understand why you dismiss all the other  
possible gods, you will understand why I dismiss yours.
   --Stephen F Roberts



Re: Update notifications including update sequence

Posted by Paul Davis <pa...@gmail.com>.
On Sun, Jan 18, 2009 at 11:56 PM, Antony Blakey <an...@gmail.com> wrote:
>
> On 19/01/2009, at 2:53 PM, Paul Davis wrote:
>
>> On Sun, Jan 18, 2009 at 10:51 PM, Antony Blakey <an...@gmail.com>
>> wrote:
>>>
>>> I've previously posted a solution using _external that doesn't hit couch
>>> every update, and that maintains MVCC consistency and lazy-update view
>>> behaviour.
>>>
>>
>> Right. I tried looking through mark mail for a link to your
>> implementation but came up empty handed. I'd contemplated something
>> similar as well. The issue though is that Lucene index writers are
>> AFAIK not reentrant.
>
> Thread 'couchdb' started by Tim Parkin around 20/21 December.
>

Odd. I only noticed that last 2 or 3 posts of that thread before.
Thanks for the tip.

> IndexWriters are mutexed using a lock file.
>

Ew.

>> Thus the headache of coordinating multiple random
>> processes would start to suck. Lots.
>
> My reading of the code was that there was a single process for each
> _external definition (although admittedly that was early in my understanding
> of gen_server). Major consistency issues result if requests to the _external
> aren't serialized.
>

There can be many _external processes for a single definition. So, not
only are requests not serialized, they can be concurrent etc.

>>> The problem with using notifications is lack of snapshot coordination
>>> between the update process and the external process.
>>>
>>
>> I'd say this is use case dependent.
>
> It does mean that you can't guarantee that an external request (that does
> reference a given MVCC snapshot) is getting data from the same snapshot.
>
> You're right that's use case dependent, but the issue is whether the use
> case is 'free text indexing' or is a client use case. If the later, then you
> need to handle the situation where it *does* matter, so an implementation
> that has random characteristics is IMO less than optimal.
>

Err, right. Its use case dependent. If your (client defined) use case
requires certain characteristics, the update_notifcation/_external
process may just not be the right tool for the job etc etc.

>>> The synchronisation between sequential _external calls is obvious e.g.
>>> guaranteeing that the _external process sees a monotonic increasing
>>> update_seq.
>>>
>>
>> I don't follow.
>
> I mean you'll never get a request in the context of an update_seq that your
> _external process has already advanced beyond, because the update_seqs seen
> by the external are a) serialized and b) only see a monotonic increasing
> sequence of update_seq values. Hence you can safely run an update process
> and set a 'last_update_seq_seen' (which is the key to avoiding hitting couch
> again) knowing that you never have to backtrack.
>

A single _external process should only see monotonically increasing
update_seq's. I think it's techincally possible to have a smaller
update_seq processed later in time in a different os process though
(later in time <= few ms).

The ideas from the other thread about having a UUID per db and
compaction are interesting, are either of those included the fs layout
stuff you were working on?

Paul

> Antony Blakey
> --------------------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> Human beings, who are almost unique in having the ability to learn from the
> experience of others, are also remarkable for their apparent disinclination
> to do so.
>  -- Douglas Adams
>
>
>

Re: Update notifications including update sequence

Posted by Antony Blakey <an...@gmail.com>.
On 19/01/2009, at 2:53 PM, Paul Davis wrote:

> On Sun, Jan 18, 2009 at 10:51 PM, Antony Blakey <antony.blakey@gmail.com 
> > wrote:
>> I've previously posted a solution using _external that doesn't hit  
>> couch
>> every update, and that maintains MVCC consistency and lazy-update  
>> view
>> behaviour.
>>
>
> Right. I tried looking through mark mail for a link to your
> implementation but came up empty handed. I'd contemplated something
> similar as well. The issue though is that Lucene index writers are
> AFAIK not reentrant.

Thread 'couchdb' started by Tim Parkin around 20/21 December.

IndexWriters are mutexed using a lock file.

> Thus the headache of coordinating multiple random
> processes would start to suck. Lots.

My reading of the code was that there was a single process for each  
_external definition (although admittedly that was early in my  
understanding of gen_server). Major consistency issues result if  
requests to the _external aren't serialized.

>> The problem with using notifications is lack of snapshot coordination
>> between the update process and the external process.
>>
>
> I'd say this is use case dependent.

It does mean that you can't guarantee that an external request (that  
does reference a given MVCC snapshot) is getting data from the same  
snapshot.

You're right that's use case dependent, but the issue is whether the  
use case is 'free text indexing' or is a client use case. If the  
later, then you need to handle the situation where it *does* matter,  
so an implementation that has random characteristics is IMO less than  
optimal.

>> The synchronisation between sequential _external calls is obvious  
>> e.g.
>> guaranteeing that the _external process sees a monotonic increasing
>> update_seq.
>>
>
> I don't follow.

I mean you'll never get a request in the context of an update_seq that  
your _external process has already advanced beyond, because the  
update_seqs seen by the external are a) serialized and b) only see a  
monotonic increasing sequence of update_seq values. Hence you can  
safely run an update process and set a 'last_update_seq_seen' (which  
is the key to avoiding hitting couch again) knowing that you never  
have to backtrack.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Human beings, who are almost unique in having the ability to learn  
from the experience of others, are also remarkable for their apparent  
disinclination to do so.
   -- Douglas Adams



Re: Update notifications including update sequence

Posted by Paul Davis <pa...@gmail.com>.
On Sun, Jan 18, 2009 at 10:51 PM, Antony Blakey <an...@gmail.com> wrote:
> I've previously posted a solution using _external that doesn't hit couch
> every update, and that maintains MVCC consistency and lazy-update view
> behaviour.
>

Right. I tried looking through mark mail for a link to your
implementation but came up empty handed. I'd contemplated something
similar as well. The issue though is that Lucene index writers are
AFAIK not reentrant. Thus the headache of coordinating multiple random
processes would start to suck. Lots.

> The problem with using notifications is lack of snapshot coordination
> between the update process and the external process.
>

I'd say this is use case dependent.

> The synchronisation between sequential _external calls is obvious e.g.
> guaranteeing that the _external process sees a monotonic increasing
> update_seq.
>

I don't follow.

You mention an sqlite db _external process similar to the GeoCouch
project a few times on the mailing list. How do you manage to keep
things sane in the face of possibly multiple-writers? I couldn't
figure anything out other than starting something with lock files
which is just plain dirty. And FTI indexing is obviously too expensive
to do multiple times so I can't just create an index per spawned os
process or some such.

Thanks,
Paul Davis

> On 19/01/2009, at 9:42 AM, Paul Davis wrote:
>
>> Hey,
>>
>> I'm working on this Lucene indexing stuff and I'm trying to write it
>> in such a way that I don't have to pound couchdb once per update. I
>> know that others have either gone every N updates or after a timeout,
>> but I'm not sure that's behavior that people would want in terms of
>> full text indexing.
>>
>> The general update_notification outline is:
>>
>> 1. Receive notification with type == "updated"
>> 2. while _all_docs_by_seq returns more data:
>>       index updates
>>
>> The kicker is that it's possible that while we're doing the while
>> loop, we're receiving more update notifications. Naively we could just
>> queue them up and process them all which leads to us hitting couchdb
>> at least once per write to the db (which is teh suck) or we could
>> discard them all except for one and just restart the indexer when it
>> thinks it's finished etc etc.
>>
>> After thinking about this, I thought that a simple way to actually
>> know if you need to start indexing again is if the notification sent
>> to update_notifications included the update_seq of the db. Then your
>> indexer that is already storing the current update_seq can just
>> compare if there's something new that needs to be worked on without
>> having to make an http request.
>>
>> Then it just becomes "index till no new docs, then discard all update
>> notifications with an update_seq we've already indexed past.
>>
>> I attached a patch that is extremely trivial, but I'd like to hear if
>> anyone has feed back on the merits or if there's just a better way
>> that I'm not thinking of.
>>
>> Thanks,
>> Paul Davis
>> <update_notification_sequene.patch>
>
> Antony Blakey
> --------------------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> You can't just ask customers what they want and then try to give that to
> them. By the time you get it built, they'll want something new.
>  -- Steve Jobs
>
>
>
>

Re: Update notifications including update sequence

Posted by Antony Blakey <an...@gmail.com>.
I've previously posted a solution using _external that doesn't hit  
couch every update, and that maintains MVCC consistency and lazy- 
update view behaviour.

The problem with using notifications is lack of snapshot coordination  
between the update process and the external process.

The synchronisation between sequential _external calls is obvious e.g.  
guaranteeing that the _external process sees a monotonic increasing  
update_seq.

On 19/01/2009, at 9:42 AM, Paul Davis wrote:

> Hey,
>
> I'm working on this Lucene indexing stuff and I'm trying to write it
> in such a way that I don't have to pound couchdb once per update. I
> know that others have either gone every N updates or after a timeout,
> but I'm not sure that's behavior that people would want in terms of
> full text indexing.
>
> The general update_notification outline is:
>
> 1. Receive notification with type == "updated"
> 2. while _all_docs_by_seq returns more data:
>        index updates
>
> The kicker is that it's possible that while we're doing the while
> loop, we're receiving more update notifications. Naively we could just
> queue them up and process them all which leads to us hitting couchdb
> at least once per write to the db (which is teh suck) or we could
> discard them all except for one and just restart the indexer when it
> thinks it's finished etc etc.
>
> After thinking about this, I thought that a simple way to actually
> know if you need to start indexing again is if the notification sent
> to update_notifications included the update_seq of the db. Then your
> indexer that is already storing the current update_seq can just
> compare if there's something new that needs to be worked on without
> having to make an http request.
>
> Then it just becomes "index till no new docs, then discard all update
> notifications with an update_seq we've already indexed past.
>
> I attached a patch that is extremely trivial, but I'd like to hear if
> anyone has feed back on the merits or if there's just a better way
> that I'm not thinking of.
>
> Thanks,
> Paul Davis
> <update_notification_sequene.patch>

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

You can't just ask customers what they want and then try to give that  
to them. By the time you get it built, they'll want something new.
   -- Steve Jobs