You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Paul Davis <pa...@gmail.com> on 2009/01/19 00:12:03 UTC
Update notifications including update sequence
Hey,
I'm working on this Lucene indexing stuff and I'm trying to write it
in such a way that I don't have to pound couchdb once per update. I
know that others have either gone every N updates or after a timeout,
but I'm not sure that's behavior that people would want in terms of
full text indexing.
The general update_notification outline is:
1. Receive notification with type == "updated"
2. while _all_docs_by_seq returns more data:
index updates
The kicker is that it's possible that while we're doing the while
loop, we're receiving more update notifications. Naively we could just
queue them up and process them all which leads to us hitting couchdb
at least once per write to the db (which is teh suck) or we could
discard them all except for one and just restart the indexer when it
thinks it's finished etc etc.
After thinking about this, I thought that a simple way to actually
know if you need to start indexing again is if the notification sent
to update_notifications included the update_seq of the db. Then your
indexer that is already storing the current update_seq can just
compare if there's something new that needs to be worked on without
having to make an http request.
Then it just becomes "index till no new docs, then discard all update
notifications with an update_seq we've already indexed past.
I attached a patch that is extremely trivial, but I'd like to hear if
anyone has feed back on the merits or if there's just a better way
that I'm not thinking of.
Thanks,
Paul Davis
Re: Update notifications including update sequence
Posted by Paul Davis <pa...@gmail.com>.
On Sun, Jan 18, 2009 at 7:17 PM, Chris Anderson <jc...@gmail.com> wrote:
> On Sun, Jan 18, 2009 at 3:12 PM, Paul Davis <pa...@gmail.com> wrote:
>> I attached a patch that is extremely trivial, but I'd like to hear if
>> anyone has feed back on the merits or if there's just a better way
>> that I'm not thinking of.
>>
>
> I think this is a good way to do it (and useful for other things). The
> patch looks solid. I'd have to look more closely at the code to see if
> the update_seq's interactions with deferred commit need to be
> accounted for here.
>
Good call. I didn't think to hard about anything other than just
taking what was in the db record when the notification was sent.
>
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>
Re: Update notifications including update sequence
Posted by Chris Anderson <jc...@gmail.com>.
On Sun, Jan 18, 2009 at 3:12 PM, Paul Davis <pa...@gmail.com> wrote:
> I attached a patch that is extremely trivial, but I'd like to hear if
> anyone has feed back on the merits or if there's just a better way
> that I'm not thinking of.
>
I think this is a good way to do it (and useful for other things). The
patch looks solid. I'd have to look more closely at the code to see if
the update_seq's interactions with deferred commit need to be
accounted for here.
--
Chris Anderson
http://jchris.mfdz.com
Re: Update notifications including update sequence
Posted by Chris Anderson <jc...@gmail.com>.
On Sun, Jan 18, 2009 at 10:52 PM, Paul Davis
<pa...@gmail.com> wrote:
> On Mon, Jan 19, 2009 at 1:46 AM, Antony Blakey <an...@gmail.com> wrote:
>>
>> On 19/01/2009, at 3:51 PM, Paul Davis wrote:
>>
>>> There can be many _external processes for a single definition. So, not
>>> only are requests not serialized, they can be concurrent etc.
>>
>> Hmmm. I must be particularly thick today, because my reading of the code has
>> a single couch_external_manager creating and maintaining an instance of
>> couch_external_server *per* UrlName, with each couch_external_server
>> instance corresponding to a single invocation of the external process
>> backing that URL.
>>
>> Where am I going wrong?
>>
>
> Wow. I am the dumb one here. I was just checking it out again as well
> to pin down the spot you'd need. Turns out that everything I said
> about _external is dead wrong. Though, if it helps, the model I had in
> my head is definitely how view server processes work XD
>
> And now that I just got that into my head I'm scrapping the update
> notification side of my couchdb-lucene stuff and running it all from
> _external.
>
> Apologies for wasting everyone's time.
>
Not a waste of time. Perhaps on another thread, we should consider
enhancements to the db-update-notification process.
--
Chris Anderson
http://jchris.mfdz.com
Re: Update notifications including update sequence
Posted by Paul Davis <pa...@gmail.com>.
On Mon, Jan 19, 2009 at 1:46 AM, Antony Blakey <an...@gmail.com> wrote:
>
> On 19/01/2009, at 3:51 PM, Paul Davis wrote:
>
>> There can be many _external processes for a single definition. So, not
>> only are requests not serialized, they can be concurrent etc.
>
> Hmmm. I must be particularly thick today, because my reading of the code has
> a single couch_external_manager creating and maintaining an instance of
> couch_external_server *per* UrlName, with each couch_external_server
> instance corresponding to a single invocation of the external process
> backing that URL.
>
> Where am I going wrong?
>
Wow. I am the dumb one here. I was just checking it out again as well
to pin down the spot you'd need. Turns out that everything I said
about _external is dead wrong. Though, if it helps, the model I had in
my head is definitely how view server processes work XD
And now that I just got that into my head I'm scrapping the update
notification side of my couchdb-lucene stuff and running it all from
_external.
Apologies for wasting everyone's time.
Paul Davis
> Antony Blakey
> -------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> A Buddhist walks up to a hot-dog stand and says, "Make me one with
> everything". He then pays the vendor and asks for change. The vendor says,
> "Change comes from within".
>
>
>
>
Re: Update notifications including update sequence
Posted by Antony Blakey <an...@gmail.com>.
On 19/01/2009, at 3:51 PM, Paul Davis wrote:
> There can be many _external processes for a single definition. So, not
> only are requests not serialized, they can be concurrent etc.
Hmmm. I must be particularly thick today, because my reading of the
code has a single couch_external_manager creating and maintaining an
instance of couch_external_server *per* UrlName, with each
couch_external_server instance corresponding to a single invocation of
the external process backing that URL.
Where am I going wrong?
Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
A Buddhist walks up to a hot-dog stand and says, "Make me one with
everything". He then pays the vendor and asks for change. The vendor
says, "Change comes from within".
Re: Update notifications including update sequence
Posted by Antony Blakey <an...@gmail.com>.
On 19/01/2009, at 5:05 PM, Antony Blakey wrote:
>> The ideas from the other thread about having a UUID per db and
>> compaction are interesting, are either of those included the fs
>> layout
>> stuff you were working on?
>
> No. UUIDs are useful for the fs because you need a strictly
> functional mapping from name -> file, and using a UUID is begging
> the question.
s/UUIDs are useful/UUIDs are *not* useful/
Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
In anything at all, perfection is finally attained not when there is
no longer anything to add, but when there is no longer anything to
take away.
-- Antoine de Saint-Exupery
Re: Update notifications including update sequence
Posted by Antony Blakey <an...@gmail.com>.
On 19/01/2009, at 3:51 PM, Paul Davis wrote:
> There can be many _external processes for a single definition. So, not
> only are requests not serialized, they can be concurrent etc.
OK. I'll be patching my deployment to ensure a single external process
per external definition.
IMO the _external system is considerable less useful in this form,
especially for external indexing. Concurrency and consistency should
be a matter for the external system to control, because it's the
external system that understands/imposes/relaxes the concurrency and
serialization requirements.
Maybe an external that acts more like a real server, even if the
single command channel needs a request multiplexing protocol.
> A single _external process should only see monotonically increasing
> update_seq's. I think it's techincally possible to have a smaller
> update_seq processed later in time in a different os process though
> (later in time <= few ms).
Possible => broken.
> The ideas from the other thread about having a UUID per db and
> compaction are interesting, are either of those included the fs layout
> stuff you were working on?
No. UUIDs are useful for the fs because you need a strictly functional
mapping from name -> file, and using a UUID is begging the question.
The compaction issue isn't real. My first thought is that the purge
issue could be dealt with by a) having a notification of the purge and
b) having the purge_seq be set to the update_seq of the snapshot seen
by the purge. Maybe it works that way already.
I definitely prefer state transitions to be reified rather than
notified, and IMO it's more consistent with the overall couch model.
Personally I think an _external system with a few richer protocol is
required, rolling in notifications with the requests, so that an
external system can maintain accurate state-correspondence with the
canonical couch data, without exceptions e.g. without needing some
sideband for database life-cycle events. Also being able to make
queries to a given snapshot, using either the request channel or an
additional parameter via HTTP. The request channel is by far a better
idea because the snapshot can be implicitly scoped by the request.
I am adding the db UUID, view function UUID and revs in view results
for my own purposes, but there wasn't much interest on the list, and I
haven't the time to convince/shepherd/clean/publish etc that proposal
or implementation.
Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
I contend that we are both atheists. I just believe in one fewer god
than you do. When you understand why you dismiss all the other
possible gods, you will understand why I dismiss yours.
--Stephen F Roberts
Re: Update notifications including update sequence
Posted by Paul Davis <pa...@gmail.com>.
On Sun, Jan 18, 2009 at 11:56 PM, Antony Blakey <an...@gmail.com> wrote:
>
> On 19/01/2009, at 2:53 PM, Paul Davis wrote:
>
>> On Sun, Jan 18, 2009 at 10:51 PM, Antony Blakey <an...@gmail.com>
>> wrote:
>>>
>>> I've previously posted a solution using _external that doesn't hit couch
>>> every update, and that maintains MVCC consistency and lazy-update view
>>> behaviour.
>>>
>>
>> Right. I tried looking through mark mail for a link to your
>> implementation but came up empty handed. I'd contemplated something
>> similar as well. The issue though is that Lucene index writers are
>> AFAIK not reentrant.
>
> Thread 'couchdb' started by Tim Parkin around 20/21 December.
>
Odd. I only noticed that last 2 or 3 posts of that thread before.
Thanks for the tip.
> IndexWriters are mutexed using a lock file.
>
Ew.
>> Thus the headache of coordinating multiple random
>> processes would start to suck. Lots.
>
> My reading of the code was that there was a single process for each
> _external definition (although admittedly that was early in my understanding
> of gen_server). Major consistency issues result if requests to the _external
> aren't serialized.
>
There can be many _external processes for a single definition. So, not
only are requests not serialized, they can be concurrent etc.
>>> The problem with using notifications is lack of snapshot coordination
>>> between the update process and the external process.
>>>
>>
>> I'd say this is use case dependent.
>
> It does mean that you can't guarantee that an external request (that does
> reference a given MVCC snapshot) is getting data from the same snapshot.
>
> You're right that's use case dependent, but the issue is whether the use
> case is 'free text indexing' or is a client use case. If the later, then you
> need to handle the situation where it *does* matter, so an implementation
> that has random characteristics is IMO less than optimal.
>
Err, right. Its use case dependent. If your (client defined) use case
requires certain characteristics, the update_notifcation/_external
process may just not be the right tool for the job etc etc.
>>> The synchronisation between sequential _external calls is obvious e.g.
>>> guaranteeing that the _external process sees a monotonic increasing
>>> update_seq.
>>>
>>
>> I don't follow.
>
> I mean you'll never get a request in the context of an update_seq that your
> _external process has already advanced beyond, because the update_seqs seen
> by the external are a) serialized and b) only see a monotonic increasing
> sequence of update_seq values. Hence you can safely run an update process
> and set a 'last_update_seq_seen' (which is the key to avoiding hitting couch
> again) knowing that you never have to backtrack.
>
A single _external process should only see monotonically increasing
update_seq's. I think it's techincally possible to have a smaller
update_seq processed later in time in a different os process though
(later in time <= few ms).
The ideas from the other thread about having a UUID per db and
compaction are interesting, are either of those included the fs layout
stuff you were working on?
Paul
> Antony Blakey
> --------------------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> Human beings, who are almost unique in having the ability to learn from the
> experience of others, are also remarkable for their apparent disinclination
> to do so.
> -- Douglas Adams
>
>
>
Re: Update notifications including update sequence
Posted by Antony Blakey <an...@gmail.com>.
On 19/01/2009, at 2:53 PM, Paul Davis wrote:
> On Sun, Jan 18, 2009 at 10:51 PM, Antony Blakey <antony.blakey@gmail.com
> > wrote:
>> I've previously posted a solution using _external that doesn't hit
>> couch
>> every update, and that maintains MVCC consistency and lazy-update
>> view
>> behaviour.
>>
>
> Right. I tried looking through mark mail for a link to your
> implementation but came up empty handed. I'd contemplated something
> similar as well. The issue though is that Lucene index writers are
> AFAIK not reentrant.
Thread 'couchdb' started by Tim Parkin around 20/21 December.
IndexWriters are mutexed using a lock file.
> Thus the headache of coordinating multiple random
> processes would start to suck. Lots.
My reading of the code was that there was a single process for each
_external definition (although admittedly that was early in my
understanding of gen_server). Major consistency issues result if
requests to the _external aren't serialized.
>> The problem with using notifications is lack of snapshot coordination
>> between the update process and the external process.
>>
>
> I'd say this is use case dependent.
It does mean that you can't guarantee that an external request (that
does reference a given MVCC snapshot) is getting data from the same
snapshot.
You're right that's use case dependent, but the issue is whether the
use case is 'free text indexing' or is a client use case. If the
later, then you need to handle the situation where it *does* matter,
so an implementation that has random characteristics is IMO less than
optimal.
>> The synchronisation between sequential _external calls is obvious
>> e.g.
>> guaranteeing that the _external process sees a monotonic increasing
>> update_seq.
>>
>
> I don't follow.
I mean you'll never get a request in the context of an update_seq that
your _external process has already advanced beyond, because the
update_seqs seen by the external are a) serialized and b) only see a
monotonic increasing sequence of update_seq values. Hence you can
safely run an update process and set a 'last_update_seq_seen' (which
is the key to avoiding hitting couch again) knowing that you never
have to backtrack.
Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
Human beings, who are almost unique in having the ability to learn
from the experience of others, are also remarkable for their apparent
disinclination to do so.
-- Douglas Adams
Re: Update notifications including update sequence
Posted by Paul Davis <pa...@gmail.com>.
On Sun, Jan 18, 2009 at 10:51 PM, Antony Blakey <an...@gmail.com> wrote:
> I've previously posted a solution using _external that doesn't hit couch
> every update, and that maintains MVCC consistency and lazy-update view
> behaviour.
>
Right. I tried looking through mark mail for a link to your
implementation but came up empty handed. I'd contemplated something
similar as well. The issue though is that Lucene index writers are
AFAIK not reentrant. Thus the headache of coordinating multiple random
processes would start to suck. Lots.
> The problem with using notifications is lack of snapshot coordination
> between the update process and the external process.
>
I'd say this is use case dependent.
> The synchronisation between sequential _external calls is obvious e.g.
> guaranteeing that the _external process sees a monotonic increasing
> update_seq.
>
I don't follow.
You mention an sqlite db _external process similar to the GeoCouch
project a few times on the mailing list. How do you manage to keep
things sane in the face of possibly multiple-writers? I couldn't
figure anything out other than starting something with lock files
which is just plain dirty. And FTI indexing is obviously too expensive
to do multiple times so I can't just create an index per spawned os
process or some such.
Thanks,
Paul Davis
> On 19/01/2009, at 9:42 AM, Paul Davis wrote:
>
>> Hey,
>>
>> I'm working on this Lucene indexing stuff and I'm trying to write it
>> in such a way that I don't have to pound couchdb once per update. I
>> know that others have either gone every N updates or after a timeout,
>> but I'm not sure that's behavior that people would want in terms of
>> full text indexing.
>>
>> The general update_notification outline is:
>>
>> 1. Receive notification with type == "updated"
>> 2. while _all_docs_by_seq returns more data:
>> index updates
>>
>> The kicker is that it's possible that while we're doing the while
>> loop, we're receiving more update notifications. Naively we could just
>> queue them up and process them all which leads to us hitting couchdb
>> at least once per write to the db (which is teh suck) or we could
>> discard them all except for one and just restart the indexer when it
>> thinks it's finished etc etc.
>>
>> After thinking about this, I thought that a simple way to actually
>> know if you need to start indexing again is if the notification sent
>> to update_notifications included the update_seq of the db. Then your
>> indexer that is already storing the current update_seq can just
>> compare if there's something new that needs to be worked on without
>> having to make an http request.
>>
>> Then it just becomes "index till no new docs, then discard all update
>> notifications with an update_seq we've already indexed past.
>>
>> I attached a patch that is extremely trivial, but I'd like to hear if
>> anyone has feed back on the merits or if there's just a better way
>> that I'm not thinking of.
>>
>> Thanks,
>> Paul Davis
>> <update_notification_sequene.patch>
>
> Antony Blakey
> --------------------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> You can't just ask customers what they want and then try to give that to
> them. By the time you get it built, they'll want something new.
> -- Steve Jobs
>
>
>
>
Re: Update notifications including update sequence
Posted by Antony Blakey <an...@gmail.com>.
I've previously posted a solution using _external that doesn't hit
couch every update, and that maintains MVCC consistency and lazy-
update view behaviour.
The problem with using notifications is lack of snapshot coordination
between the update process and the external process.
The synchronisation between sequential _external calls is obvious e.g.
guaranteeing that the _external process sees a monotonic increasing
update_seq.
On 19/01/2009, at 9:42 AM, Paul Davis wrote:
> Hey,
>
> I'm working on this Lucene indexing stuff and I'm trying to write it
> in such a way that I don't have to pound couchdb once per update. I
> know that others have either gone every N updates or after a timeout,
> but I'm not sure that's behavior that people would want in terms of
> full text indexing.
>
> The general update_notification outline is:
>
> 1. Receive notification with type == "updated"
> 2. while _all_docs_by_seq returns more data:
> index updates
>
> The kicker is that it's possible that while we're doing the while
> loop, we're receiving more update notifications. Naively we could just
> queue them up and process them all which leads to us hitting couchdb
> at least once per write to the db (which is teh suck) or we could
> discard them all except for one and just restart the indexer when it
> thinks it's finished etc etc.
>
> After thinking about this, I thought that a simple way to actually
> know if you need to start indexing again is if the notification sent
> to update_notifications included the update_seq of the db. Then your
> indexer that is already storing the current update_seq can just
> compare if there's something new that needs to be worked on without
> having to make an http request.
>
> Then it just becomes "index till no new docs, then discard all update
> notifications with an update_seq we've already indexed past.
>
> I attached a patch that is extremely trivial, but I'd like to hear if
> anyone has feed back on the merits or if there's just a better way
> that I'm not thinking of.
>
> Thanks,
> Paul Davis
> <update_notification_sequene.patch>
Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
You can't just ask customers what they want and then try to give that
to them. By the time you get it built, they'll want something new.
-- Steve Jobs