You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Miles Fidelman <mf...@meetinghouse.net> on 2009/10/23 21:22:55 UTC

massive replication?

A couple of the recent CouchDB powerpoint presentations illustrate 
replication across massive numbers of database instances.  Does that 
represent anything remotely possible, for real? 

We have an application that involves replicating data across very large 
numbers of nodes that are intermittently connected - where we'd like 
collections of data to replicate and synchronize as connectivity 
allows.  Think something like USENET news as an analogy.

I'm trying to sort out whether or not CouchDB is a potential platform.

Thanks very much,

Miles Fidelman


-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra

Re: massive replication?

Posted by Adam Kocoloski <ko...@apache.org>.

On Oct 23, 2009, at 3:27 PM, Andrew Melo wrote:

> On Fri, Oct 23, 2009 at 2:22 PM, Miles Fidelman
> <mf...@meetinghouse.net> wrote:
>> A couple of the recent CouchDB powerpoint presentations illustrate
>> replication across massive numbers of database instances.  Does that
>> represent anything remotely possible, for real?
>> We have an application that involves replicating data across very  
>> large
>> numbers of nodes that are intermittently connected - where we'd like
>> collections of data to replicate and synchronize as connectivity  
>> allows.
>>  Think something like USENET news as an analogy.
>
> Hey Miles,
>
> I don't know if it would fit your needs (I've not used it personally),
> but this may be worth looking into:
>
> http://tilgovi.github.com/couchdb-lounge/
>
> Thanks,
> Andrew

Lounge is very cool, but it's more suited to distributing very large  
datasets over a smallish number of always-connected couches.  Miles'  
situation is tailor-made for CouchDB's replication model.  Best,

Adam

Re: massive replication?

Posted by Andrew Melo <an...@gmail.com>.

On Fri, Oct 23, 2009 at 2:22 PM, Miles Fidelman
<mf...@meetinghouse.net> wrote:
> A couple of the recent CouchDB powerpoint presentations illustrate
> replication across massive numbers of database instances.  Does that
> represent anything remotely possible, for real?
> We have an application that involves replicating data across very large
> numbers of nodes that are intermittently connected - where we'd like
> collections of data to replicate and synchronize as connectivity allows.
>  Think something like USENET news as an analogy.

Hey Miles,

I don't know if it would fit your needs (I've not used it personally),
but this may be worth looking into:

http://tilgovi.github.com/couchdb-lounge/

Thanks,
Andrew




>
> I'm trying to sort out whether or not CouchDB is a potential platform.
>
> Thanks very much,
>



> Miles Fidelman
>
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   .... Yogi Berra
>
>
>

Re: massive replication?

Posted by Paul Davis <pa...@gmail.com>.

Miles,

Oh. Gotchya. But the fun part is the protocol implementation ;)

Paul Davis

On Mon, Oct 26, 2009 at 2:21 PM, Miles Fidelman
<mf...@meetinghouse.net> wrote:
> Paul,
>
> I'm actually suggesting using NNTP infrastructure, or something like it, to
> propagate updates - rather than trying to reinvent the protocol and
> supporting infrastructure in CouchDB.
> Something like:
> change -> local CouchDB instance -<continuous replication (output)> -> local
> NNTP daemon (specific newsgroup) -> "the net"
>
> "the net" -> local NNTP daemon (specific newsgroup) -> <continuous
> replication (input) -> local CouchDB instance
>
> All messages eventually get to all Couch instances - but the order and delay
> can vary considerably.
>
> Miles
>
>
> Paul Davis wrote:
>>
>> Miles,
>>
>> This sounds like what I was trying to propose. More concretely:
>>
>>
>>>
>>> I keep coming back to NNTP (USENET) as a model for many-to-many
>>> messaging:
>>>
>>> - push a message into a newsgroup on any NNTP node subscribing to that
>>> newsgroup
>>>
>>
>> A message group is persisted in a DB. In the future, a single db with
>> filtered replication could work, but a db per group will probably be
>> easier. Not all nodes have all db's.
>>
>>
>>>
>>> - nodes exchange "I-have"/"You-have" on a regular basis
>>>
>>
>> Replication of the "network status db" that holds what nodes have what
>> update_seq/host pairs. The pairing is important. The rate of
>> replication obviously affects delays and what not.
>>
>>
>>>
>>> - message propagate to all subscribing nodes by essentially a flooding or
>>> epidemic routing mechanism
>>>
>>
>> This is the fun part. Given your local copy of a group, how do you
>> pick a peer to pull from. Or push to if you're feeling proactive. Do
>> we sync in both directions, etc etc. Depending on the behavior this
>> could manifest in multiple ways. If you have a single authority per
>> 'subscription source' then you'd want to keep
>> "authority/update_seq/located_at" triples. This way you know that if
>> the largest update_seq seen is N, then you can replicate from any node
>> that has that sequence.
>>
>>
>>>
>>> - pretty quickly, a message propagates to all nodes subscribing to the
>>> newsgroup
>>>
>>
>> Message delivery via replication. Judging the quickly part would
>> require a more formal definition of quickly and then building the
>> thing. Depending on your definition of quickly there are definitely
>> different types of design decisions to be made.
>>
>>
>>>
>>> - lack of connectivity simply delays message propagation
>>>
>>
>> Or reroutes through an alternate node. If we know that four nodes have
>> a copy of the db, we can replicate from any that are alive.
>> Replication is incremental, a link can disappear in the middle of a
>> replication and it'll resume from the previous check point.
>>
>>
>>>
>>> - the whole system scales massively, and is very robust in the face of
>>> connectivity outages, node failures, etc. (messages can flow across
>>> multiple
>>> routes)
>>>
>>
>> It scales on my brain debugger assuming the propagation algorithm
>> isn't extremely naīve. But as I said, I haven't built it yet so who
>> knows.
>>
>>
>>>
>>> In some sense, what I'm thinking of would look a lot like:
>>>
>>> - a group of CouchDB nodes all subscribe to a newsgroup
>>>
>>
>> I'm confused by the term subscription here. Generally I'd assume that
>> means that a foreign host knows that the local node is interested in
>> something. For fully distributed awesome, I think it'd be better
>> phrased as "nodes can declare what they want" and the algorithm will
>> attempt to maintain their local state somewhere close to the global
>> state. Kind of like the difference between newspaper delivery and
>> buying a newspaper at any one of the many shops in town.
>>
>>
>>>
>>> - each node publishes changes as messages to that newsgroup
>>>
>>
>> $ curl -X PUT -d '{"message": "First post!!1!1"}'
>> http://127.0.0.1:5984/newsgroup/memesrus
>>
>> Part of the replication routing could include a proactive step in
>> pushing messages to nodes it  knows care about a message.
>>
>>
>>>
>>> - NNTP takes care of getting messages everywhere, eventually
>>>
>>
>> The 'network state in a db' means that regardless of who's interested
>> in what, if we make sure that the first step in node communication is
>> a 'i know about these endpoints in these states' then we'll push
>> information to the people that care. Eventually.
>>
>>
>>>
>>> - each node looks for incoming messages and applies them as changes
>>>
>>
>> Replication FTW!
>>
>>
>>>
>>> - use a shared key to secure things (note: some implementations of NNTP
>>> already support secure messaging)
>>>
>>
>> This is slightly harder. Read only means you'd need something infront
>> of couch to prevent readers. But OAuth or distributing SSL certs or
>> similar wouldn't be out of the question.
>>
>>
>>>
>>> A similar approach could be taken using:
>>> - a distributed hash table as a message que (that's what spread and
>>> splines
>>> seem to do)
>>>
>>
>> I haven't looked to hard at these, but "DHT as message queue" seems
>> contradictory to the idea that most DHT's (all?) that I know of don't
>> allow range queries.
>>
>>
>>>
>>> - the DIS or HLA protocols (used for distributed simulation - keeping
>>> multiple copies of a "world" synchronized)
>>>
>>
>> I couldn't get past the first wiki page Google gave me, so, I dunno.
>>
>> HTH,
>> Paul Davis
>>
>
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   .... Yogi Berra
>
>
>

Re: massive replication?

Posted by Miles Fidelman <mf...@meetinghouse.net>.

Paul,

I'm actually suggesting using NNTP infrastructure, or something like it, 
to propagate updates - rather than trying to reinvent the protocol and 
supporting infrastructure in CouchDB. 

Something like: 

change -> local CouchDB instance -<continuous replication (output)> -> 
local NNTP daemon (specific newsgroup) -> "the net"

"the net" -> local NNTP daemon (specific newsgroup) -> <continuous 
replication (input) -> local CouchDB instance

All messages eventually get to all Couch instances - but the order and 
delay can vary considerably.

Miles


Paul Davis wrote:
> Miles,
>
> This sounds like what I was trying to propose. More concretely:
>
>   
>> I keep coming back to NNTP (USENET) as a model for many-to-many messaging:
>>
>> - push a message into a newsgroup on any NNTP node subscribing to that
>> newsgroup
>>     
>
> A message group is persisted in a DB. In the future, a single db with
> filtered replication could work, but a db per group will probably be
> easier. Not all nodes have all db's.
>
>   
>> - nodes exchange "I-have"/"You-have" on a regular basis
>>     
>
> Replication of the "network status db" that holds what nodes have what
> update_seq/host pairs. The pairing is important. The rate of
> replication obviously affects delays and what not.
>
>   
>> - message propagate to all subscribing nodes by essentially a flooding or
>> epidemic routing mechanism
>>     
>
> This is the fun part. Given your local copy of a group, how do you
> pick a peer to pull from. Or push to if you're feeling proactive. Do
> we sync in both directions, etc etc. Depending on the behavior this
> could manifest in multiple ways. If you have a single authority per
> 'subscription source' then you'd want to keep
> "authority/update_seq/located_at" triples. This way you know that if
> the largest update_seq seen is N, then you can replicate from any node
> that has that sequence.
>
>   
>> - pretty quickly, a message propagates to all nodes subscribing to the
>> newsgroup
>>     
>
> Message delivery via replication. Judging the quickly part would
> require a more formal definition of quickly and then building the
> thing. Depending on your definition of quickly there are definitely
> different types of design decisions to be made.
>
>   
>> - lack of connectivity simply delays message propagation
>>     
>
> Or reroutes through an alternate node. If we know that four nodes have
> a copy of the db, we can replicate from any that are alive.
> Replication is incremental, a link can disappear in the middle of a
> replication and it'll resume from the previous check point.
>
>   
>> - the whole system scales massively, and is very robust in the face of
>> connectivity outages, node failures, etc. (messages can flow across multiple
>> routes)
>>     
>
> It scales on my brain debugger assuming the propagation algorithm
> isn't extremely naïve. But as I said, I haven't built it yet so who
> knows.
>
>   
>> In some sense, what I'm thinking of would look a lot like:
>>
>> - a group of CouchDB nodes all subscribe to a newsgroup
>>     
>
> I'm confused by the term subscription here. Generally I'd assume that
> means that a foreign host knows that the local node is interested in
> something. For fully distributed awesome, I think it'd be better
> phrased as "nodes can declare what they want" and the algorithm will
> attempt to maintain their local state somewhere close to the global
> state. Kind of like the difference between newspaper delivery and
> buying a newspaper at any one of the many shops in town.
>
>   
>> - each node publishes changes as messages to that newsgroup
>>     
>
> $ curl -X PUT -d '{"message": "First post!!1!1"}'
> http://127.0.0.1:5984/newsgroup/memesrus
>
> Part of the replication routing could include a proactive step in
> pushing messages to nodes it  knows care about a message.
>
>   
>> - NNTP takes care of getting messages everywhere, eventually
>>     
>
> The 'network state in a db' means that regardless of who's interested
> in what, if we make sure that the first step in node communication is
> a 'i know about these endpoints in these states' then we'll push
> information to the people that care. Eventually.
>
>   
>> - each node looks for incoming messages and applies them as changes
>>     
>
> Replication FTW!
>
>   
>> - use a shared key to secure things (note: some implementations of NNTP
>> already support secure messaging)
>>     
>
> This is slightly harder. Read only means you'd need something infront
> of couch to prevent readers. But OAuth or distributing SSL certs or
> similar wouldn't be out of the question.
>
>   
>> A similar approach could be taken using:
>> - a distributed hash table as a message que (that's what spread and splines
>> seem to do)
>>     
>
> I haven't looked to hard at these, but "DHT as message queue" seems
> contradictory to the idea that most DHT's (all?) that I know of don't
> allow range queries.
>
>   
>> - the DIS or HLA protocols (used for distributed simulation - keeping
>> multiple copies of a "world" synchronized)
>>     
>
> I couldn't get past the first wiki page Google gave me, so, I dunno.
>
> HTH,
> Paul Davis
>   


-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra

Re: massive replication?

Posted by Paul Davis <pa...@gmail.com>.

Miles,

This sounds like what I was trying to propose. More concretely:

> I keep coming back to NNTP (USENET) as a model for many-to-many messaging:
>
> - push a message into a newsgroup on any NNTP node subscribing to that
> newsgroup

A message group is persisted in a DB. In the future, a single db with
filtered replication could work, but a db per group will probably be
easier. Not all nodes have all db's.

> - nodes exchange "I-have"/"You-have" on a regular basis

Replication of the "network status db" that holds what nodes have what
update_seq/host pairs. The pairing is important. The rate of
replication obviously affects delays and what not.

> - message propagate to all subscribing nodes by essentially a flooding or
> epidemic routing mechanism

This is the fun part. Given your local copy of a group, how do you
pick a peer to pull from. Or push to if you're feeling proactive. Do
we sync in both directions, etc etc. Depending on the behavior this
could manifest in multiple ways. If you have a single authority per
'subscription source' then you'd want to keep
"authority/update_seq/located_at" triples. This way you know that if
the largest update_seq seen is N, then you can replicate from any node
that has that sequence.

> - pretty quickly, a message propagates to all nodes subscribing to the
> newsgroup

Message delivery via replication. Judging the quickly part would
require a more formal definition of quickly and then building the
thing. Depending on your definition of quickly there are definitely
different types of design decisions to be made.

> - lack of connectivity simply delays message propagation

Or reroutes through an alternate node. If we know that four nodes have
a copy of the db, we can replicate from any that are alive.
Replication is incremental, a link can disappear in the middle of a
replication and it'll resume from the previous check point.

> - the whole system scales massively, and is very robust in the face of
> connectivity outages, node failures, etc. (messages can flow across multiple
> routes)

It scales on my brain debugger assuming the propagation algorithm
isn't extremely naïve. But as I said, I haven't built it yet so who
knows.

> In some sense, what I'm thinking of would look a lot like:
>
> - a group of CouchDB nodes all subscribe to a newsgroup

I'm confused by the term subscription here. Generally I'd assume that
means that a foreign host knows that the local node is interested in
something. For fully distributed awesome, I think it'd be better
phrased as "nodes can declare what they want" and the algorithm will
attempt to maintain their local state somewhere close to the global
state. Kind of like the difference between newspaper delivery and
buying a newspaper at any one of the many shops in town.

> - each node publishes changes as messages to that newsgroup

$ curl -X PUT -d '{"message": "First post!!1!1"}'
http://127.0.0.1:5984/newsgroup/memesrus

Part of the replication routing could include a proactive step in
pushing messages to nodes it  knows care about a message.

> - NNTP takes care of getting messages everywhere, eventually

The 'network state in a db' means that regardless of who's interested
in what, if we make sure that the first step in node communication is
a 'i know about these endpoints in these states' then we'll push
information to the people that care. Eventually.

> - each node looks for incoming messages and applies them as changes

Replication FTW!

> - use a shared key to secure things (note: some implementations of NNTP
> already support secure messaging)

This is slightly harder. Read only means you'd need something infront
of couch to prevent readers. But OAuth or distributing SSL certs or
similar wouldn't be out of the question.

> A similar approach could be taken using:
> - a distributed hash table as a message que (that's what spread and splines
> seem to do)

I haven't looked to hard at these, but "DHT as message queue" seems
contradictory to the idea that most DHT's (all?) that I know of don't
allow range queries.

> - the DIS or HLA protocols (used for distributed simulation - keeping
> multiple copies of a "world" synchronized)

I couldn't get past the first wiki page Google gave me, so, I dunno.

HTH,
Paul Davis

Re: massive replication?

Posted by Jesse Hallett <ha...@gmail.com>.

After reading some Wikipedia it looks to me that Chris is right about an
information-spreading gossip protocol being the way to go.  You could have
each node replicate with a small number of other nodes at random - or with
its "neighbor" nodes - some fixed number of times for minute.  That should
get updates distributed effectively.

Then the problem comes down to maintaining a list of active peers.  One way
to do the would be to set bittorrent-style trackers: a handful of nodes that
are unlikely to go down very often.  Every Couch instance would register
itself with a tracker.

Another approach would be, as someone suggested, keeping a special database
in each node with a list of peers.  On every replication nodes would also
replicate this database so that everybody has the same list.  Peers could be
marked as inactive if they time out consistently.  The tricky part here is
what address to give new nodes as a starting point for their first
replication.

The problem that I see with a bus of updates is that update sequences are
local and will likely not match up from node to node.  You could use
timestamps instead of update sequences though if you expect all of the nodes
to have relatively synchronized clocks.  But I suggest avoiding this path.
Replications in CouchDB are cheap and it is ok to have lots of redundant
replication attempts.

Some sort of supervisory agent would be required for any solution.  But
CouchDB's replication ability should make the design of that agent far
easier than it would be with most other systems.

On Oct 26, 2009 9:13 AM, "Adam Kocoloski" <ko...@apache.org> wrote:

On Oct 26, 2009, at 11:35 AM, Chris Anderson wrote: > On Mon, Oct 26, 2009
at 8:33 AM, Miles Fidelm...
Sounds that way to me, too, although that could be because CouchDB is the
hammer I know really well.

I'm still trying to figure out how multicast fits into the picture.  I can
see it really helping to reduce bandwidth and server load in a case where
the nodes are all expected to be online 100% of the time, but if nodes are
coming and going they're likely to be requesting feeds at different starting
sequences much of the time.  What's the win in that case?

Best, Adam

Re: massive replication?

Posted by Miles Fidelman <mf...@meetinghouse.net>.

Adam Kocoloski wrote:
> On Oct 26, 2009, at 11:35 AM, Chris Anderson wrote:
>
>>
>>>>
>>>> 2) When these CouchDB servers drop off for an extended period and then
>>>> rejoin, how do they subscribe to the update feed from the 
>>>> replication bus at
>>>> a particular sequence?  This is really the key element of the 
>>>> setup.  When I
>>>> think of multicasting I think of video feeds and such, where if you 
>>>> drop off
>>>> and rejoin you don't care about the old stuff you missed.  That's 
>>>> not the
>>>> case here.  Does the bus store all this old feed data?
>>>
>>> Think of something like RSS, but with distributed infrastructure.
>>> A node would publish an update to a specific address (e.g., like 
>>> publishing
>>> an RSS feed).
>>>
>>> All nodes would subscribe to the feed, and receive new messages in 
>>> sequence.
>>>  When picking up updates, you ask for everything after a particular 
>>> sequence
>>> number.  The update service maintains the data.
>>
>> The best candidate for an update service like this is probably a 
>> CouchDB.
>
> Sounds that way to me, too, although that could be because CouchDB is 
> the hammer I know really well.
>
> I'm still trying to figure out how multicast fits into the picture.  I 
> can see it really helping to reduce bandwidth and server load in a 
> case where the nodes are all expected to be online 100% of the time, 
> but if nodes are coming and going they're likely to be requesting 
> feeds at different starting sequences much of the time.  What's the 
> win in that case?
>
Doesn't seem that way to me.  At the very least, for a fully distributed 
design (which is what we're seeking), this would require a backbone of 
multiple CouchDB instances plus a management infrastructure of some sort.

What I'm looking for is a way to avoid:

1. any kind of central node
2. the need to manage an unbounded number of 1-1 links between nodes

That requires some kind of many-many protocol that takes care of moving 
messages around.

Miles


-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra

Re: massive replication?

Posted by Adam Kocoloski <ko...@apache.org>.

On Oct 26, 2009, at 11:35 AM, Chris Anderson wrote:

> On Mon, Oct 26, 2009 at 8:33 AM, Miles Fidelman
> <mf...@meetinghouse.net> wrote:
>> Adam Kocoloski wrote:
>>>
>>> On Oct 26, 2009, at 10:45 AM, Miles Fidelman wrote:
>>>
>>>> The environment we're looking at is more of a mesh where  
>>>> connectivity is
>>>> coming up and down - think mobile ad hoc networks.
>>>>
>>>> I like the idea of a replication bus, perhaps using something  
>>>> like spread
>>>> (http://www.spread.org/) or spines (www.spines.org) as a multi- 
>>>> cast fabric.
>>>>
>>>> I'm thinking something like continuous replication - but where the
>>>> updates are pushed to a multi-cast port rather than to a specific  
>>>> node, with
>>>> each node subscribing to update feeds.
>>>>
>>>> Anybody have any thoughts on how that would play with the current
>>>> replication and conflict resolution schemes?
>>>>
>>>> Miles Fidelman
>>>
>>> Hi Miles, this sounds like really cool stuff.  Caveat: I have no
>>> experience using Spread/Spines and very little experience with IP
>>> multicasting, which I guess is what those tools try to reproduce in
>>> internet-like environments.  So bear with me if I ask stupid  
>>> questions.
>>>
>>> 1) Would the CouchDB servers be responsible for error detection and
>>> correction?  I imagine that complicates matters considerably, but it
>>> wouldn't be impossible.
>>
>> Good question.  I hadn't quite thought that far ahead.  I think the  
>> basic
>> answer is no (assume reliable multicast), but... some kind of healing
>> mechanism would probably be required (see below).
>>>
>>> 2) When these CouchDB servers drop off for an extended period and  
>>> then
>>> rejoin, how do they subscribe to the update feed from the  
>>> replication bus at
>>> a particular sequence?  This is really the key element of the  
>>> setup.  When I
>>> think of multicasting I think of video feeds and such, where if  
>>> you drop off
>>> and rejoin you don't care about the old stuff you missed.  That's  
>>> not the
>>> case here.  Does the bus store all this old feed data?
>>
>> Think of something like RSS, but with distributed infrastructure.
>> A node would publish an update to a specific address (e.g., like  
>> publishing
>> an RSS feed).
>>
>> All nodes would subscribe to the feed, and receive new messages in  
>> sequence.
>>  When picking up updates, you ask for everything after a particular  
>> sequence
>> number.  The update service maintains the data.
>
> The best candidate for an update service like this is probably a  
> CouchDB.

Sounds that way to me, too, although that could be because CouchDB is  
the hammer I know really well.

I'm still trying to figure out how multicast fits into the picture.  I  
can see it really helping to reduce bandwidth and server load in a  
case where the nodes are all expected to be online 100% of the time,  
but if nodes are coming and going they're likely to be requesting  
feeds at different starting sequences much of the time.  What's the  
win in that case?

Best, Adam

Re: massive replication?

Posted by Chris Anderson <jc...@apache.org>.

On Mon, Oct 26, 2009 at 8:33 AM, Miles Fidelman
<mf...@meetinghouse.net> wrote:
> Adam Kocoloski wrote:
>>
>> On Oct 26, 2009, at 10:45 AM, Miles Fidelman wrote:
>>
>>> The environment we're looking at is more of a mesh where connectivity is
>>> coming up and down - think mobile ad hoc networks.
>>>
>>> I like the idea of a replication bus, perhaps using something like spread
>>> (http://www.spread.org/) or spines (www.spines.org) as a multi-cast fabric.
>>>
>>> I'm thinking something like continuous replication - but where the
>>> updates are pushed to a multi-cast port rather than to a specific node, with
>>> each node subscribing to update feeds.
>>>
>>> Anybody have any thoughts on how that would play with the current
>>> replication and conflict resolution schemes?
>>>
>>> Miles Fidelman
>>
>> Hi Miles, this sounds like really cool stuff.  Caveat: I have no
>> experience using Spread/Spines and very little experience with IP
>> multicasting, which I guess is what those tools try to reproduce in
>> internet-like environments.  So bear with me if I ask stupid questions.
>>
>> 1) Would the CouchDB servers be responsible for error detection and
>> correction?  I imagine that complicates matters considerably, but it
>> wouldn't be impossible.
>
> Good question.  I hadn't quite thought that far ahead.  I think the basic
> answer is no (assume reliable multicast), but... some kind of healing
> mechanism would probably be required (see below).
>>
>> 2) When these CouchDB servers drop off for an extended period and then
>> rejoin, how do they subscribe to the update feed from the replication bus at
>> a particular sequence?  This is really the key element of the setup.  When I
>> think of multicasting I think of video feeds and such, where if you drop off
>> and rejoin you don't care about the old stuff you missed.  That's not the
>> case here.  Does the bus store all this old feed data?
>
> Think of something like RSS, but with distributed infrastructure.
> A node would publish an update to a specific address (e.g., like publishing
> an RSS feed).
>
> All nodes would subscribe to the feed, and receive new messages in sequence.
>  When picking up updates, you ask for everything after a particular sequence
> number.  The update service maintains the data.

The best candidate for an update service like this is probably a CouchDB.

>>
>> 3) Which steps of the replication do you envision using the replication
>> bus?  Just the _changes feed (essentially a list of docid:rev pairs) or the
>> actual documents themselves?
>
> Any change to a local copy of the database (i.e., everything).
>>
>> The conflict resolution model shouldn't care about whether replication is
>> p2p or uses this bus.  Best,
>
> Thanks,
>
> Miles
>
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   .... Yogi Berra
>
>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: massive replication?

Posted by Miles Fidelman <mf...@meetinghouse.net>.

Paul Davis wrote:
> Miles,
>
> One the one hand, it sounds like you could solve this with XMPP and
> use CouchDB as a backing store. People have already connected RabbitMQ
> and CouchDB, I can't imagine that connecting ejabberd and CouchDB
> would be much harder. The pubsub extensions could very much be what
> you're wanting.
>
>   
A couple of issues with this - XMPP requires connectivity, it's not 
really asynchronous message passing.  And, both XMPP and RabbitMQ are 
really hub&spokes, not distributed models.

I keep coming back to NNTP (USENET) as a model for many-to-many messaging:

- push a message into a newsgroup on any NNTP node subscribing to that 
newsgroup
- nodes exchange "I-have"/"You-have" on a regular basis
- message propagate to all subscribing nodes by essentially a flooding 
or epidemic routing mechanism
- pretty quickly, a message propagates to all nodes subscribing to the 
newsgroup
- lack of connectivity simply delays message propagation
- the whole system scales massively, and is very robust in the face of 
connectivity outages, node failures, etc. (messages can flow across 
multiple routes)

In some sense, what I'm thinking of would look a lot like:

- a group of CouchDB nodes all subscribe to a newsgroup
- each node publishes changes as messages to that newsgroup
- NNTP takes care of getting messages everywhere, eventually
- each node looks for incoming messages and applies them as changes
- use a shared key to secure things (note: some implementations of NNTP 
already support secure messaging)

A similar approach could be taken using:
- a distributed hash table as a message que (that's what spread and 
splines seem to do)
- the DIS or HLA protocols (used for distributed simulation - keeping 
multiple copies of a "world" synchronized)

Miles

-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra

Re: massive replication?

Posted by Benoit Chesneau <bc...@gmail.com>.

On Mon, Oct 26, 2009 at 4:57 PM, Paul Davis <pa...@gmail.com> wrote:
> Miles,
>
> One the one hand, it sounds like you could solve this with XMPP and
> use CouchDB as a backing store. People have already connected RabbitMQ
> and CouchDB, I can't imagine that connecting ejabberd and CouchDB
> would be much harder. The pubsub extensions could very much be what
> you're wanting.

there are some code in github to star made by collecta guys. I wonder
if they use couchdb as backend.

http://github.com/twonds/ejabberd_couchdb/

- benoît

Re: massive replication?

Posted by Paul Davis <pa...@gmail.com>.

Miles,

One the one hand, it sounds like you could solve this with XMPP and
use CouchDB as a backing store. People have already connected RabbitMQ
and CouchDB, I can't imagine that connecting ejabberd and CouchDB
would be much harder. The pubsub extensions could very much be what
you're wanting.

Though, that doesn't let you write awesome distributed routing
protocols. So, really, what fun could that be?

I'm basically thinking something along the lines of:

1. Initial peer discovery is seeded or auto discovered from multicast
messages. Or some combination thereof depending on your network
topology.
2. Nodes gossip information they've found through replication with
nodes they know about. Something like, create a database that contains
network connections and replicate that along side with your data. A
view on this database would calculate each subscription end point's
highest known update_seq and who has it.
3. Each subscription end point is a database that can be replicated.

There are at least two hard parts I can spot right now:

1. Replication currently has no endpoint for detection. We don't
expose a list of _local/docs to anything so detecting when a node is
replicating too us would be harder. Adding a custom patch to alert a
db_update_notification process with _local/doc updates would be fairly
trivial.

2. The fun part of replication strategy in trying to maximize network
throughput, avoid stampeding a single updater, etc etc. There's enough
literature on this to get a general idea of where to go, but it'd
still probably take some tuning and experimentation. But with plenty
of monitoring and some foresight, giving your self the ability to
tweak system parameters and measure the effects would be quite a lot
of fun. ... Holy crap, I'm contemplating writing the basic routing
logic in Python/Ruby and replicating it out to the network, then as
each node gets a copy the router replaces its logic dynamically. Now
that'd be just plain awesome.

Note that this sort of protocol doesn't attempt to avoid bandwidth
usage by multi-casting documents which may be a non-starter depending
on your topology and characteristics. Instead the idea is to figure
out how to play with your network topology to react and match your
physical link network yet with enough redundancy that it doesn't break
as physical links come and go.

Granted, that's all off the top of my head, so until I sit down and
implement that on a couple hundred nodes it may well be a bunch of hot
air. :)

Paul Davis

On Mon, Oct 26, 2009 at 11:33 AM, Miles Fidelman
<mf...@meetinghouse.net> wrote:
> Adam Kocoloski wrote:
>>
>> On Oct 26, 2009, at 10:45 AM, Miles Fidelman wrote:
>>
>>> The environment we're looking at is more of a mesh where connectivity is
>>> coming up and down - think mobile ad hoc networks.
>>>
>>> I like the idea of a replication bus, perhaps using something like spread
>>> (http://www.spread.org/) or spines (www.spines.org) as a multi-cast fabric.
>>>
>>> I'm thinking something like continuous replication - but where the
>>> updates are pushed to a multi-cast port rather than to a specific node, with
>>> each node subscribing to update feeds.
>>>
>>> Anybody have any thoughts on how that would play with the current
>>> replication and conflict resolution schemes?
>>>
>>> Miles Fidelman
>>
>> Hi Miles, this sounds like really cool stuff.  Caveat: I have no
>> experience using Spread/Spines and very little experience with IP
>> multicasting, which I guess is what those tools try to reproduce in
>> internet-like environments.  So bear with me if I ask stupid questions.
>>
>> 1) Would the CouchDB servers be responsible for error detection and
>> correction?  I imagine that complicates matters considerably, but it
>> wouldn't be impossible.
>
> Good question.  I hadn't quite thought that far ahead.  I think the basic
> answer is no (assume reliable multicast), but... some kind of healing
> mechanism would probably be required (see below).
>>
>> 2) When these CouchDB servers drop off for an extended period and then
>> rejoin, how do they subscribe to the update feed from the replication bus at
>> a particular sequence?  This is really the key element of the setup.  When I
>> think of multicasting I think of video feeds and such, where if you drop off
>> and rejoin you don't care about the old stuff you missed.  That's not the
>> case here.  Does the bus store all this old feed data?
>
> Think of something like RSS, but with distributed infrastructure.
> A node would publish an update to a specific address (e.g., like publishing
> an RSS feed).
>
> All nodes would subscribe to the feed, and receive new messages in sequence.
>  When picking up updates, you ask for everything after a particular sequence
> number.  The update service maintains the data.
>>
>> 3) Which steps of the replication do you envision using the replication
>> bus?  Just the _changes feed (essentially a list of docid:rev pairs) or the
>> actual documents themselves?
>
> Any change to a local copy of the database (i.e., everything).
>>
>> The conflict resolution model shouldn't care about whether replication is
>> p2p or uses this bus.  Best,
>
> Thanks,
>
> Miles
>
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   .... Yogi Berra
>
>
>

Re: massive replication?

Posted by Miles Fidelman <mf...@meetinghouse.net>.

Adam Kocoloski wrote:
> On Oct 26, 2009, at 10:45 AM, Miles Fidelman wrote:
>
>> The environment we're looking at is more of a mesh where connectivity 
>> is coming up and down - think mobile ad hoc networks.
>>
>> I like the idea of a replication bus, perhaps using something like 
>> spread (http://www.spread.org/) or spines (www.spines.org) as a 
>> multi-cast fabric.
>>
>> I'm thinking something like continuous replication - but where the 
>> updates are pushed to a multi-cast port rather than to a specific 
>> node, with each node subscribing to update feeds.
>>
>> Anybody have any thoughts on how that would play with the current 
>> replication and conflict resolution schemes?
>>
>> Miles Fidelman
>
> Hi Miles, this sounds like really cool stuff.  Caveat: I have no 
> experience using Spread/Spines and very little experience with IP 
> multicasting, which I guess is what those tools try to reproduce in 
> internet-like environments.  So bear with me if I ask stupid questions.
>
> 1) Would the CouchDB servers be responsible for error detection and 
> correction?  I imagine that complicates matters considerably, but it 
> wouldn't be impossible.
Good question.  I hadn't quite thought that far ahead.  I think the 
basic answer is no (assume reliable multicast), but... some kind of 
healing mechanism would probably be required (see below).
>
> 2) When these CouchDB servers drop off for an extended period and then 
> rejoin, how do they subscribe to the update feed from the replication 
> bus at a particular sequence?  This is really the key element of the 
> setup.  When I think of multicasting I think of video feeds and such, 
> where if you drop off and rejoin you don't care about the old stuff 
> you missed.  That's not the case here.  Does the bus store all this 
> old feed data?
Think of something like RSS, but with distributed infrastructure. 

A node would publish an update to a specific address (e.g., like 
publishing an RSS feed).

All nodes would subscribe to the feed, and receive new messages in 
sequence.  When picking up updates, you ask for everything after a 
particular sequence number.  The update service maintains the data.
>
> 3) Which steps of the replication do you envision using the 
> replication bus?  Just the _changes feed (essentially a list of 
> docid:rev pairs) or the actual documents themselves?
Any change to a local copy of the database (i.e., everything).
> The conflict resolution model shouldn't care about whether replication 
> is p2p or uses this bus.  Best,
Thanks,

Miles


-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra

Re: massive replication?

Posted by Adam Kocoloski <ko...@apache.org>.

On Oct 26, 2009, at 10:45 AM, Miles Fidelman wrote:

> Chris Anderson wrote:
>>
>> If you do a hub-and-spoke or ring topology you can simplify the
>> replication problem a bit. But a gossip protocol is more resistant to
>> down nodes. I'd like to see a replication bus in the open source
>> project.
>>
>> However, continuous replication will pretty much work.
>>
> The environment we're looking at is more of a mesh where  
> connectivity is coming up and down - think mobile ad hoc networks.
>
> I like the idea of a replication bus, perhaps using something like  
> spread (http://www.spread.org/) or spines (www.spines.org) as a  
> multi-cast fabric.
>
> I'm thinking something like continuous replication - but where the  
> updates are pushed to a multi-cast port rather than to a specific  
> node, with each node subscribing to update feeds.
>
> Anybody have any thoughts on how that would play with the current  
> replication and conflict resolution schemes?
>
> Miles Fidelman

Hi Miles, this sounds like really cool stuff.  Caveat: I have no  
experience using Spread/Spines and very little experience with IP  
multicasting, which I guess is what those tools try to reproduce in  
internet-like environments.  So bear with me if I ask stupid questions.

1) Would the CouchDB servers be responsible for error detection and  
correction?  I imagine that complicates matters considerably, but it  
wouldn't be impossible.

2) When these CouchDB servers drop off for an extended period and then  
rejoin, how do they subscribe to the update feed from the replication  
bus at a particular sequence?  This is really the key element of the  
setup.  When I think of multicasting I think of video feeds and such,  
where if you drop off and rejoin you don't care about the old stuff  
you missed.  That's not the case here.  Does the bus store all this  
old feed data?

3) Which steps of the replication do you envision using the  
replication bus?  Just the _changes feed (essentially a list of  
docid:rev pairs) or the actual documents themselves?

The conflict resolution model shouldn't care about whether replication  
is p2p or uses this bus.  Best,

Adam

Re: massive replication?

Posted by Miles Fidelman <mf...@meetinghouse.net>.

Chris Anderson wrote:
>
> If you do a hub-and-spoke or ring topology you can simplify the
> replication problem a bit. But a gossip protocol is more resistant to
> down nodes. I'd like to see a replication bus in the open source
> project.
>
> However, continuous replication will pretty much work.
>   
The environment we're looking at is more of a mesh where connectivity is 
coming up and down - think mobile ad hoc networks.

I like the idea of a replication bus, perhaps using something like 
spread (http://www.spread.org/) or spines (www.spines.org) as a 
multi-cast fabric.

I'm thinking something like continuous replication - but where the 
updates are pushed to a multi-cast port rather than to a specific node, 
with each node subscribing to update feeds.

Anybody have any thoughts on how that would play with the current 
replication and conflict resolution schemes?

Miles Fidelman

-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra

Re: massive replication?

Posted by Chris Anderson <jc...@apache.org>.

On Fri, Oct 23, 2009 at 2:47 PM, Miles Fidelman
<mf...@meetinghouse.net> wrote:
> Daniel Trümper wrote:
>>
>> Hi,
>>
>>> 2. It seems like there's a point at which explicit 1-1 replication starts
>>> to be an administrative nightmare.  Some kind of publish-subscribe or
>>> multi-cast update model seems needed.
>>
>> Would the new continuous replication feature be what you need? With this
>> all changes to A get automatically replicated to B, if I get things right
>> here...
>
> Well it's more of, when I change A, I want the changes to propagate to B
> through Z (or "all") - with some sort of multi-cast addressing rather than
> having to identify every node explicitly.

If you do a hub-and-spoke or ring topology you can simplify the
replication problem a bit. But a gossip protocol is more resistant to
down nodes. I'd like to see a replication bus in the open source
project.

However, continuous replication will pretty much work.

>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   .... Yogi Berra
>
>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: massive replication?

Posted by Daniel Trümper <tr...@googlemail.com>.

>>> 2. It seems like there's a point at which explicit 1-1 replication  
>>> starts to be an administrative nightmare.  Some kind of publish- 
>>> subscribe or multi-cast update model seems needed.
>> Would the new continuous replication feature be what you need? With  
>> this all changes to A get automatically replicated to B, if I get  
>> things right here...
> Well it's more of, when I change A, I want the changes to propagate  
> to B through Z (or "all") - with some sort of multi-cast addressing  
> rather than having to identify every node explicitly.
Hm, I guess you would have to create/write/program a custom layer on  
top of each CouchDB instance. Maybe an adpted version of the couchdb- 
proxy [1] that would listen on the multicast address. With this you  
could notice new instances being up and the replication from the  
master machines would be handled by the proxy, i.e. the proxy would  
send the replication commands to the master. You could then even do  
this in layers of proxies and CouchDBs...

But: while writing the above lines I notice that you may be want to  
have a look at Zookeeper [2] !?

> ZooKeeper is a centralized service for maintaining configuration  
> information, naming, providing distributed synchronization, and  
> providing group services.

You could basically wrap the start script and write additional  
information (like the CouchDB URL) into Zookeeper. One node could then  
periodically read the info about existing CouchDB instances and  
trigger or configure the replications...

Daniel

[1]: http://github.com/benoitc/couchdbproxy
[2]: http://hadoop.apache.org/zookeeper/

Re: massive replication?

Posted by Miles Fidelman <mf...@meetinghouse.net>.

Daniel Trümper wrote:
> Hi,
>
>> 2. It seems like there's a point at which explicit 1-1 replication 
>> starts to be an administrative nightmare.  Some kind of 
>> publish-subscribe or multi-cast update model seems needed.
> Would the new continuous replication feature be what you need? With 
> this all changes to A get automatically replicated to B, if I get 
> things right here...
Well it's more of, when I change A, I want the changes to propagate to B 
through Z (or "all") - with some sort of multi-cast addressing rather 
than having to identify every node explicitly.

-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra

Re: massive replication?

Posted by Daniel Trümper <tr...@googlemail.com>.

Hi,

> 2. It seems like there's a point at which explicit 1-1 replication  
> starts to be an administrative nightmare.  Some kind of publish- 
> subscribe or multi-cast update model seems needed.
Would the new continuous replication feature be what you need? With  
this all changes to A get automatically replicated to B, if I get  
things right here...

Daniel

Re: massive replication?

Posted by Miles Fidelman <mf...@meetinghouse.net>.

Adam Kocoloski wrote:
> On Oct 23, 2009, at 3:22 PM, Miles Fidelman wrote:
>
>> A couple of the recent CouchDB powerpoint presentations illustrate 
>> replication across massive numbers of database instances.  Does that 
>> represent anything remotely possible, for real?
>> We have an application that involves replicating data across very 
>> large numbers of nodes that are intermittently connected - where we'd 
>> like collections of data to replicate and synchronize as connectivity 
>> allows.  Think something like USENET news as an analogy.
>>
>> I'm trying to sort out whether or not CouchDB is a potential platform.
>>
>> Thanks very much,
>>
>> Miles Fidelman
>
> Hi Miles, we certainly tried to design for that use case.  Best,
>
Ok.... but that begs a few follow-up questions:

1. Has anybody tried it? (And documented it?)

2. It seems like there's a point at which explicit 1-1 replication 
starts to be an administrative nightmare.  Some kind of 
publish-subscribe or multi-cast update model seems needed.

Miles

-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra

Re: massive replication?

Posted by Adam Kocoloski <ko...@apache.org>.

On Oct 23, 2009, at 3:22 PM, Miles Fidelman wrote:

> A couple of the recent CouchDB powerpoint presentations illustrate  
> replication across massive numbers of database instances.  Does that  
> represent anything remotely possible, for real?
> We have an application that involves replicating data across very  
> large numbers of nodes that are intermittently connected - where  
> we'd like collections of data to replicate and synchronize as  
> connectivity allows.  Think something like USENET news as an analogy.
>
> I'm trying to sort out whether or not CouchDB is a potential platform.
>
> Thanks very much,
>
> Miles Fidelman

Hi Miles, we certainly tried to design for that use case.  Best,

Adam