You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@geronimo.apache.org by li...@bway.net on 2006/01/04 19:46:44 UTC

Replication using totem protocol

On google I saw that the apache geronimo incubator had traces of a totem
protocol implementation (author "adc" ??). Is this in geronimo,
replication using totem?

If you google "java totem protocol" the first link is my evs4j project,
the second one is a reference to old apache geronimo incubator code.

Re: Replication using totem protocol

Posted by li...@bway.net.

I will take a closer look at it. My first impression was that
activecluster assumes a jms or jms-like api as a transport.

> lichtner@bway.net wrote:
>
>>>Given the inherent over head in total order protocols, I think we
>>>should work to limit the messages passed over the protocol, to only
>>>the absolute minimum to make our cluster work reliably.
>>>Specifically, I think this is only the distributed lock.  For state
>>>replication we can use a much more efficient tcp or udp based protocol.
>>>
>>>
>>
>>As I said, if your workload has low data sharing (e.g. session
>>replication), you should not use totem. It's designed for systems where
>>_each_ processor needs _most_ of the messages.
>>
>>
> Geronimo has a number of replication usecases (I'll be enumerating them
> in a document that I am putting together at the moment) Totem may well
> suit some of these. If we were to look seriously at using it, I think
> the first technical consideration would be API. Geronimo already has
> ActiveCluster (AC) in the incubator and WADI (An HttpSession and SFSB
> clustering solution is built on AC). AC is both an API to basic
> clustering fn-ality and a number of pluggable impls. My suggestion would
> be that we look at how we could map Totem to the AC API.
>
> Do Totem and AC (http://activecluster.codehaus.org/) look like a good
> match ?
>
>
> Jules
>
>
> --
> "Open Source is a self-assembling organism. You dangle a piece of
> string into a super-saturated solution and a whole operating-system
> crystallises out around it."
>
> /**********************************
>  * Jules Gosnell
>  * Partner
>  * Core Developers Network (Europe)
>  *
>  *    www.coredevelopers.net
>  *
>  * Open Source Training & Support.
>  **********************************/
>
>

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

lichtner@bway.net wrote:

>>Given the inherent over head in total order protocols, I think we
>>should work to limit the messages passed over the protocol, to only
>>the absolute minimum to make our cluster work reliably.
>>Specifically, I think this is only the distributed lock.  For state
>>replication we can use a much more efficient tcp or udp based protocol.
>>    
>>
>
>As I said, if your workload has low data sharing (e.g. session
>replication), you should not use totem. It's designed for systems where
>_each_ processor needs _most_ of the messages.
>  
>
Geronimo has a number of replication usecases (I'll be enumerating them 
in a document that I am putting together at the moment) Totem may well 
suit some of these. If we were to look seriously at using it, I think 
the first technical consideration would be API. Geronimo already has 
ActiveCluster (AC) in the incubator and WADI (An HttpSession and SFSB 
clustering solution is built on AC). AC is both an API to basic 
clustering fn-ality and a number of pluggable impls. My suggestion would 
be that we look at how we could map Totem to the AC API.

Do Totem and AC (http://activecluster.codehaus.org/) look like a good 
match ?


Jules


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by James Strachan <ja...@gmail.com>.

On 1/13/06, Jules Gosnell <ju...@coredevelopers.net> wrote:
>
> > Perhaps Totem might be useful here ?
> >
> ActiveSpace (in incubator) may already have support for this sort of thing
> ?
> If you had a working clustered cache with a cache loader pointing at the
> entity...

Yes, ActiveSpace currently has a distributed JCache using optimistic
transactions which relies on total ordering; it could well be a good use
case for using totem. I'd be interested in a performance comparison with
Totem v ActiveMQ (indeed it'd be trivial to integrate Totem into ActiveMQ as
a transport; we already have multicast, UDP, TCP, SSL, HTTP et al).

FWIW when I get chance I've been meaning to refactor the ActiveSpace code to
make use of Spring Remoting / Lingo (probably moving it into the Lingo
project) so it'd be much easier to decouple the remoting code easily.

--

James
-------
http://radio.weblogs.com/0112098/

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.

> > If you cluster an entity bean on two nodes naively, you lose many of the
> > benefits of caching. This is because neither node, at the beginning of a
> > transaction, knows whether the other node has changed the beans contents
> > since it was last loaded into cache, so the cache must be assumed
> > invalid. Thus, you find yourself going to the db much more frequently
> > than you would like, and the number of trips increases linearly with the
> > number of clients - i.e. you are no longer scalable.
>
>
> It depends on your transaction isolation level; i.e. do you want to do a
> dirty read or not. You should be able to enable dirty reads to get
> scalability & performance.

I like dirty reads from a theoretical standpoint because if you can do
dirty reads it means you have a high-class message bus. However, I don't
expect people to ask for dirty reads unless you mean that they are going
to roll back automatically. Example: inventory.

Non-transactional applications however could use dirty data and still
find it useful even if they don't roll back.

> The only way to really and truly know if the cache is up to date is to use a
> pessimistic read lock; but thats what databases are great for - so you might
> as well use the DB and not the cache in those circumstances. i.e. you always
> use caches for dirty reads

Major databases currently do not use read locks. Oracle and SQL Server use
mv2pl (multiversion two phase locking.) MySQL on the InnoDB storage engine
also. PostgresSQL. (Interestingly, the first I article I know on this is
Bernstein and Goodman, 1978 (!)). Sybase I don't know. I think it may have
fallen a bit behind.

When a tx starts you assign a cluster-wide unique id to it. That's its
'position' in time (known in Oracle as the scn, system change number).
When the tx writes a data item it creates a new version, tagged with this
scn. When a transaction wants to read the data, it reads the last version
before _its_ scn. So when you read you definitely don't need a lock. When
you write you can either use a lock (mv2pl) or proceed until you find a
conflict, in which case you roll back. The latter should be used in
workloads that have very little contention. Or you can use in general also
but you need to have automatic retries, as with mdbs, and you should
really not send any data to the dbms until you know for sure to cut down
on the time required to roll back.

> * invalidation; as each entity changes it sends an invalidation message -
> which is really simple & doesn't need total ordering, just something
> lightweight & fast. (Actually pure multicast is fine for invalidation stuff
> since messages are tiny & reliability is not that big a deal, particularly
> if coupled with a cache timeout/reload policy).
>
> * broadcasting the new data to interested parties (say everyone else in the
> cluster). This typically requires either (i) a global publisher (maybe
> listening to the DB transaction log) or (ii) total ordering if each entity
> bean server sends its changes.

That's the one.

> The former is good for very high update rates or very sparse caches, the
> latter is good when everyone interested in the cluster needs to cache mostly
> the same stuff & the cache size is sufficient that most nodes have all the
> same data in their cache. The former is more lightweight and simpler & a
> good first step :)

You can split up the data also. You can keep 4 replicas of each data item
instead of N, and just migrate it around. But for semi-constant data like
reference data, e.g. stock symbols or client data you can keep copies
everywhere.

Guglielmo

Re: Replication using totem protocol

Posted by James Strachan <ja...@gmail.com>.

On 1/13/06, Jules Gosnell <ju...@coredevelopers.net> wrote:
>
> lichtner@bway.net wrote:
>
> >>lichtner@bway.net wrote:
> >>Okay. I will ask you a question then. What are you doing as far caching
> >entity beans?
> >
> >
> In terms or replication or some form of distributed invalidation, I'm
> not aware that this has been discussed yet.
>
> This is another one for the forthcoming doc - briefly :
>
> If you cluster an entity bean on two nodes naively, you lose many of the
> benefits of caching. This is because neither node, at the beginning of a
> transaction, knows whether the other node has changed the beans contents
> since it was last loaded into cache, so the cache must be assumed
> invalid. Thus, you find yourself going to the db much more frequently
> than you would like, and the number of trips increases linearly with the
> number of clients - i.e. you are no longer scalable.

It depends on your transaction isolation level; i.e. do you want to do a
dirty read or not. You should be able to enable dirty reads to get
scalability & performance.

The only way to really and truly know if the cache is up to date is to use a
pessimistic read lock; but thats what databases are great for - so you might
as well use the DB and not the cache in those circumstances. i.e. you always
use caches for dirty reads

If you can arrange for the cache on one node to notify the cache on
> other nodes, whenever an entity is changed, then the other caches can
> optimise their actions, so that rather assuming that all beans are
> invalid, they can pinpoint the ones that actually are invalid and only
> reload those.
>
> You could go one step further and send, not an invalidation, but a
> replication message. This would contain the Entity's new value and head
> off any reloading from the DB at all....

The two main strategies here are

* invalidation; as each entity changes it sends an invalidation message -
which is really simple & doesn't need total ordering, just something
lightweight & fast. (Actually pure multicast is fine for invalidation stuff
since messages are tiny & reliability is not that big a deal, particularly
if coupled with a cache timeout/reload policy).

* broadcasting the new data to interested parties (say everyone else in the
cluster). This typically requires either (i) a global publisher (maybe
listening to the DB transaction log) or (ii) total ordering if each entity
bean server sends its changes.

The former is good for very high update rates or very sparse caches, the
latter is good when everyone interested in the cluster needs to cache mostly
the same stuff & the cache size is sufficient that most nodes have all the
same data in their cache. The former is more lightweight and simpler & a
good first step :)

James

--

James
-------
http://radio.weblogs.com/0112098/

Re: Replication using totem protocol

Posted by li...@bway.net.

> Its been talked about but currently not implemented.I'm catching up on the
> conversation and haven't looked at the pointers yet so I have a bit of
> reading
> to do.
>
> Are you thinking about using Totem to replicate Entity cache information
> in a cluster?

Yes.

You can take your pick of concurrency control protocol:

1. Lock locally for reads and possibly roll back if a write arrived with
an earlier transaction id. This works well for low sharing and mdb-based
transactions, because jms will retry the transaction automatically.

2. Lock globally, shared locks for reads and exclusive locks for writes.
This is good if you really cannot handle rollbacks or you have close to
100% sharing so you are destined to wait anyway. It actually turns out
that if you have this type of workload _and_ you expect your application
to operate at peak throughput then totem provides pretty predictable
latency (typically a few ms), which is nice.

3. Multiversion timestamp ordering. See 1, but without read locks.

4. Multiversion two-phase locking. Get exclusive locks for writes, create
a new version for reads. This is my favorite.

Guglielmo

Re: Replication using totem protocol

Posted by Matt Hogstrom <ma...@hogstrom.org>.

Its been talked about but currently not implemented.I'm catching up on the 
conversation and haven't looked at the pointers yet so I have a bit of reading 
to do.

Are you thinking about using Totem to replicate Entity cache information in a 
cluster?

lichtner@bway.net wrote:
>>lichtner@bway.net wrote:
>>
>>>Well, you guys let me know if I can help you in any way.
>>
>>Keep on talking ;-)
> 
> 
> Okay. I will ask you a question then. What are you doing as far caching
> entity beans?
> 
> 
> 
>

Re: Replication using totem protocol (Blevins or D'Amour your being called)

Posted by Jeff Genender <jg...@apache.org>.

Great question...

This is a question for Gianni or David Blevins..the OpenEJB experts ;-)

lichtner@bway.net wrote:
>> lichtner@bway.net wrote:
>>> Well, you guys let me know if I can help you in any way.
>> Keep on talking ;-)
> 
> Okay. I will ask you a question then. What are you doing as far caching
> entity beans?

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

further thoughts at bottom....

Jules Gosnell wrote:

> lichtner@bway.net wrote:
>
>>> lichtner@bway.net wrote:
>>>   
>>>
>>>> Well, you guys let me know if I can help you in any way.
>>>>     
>>>
>>> Keep on talking ;-)
>>>   
>>
>>
>> Okay. I will ask you a question then. What are you doing as far caching
>> entity beans?
>>  
>>
> In terms or replication or some form of distributed invalidation, I'm 
> not aware that this has been discussed yet.
>
> This is another one for the forthcoming doc - briefly :
>
> If you cluster an entity bean on two nodes naively, you lose many of 
> the benefits of caching. This is because neither node, at the 
> beginning of a transaction, knows whether the other node has changed 
> the beans contents since it was last loaded into cache, so the cache 
> must be assumed invalid. Thus, you find yourself going to the db much 
> more frequently than you would like, and the number of trips increases 
> linearly with the number of clients - i.e. you are no longer scalable.
>
> If you can arrange for the cache on one node to notify the cache on 
> other nodes, whenever an entity is changed, then the other caches can 
> optimise their actions, so that rather assuming that all beans are 
> invalid, they can pinpoint the ones that actually are invalid and only 
> reload those.
>
> You could go one step further and send, not an invalidation, but a 
> replication message. This would contain the Entity's new value and 
> head off any reloading from the DB at all....
>
> All of this needs to be properly integrated with e.g. transactions, 
> locking etc...
>
> Perhaps Totem might be useful here ?
>
ActiveSpace (in incubator) may already have support for this sort of thing ?
If you had a working clustered cache with a cache loader pointing at the 
entity...

Jules

>
> Jules
>
>


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by li...@bway.net.

> You could go one step further and send, not an invalidation, but a
> replication message. This would contain the Entity's new value and head
> off any reloading from the DB at all....
>
> All of this needs to be properly integrated with e.g. transactions,
> locking etc...
>
> Perhaps Totem might be useful here ?

Yes, I would decide in this area it would be worth using totem. As I said
in a  different email, you would have to pick a concurrency control
protocol. I favor multi-version 2-phase locking, or multiversion timestamp
ordering in the case of transactions that don't mind rolling back.

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

lichtner@bway.net wrote:

>>lichtner@bway.net wrote:
>>    
>>
>>>Well, you guys let me know if I can help you in any way.
>>>      
>>>
>>Keep on talking ;-)
>>    
>>
>
>Okay. I will ask you a question then. What are you doing as far caching
>entity beans?
>  
>
In terms or replication or some form of distributed invalidation, I'm 
not aware that this has been discussed yet.

This is another one for the forthcoming doc - briefly :

If you cluster an entity bean on two nodes naively, you lose many of the 
benefits of caching. This is because neither node, at the beginning of a 
transaction, knows whether the other node has changed the beans contents 
since it was last loaded into cache, so the cache must be assumed 
invalid. Thus, you find yourself going to the db much more frequently 
than you would like, and the number of trips increases linearly with the 
number of clients - i.e. you are no longer scalable.

If you can arrange for the cache on one node to notify the cache on 
other nodes, whenever an entity is changed, then the other caches can 
optimise their actions, so that rather assuming that all beans are 
invalid, they can pinpoint the ones that actually are invalid and only 
reload those.

You could go one step further and send, not an invalidation, but a 
replication message. This would contain the Entity's new value and head 
off any reloading from the DB at all....

All of this needs to be properly integrated with e.g. transactions, 
locking etc...

Perhaps Totem might be useful here ?

Jules

-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by li...@bway.net.

> lichtner@bway.net wrote:
>> Well, you guys let me know if I can help you in any way.
>
> Keep on talking ;-)

Okay. I will ask you a question then. What are you doing as far caching
entity beans?

Re: Replication using totem protocol

Posted by Jeff Genender <jg...@apache.org>.


lichtner@bway.net wrote:
> Well, you guys let me know if I can help you in any way.

Keep on talking ;-)


> 
>> I think there is a time and place for this and can be leveraged in other
>> protocols.  As a minimum it can be a pluggable protocol.  Its a great
>> start.
>>
>> lichtner@bway.net wrote:
>>>> Given the inherent over head in total order protocols, I think we
>>>> should work to limit the messages passed over the protocol, to only
>>>> the absolute minimum to make our cluster work reliably.
>>>> Specifically, I think this is only the distributed lock.  For state
>>>> replication we can use a much more efficient tcp or udp based protocol.
>>> As I said, if your workload has low data sharing (e.g. session
>>> replication), you should not use totem. It's designed for systems where
>>> _each_ processor needs _most_ of the messages.
>>>
>>>
>>>
>

Re: Replication using totem protocol

Posted by li...@bway.net.

Well, you guys let me know if I can help you in any way.

> I think there is a time and place for this and can be leveraged in other
> protocols.  As a minimum it can be a pluggable protocol.  Its a great
> start.
>
> lichtner@bway.net wrote:
>>> Given the inherent over head in total order protocols, I think we
>>> should work to limit the messages passed over the protocol, to only
>>> the absolute minimum to make our cluster work reliably.
>>> Specifically, I think this is only the distributed lock.  For state
>>> replication we can use a much more efficient tcp or udp based protocol.
>>
>> As I said, if your workload has low data sharing (e.g. session
>> replication), you should not use totem. It's designed for systems where
>> _each_ processor needs _most_ of the messages.
>>
>>
>>
>

Re: Replication using totem protocol

Posted by Jeff Genender <jg...@apache.org>.

I think there is a time and place for this and can be leveraged in other
protocols.  As a minimum it can be a pluggable protocol.  Its a great start.

lichtner@bway.net wrote:
>> Given the inherent over head in total order protocols, I think we
>> should work to limit the messages passed over the protocol, to only
>> the absolute minimum to make our cluster work reliably.
>> Specifically, I think this is only the distributed lock.  For state
>> replication we can use a much more efficient tcp or udp based protocol.
> 
> As I said, if your workload has low data sharing (e.g. session
> replication), you should not use totem. It's designed for systems where
> _each_ processor needs _most_ of the messages.
> 
> 
>

Re: Replication using totem protocol

Posted by li...@bway.net.

> Given the inherent over head in total order protocols, I think we
> should work to limit the messages passed over the protocol, to only
> the absolute minimum to make our cluster work reliably.
> Specifically, I think this is only the distributed lock.  For state
> replication we can use a much more efficient tcp or udp based protocol.

As I said, if your workload has low data sharing (e.g. session
replication), you should not use totem. It's designed for systems where
_each_ processor needs _most_ of the messages.

Re: Replication using totem protocol

Posted by Dain Sundstrom <da...@iq80.com>.

Given the inherent over head in total order protocols, I think we  
should work to limit the messages passed over the protocol, to only  
the absolute minimum to make our cluster work reliably.   
Specifically, I think this is only the distributed lock.  For state  
replication we can use a much more efficient tcp or udp based protocol.

-dain

On Jan 12, 2006, at 3:02 PM, Alan D. Cabrera wrote:

> The nice thing about it is that there are no ACKS/NACKS so, it's  
> not very chatty.  The bad thing is that you have to wait for the  
> token to come your way before you can broadcast; if there are a lot  
> of participants in the group the latency will be larger that you  
> might like.
>
> http://citeseer.csail.mit.edu/amir95totem.html
>
>
> Regards,
> Alan
>
> Jeff Genender wrote, On 1/12/2006 2:53 PM:
>> Guglielmo,
>> Ok..lets chat about the technical components of Totem.  What are the
>> strengths and weaknesses?  Is there scaling issues, and if so, are  
>> there
>> some mitigation strategies from an algorithm perspective?
>> Feel free to elaborate...this is great stuff.
>> Thanks,
>> Jeff
>> lichtner@bway.net wrote:
>>>> Yes...awesome.  Bruce had chatted with me about this too...I am  
>>>> very
>>>> interested.
>>>
>>> Thanks.
>>>
>>>
>>>> Guglielmo, I would be very interested in speaking with you  
>>>> further on
>>>> this.
>>>
>>> I am available to speak more about it. If you need my phone  
>>> number, it's
>>> six one nine, two five five, nine seven eight six.
>>>
>>>
>>>> This is looks like something we could heavily use.  What's your  
>>>> thoughts?
>>>
>>> I think totem is a great protocol. Whether _you_ need it depends  
>>> on the
>>> application. I originally wrote this code back in 2000, and it  
>>> took me
>>> this long to find the ideal application for it.
>>>
>>> I would like to recommend the following article by Ken Birman  
>>> (probably
>>> the grandad of process groups):
>>>
>>> http://portal.acm.org/citation.cfm?id=326136
>>>
>>> Unfortunately you need to be a member of the acm to read it (I  
>>> used to be,
>>> but right now I am not.) This article describes his experiences  
>>> using
>>> ISIS, an early process group library, to build some interesting  
>>> systems.
>>>
>>> Guglielmo
>>>
>

Re: Fwd: Replication using totem protocol

Posted by "Alan D. Cabrera" <li...@toolazydogs.com>.

The nice thing about it is that there are no ACKS/NACKS so, it's not 
very chatty.  The bad thing is that you have to wait for the token to 
come your way before you can broadcast; if there are a lot of 
participants in the group the latency will be larger that you might like.

http://citeseer.csail.mit.edu/amir95totem.html


Regards,
Alan

Jeff Genender wrote, On 1/12/2006 2:53 PM:
> Guglielmo,
> 
> Ok..lets chat about the technical components of Totem.  What are the
> strengths and weaknesses?  Is there scaling issues, and if so, are there
> some mitigation strategies from an algorithm perspective?
> 
> Feel free to elaborate...this is great stuff.
> 
> Thanks,
> 
> Jeff
> 
> 
> lichtner@bway.net wrote:
> 
>>>Yes...awesome.  Bruce had chatted with me about this too...I am very
>>>interested.
>>
>>Thanks.
>>
>>
>>>Guglielmo, I would be very interested in speaking with you further on
>>>this.
>>
>>I am available to speak more about it. If you need my phone number, it's
>>six one nine, two five five, nine seven eight six.
>>
>>
>>>This is looks like something we could heavily use.  What's your thoughts?
>>
>>I think totem is a great protocol. Whether _you_ need it depends on the
>>application. I originally wrote this code back in 2000, and it took me
>>this long to find the ideal application for it.
>>
>>I would like to recommend the following article by Ken Birman (probably
>>the grandad of process groups):
>>
>>http://portal.acm.org/citation.cfm?id=326136
>>
>>Unfortunately you need to be a member of the acm to read it (I used to be,
>>but right now I am not.) This article describes his experiences using
>>ISIS, an early process group library, to build some interesting systems.
>>
>>Guglielmo
>>

Re: Fwd: Replication using totem protocol

Posted by "Alan D. Cabrera" <li...@toolazydogs.com>.

The nice thing about it is that there are no ACKS/NACKS so, it's not 
very chatty.  The bad thing is that you have to wait for the token to 
come your way before you can broadcast; if there are a lot of 
participants in the group the latency will be larger that you might like.

http://citeseer.csail.mit.edu/amir95totem.html


Regards,
Alan

Jeff Genender wrote, On 1/12/2006 2:53 PM:
> Guglielmo,
> 
> Ok..lets chat about the technical components of Totem.  What are the
> strengths and weaknesses?  Is there scaling issues, and if so, are there
> some mitigation strategies from an algorithm perspective?
> 
> Feel free to elaborate...this is great stuff.
> 
> Thanks,
> 
> Jeff
> 
> 
> lichtner@bway.net wrote:
> 
>>>Yes...awesome.  Bruce had chatted with me about this too...I am very
>>>interested.
>>
>>Thanks.
>>
>>
>>>Guglielmo, I would be very interested in speaking with you further on
>>>this.
>>
>>I am available to speak more about it. If you need my phone number, it's
>>six one nine, two five five, nine seven eight six.
>>
>>
>>>This is looks like something we could heavily use.  What's your thoughts?
>>
>>I think totem is a great protocol. Whether _you_ need it depends on the
>>application. I originally wrote this code back in 2000, and it took me
>>this long to find the ideal application for it.
>>
>>I would like to recommend the following article by Ken Birman (probably
>>the grandad of process groups):
>>
>>http://portal.acm.org/citation.cfm?id=326136
>>
>>Unfortunately you need to be a member of the acm to read it (I used to be,
>>but right now I am not.) This article describes his experiences using
>>ISIS, an early process group library, to build some interesting systems.
>>
>>Guglielmo
>>

Re: Fwd: Replication using totem protocol

Posted by Jeff Genender <jg...@apache.org>.

Guglielmo,

Ok..lets chat about the technical components of Totem.  What are the
strengths and weaknesses?  Is there scaling issues, and if so, are there
some mitigation strategies from an algorithm perspective?

Feel free to elaborate...this is great stuff.

Thanks,

Jeff


lichtner@bway.net wrote:
>> Yes...awesome.  Bruce had chatted with me about this too...I am very
>> interested.
> 
> Thanks.
> 
>> Guglielmo, I would be very interested in speaking with you further on
>> this.
> 
> I am available to speak more about it. If you need my phone number, it's
> six one nine, two five five, nine seven eight six.
> 
>> This is looks like something we could heavily use.  What's your thoughts?
> 
> I think totem is a great protocol. Whether _you_ need it depends on the
> application. I originally wrote this code back in 2000, and it took me
> this long to find the ideal application for it.
> 
> I would like to recommend the following article by Ken Birman (probably
> the grandad of process groups):
> 
> http://portal.acm.org/citation.cfm?id=326136
> 
> Unfortunately you need to be a member of the acm to read it (I used to be,
> but right now I am not.) This article describes his experiences using
> ISIS, an early process group library, to build some interesting systems.
> 
> Guglielmo
>

Re: Replication using totem protocol

Posted by Dain Sundstrom <da...@iq80.com>.

On Jan 12, 2006, at 1:00 PM, lichtner@bway.net wrote:

> I would like to recommend the following article by Ken Birman  
> (probably
> the grandad of process groups):
>
> http://portal.acm.org/citation.cfm?id=326136
>
> Unfortunately you need to be a member of the acm to read it (I used  
> to be,
> but right now I am not.) This article describes his experiences using
> ISIS, an early process group library, to build some interesting  
> systems.

A quick search on scholar.google.com turned up a free version:

http://scholar.google.com/url?sa=U&q=http://historical.ncstrl.org/tr/ 
ps/cornellcs/TR99-1726.ps

You can also view it in HTML if you don't have a post script viewer  
(it is built into Apple Preview), but the images don't show up:

http://scholar.google.com/scholar? 
num=100&hl=en&lr=&safe=off&q=cache:Fri1iJMdcxUJ:historical.ncstrl.org/ 
tr/ps/cornellcs/TR99-1726.ps+

-dain

Re: Replication using totem protocol

Posted by Dain Sundstrom <da...@iq80.com>.

On Jan 12, 2006, at 1:00 PM, lichtner@bway.net wrote:

> I would like to recommend the following article by Ken Birman  
> (probably
> the grandad of process groups):
>
> http://portal.acm.org/citation.cfm?id=326136
>
> Unfortunately you need to be a member of the acm to read it (I used  
> to be,
> but right now I am not.) This article describes his experiences using
> ISIS, an early process group library, to build some interesting  
> systems.

A quick search on scholar.google.com turned up a free version:

http://scholar.google.com/url?sa=U&q=http://historical.ncstrl.org/tr/ 
ps/cornellcs/TR99-1726.ps

You can also view it in HTML if you don't have a post script viewer  
(it is built into Apple Preview), but the images don't show up:

http://scholar.google.com/scholar? 
num=100&hl=en&lr=&safe=off&q=cache:Fri1iJMdcxUJ:historical.ncstrl.org/ 
tr/ps/cornellcs/TR99-1726.ps+

-dain

Re: Fwd: Replication using totem protocol

Posted by Jeff Genender <jg...@apache.org>.

Guglielmo,

Ok..lets chat about the technical components of Totem.  What are the
strengths and weaknesses?  Is there scaling issues, and if so, are there
some mitigation strategies from an algorithm perspective?

Feel free to elaborate...this is great stuff.

Thanks,

Jeff


lichtner@bway.net wrote:
>> Yes...awesome.  Bruce had chatted with me about this too...I am very
>> interested.
> 
> Thanks.
> 
>> Guglielmo, I would be very interested in speaking with you further on
>> this.
> 
> I am available to speak more about it. If you need my phone number, it's
> six one nine, two five five, nine seven eight six.
> 
>> This is looks like something we could heavily use.  What's your thoughts?
> 
> I think totem is a great protocol. Whether _you_ need it depends on the
> application. I originally wrote this code back in 2000, and it took me
> this long to find the ideal application for it.
> 
> I would like to recommend the following article by Ken Birman (probably
> the grandad of process groups):
> 
> http://portal.acm.org/citation.cfm?id=326136
> 
> Unfortunately you need to be a member of the acm to read it (I used to be,
> but right now I am not.) This article describes his experiences using
> ISIS, an early process group library, to build some interesting systems.
> 
> Guglielmo
>

Re: Fwd: Replication using totem protocol

Posted by li...@bway.net.

> Yes...awesome.  Bruce had chatted with me about this too...I am very
> interested.

Thanks.

> Guglielmo, I would be very interested in speaking with you further on
> this.

I am available to speak more about it. If you need my phone number, it's
six one nine, two five five, nine seven eight six.

> This is looks like something we could heavily use.  What's your thoughts?

I think totem is a great protocol. Whether _you_ need it depends on the
application. I originally wrote this code back in 2000, and it took me
this long to find the ideal application for it.

I would like to recommend the following article by Ken Birman (probably
the grandad of process groups):

http://portal.acm.org/citation.cfm?id=326136

Unfortunately you need to be a member of the acm to read it (I used to be,
but right now I am not.) This article describes his experiences using
ISIS, an early process group library, to build some interesting systems.

Guglielmo

Re: Fwd: Replication using totem protocol

Posted by li...@bway.net.

> Yes...awesome.  Bruce had chatted with me about this too...I am very
> interested.

Thanks.

> Guglielmo, I would be very interested in speaking with you further on
> this.

I am available to speak more about it. If you need my phone number, it's
six one nine, two five five, nine seven eight six.

> This is looks like something we could heavily use.  What's your thoughts?

I think totem is a great protocol. Whether _you_ need it depends on the
application. I originally wrote this code back in 2000, and it took me
this long to find the ideal application for it.

I would like to recommend the following article by Ken Birman (probably
the grandad of process groups):

http://portal.acm.org/citation.cfm?id=326136

Unfortunately you need to be a member of the acm to read it (I used to be,
but right now I am not.) This article describes his experiences using
ISIS, an early process group library, to build some interesting systems.

Guglielmo

Re: Fwd: Replication using totem protocol

Posted by Jeff Genender <jg...@apache.org>.

Yes...awesome.  Bruce had chatted with me about this too...I am very
interested.

Guglielmo, I would be very interested in speaking with you further on
this.  This is looks like something we could heavily use.  What's your
thoughts?

Jeff

Dain Sundstrom wrote:
> I'm not sure if you saw this email....
> 
> According to the website (http://www.bway.net/~lichtner/evs4j.html):
> 
>     "Extended Virtual Synchrony for Java (EVS4J), an Apache-Licensed,
> pure-Java implementation of the fastest known totally ordered reliable
> multicast protocol."
> 
> 
> Once you have a total ordered messing protocol, implementing a
> distributed lock is trivial (I can go into detail if you want).  I
> suggest we ask Guglielmo if he would like to donate his implementation
> to this incubator project and if he would like to work on a pessimistic
> distributed locking implementation.
> 
> What do you think?
> 
> -dain
> 
> Begin forwarded message:
> 
>> From: lichtner@bway.net
>> Date: January 4, 2006 10:46:44 AM PST
>> To: user@geronimo.apache.org
>> Subject: Replication using totem protocol
>> Reply-To: user@geronimo.apache.org
>>
>>
>> On google I saw that the apache geronimo incubator had traces of a totem
>> protocol implementation (author "adc" ??). Is this in geronimo,
>> replication using totem?
>>
>> If you google "java totem protocol" the first link is my evs4j project,
>> the second one is a reference to old apache geronimo incubator code.

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

inline at end...


Dain Sundstrom wrote:

> On Jan 12, 2006, at 12:28 PM, lichtner@bway.net wrote:
>
>> I didn't see it - I'm not sure why.
>>
>>> According to the website (http://www.bway.net/~lichtner/evs4j.html):
>>>
>>>      "Extended Virtual Synchrony for Java (EVS4J), an Apache-
>>> Licensed, pure-Java implementation of the fastest known totally
>>> ordered reliable multicast protocol."
>>
>>
>> Yes, I wrote that.
>>
>>> Once you have a total ordered messing protocol, implementing a
>>> distributed lock is trivial (I can go into detail if you want).
>>
>>
>> Yes. You just send a totally-ordered message and wait for it to  arrive.
>>
>>> I suggest we ask Guglielmo if he would like to donate his
>>> implementation to this incubator project
>>
>>
>> I don't know about donating it. Who would they want me to transfer the
>> copyright to?
>
>
> No.  You license the code to the Apache Software Foundation giving  
> the foundation the rights to relicense under any license (so the  
> foundation can upgrade the license as they did with ASL2).  We do ask  
> that you change the copyrights on the version of the code you give to  
> the ASF to something like "Copyright 2004 The Apache Software  
> Foundation or its licensors, as applicable."
>
>>> and if he would like to work on a pessimistic distributed locking
>>
>> implementation.
>>
>>> What do you think?
>>
>>
>> I would definitely like to work on it, but I still work for a  
>> living, so
>> that's something to think about. (I happen to be between jobs right  
>> now.)
>
>
> Nothing better to do between jobs than coding :)
>
>> Also, what do you need to locks for?
>
>
> Locking web sessions and stateful session beans in the cluster when a  
> node is working on it.

Dain,

I see where you are coming from and it looks interesting. Virtual 
synchrony might well be useful in terms of guaranteeing lock-ordering 
etc., although at first glance, I have reservations about the 
scalability of multicast - but I need to read up on Totem...

I'll give Totem a mention it in the clustering doc that I am putting 
together.


Jules

>
> -dain



-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

see inline at bottom....

Dain Sundstrom wrote:

> On Jan 12, 2006, at 12:28 PM, lichtner@bway.net wrote:
>
>> I didn't see it - I'm not sure why.
>>
>>> According to the website (http://www.bway.net/~lichtner/evs4j.html):
>>>
>>>      "Extended Virtual Synchrony for Java (EVS4J), an Apache-
>>> Licensed, pure-Java implementation of the fastest known totally
>>> ordered reliable multicast protocol."
>>
>>
>> Yes, I wrote that.
>>
>>> Once you have a total ordered messing protocol, implementing a
>>> distributed lock is trivial (I can go into detail if you want).
>>
>>
>> Yes. You just send a totally-ordered message and wait for it to  arrive.
>>
>>> I suggest we ask Guglielmo if he would like to donate his
>>> implementation to this incubator project
>>
>>
>> I don't know about donating it. Who would they want me to transfer the
>> copyright to?
>
>
> No.  You license the code to the Apache Software Foundation giving  
> the foundation the rights to relicense under any license (so the  
> foundation can upgrade the license as they did with ASL2).  We do ask  
> that you change the copyrights on the version of the code you give to  
> the ASF to something like "Copyright 2004 The Apache Software  
> Foundation or its licensors, as applicable."
>
>>> and if he would like to work on a pessimistic distributed locking
>>
>> implementation.
>>
>>> What do you think?
>>
>>
>> I would definitely like to work on it, but I still work for a  
>> living, so
>> that's something to think about. (I happen to be between jobs right  
>> now.)
>
>
> Nothing better to do between jobs than coding :)
>
>> Also, what do you need to locks for?
>
>
> Locking web sessions and stateful session beans in the cluster when a  
> node is working on it.
>
So, Dain, what do you have in mind ? Something to go into WADI or 
ActiveSpace, or something else ? Could you expand.... I would be 
interested in comparing this approach to the one currently used by WADI 
and getting everyone's input.

Cheers,


Jules

> -dain



-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.

Still, it doesn't seem like there is much interest in using totem. For
session replication you can use primary-backup, if anything.

On Sun, 22 Jan 2006, Geir Magnusson Jr wrote:

> Catching up :
>
> lichtner@bway.net wrote:
> >> No.  You license the code to the Apache Software Foundation giving
> >> the foundation the rights to relicense under any license (so the
> >> foundation can upgrade the license as they did with ASL2).  We do ask
> >> that you change the copyrights on the version of the code you give to
> >> the ASF to something like "Copyright 2004 The Apache Software
> >> Foundation or its licensors, as applicable."
> >
> > That _is_ transferring the copyright.
>
> No, it isn't.  You are still the copyright holder of the contributed
> material.  The (c) statement that Dain suggested represents the
> collective copyright of the whole package, which is your original code
> (for which you hold the copyright), and additions from other people (who
> individually hold copyright or share copyright depending on the
> contribution.)
>
> That's why it's "or it's licensors", which you would certainly be.
>
> >
> > As I told Jeff on the phone, I would definitely considering this if it
> > turns that evs4j will really be used, but I would rather not grant someone
> > an unlimited license at the present time. Jeff said we are going to have a
> > discussion, so we'll know more soon enough.
>
> The Apache License is fairly close to an unlimited license, so if it's
> available under the AL, you are already there.
>
> The only thing different is that you are giving the ASF the ability to
> distribute the collective work under other terms other than the current
> version of the Apache License.
>
> I hope that makes you feel a little more comfortable about things.
>
> geir
>
>

Re: Replication using totem protocol

Posted by Geir Magnusson Jr <ge...@pobox.com>.

Catching up :

lichtner@bway.net wrote:
>> No.  You license the code to the Apache Software Foundation giving
>> the foundation the rights to relicense under any license (so the
>> foundation can upgrade the license as they did with ASL2).  We do ask
>> that you change the copyrights on the version of the code you give to
>> the ASF to something like "Copyright 2004 The Apache Software
>> Foundation or its licensors, as applicable."
> 
> That _is_ transferring the copyright.

No, it isn't.  You are still the copyright holder of the contributed 
material.  The (c) statement that Dain suggested represents the 
collective copyright of the whole package, which is your original code 
(for which you hold the copyright), and additions from other people (who 
individually hold copyright or share copyright depending on the 
contribution.)

That's why it's "or it's licensors", which you would certainly be.

> 
> As I told Jeff on the phone, I would definitely considering this if it
> turns that evs4j will really be used, but I would rather not grant someone
> an unlimited license at the present time. Jeff said we are going to have a
> discussion, so we'll know more soon enough.

The Apache License is fairly close to an unlimited license, so if it's 
available under the AL, you are already there.

The only thing different is that you are giving the ASF the ability to 
distribute the collective work under other terms other than the current 
version of the Apache License.

I hope that makes you feel a little more comfortable about things.

geir

Re: Replication using totem protocol

Posted by Rajith Attapattu <ra...@gmail.com>.

Thanks a lot for the info !!!

Regards,

Rajith


On 1/17/06, lichtner <li...@bway.net> wrote:
>
>
> By reading selected parts of this book you can get a background on various
> issues that you have asked about:
>
> http://citeseer.ist.psu.edu/birman96building.html
>
> On Tue, 17 Jan 2006, Rajith Attapattu wrote:
>
> > > Can u guys talk more about locking mechanisms pros and cons wrt in
> memory
> > > replication and storaged backed replication.
> >
> > >I don't know what you have in mind here by 'storage-backed'.
> >
> > Sorry if I was not clear on that. what i meant was in memory vs
> serialized
> > form, either stored in a file or database or some other mechanism.
> >
> > >>you want to guarantee that the user's work is _never_lost, just send
> all
> > session updates to yourself in a totem-protocol 'safe' message
> > hmm can we really make a garuntee here even that you assumption
> > holds (Assuming 4 nodes and likely to survive node crashes up to 4 - R =
> 2
> > node crashes.)
> >
> > Also I didn't understand how u arrived at the 4-R value. I guess it's
> bcos I
> > don't have much knowledge about totem.
> > If there is a short answer and if it's not beyond the scope of the
> thread
> > can u try one more time to explain the thoery behind your assumption
> >
> > Regards,
> >
> > Rajith.
> >
> > On 1/17/06, lichtner <li...@bway.net> wrote:
> > >
> > >
> > > On Tue, 17 Jan 2006, Rajith Attapattu wrote:
> > >
> > > > Can u guys talk more about locking mechanisms pros and cons wrt in
> > > memory
> > > > replication and storaged backed replication.
> > >
> > > I don't know what you have in mind here by 'storage-backed'.
> > >
> > > > Also what if a node goes down while the lock is aquirred?? I assume
> > > there is
> > > > a time out.
> > >
> > > Which architecture do you have in mind here? I think the question is
> > > relevant if you use a standalone lock server, but if you don't then
> you
> > > just put the lock queue with the data item in question.
> > >
> > > > When it comes to partition (either network/power failure or vistual)
> or
> > > > healing (same new nodes comming up as well??) what are some of the
> > > > algorithms and stratergies that are widely used to handle those
> > > situations
> > > > ?? any pointers will be great.
> > >
> > > I believe the best strategy depends on what type of state the
> application
> > > has. Clearly if the state took zero time to transfer over you could
> > > compare version numbers, transfer the state to the nodes that happen
> to be
> > > out-of-date, and you are back in business. OTOH if the state is 1Gb
> you
> > > will take a different approach. There is not much to look up here.
> Think
> > > about it carefull and you can come up with the best state transfer for
> > > your application.
> > >
> > > Session state is easier than others because it consists of miryads
> small,
> > > independent data items that do not support concurrent access.
> > >
> > > > so if u are in the middle of filling a 10 page application on the
> web
> > > and
> > > > while in the 9th page and the server goes down, if you can restart
> again
> > > > with the 7 or 8th page (a resonable percentage of data was preserved
> > > through
> > > > merge/split/change) I guess it would be tolarable if not excellent
> in a
> > > very
> > > > busy server.
> > >
> > > Since this is a question about availability consider a cluster, say 4
> > > nodes, with a minimum R=2, say, where all the sessions are replicated
> on
> > > _each_ node. If you want to guarantee that the user's work is _never_
> > > lost, just send all session updates to yourself in a totem-protocol
> 'safe'
> > > message, which is delivered only after the message has been received
> (but
> > > not delivered) by all the nodes, and wait for your own message to
> arrive.
> > > This takes between 1 and 2 token rotations, which on 4 nodes I guess
> would
> > > be between 10-20 milliseconds, which is not a lot as http request
> > > latencies go.
> > >
> > > As a result of this after an http request returns, the work done is
> likely
> > > to survive node crashes up to 4 - R = 2 node crashes.
> > >
> > >
> >
>

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.

By reading selected parts of this book you can get a background on various
issues that you have asked about:

http://citeseer.ist.psu.edu/birman96building.html

On Tue, 17 Jan 2006, Rajith Attapattu wrote:

> > Can u guys talk more about locking mechanisms pros and cons wrt in memory
> > replication and storaged backed replication.
>
> >I don't know what you have in mind here by 'storage-backed'.
>
> Sorry if I was not clear on that. what i meant was in memory vs serialized
> form, either stored in a file or database or some other mechanism.
>
> >>you want to guarantee that the user's work is _never_lost, just send all
> session updates to yourself in a totem-protocol 'safe' message
> hmm can we really make a garuntee here even that you assumption
> holds (Assuming 4 nodes and likely to survive node crashes up to 4 - R = 2
> node crashes.)
>
> Also I didn't understand how u arrived at the 4-R value. I guess it's bcos I
> don't have much knowledge about totem.
> If there is a short answer and if it's not beyond the scope of the thread
> can u try one more time to explain the thoery behind your assumption
>
> Regards,
>
> Rajith.
>
> On 1/17/06, lichtner <li...@bway.net> wrote:
> >
> >
> > On Tue, 17 Jan 2006, Rajith Attapattu wrote:
> >
> > > Can u guys talk more about locking mechanisms pros and cons wrt in
> > memory
> > > replication and storaged backed replication.
> >
> > I don't know what you have in mind here by 'storage-backed'.
> >
> > > Also what if a node goes down while the lock is aquirred?? I assume
> > there is
> > > a time out.
> >
> > Which architecture do you have in mind here? I think the question is
> > relevant if you use a standalone lock server, but if you don't then you
> > just put the lock queue with the data item in question.
> >
> > > When it comes to partition (either network/power failure or vistual) or
> > > healing (same new nodes comming up as well??) what are some of the
> > > algorithms and stratergies that are widely used to handle those
> > situations
> > > ?? any pointers will be great.
> >
> > I believe the best strategy depends on what type of state the application
> > has. Clearly if the state took zero time to transfer over you could
> > compare version numbers, transfer the state to the nodes that happen to be
> > out-of-date, and you are back in business. OTOH if the state is 1Gb you
> > will take a different approach. There is not much to look up here. Think
> > about it carefull and you can come up with the best state transfer for
> > your application.
> >
> > Session state is easier than others because it consists of miryads small,
> > independent data items that do not support concurrent access.
> >
> > > so if u are in the middle of filling a 10 page application on the web
> > and
> > > while in the 9th page and the server goes down, if you can restart again
> > > with the 7 or 8th page (a resonable percentage of data was preserved
> > through
> > > merge/split/change) I guess it would be tolarable if not excellent in a
> > very
> > > busy server.
> >
> > Since this is a question about availability consider a cluster, say 4
> > nodes, with a minimum R=2, say, where all the sessions are replicated on
> > _each_ node. If you want to guarantee that the user's work is _never_
> > lost, just send all session updates to yourself in a totem-protocol 'safe'
> > message, which is delivered only after the message has been received (but
> > not delivered) by all the nodes, and wait for your own message to arrive.
> > This takes between 1 and 2 token rotations, which on 4 nodes I guess would
> > be between 10-20 milliseconds, which is not a lot as http request
> > latencies go.
> >
> > As a result of this after an http request returns, the work done is likely
> > to survive node crashes up to 4 - R = 2 node crashes.
> >
> >
>

Re: Replication using totem protocol

Posted by Rajith Attapattu <ra...@gmail.com>.

> Can u guys talk more about locking mechanisms pros and cons wrt in memory
> replication and storaged backed replication.

>I don't know what you have in mind here by 'storage-backed'.

Sorry if I was not clear on that. what i meant was in memory vs serialized
form, either stored in a file or database or some other mechanism.

>>you want to guarantee that the user's work is _never_lost, just send all
session updates to yourself in a totem-protocol 'safe' message
hmm can we really make a garuntee here even that you assumption
holds (Assuming 4 nodes and likely to survive node crashes up to 4 - R = 2
node crashes.)

Also I didn't understand how u arrived at the 4-R value. I guess it's bcos I
don't have much knowledge about totem.
If there is a short answer and if it's not beyond the scope of the thread
can u try one more time to explain the thoery behind your assumption

Regards,

Rajith.

On 1/17/06, lichtner <li...@bway.net> wrote:
>
>
> On Tue, 17 Jan 2006, Rajith Attapattu wrote:
>
> > Can u guys talk more about locking mechanisms pros and cons wrt in
> memory
> > replication and storaged backed replication.
>
> I don't know what you have in mind here by 'storage-backed'.
>
> > Also what if a node goes down while the lock is aquirred?? I assume
> there is
> > a time out.
>
> Which architecture do you have in mind here? I think the question is
> relevant if you use a standalone lock server, but if you don't then you
> just put the lock queue with the data item in question.
>
> > When it comes to partition (either network/power failure or vistual) or
> > healing (same new nodes comming up as well??) what are some of the
> > algorithms and stratergies that are widely used to handle those
> situations
> > ?? any pointers will be great.
>
> I believe the best strategy depends on what type of state the application
> has. Clearly if the state took zero time to transfer over you could
> compare version numbers, transfer the state to the nodes that happen to be
> out-of-date, and you are back in business. OTOH if the state is 1Gb you
> will take a different approach. There is not much to look up here. Think
> about it carefull and you can come up with the best state transfer for
> your application.
>
> Session state is easier than others because it consists of miryads small,
> independent data items that do not support concurrent access.
>
> > so if u are in the middle of filling a 10 page application on the web
> and
> > while in the 9th page and the server goes down, if you can restart again
> > with the 7 or 8th page (a resonable percentage of data was preserved
> through
> > merge/split/change) I guess it would be tolarable if not excellent in a
> very
> > busy server.
>
> Since this is a question about availability consider a cluster, say 4
> nodes, with a minimum R=2, say, where all the sessions are replicated on
> _each_ node. If you want to guarantee that the user's work is _never_
> lost, just send all session updates to yourself in a totem-protocol 'safe'
> message, which is delivered only after the message has been received (but
> not delivered) by all the nodes, and wait for your own message to arrive.
> This takes between 1 and 2 token rotations, which on 4 nodes I guess would
> be between 10-20 milliseconds, which is not a lot as http request
> latencies go.
>
> As a result of this after an http request returns, the work done is likely
> to survive node crashes up to 4 - R = 2 node crashes.
>
>

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.

On Tue, 17 Jan 2006, Rajith Attapattu wrote:

> Can u guys talk more about locking mechanisms pros and cons wrt in memory
> replication and storaged backed replication.

I don't know what you have in mind here by 'storage-backed'.

> Also what if a node goes down while the lock is aquirred?? I assume there is
> a time out.

Which architecture do you have in mind here? I think the question is
relevant if you use a standalone lock server, but if you don't then you
just put the lock queue with the data item in question.

> When it comes to partition (either network/power failure or vistual) or
> healing (same new nodes comming up as well??) what are some of the
> algorithms and stratergies that are widely used to handle those situations
> ?? any pointers will be great.

I believe the best strategy depends on what type of state the application
has. Clearly if the state took zero time to transfer over you could
compare version numbers, transfer the state to the nodes that happen to be
out-of-date, and you are back in business. OTOH if the state is 1Gb you
will take a different approach. There is not much to look up here. Think
about it carefull and you can come up with the best state transfer for
your application.

Session state is easier than others because it consists of miryads small,
independent data items that do not support concurrent access.

> so if u are in the middle of filling a 10 page application on the web and
> while in the 9th page and the server goes down, if you can restart again
> with the 7 or 8th page (a resonable percentage of data was preserved through
> merge/split/change) I guess it would be tolarable if not excellent in a very
> busy server.

Since this is a question about availability consider a cluster, say 4
nodes, with a minimum R=2, say, where all the sessions are replicated on
_each_ node. If you want to guarantee that the user's work is _never_
lost, just send all session updates to yourself in a totem-protocol 'safe'
message, which is delivered only after the message has been received (but
not delivered) by all the nodes, and wait for your own message to arrive.
This takes between 1 and 2 token rotations, which on 4 nodes I guess would
be between 10-20 milliseconds, which is not a lot as http request
latencies go.

As a result of this after an http request returns, the work done is likely
to survive node crashes up to 4 - R = 2 node crashes.

Re: Replication using totem protocol

Posted by Rajith Attapattu <ra...@gmail.com>.

Can u guys talk more about locking mechanisms pros and cons wrt in memory
replication and storaged backed replication.

Also what if a node goes down while the lock is aquirred?? I assume there is
a time out.

When it comes to partition (either network/power failure or vistual) or
healing (same new nodes comming up as well??) what are some of the
algorithms and stratergies that are widely used to handle those situations
?? any pointers will be great.
(All I know is there is no algorithm that garuntees 100% recovery and
fail-over, but a reasonable expectation is that all is not lost and can
continue from some where)

so if u are in the middle of filling a 10 page application on the web and
while in the 9th page and the server goes down, if you can restart again
with the 7 or 8th page (a resonable percentage of data was preserved through
merge/split/change) I guess it would be tolarable if not excellent in a very
busy server.

I guess the sucess of any clustering framework is not to solve all concerns
regarding every possible solution, but to have a good abstraction of the
high level concerns (and delay the impl conerns to as late as application
level if possible) BUT!!! bundle with a few sensible impls/stratergies, so
that if people have very specific situations they can make the decesions by
themselves to how they are going to manage performance vs compliance vs
scalability and HA.

Regards,

Rajith.

On 1/17/06, lichtner <li...@bway.net> wrote:
>
>
>
> On Tue, 17 Jan 2006, Jules Gosnell wrote:
>
> > just when you thought that this thread would die :-)
>
> I think Jeff Genender wanted a discussion to be sparked, and it worked.
>
> > So, I am wondering how might I use e.g. a shared disc or majority voting
> > in this situation ? In order to decide which fragment was the original
> > cluster and which was the piece that had broken off ? but then what
> > would the piece that had broken off do ? shutdown ?
>
> Wait to rejoin the cluster. Since it is not "the" cluster, it waits. It is
> not safe to make any updates.
>
> _How_ a groups decides it is "the" cluster can be done in several ways.
> Shared-disk cluster can do by a locking operation on a disk (I would have
> to research the details on this), a cluster with a database can get a lock
> from the database (and keep the connection open). And one way to do this
> in a shared-nothing cluster is to use a quorum of N/2 + 1, where is the
> maximum number of nodes. Clearly it has to be the majority or else you can
> have a split-brain cluster.
>
> > Do you think that we need to worry about situations where a piece of
> > state has more than one client, so a network partition may result in two
> > copies diverging in different and incompatible directions, rather than
> > only one diverging.
>
> If you use a quorum or quorum-resource as above you do not have this
> problem. You can turn down the requests or let them block until the
> cluster re-discovers the 'failed' nodes.
>
> > I can imagine this happening in an Entity Bean (but
> > we should be able to use the DB to resolve this) or an application POJO.
> > I haven't considered the latter case and it looks pretty hopeless to me,
> > unless you have some alternative route over which the two fragments can
> > communicate... but then, if you did, would you not pair it with your
> > original network, so that the one failed over to the other or replicated
> > its activity, so that you never perceived a split in the first place ?
> > Is this a common solution, or do people use other mechanisms here ?
>
> I do believe that membership and quorum is all you need.
>
> Guglielmo
>

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

Rajith Attapattu wrote:

> This is a very educating thread, maybe Jules can incoporate some of 
> the ideas into your document on clustering.
> I do have a few questions on Guglielmo's points on session 
> replication. (Not an expert so pls bear with my questions)
>  

Rajith, Guglielmo, these are good questions, so I hope you don't mind if 
I answer some of them for WADI...

> >1. The user should configure a minimum-degree-of-replication R. This is
> >the number of replicas of a specific session which need to be 
> available in
> >order for an HTTP request to be serviced.
>  
> 1.) How do u figure out the most efficient value for R?
> I assume when R increases, network chatter increases at a magnitue of 
> X, and X depends on wether it's a multicast  protocol or 1->1 (first 
> of all is this assumption correct ???).

Since WADI is using 1->1 messaging. increasing the number of replicants 
will increase the number of messages accordingly. I expect that 1 or at 
most 2 backup copies will be sufficient for most sites.

> And when R reduces the chances of a request hitting a server where the 
> session is not replicated is high.

This is not something that is really considered a significant saving in 
WADI (see my last posting's explanation of why you only want one 
'active' copy of a session). WADI will keep session backups serialised, 
to save resources being constantly expended deserialising session 
backups that may never be accessed. I guess actually, you could consider 
that WADI will do a lazy deserialisation in the case that you have 
outlined, as primary and secondary copies will actually swap roles with 
attendant serialisation/passivation and deserialisation/activation 
coordinated by messages.

If you are running a reasonable sized cluster (e.g. 30 nodes - it's all 
relative) with a small number of backups configured (e.g. 1), then, in 
the case of a session affinity brekdown (due to the leaving of a 
primary's node), you have a 1/30 chance that the request will hit the 
primary, a 1/30 that you will hit the secondary and a 28/30 that you 
will miss :-) So, you are right :-)

>  
> So the sweet spot is a balance btw the above to factors ??? or have I 
> missed any other critical factor(s) ?

In WADI, I don't really see this as a sweetspot. If you are running with 
session affinity, then requests should rarely miss their primary. If, in 
exceptional circumstances they do, and you have to arrange a role-swap 
between primary-secondary or a migration, you are still looking at cost.

If, however,  you did your deserialisation of replicants up front and 
thus avoided further messages when a secondary was hit, by maintaining 
all copies 'active' (I think you would not be spec compliant if you did 
this), then you would find a sweetspot, but only because you had paid up 
front with a lot of deserialisation to create it. Furthermore, in 
creating this sweetspot, you are constraining yourself (although not 
strictly) in terms of cluster size, because you don't want many nodes 
more than session copies.... So I don't see any advantage in such a 
'sweetspot'. I would rather be spec compliant, pay the deserialisation 
cost lazily and not worry about using as many nodes as I like :-).

I thought long and hard about whether I should allow multiple 'active' 
copies of a session and my decision to disallow this has had a large 
impact on WADI's design.

I am still happy that it is the right decision for HttpSessions, but 
there may be some slack here that SFSBs could exploit. I need to 
investigate.

>  
> 2.) When you say minimum-degree-of-replication it imples to me a 
> floor?? is there like a ceiling value like 
> maximum-degree-of-replication?? I guess we don't want the session to 
> grow beyond a point.
>  
> >2. When an HTTP request arrives, if the cluster which received does not
> >have R copies then it blocks (it waits until there are.) This should in
> >data centers because partitions are likely to be very short-lived (aka
> >virtual partitions, which are due to congestion, not to any hardware
> >issue.)
>  
> 1) Can u pls elaborate a bit more on this, didn't really understand 
> it, when u said wait untill, does it mean
>     a) wait till there are R no of replicas in the cluster?
>     b) or until a session is replicated within the server the http 
> request is received?
>  
> 2) when u said virtual partition did u mean a sub set of nodes 
> being isolated due to congestion. By isolation I meant they have not 
> able to replicate there sessions or receive replications from sessions 
> from other nodes outside of the subset due to congestion. Is this 
> correct??     
>  
> 3) Assuming an HTTP request arrives and the cluster does not have R 
> copies. How different is this situation from "an HTTP request arrives 
> but no session replication in that server" ??
>  
> >3. If at any time an HTTP reaches a server which does not have itself a
> >replica of the session it sends a client redirect to a node which does.
> How can this be achived?? Is it by having a central cordinator that 
> handles a mapping or is this information replicated in all nodes on 
> the entire cluster.
>  
> information == "which clusters have replicas of each session"
>  
> The point below gave me the impression that some of inventory has to 
> be maintained centrally or cluster-wide (ideally in case controller dies).
>  
> >4. When a new cluster is formed (with nodes coming or going), it takes an
> >inventory of all the sessions and their version numbers. Sessions 
> which do
> >not have the necessary degree of replication need to be fixed, which 
> will
> >require some state transfer, and possibly migration of some session for
> >proper load balancing.
>  
> Again how does the replication healing/shedding works. Assuming nodes 
> die or comeback with carrying there state 
> how does the cluster decide on adding or removing sessions to maintain 
> the optimal R value.
> Where does the brain/logic for this sit?? Ideally distributable in 
> case the controller dies.
>  
> General comments/questions
> -------------------------------------------
>  
> 1. How much does the current impls like WADI, ACluster and ASpace 
> address those above concerns?

WADI does not currently implement a solution, but, I have the one that I 
described in mind.
I think (but am open to correction) that AC is happy to leave this issue 
to its consumer
James can answer for ActiveSpace and maybe correct me on AC ?

Jules

>  
> 2.) What aspects of the above concerns can be addresed with totem 
> better than other protocols?
>  
> 3. Can SEDA like architechture solve the problem of deciding the value 
> of R dynamically runtime from time to time based on load and network 
> latency?? I guess the network latency can be messured with some 
> metrics around token passing or something like that.
>  
> Answers are greatly appreciated.
>  
> Regards,
>  
> Rajith.
>  
> On 1/16/06, *lichtner* <lichtner@bway.net <ma...@bway.net>> 
> wrote:
>
>
>
>     On Mon, 16 Jan 2006, Jules Gosnell wrote:
>
>     > REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a
>     node dies ...
>
>     I figured. I imagine that if I had to add this distinction to totem I
>     would add a message were the node in question announces that it is
>     leaving, and then stops forwarding the token. On the other hand,
>     it does
>     not need to announce anything, and the other nodes will detect that it
>     left. In fact totem does not judge a node either way: you can leave
>     because you want to or under duress, and the consequences as far
>     distribute algorithms are probably minimal. I think the only where
>     this
>     might is for logging purposes (but that could be handled at the
>     application level) or to speed the membership protocol, although it's
>     already pretty fast.
>
>     So I would not draw a distinction there.
>
>     > By also treating nodes joining, leaving and dieing, as split and
>     merge
>     > operations I can reduce the number of cases that I have to deal
>     with.
>
>     I would even add that the difference is known only to the application.
>
>     > and ensure that what might be very uncommonly run code (run on
>     network
>     > partition/healing) is the same code that is commonly run on e.g.
>     node
>     > join/leave - so it is likely to be more robust.
>
>     Sounds good.
>
>     > In the case of a binary split, I envisage two sets of nodes losing
>     > contact with each other. Each cluster fragment will repair its
>     internal
>     > structure. I expect that after this repair, neither fragment
>     will carry
>     > a complete copy of the cluster's original state (unless we are
>     > replicating 1->all, which WADI will usually not do), rather, the two
>     > datasets will intersect and their union will be the original
>     dataset.
>     > Replicated state will carry a version number.
>
>     I think a version number should work very well.
>
>     > If client affinity survives the split (i.e. clients continue to
>     talk to
>     > the same nodes), then we should find ourselves in a working
>     state, with
>     > two smaller clusters carrying overlapping and diverging state. Each
>     > piece of state should be static in one subcluster and divergant
>     in the
>     > other (it has only one client). The version carried by each piece of
>     > state may be used to decide which is the most recent version.
>     >
>     > (If client affinity is not maintained, then, without a
>     backchannel of
>     > some sort, we are in trouble).
>     >
>     > When a merge occurs, WADI will be able to merge the internal
>     > representations of the participants, delegating awkward
>     decisions about
>     > divergant state to deploy-time pluggable algorithms. Hopefully, each
>     > piece of state will only have diverged in one cluster fragment
>     so the
>     > choosing which copy to go forward with will be trivial.
>
>     > A node death can just be thought of as a 'split' which never
>     'merges'.
>
>     Definitely :)
>
>     > Of course, multiple splits could occur concurrently and merging
>     them is
>     > a little more complicated than I may have implied, but I am getting
>     > there....
>
>     Although I consider the problem of session replication less than
>     glamorous, since it is at hand, I would approach it this way:
>
>     1. The user should configure a minimum-degree-of-replication R.
>     This is
>     the number of replicas of a specific session which need to be
>     available in
>     order for an HTTP request to be serviced.
>
>     2. When an HTTP request arrives, if the cluster which received
>     does not
>     have R copies then it blocks (it waits until there are.) This
>     should in
>     data centers because partitions are likely to be very short-lived (aka
>     virtual partitions, which are due to congestion, not to any hardware
>     issue.)
>
>     3. If at any time an HTTP reaches a server which does not have
>     itself a
>     replica of the session it sends a client redirect to a node which
>     does.
>
>     4. When a new cluster is formed (with nodes coming or going), it
>     takes an
>     inventory of all the sessions and their version numbers. Sessions
>     which do
>     not have the necessary degree of replication need to be fixed,
>     which will
>     require some state transfer, and possibly migration of some
>     session for
>     proper load balancing.
>
>     Guglielmo
>
>


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.

On Mon, 16 Jan 2006, Rajith Attapattu wrote:

> This is a very educating thread, maybe Jules can incoporate some of the
> ideas into your document on clustering.

Let's hope the thread also eventually translates into working code :)

> >1. The user should configure a minimum-degree-of-replication R. This is
> >the number of replicas of a specific session which need to be available in
> >order for an HTTP request to be serviced.
>
> 1.) How do u figure out the most efficient value for R?

I am not sure what you mean by efficient. If you mean that it maximizes
availability, I have seen a derivation in this book:

"Fault Tolerance in Distributed Systems"
Pankaj Jalote, 1994
Chapter 7, Section 5, "Degree of Replication"

He shows that the availability _as_ a function of the number of replicas
goes up and then down again, basically because more replicas defend
against failures but require more housekeeping, and the resources used to
do housekeeping cannot be used for servicing transactions.

I believe it is very difficulty to compute availability analytically, and
that the majority of downtime would not be due to hardware failures. It's
probably 1) power failures and 2) software failures. I think Pfister talks
about the various causes of downtime in his book.

> I assume when R increases, network chatter increases at a magnitue of X, and
> X depends on wether it's a multicast  protocol or 1->1 (first of all is this
> assumption correct ???).

I think for this thread we were assuming reliable multicast. See also
the thread about infiniband, which completely changes the calculus because
of the lack of context switching - that would be closer to just using a
symmetric multiprocessor.

> And when R reduces the chances of a request hitting a server where the
> session is not replicated is high.

That doesn't matter. When the request hits a server where the session is
not replicated you send a redirect - the system is available, but perhaps
the latency for that particular request is larger than for others.

> So the sweet spot is a balance btw the above to factors ??? or have I missed
> any other critical factor(s) ??

See reference above.

> 2.) When you say minimum-degree-of-replication it imples to me a floor?? is
> there like a ceiling value like maximum-degree-of-replication?? I guess we
> don't want the session to grow beyond a point.

Yes. See above. Availability goes down past a certain value of R.

> >2. When an HTTP request arrives, if the cluster which received does not
> >have R copies then it blocks (it waits until there are.) This should in
> >data centers because partitions are likely to be very short-lived (aka
> >virtual partitions, which are due to congestion, not to any hardware
> >issue.)
>
> 1) Can u pls elaborate a bit more on this, didn't really understand it, when
> u said wait untill, does it mean
>     a) wait till there are R no of replicas in the cluster?

At any time that there was a change in the composition of the cluster it
must review its global state and if necessary arrange for new session
replicas to be installed in some nodes, for replicas to be migrated, or
for replicas to be deleted. For example, if R=3 and replica no. 2 of
session 49030 was on node N7 which just bowed out, the cluster might
decide to install a replica of session 49030 on node N3.

Rearranging replicas, aka state-transfer, takes time. While that happens
you block new http requests for the relevant sessions.

>     b) or until a session is replicated within the server the http request
> is received?

No. See above. Although when rearranging replicas you have some freedom
and you are free to give priority to some nodes over others.

> 2) when u said virtual partition did u mean a sub set of nodes
> being isolated due to congestion.

Yes.

> By isolation I meant they have not able to
> replicate there sessions or receive replications from sessions from other
> nodes outside of the subset due to congestion. Is this correct??

It's also possible that all nodes are up to date on a given session, and
the virtual partition heals before the user tries to update the session
again.

A partition occurs when nodes 1 and 2 agree with each other that nodes 3
and 4 are no longer around and install a new group, a.k.a. "view", a.k.a.
"configuration".

But 3 and 4 may appear again soon after (e.g. 5 seconds) and so the
partition may end up having few consequences if any.

> 3) Assuming an HTTP request arrives and the cluster does not have R copies.
> How different is this situation from "an HTTP request arrives but no session
> replication in that server" ??
>
> >3. If at any time an HTTP reaches a server which does not have itself a
> >replica of the session it sends a client redirect to a node which does.
> How can this be achived?? Is it by having a central cordinator that handles
> a mapping or is this information replicated in all nodes on the entire
> cluster.
>
> information == "which clusters have replicas of each session"
>
> The point below gave me the impression that some of inventory has to be
> maintained centrally or cluster-wide (ideally in case controller dies).

You could use R replicas here also.

> >4. When a new cluster is formed (with nodes coming or going), it takes an
> >inventory of all the sessions and their version numbers. Sessions which do
> >not have the necessary degree of replication need to be fixed, which will
> >require some state transfer, and possibly migration of some session for
> >proper load balancing.
>
> Again how does the replication healing/shedding works. Assuming nodes die or
> comeback with carrying there state
> how does the cluster decide on adding or removing sessions to maintain the
> optimal R value.
> Where does the brain/logic for this sit?? Ideally distributable in case the
> controller dies.

You can design either distributed or centralized.

> General comments/questions
> -------------------------------------------
>
> 1. How much does the current impls like WADI, ACluster and ASpace address
> those above concerns?

The part about the organization of the state and state transfer has to be
coded. I think those tools are agnostic as far as application state goes.

> 2.) What aspects of the above concerns can be addresed with totem better
> than other protocols?

I don't see totem addressing state transfer. It does provide membership
and very well-behaved reliable multiast. Most importantly, since messages
are totally ordered it makes it much easier. Although in the case of
session replication there is no data sharing, so ordering is not as
critical.

> 3. Can SEDA like architechture solve the problem of deciding the value of R
> dynamically runtime from time to time based on load and network latency?? I
> guess the network latency can be messured with some metrics around token
> passing or something like that.

The value of R depends on how available you want your system to be.

I know what SEDA does but I don't see its relevance here, except to say
that if the application is based on SEDA you will get a better-behaved
application (and spend a lot of time coding it.)

Re: Replication using totem protocol

Posted by Rajith Attapattu <ra...@gmail.com>.

This is a very educating thread, maybe Jules can incoporate some of the
ideas into your document on clustering.
I do have a few questions on Guglielmo's points on session replication. (Not
an expert so pls bear with my questions)

>1. The user should configure a minimum-degree-of-replication R. This is
>the number of replicas of a specific session which need to be available in
>order for an HTTP request to be serviced.

1.) How do u figure out the most efficient value for R?
I assume when R increases, network chatter increases at a magnitue of X, and
X depends on wether it's a multicast  protocol or 1->1 (first of all is this
assumption correct ???).
And when R reduces the chances of a request hitting a server where the
session is not replicated is high.

So the sweet spot is a balance btw the above to factors ??? or have I missed
any other critical factor(s) ??

2.) When you say minimum-degree-of-replication it imples to me a floor?? is
there like a ceiling value like maximum-degree-of-replication?? I guess we
don't want the session to grow beyond a point.

>2. When an HTTP request arrives, if the cluster which received does not
>have R copies then it blocks (it waits until there are.) This should in
>data centers because partitions are likely to be very short-lived (aka
>virtual partitions, which are due to congestion, not to any hardware
>issue.)

1) Can u pls elaborate a bit more on this, didn't really understand it, when
u said wait untill, does it mean
    a) wait till there are R no of replicas in the cluster?
    b) or until a session is replicated within the server the http request
is received?

2) when u said virtual partition did u mean a sub set of nodes
being isolated due to congestion. By isolation I meant they have not able to
replicate there sessions or receive replications from sessions from other
nodes outside of the subset due to congestion. Is this correct??

3) Assuming an HTTP request arrives and the cluster does not have R copies.
How different is this situation from "an HTTP request arrives but no session
replication in that server" ??

>3. If at any time an HTTP reaches a server which does not have itself a
>replica of the session it sends a client redirect to a node which does.
How can this be achived?? Is it by having a central cordinator that handles
a mapping or is this information replicated in all nodes on the entire
cluster.

information == "which clusters have replicas of each session"

The point below gave me the impression that some of inventory has to be
maintained centrally or cluster-wide (ideally in case controller dies).

>4. When a new cluster is formed (with nodes coming or going), it takes an
>inventory of all the sessions and their version numbers. Sessions which do
>not have the necessary degree of replication need to be fixed, which will
>require some state transfer, and possibly migration of some session for
>proper load balancing.

Again how does the replication healing/shedding works. Assuming nodes die or
comeback with carrying there state
how does the cluster decide on adding or removing sessions to maintain the
optimal R value.
Where does the brain/logic for this sit?? Ideally distributable in case the
controller dies.

General comments/questions
-------------------------------------------

1. How much does the current impls like WADI, ACluster and ASpace address
those above concerns?

2.) What aspects of the above concerns can be addresed with totem better
than other protocols?

3. Can SEDA like architechture solve the problem of deciding the value of R
dynamically runtime from time to time based on load and network latency?? I
guess the network latency can be messured with some metrics around token
passing or something like that.

Answers are greatly appreciated.

Regards,

Rajith.

On 1/16/06, lichtner <li...@bway.net> wrote:
>
>
>
> On Mon, 16 Jan 2006, Jules Gosnell wrote:
>
> > REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a node dies
> ...
>
> I figured. I imagine that if I had to add this distinction to totem I
> would add a message were the node in question announces that it is
> leaving, and then stops forwarding the token. On the other hand, it does
> not need to announce anything, and the other nodes will detect that it
> left. In fact totem does not judge a node either way: you can leave
> because you want to or under duress, and the consequences as far
> distribute algorithms are probably minimal. I think the only where this
> might is for logging purposes (but that could be handled at the
> application level) or to speed the membership protocol, although it's
> already pretty fast.
>
> So I would not draw a distinction there.
>
> > By also treating nodes joining, leaving and dieing, as split and merge
> > operations I can reduce the number of cases that I have to deal with.
>
> I would even add that the difference is known only to the application.
>
> > and ensure that what might be very uncommonly run code (run on network
> > partition/healing) is the same code that is commonly run on e.g. node
> > join/leave - so it is likely to be more robust.
>
> Sounds good.
>
> > In the case of a binary split, I envisage two sets of nodes losing
> > contact with each other. Each cluster fragment will repair its internal
> > structure. I expect that after this repair, neither fragment will carry
> > a complete copy of the cluster's original state (unless we are
> > replicating 1->all, which WADI will usually not do), rather, the two
> > datasets will intersect and their union will be the original dataset.
> > Replicated state will carry a version number.
>
> I think a version number should work very well.
>
> > If client affinity survives the split (i.e. clients continue to talk to
> > the same nodes), then we should find ourselves in a working state, with
> > two smaller clusters carrying overlapping and diverging state. Each
> > piece of state should be static in one subcluster and divergant in the
> > other (it has only one client). The version carried by each piece of
> > state may be used to decide which is the most recent version.
> >
> > (If client affinity is not maintained, then, without a backchannel of
> > some sort, we are in trouble).
> >
> > When a merge occurs, WADI will be able to merge the internal
> > representations of the participants, delegating awkward decisions about
> > divergant state to deploy-time pluggable algorithms. Hopefully, each
> > piece of state will only have diverged in one cluster fragment so the
> > choosing which copy to go forward with will be trivial.
>
> > A node death can just be thought of as a 'split' which never 'merges'.
>
> Definitely :)
>
> > Of course, multiple splits could occur concurrently and merging them is
> > a little more complicated than I may have implied, but I am getting
> > there....
>
> Although I consider the problem of session replication less than
> glamorous, since it is at hand, I would approach it this way:
>
> 1. The user should configure a minimum-degree-of-replication R. This is
> the number of replicas of a specific session which need to be available in
> order for an HTTP request to be serviced.
>
> 2. When an HTTP request arrives, if the cluster which received does not
> have R copies then it blocks (it waits until there are.) This should in
> data centers because partitions are likely to be very short-lived (aka
> virtual partitions, which are due to congestion, not to any hardware
> issue.)
>
> 3. If at any time an HTTP reaches a server which does not have itself a
> replica of the session it sends a client redirect to a node which does.
>
> 4. When a new cluster is formed (with nodes coming or going), it takes an
> inventory of all the sessions and their version numbers. Sessions which do
> not have the necessary degree of replication need to be fixed, which will
> require some state transfer, and possibly migration of some session for
> proper load balancing.
>
> Guglielmo
>

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

Andy Piper wrote:

> At 09:25 AM 1/18/2006, Jules Gosnell wrote:
>
>> I haven't been able to convince myself to take the quorum approach 
>> because...
>>
>> shared-something approach:
>> - the shared something is a Single Point of Failure (SPoF) - although 
>> you could use an HA something.
>
>
> That's how WAS and WLS do it. Use an HA database, SAN or dual-ported 
> scsi. The latter is cheap. The former are probably already available 
> to customers if they really care about availability.

Well, I guess we will have to consider making something along these 
lines available... - I guess we need a pluggable QuorumStrategy.

>
>> - If the node holding the lock 'goes crazy', but does not die, the 
>> rest of the
>
>
> This is generally why you use leases. Then your craziness is only 
> believed for a fixed amount of time.

Understood.

>
>> cluster becomes a fragment - so it becomes an SPoF as well.
>> - used in isolation, it does not take into account that the lock may 
>> be held by the smallest cluster fragment
>
>
> You generally solve this again with leases. i.e. a lock that is valid 
> for some period.

i don't follow you here - but we have lost quite a bit of context. i 
think that I was saying that if the fragment that owned the 
shared-something was the smaller of the two, then 'freezing' the larger 
fragment would not be optimal - but, I guess you could use the 
shared-something to negotiate between the two fragments and decide which 
to freeze and which to allow to continue...

I don't see leases helping here - but maybe i have mitaken the context ?

>
>> shared-nothing approach:
>
>
> Nice in theory but tricky to implement well. Consensus works well here.
>
>> - I prefer this approach, but, as you have stated, if the two halves 
>> are equally sized...
>> - What if there are two concurrent fractures (does this happen?)
>> - ActiveCluster notifies you of one membership change at a time - so 
>> you would have to decide on an algorithm for 'chunking' node loss, so 
>> that you could decide when a fragmentation had occurred...
>
>
> If you really want to do this reliably you have to assume that AC will 
> send you bogus notifications. Ideally you want to achieve a consensus 
> on membership to avoid this. It sounds like totem solves some of these 
> issues.

Totem does seem to have some advanced consensus stuff, which, I am 
?assuming?, relies on its virtual synchrony. This stuff would probably 
be very useful under ActiveCluster to manage membership change and 
partition notifications, as it would, I understand, guarantee that every 
node received a consistant view of what was going on.

For the peer->peer messaging aspect of AC (1->1 and 1->all), I don't 
think VS is required. In fact it might be an unwelcome overhead. I don't 
know enough about the internals of AC and Totem to know if it would be 
possible to reuse Totem's VS/consensus stuff on-top-of/along-side AMQs 
e.g. peer:// protocol stack and underneath AC's membership notification 
API, but it seems to me that ultimately the best solution would be a 
hybrid, that uses these approaches where needed and not where not...

Have I got the right end of the stick ? Perhaps you can choose which 
messages are virtually synchronous and which are not in Totem ? I am 
pretty sure though, that it was using muticast, so is not the best 
solution for 1->1 messaging....

Jules

>
> andy 

-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by Andy Piper <an...@bea.com>.

At 09:25 AM 1/18/2006, Jules Gosnell wrote:
>I haven't been able to convince myself to take the quorum approach because...
>
>shared-something approach:
>- the shared something is a Single Point of Failure (SPoF) - 
>although you could use an HA something.

That's how WAS and WLS do it. Use an HA database, SAN or dual-ported 
scsi. The latter is cheap. The former are probably already available 
to customers if they really care about availability.

>- If the node holding the lock 'goes crazy', but does not die, the 
>rest of the

This is generally why you use leases. Then your craziness is only 
believed for a fixed amount of time.

>cluster becomes a fragment - so it becomes an SPoF as well.
>- used in isolation, it does not take into account that the lock may 
>be held by the smallest cluster fragment

You generally solve this again with leases. i.e. a lock that is valid 
for some period.

>shared-nothing approach:

Nice in theory but tricky to implement well. Consensus works well here.

>- I prefer this approach, but, as you have stated, if the two halves 
>are equally sized...
>- What if there are two concurrent fractures (does this happen?)
>- ActiveCluster notifies you of one membership change at a time - so 
>you would have to decide on an algorithm for 'chunking' node loss, 
>so that you could decide when a fragmentation had occurred...

If you really want to do this reliably you have to assume that AC 
will send you bogus notifications. Ideally you want to achieve a 
consensus on membership to avoid this. It sounds like totem solves 
some of these issues.

andy

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.

On Wed, 18 Jan 2006, Jules Gosnell wrote:

> I haven't been able to convince myself to take the quorum approach
> because...
>
> shared-something approach:
> - the shared something is a Single Point of Failure (SPoF) - although
> you could use an HA something.

It's not really a spof. You just fail over to a different resource. All
you need is a lock. You could use two java processes anywhere on the
network which listen for a socket, and only one. If one is not listening,
you try the other one.

> - If the node holding the lock 'goes crazy', but does not die, the rest
> of the cluster becomes a fragment - so it becomes an SPoF as well.

If by 'goes crazy' you mean that it's up but it's not doing anything,
totem defends against by detecting processors which fail to make progress
after N token rotations, and when they do it declares them failed.

But if you mean that it just sends corrupt data or starts using broken
algorithms etc. then I would need to research it a bit. But definitely
defending against these byzantine failures will be more expensive. I
believe the solution is that you have to process operations on multiple
nodes and compare the results.

I believe this is how Tandem machines work. Each cpu step is voted on.
Byzantine failures can happen because of cosmic rays, or other
physics-related issues.

Definitely much more fun.

> - used in isolation, it does not take into account that the lock may be
> held by the smallest cluster fragment

Yes, it does. The question is, why do you have a partition? If you have a
partition because a network element failed, then put some redundancy in
your network topology. If the partition is virtual, i.e.
congestion-induced, then wait a few seconds for it to heal.

And if you get too many virtual partitions it means you either need to
tweak your failure detection parameters (token-loss-timeout in totem) or
your load is too high and you need to add some capacity to the cluster.

> shared-nothing approach:
> - I prefer this approach, but, as you have stated, if the two halves are
> equally sized...

I didn't mean to say that. In this approach you _must_ set a minimum
quorum which is _the_ majority of the size of the rack. If you own five
machines, make the quorum three.

> - What if there are two concurrent fractures (does this happen?)

It's no different than any other partition.

> - ActiveCluster notifies you of one membership change at a time - so you
> would have to decide on an algorithm for 'chunking' node loss, so that
> you could decide when a fragmentation had occurred...

The problem exists anyway. Even in totem you can have several
'configurations' installed in quick succession. In order to defend against
this you need to design your state transfer algorithms around it.

> perhaps a hybrid of the two would be able to cover more bases... -
> shared-nothing falling back to shared-something if your fragment is
> sized N/2.

You can definitely make a totally custom algorithm for determining your
majority partition, and it's a fact that hybrid approaches can solve
difficult problems, but for the reasons I said above I believe that you
can just fine if you have a redundant network and you keep some cpu
unused.

> As far as my plans for WADI, I think I am happy to stick with the, 'rely
> on affinity and keep going' approach.
>
> As far as situations where a distributed object may have more than one
> client, I can see that quorum offers the hope of a solution, but,
> without some very careful thought, I would still be hesitant to stake my
> shirt on it :-) for the reasons given above...
>
> I hadn't really considered 'pausing' a cluster fragment, so this is a
> useful idea. I guess that I have been thinking more in terms of
> long-lived fractures, rather than short-lived ones. If the latter are
> that much more common, then this is great input and I need to take it
> into account.
>
> The issue about 'chunking' node loss interests me... I see that the
> EVS4J Listener returns a set of members, so it is possible to express
> the loss of more than one node. How is membership decided and node loss
> aggregated ?

Read the totem protocol article. The membership protocol is in there. But
as I said you can still get a flurry of configurations installed one after
the other. It is only a problem if you plan to do your cluster
re-organization all at once.

Guglielmo

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

lichtner wrote:

>On Tue, 17 Jan 2006, Jules Gosnell wrote:
>
>  
>
>>just when you thought that this thread would die :-)
>>    
>>
>
>I think Jeff Genender wanted a discussion to be sparked, and it worked.
>
>  
>
>>So, I am wondering how might I use e.g. a shared disc or majority voting
>>in this situation ? In order to decide which fragment was the original
>>cluster and which was the piece that had broken off ? but then what
>>would the piece that had broken off do ? shutdown ?
>>    
>>
>
>Wait to rejoin the cluster. Since it is not "the" cluster, it waits. It is
>not safe to make any updates.
>
>  
>
>_How_ a groups decides it is "the" cluster can be done in several ways.
>Shared-disk cluster can do by a locking operation on a disk (I would have
>to research the details on this), a cluster with a database can get a lock
>from the database (and keep the connection open). And one way to do this
>in a shared-nothing cluster is to use a quorum of N/2 + 1, where is the
>maximum number of nodes. Clearly it has to be the majority or else you can
>have a split-brain cluster.
>  
>
I haven't been able to convince myself to take the quorum approach 
because...

shared-something approach:
- the shared something is a Single Point of Failure (SPoF) - although 
you could use an HA something.
- If the node holding the lock 'goes crazy', but does not die, the rest 
of the cluster becomes a fragment - so it becomes an SPoF as well.
- used in isolation, it does not take into account that the lock may be 
held by the smallest cluster fragment

shared-nothing approach:
- I prefer this approach, but, as you have stated, if the two halves are 
equally sized...
- What if there are two concurrent fractures (does this happen?)
- ActiveCluster notifies you of one membership change at a time - so you 
would have to decide on an algorithm for 'chunking' node loss, so that 
you could decide when a fragmentation had occurred...

perhaps a hybrid of the two would be able to cover more bases... - 
shared-nothing falling back to shared-something if your fragment is 
sized N/2.

As far as my plans for WADI, I think I am happy to stick with the, 'rely 
on affinity and keep going' approach.

As far as situations where a distributed object may have more than one 
client, I can see that quorum offers the hope of a solution, but, 
without some very careful thought, I would still be hesitant to stake my 
shirt on it :-) for the reasons given above...

I hadn't really considered 'pausing' a cluster fragment, so this is a 
useful idea. I guess that I have been thinking more in terms of 
long-lived fractures, rather than short-lived ones. If the latter are 
that much more common, then this is great input and I need to take it 
into account.

The issue about 'chunking' node loss interests me... I see that the 
EVS4J Listener returns a set of members, so it is possible to express 
the loss of more than one node. How is membership decided and node loss 
aggregated ?

Thanks again for your time,


Jules

>  
>
>>Do you think that we need to worry about situations where a piece of
>>state has more than one client, so a network partition may result in two
>>copies diverging in different and incompatible directions, rather than
>>only one diverging.
>>    
>>
>
>If you use a quorum or quorum-resource as above you do not have this
>problem. You can turn down the requests or let them block until the
>cluster re-discovers the 'failed' nodes.
>
>  
>
>>I can imagine this happening in an Entity Bean (but
>>we should be able to use the DB to resolve this) or an application POJO.
>>I haven't considered the latter case and it looks pretty hopeless to me,
>>unless you have some alternative route over which the two fragments can
>>communicate... but then, if you did, would you not pair it with your
>>original network, so that the one failed over to the other or replicated
>>its activity, so that you never perceived a split in the first place ?
>>Is this a common solution, or do people use other mechanisms here ?
>>    
>>
>
>I do believe that membership and quorum is all you need.
>
>Guglielmo
>  
>


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.

On Tue, 17 Jan 2006, Jules Gosnell wrote:

> just when you thought that this thread would die :-)

I think Jeff Genender wanted a discussion to be sparked, and it worked.

> So, I am wondering how might I use e.g. a shared disc or majority voting
> in this situation ? In order to decide which fragment was the original
> cluster and which was the piece that had broken off ? but then what
> would the piece that had broken off do ? shutdown ?

Wait to rejoin the cluster. Since it is not "the" cluster, it waits. It is
not safe to make any updates.

_How_ a groups decides it is "the" cluster can be done in several ways.
Shared-disk cluster can do by a locking operation on a disk (I would have
to research the details on this), a cluster with a database can get a lock
from the database (and keep the connection open). And one way to do this
in a shared-nothing cluster is to use a quorum of N/2 + 1, where is the
maximum number of nodes. Clearly it has to be the majority or else you can
have a split-brain cluster.

> Do you think that we need to worry about situations where a piece of
> state has more than one client, so a network partition may result in two
> copies diverging in different and incompatible directions, rather than
> only one diverging.

If you use a quorum or quorum-resource as above you do not have this
problem. You can turn down the requests or let them block until the
cluster re-discovers the 'failed' nodes.

> I can imagine this happening in an Entity Bean (but
> we should be able to use the DB to resolve this) or an application POJO.
> I haven't considered the latter case and it looks pretty hopeless to me,
> unless you have some alternative route over which the two fragments can
> communicate... but then, if you did, would you not pair it with your
> original network, so that the one failed over to the other or replicated
> its activity, so that you never perceived a split in the first place ?
> Is this a common solution, or do people use other mechanisms here ?

I do believe that membership and quorum is all you need.

Guglielmo

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

just when you thought that this thread would die :-)

So, Guglielmo,

in an earlier posting on this thread you said "BTW, how does AC defend 
against the problem of a split-brain cluster?
Shared scsi disk? Majority voting? Curious."

So, I am wondering how might I use e.g. a shared disc or majority voting 
in this situation ? In order to decide which fragment was the original 
cluster and which was the piece that had broken off ? but then what 
would the piece that had broken off do ? shutdown ?

Do you think that we need to worry about situations where a piece of 
state has more than one client, so a network partition may result in two 
copies diverging in different and incompatible directions, rather than 
only one diverging. I can imagine this happening in an Entity Bean (but 
we should be able to use the DB to resolve this) or an application POJO. 
I haven't considered the latter case and it looks pretty hopeless to me, 
unless you have some alternative route over which the two fragments can 
communicate... but then, if you did, would you not pair it with your 
original network, so that the one failed over to the other or replicated 
its activity, so that you never perceived a split in the first place ? 
Is this a common solution, or do people use other mechanisms here ?

thanks again for your time,

Jules

lichtner wrote:

>On Tue, 17 Jan 2006, Jules Gosnell wrote:
>
>  
>
>>>I believe that if you put some spare capacity in your cluster you will get
>>>good availability. For example, if your minimum R is 2 and the normal
>>>operating value is 4, when a node fails you will not be frantically doing
>>>state transfer.
>>>
>>>
>>>      
>>>
>>OK - so your system is a little more relaxed about the exact number of
>>replicants. You specify upper and lower bounds rather  than an absolute
>>number, then you move towards the upper bound when you have the capacity ?
>>    
>>
>
>That's the idea. It's a bit like having hot spares, but all nodes are
>treated on the same footing.
>
>  
>
>>>I would also just send a redirect. I don't think it's worth relocating a
>>>session.
>>>
>>>      
>>>
>>If you can communicate the session's location to the load-balancer, then
>>I agree, but some load-balancers are pretty dumb :-)
>>    
>>
>
>I see .. I was hoping somebody was not going to say that. Even so, it
>depends on the latency of the request when it actually request. After all,
>this only happens after a failure. But no matter, you can also move the
>session over.
>
>Guglielmo
>  
>

-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.

On Tue, 17 Jan 2006, Jules Gosnell wrote:

> >I believe that if you put some spare capacity in your cluster you will get
> >good availability. For example, if your minimum R is 2 and the normal
> >operating value is 4, when a node fails you will not be frantically doing
> >state transfer.
> >
> >
> OK - so your system is a little more relaxed about the exact number of
> replicants. You specify upper and lower bounds rather  than an absolute
> number, then you move towards the upper bound when you have the capacity ?

That's the idea. It's a bit like having hot spares, but all nodes are
treated on the same footing.

> >I would also just send a redirect. I don't think it's worth relocating a
> >session.
> >
> If you can communicate the session's location to the load-balancer, then
> I agree, but some load-balancers are pretty dumb :-)

I see .. I was hoping somebody was not going to say that. Even so, it
depends on the latency of the request when it actually request. After all,
this only happens after a failure. But no matter, you can also move the
session over.

Guglielmo

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

lichtner wrote:

>On Mon, 16 Jan 2006, Jules Gosnell wrote:
>
>  
>
>>>2. When an HTTP request arrives, if the cluster which received does not
>>>have R copies then it blocks (it waits until there are.) This should in
>>>data centers because partitions are likely to be very short-lived (aka
>>>virtual partitions, which are due to congestion, not to any hardware
>>>issue.)
>>>
>>>
>>>      
>>>
>>Interesting. I was intending to actively repopulate the cluster
>>fragment, as soon as the split was detected. I figure that
>>- the longer that sessions spend without their full complement of
>>backups, the more likely that a further failure may result in data loss.
>>- the split is an exceptional cicumstance at which you would expect to
>>pay an exceptional cost (regenerating missing primaries from backups and
>>vice-versa)
>>
>>by waiting for a request to arrive for a session before ensuring it has
>>its correct complement of backups, you extend the time during which it
>>is 'at risk'. By doing this 'lazily', you will also have to perform an
>>additional check on every request arrival, which you would not have to
>>do if you had regenerated missing state at the point that you noticed
>>the split.
>>    
>>
>
>Actually I didn't mean to say that you should do it lazily. You most
>definitely do it aggressively, but I would not try to do _all_ the state
>transfer ASAP, because this can kill availability.
>  
>
Ah - OK, my misunderstanding - so you do it agressively but there is 
still the possibility of a request arriving before you have finished 
regenerating, so you handle that by holding it up - got you. I agree.

>If I had to do the state transfer using totem I would use priority queues,
>so that you know that while the system is doing state transfer it is still
>operating at, say, 80% efficiency.
>
>It was not about lazy vs. greedy.
>
>I believe that if you put some spare capacity in your cluster you will get
>good availability. For example, if your minimum R is 2 and the normal
>operating value is 4, when a node fails you will not be frantically doing
>state transfer.
>  
>
OK - so your system is a little more relaxed about the exact number of 
replicants. You specify upper and lower bounds rather  than an absolute 
number, then you move towards the upper bound when you have the capacity ?

>  
>
>>>3. If at any time an HTTP reaches a server which does not have itself a
>>>replica of the session it sends a client redirect to a node which does.
>>>
>>>
>>>      
>>>
>>WADI can relocate request to session, as you suggest (via redirect or
>>proxy), or session to request, by migration. Relocation of request
>>should scale better since requests are generally smaller and, in the web
>>tier, may run concurrently through the same session, whereas sessions
>>are generally larger and may only be migrated serially (since only one
>>copy at a time may be 'active').
>>    
>>
>
>I would also just send a redirect. I don't think it's worth relocating a
>session.
>  
>
If you can communicate the session's location to the load-balancer, then 
I agree, but some load-balancers are pretty dumb :-)

>  
>
>>>and possibly migration of some session for
>>>proper load balancing.
>>>
>>>
>>>      
>>>
>>forcing the balancing of state around the cluster is something that I
>>have considered with WADI, but not yet tried to implement. The type of
>>load-balancer that is being used has a big impact here. If you cannot
>>communicate a change of session location satisfactorily to the Http load
>>balancer, then you have to just go with wherever it decides a session is
>>located.... With SFSBs we should have much more control at the client
>>side, so this becomes a real option.
>>    
>>
>
>In my opinion load balancing is not something that a cluster api can
>address effectively. Half the problem is evaluating how busy the system is
>in the first place.
>
>  
>
agreed

>>all in all, though, it sounds like we see pretty much eye to eye :-)
>>    
>>
>
>Better than the other way ..
>
>  
>
>>the lazy partition regeneration is an interesting idea and this is the
>>second time it has been suggested to me, so I will give it some serious
>>thought.
>>    
>>
>
>Again, I wasn't advocating lazy state transfer. But perhaps it has
>applications somewhere.
>
>  
>
understood - and I think a hybrid approach will probably just incur the 
costs of both the other approaches - but I may still kick it around.


Jules

>>Thanks for taking the time to share your thoughts,
>>    
>>
>
>No problem.
>  
>


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.


On Mon, 16 Jan 2006, Jules Gosnell wrote:

> >2. When an HTTP request arrives, if the cluster which received does not
> >have R copies then it blocks (it waits until there are.) This should in
> >data centers because partitions are likely to be very short-lived (aka
> >virtual partitions, which are due to congestion, not to any hardware
> >issue.)
> >
> >
> Interesting. I was intending to actively repopulate the cluster
> fragment, as soon as the split was detected. I figure that
> - the longer that sessions spend without their full complement of
> backups, the more likely that a further failure may result in data loss.
> - the split is an exceptional cicumstance at which you would expect to
> pay an exceptional cost (regenerating missing primaries from backups and
> vice-versa)
>
> by waiting for a request to arrive for a session before ensuring it has
> its correct complement of backups, you extend the time during which it
> is 'at risk'. By doing this 'lazily', you will also have to perform an
> additional check on every request arrival, which you would not have to
> do if you had regenerated missing state at the point that you noticed
> the split.

Actually I didn't mean to say that you should do it lazily. You most
definitely do it aggressively, but I would not try to do _all_ the state
transfer ASAP, because this can kill availability.

If I had to do the state transfer using totem I would use priority queues,
so that you know that while the system is doing state transfer it is still
operating at, say, 80% efficiency.

It was not about lazy vs. greedy.

I believe that if you put some spare capacity in your cluster you will get
good availability. For example, if your minimum R is 2 and the normal
operating value is 4, when a node fails you will not be frantically doing
state transfer.

> >3. If at any time an HTTP reaches a server which does not have itself a
> >replica of the session it sends a client redirect to a node which does.
> >
> >
> WADI can relocate request to session, as you suggest (via redirect or
> proxy), or session to request, by migration. Relocation of request
> should scale better since requests are generally smaller and, in the web
> tier, may run concurrently through the same session, whereas sessions
> are generally larger and may only be migrated serially (since only one
> copy at a time may be 'active').

I would also just send a redirect. I don't think it's worth relocating a
session.

> > and possibly migration of some session for
> >proper load balancing.
> >
> >
> forcing the balancing of state around the cluster is something that I
> have considered with WADI, but not yet tried to implement. The type of
> load-balancer that is being used has a big impact here. If you cannot
> communicate a change of session location satisfactorily to the Http load
> balancer, then you have to just go with wherever it decides a session is
> located.... With SFSBs we should have much more control at the client
> side, so this becomes a real option.

In my opinion load balancing is not something that a cluster api can
address effectively. Half the problem is evaluating how busy the system is
in the first place.

> all in all, though, it sounds like we see pretty much eye to eye :-)

Better than the other way ..

> the lazy partition regeneration is an interesting idea and this is the
> second time it has been suggested to me, so I will give it some serious
> thought.

Again, I wasn't advocating lazy state transfer. But perhaps it has
applications somewhere.

> Thanks for taking the time to share your thoughts,

No problem.

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

lichtner wrote:

>On Mon, 16 Jan 2006, Jules Gosnell wrote:
>
>  
>
>>REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a node dies ...
>>    
>>
>
>I figured. I imagine that if I had to add this distinction to totem I
>would add a message were the node in question announces that it is
>leaving, and then stops forwarding the token. On the other hand, it does
>not need to announce anything, and the other nodes will detect that it
>left. In fact totem does not judge a node either way: you can leave
>because you want to or under duress, and the consequences as far
>distribute algorithms are probably minimal. I think the only where this
>might is for logging purposes (but that could be handled at the
>application level) or to speed the membership protocol, although it's
>already pretty fast.
>
>So I would not draw a distinction there.
>
>  
>
>>By also treating nodes joining, leaving and dieing, as split and merge
>>operations I can reduce the number of cases that I have to deal with.
>>    
>>
>
>I would even add that the difference is known only to the application.
>
>  
>
>>and ensure that what might be very uncommonly run code (run on network
>>partition/healing) is the same code that is commonly run on e.g. node
>>join/leave - so it is likely to be more robust.
>>    
>>
>
>Sounds good.
>
>  
>
>>In the case of a binary split, I envisage two sets of nodes losing
>>contact with each other. Each cluster fragment will repair its internal
>>structure. I expect that after this repair, neither fragment will carry
>>a complete copy of the cluster's original state (unless we are
>>replicating 1->all, which WADI will usually not do), rather, the two
>>datasets will intersect and their union will be the original dataset.
>>Replicated state will carry a version number.
>>    
>>
>
>I think a version number should work very well.
>
>  
>
>>If client affinity survives the split (i.e. clients continue to talk to
>>the same nodes), then we should find ourselves in a working state, with
>>two smaller clusters carrying overlapping and diverging state. Each
>>piece of state should be static in one subcluster and divergant in the
>>other (it has only one client). The version carried by each piece of
>>state may be used to decide which is the most recent version.
>>
>>(If client affinity is not maintained, then, without a backchannel of
>>some sort, we are in trouble).
>>
>>When a merge occurs, WADI will be able to merge the internal
>>representations of the participants, delegating awkward decisions about
>>divergant state to deploy-time pluggable algorithms. Hopefully, each
>>piece of state will only have diverged in one cluster fragment so the
>>choosing which copy to go forward with will be trivial.
>>    
>>
>
>  
>
>>A node death can just be thought of as a 'split' which never 'merges'.
>>    
>>
>
>Definitely :)
>
>  
>
>>Of course, multiple splits could occur concurrently and merging them is
>>a little more complicated than I may have implied, but I am getting
>>there....
>>    
>>
>
>Although I consider the problem of session replication less than
>glamorous, since it is at hand, I would approach it this way:
>
>1. The user should configure a minimum-degree-of-replication R. This is
>the number of replicas of a specific session which need to be available in
>order for an HTTP request to be serviced.
>  
>
WADI's current architecture is similar. I would specify an e.g. 
max-num-of-backups. Provided that the cluster has sufficient members, 
each session would maintain this number of backup copies.

The fact that one session is the primary and the others are backups is 
an important distinction that WADI makes. Making/refreshing session 
backups involves serialisation. Serialisation of an HttpSession may 
involve the notification of passivation to components within the 
session. It is important that only one copy of the session thinks that 
it is the 'active' copy at any one time.

>2. When an HTTP request arrives, if the cluster which received does not
>have R copies then it blocks (it waits until there are.) This should in
>data centers because partitions are likely to be very short-lived (aka
>virtual partitions, which are due to congestion, not to any hardware
>issue.)
>  
>
Interesting. I was intending to actively repopulate the cluster 
fragment, as soon as the split was detected. I figure that
- the longer that sessions spend without their full complement of 
backups, the more likely that a further failure may result in data loss.
- the split is an exceptional cicumstance at which you would expect to 
pay an exceptional cost (regenerating missing primaries from backups and 
vice-versa)

by waiting for a request to arrive for a session before ensuring it has 
its correct complement of backups, you extend the time during which it 
is 'at risk'. By doing this 'lazily', you will also have to perform an 
additional check on every request arrival, which you would not have to 
do if you had regenerated missing state at the point that you noticed 
the split.

having said this, if splits are generally shortlived then I can see that 
this approach would save a lot of cycles.

maybe there is an intermediate approach ? after a split is detected, you 
react lazily for a while, then you decide that the problem is not 
shortlived and regenerate the remaining missing structure in your 
cluster fragment.

>3. If at any time an HTTP reaches a server which does not have itself a
>replica of the session it sends a client redirect to a node which does.
>  
>
WADI can relocate request to session, as you suggest (via redirect or 
proxy), or session to request, by migration. Relocation of request 
should scale better since requests are generally smaller and, in the web 
tier, may run concurrently through the same session, whereas sessions 
are generally larger and may only be migrated serially (since only one 
copy at a time may be 'active').

>4. When a new cluster is formed (with nodes coming or going), it takes an
>inventory of all the sessions and their version numbers. Sessions which do
>not have the necessary degree of replication need to be fixed, which will
>require some state transfer,
>
yes - agreed

> and possibly migration of some session for
>proper load balancing.
>  
>
forcing the balancing of state around the cluster is something that I 
have considered with WADI, but not yet tried to implement. The type of 
load-balancer that is being used has a big impact here. If you cannot 
communicate a change of session location satisfactorily to the Http load 
balancer, then you have to just go with wherever it decides a session is 
located.... With SFSBs we should have much more control at the client 
side, so this becomes a real option.

all in all, though, it sounds like we see pretty much eye to eye :-)

the lazy partition regeneration is an interesting idea and this is the 
second time it has been suggested to me, so I will give it some serious 
thought.

Thanks for taking the time to share your thoughts,


Jules

>Guglielmo
>  
>


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.

On Mon, 16 Jan 2006, Jules Gosnell wrote:

> REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a node dies ...

I figured. I imagine that if I had to add this distinction to totem I
would add a message were the node in question announces that it is
leaving, and then stops forwarding the token. On the other hand, it does
not need to announce anything, and the other nodes will detect that it
left. In fact totem does not judge a node either way: you can leave
because you want to or under duress, and the consequences as far
distribute algorithms are probably minimal. I think the only where this
might is for logging purposes (but that could be handled at the
application level) or to speed the membership protocol, although it's
already pretty fast.

So I would not draw a distinction there.

> By also treating nodes joining, leaving and dieing, as split and merge
> operations I can reduce the number of cases that I have to deal with.

I would even add that the difference is known only to the application.

> and ensure that what might be very uncommonly run code (run on network
> partition/healing) is the same code that is commonly run on e.g. node
> join/leave - so it is likely to be more robust.

Sounds good.

> In the case of a binary split, I envisage two sets of nodes losing
> contact with each other. Each cluster fragment will repair its internal
> structure. I expect that after this repair, neither fragment will carry
> a complete copy of the cluster's original state (unless we are
> replicating 1->all, which WADI will usually not do), rather, the two
> datasets will intersect and their union will be the original dataset.
> Replicated state will carry a version number.

I think a version number should work very well.

> If client affinity survives the split (i.e. clients continue to talk to
> the same nodes), then we should find ourselves in a working state, with
> two smaller clusters carrying overlapping and diverging state. Each
> piece of state should be static in one subcluster and divergant in the
> other (it has only one client). The version carried by each piece of
> state may be used to decide which is the most recent version.
>
> (If client affinity is not maintained, then, without a backchannel of
> some sort, we are in trouble).
>
> When a merge occurs, WADI will be able to merge the internal
> representations of the participants, delegating awkward decisions about
> divergant state to deploy-time pluggable algorithms. Hopefully, each
> piece of state will only have diverged in one cluster fragment so the
> choosing which copy to go forward with will be trivial.

> A node death can just be thought of as a 'split' which never 'merges'.

Definitely :)

> Of course, multiple splits could occur concurrently and merging them is
> a little more complicated than I may have implied, but I am getting
> there....

Although I consider the problem of session replication less than
glamorous, since it is at hand, I would approach it this way:

1. The user should configure a minimum-degree-of-replication R. This is
the number of replicas of a specific session which need to be available in
order for an HTTP request to be serviced.

2. When an HTTP request arrives, if the cluster which received does not
have R copies then it blocks (it waits until there are.) This should in
data centers because partitions are likely to be very short-lived (aka
virtual partitions, which are due to congestion, not to any hardware
issue.)

3. If at any time an HTTP reaches a server which does not have itself a
replica of the session it sends a client redirect to a node which does.

4. When a new cluster is formed (with nodes coming or going), it takes an
inventory of all the sessions and their version numbers. Sessions which do
not have the necessary degree of replication need to be fixed, which will
require some state transfer, and possibly migration of some session for
proper load balancing.

Guglielmo

Work

Posted by lichtner <li...@bway.net>.

I am actually looking for another job/contract right now (in the San Diego
area, or I can telecommute), so I thought I would mention it in case
anybody knows of any openings.

Guglielmo

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.

On the subject of paritions, I remembered this paper I read a few years
ago which shows that paritions, whether caused by hardware failures or by
heavy traffic, are a fact of life:

"Understanding Partitions and the 'No Partition' Assumption"
A. Ricciard et al.

http://citeseer.ist.psu.edu/32449.html

Re: Replication using totem protocol

Posted by Jules Gosnell <ju...@coredevelopers.net>.

lichtner wrote:

>As Jules requested I am looking at the AC api. I report my observations
>below:
>
>ClusterEvent appears to represent membership-related events. These you
>can generate from evs4j, as follows: write an adapter that implements
>evs4j.Listener. In the onConfiguration(..) method you get notified of
>new configurations (new groups). You can generate ClusterEvent.ADD_NODE
>etc. by doing a diff of the old configuration and the new one.
>Evs4j does not support arbitrary algorithms for electing coordinators.
>In fact, in totem there is no coordinator. If a specific election is
>important for you, you can design one using totem's messages. If not,
>in evs4j node names are integers, so the coordinator can be the lowest
>integer. This is checked by evs4j.Configuration.getCoordinator().
>
>I don't know the difference between REMOVE_NODE and FAILED_NODE. In totem
>there is no difference between the two.
>  
>
REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a node dies ...

I don't think REMOVE_NODE is actually tied to an Event in the external 
API....

>The only other class I think I need to comment on is Cluster. It
>resembles a jms session, even being coupled to actual jms interfaces. You
>can definitely implement producers and consumers and put them on top of
>evs4j. The method send(Destination, Message) would have to encode Message
>on top of fixed-length evs4j messages. No problem here.
>
>Personally, I would not have mixed half the jms api with an original api.
>I don't think it sends a good message as far as risk management goes. I
>think people are prepared to deal with a product that says 'we assume jms'
>or 'we are completely home-grown because we are so much better', but not a
>mix of the two. Anyway that's not for me to say. Whatever works.
>  
>
I'll leave this one to James.

>In conclusion, yes, I think you could build an implementation of AC on top
>of evs4j.
>
>BTW, how does AC defend against the problem of a split-brain cluster?
>Shared scsi disk? Majority voting? Curious.
>  
>
Well, I think AC's approach is that it is an app-space problem - but 
James may care to comment.

As an AC user I am considering the following approach for WADI 
(HttpSession and SFSB clustering solution).

(1) simplifying the notifications that I might get from a cluster to the 
following :
- Split : 0-N nodes have left the cluster
- Merge : 0-N nodes have joined the cluster
- Change : A node has updated it's public, distributed state

'Split' should now be generic enough to encompass the following common 
cases:

- a node leaving cleanly (having evacuated its state, therefore carrying 
NO state) [clean node shutdown]
- a node dieing (therefore carrying state) [catastrophic failure]
- a group of nodes falling out of contact (still carrying state) 
[network partition]

'Join' can encompass:

- a new node joining (therefore carrying NO state).
- a group of nodes coming back into contact after a split (carrying 
state that needs to be merged) [network healing]

'Change'

- same as in AC - each node makes public by distribution, a small amount 
of data, this is republished each time it is updated.

By also treating nodes joining, leaving and dieing, as split and merge 
operations I can reduce the number of cases that I have to deal with, 
and ensure that what might be very uncommonly run code (run on network 
partition/healing) is the same code that is commonly run on e.g. node 
join/leave - so it is likely to be more robust.

In the case of a binary split, I envisage two sets of nodes losing 
contact with each other. Each cluster fragment will repair its internal 
structure. I expect that after this repair, neither fragment will carry 
a complete copy of the cluster's original state (unless we are 
replicating 1->all, which WADI will usually not do), rather, the two 
datasets will intersect and their union will be the original dataset. 
Replicated state will carry a version number.

If client affinity survives the split (i.e. clients continue to talk to 
the same nodes), then we should find ourselves in a working state, with 
two smaller clusters carrying overlapping and diverging state. Each 
piece of state should be static in one subcluster and divergant in the 
other (it has only one client). The version carried by each piece of 
state may be used to decide which is the most recent version.

(If client affinity is not maintained, then, without a backchannel of 
some sort, we are in trouble).

When a merge occurs, WADI will be able to merge the internal 
representations of the participants, delegating awkward decisions about 
divergant state to deploy-time pluggable algorithms. Hopefully, each 
piece of state will only have diverged in one cluster fragment so the 
choosing which copy to go forward with will be trivial.

A node death can just be thought of as a 'split' which never 'merges'.

Of course, multiple splits could occur concurrently and merging them is 
a little more complicated than I may have implied, but I am getting 
there....

WADI's approach should work for HttpSession and SFSB, where there is a 
single client who will be talking to a single node. In the case of some 
other type, where clients for the same resource may end up in different 
cluster fragments, this approach will be insufficient.

I would be very interested in hearing your thoughts on the subject,

Jules

>Guglielmo
>  
>

-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Replication using totem protocol

Posted by lichtner <li...@bway.net>.

As Jules requested I am looking at the AC api. I report my observations
below:

ClusterEvent appears to represent membership-related events. These you
can generate from evs4j, as follows: write an adapter that implements
evs4j.Listener. In the onConfiguration(..) method you get notified of
new configurations (new groups). You can generate ClusterEvent.ADD_NODE
etc. by doing a diff of the old configuration and the new one.
Evs4j does not support arbitrary algorithms for electing coordinators.
In fact, in totem there is no coordinator. If a specific election is
important for you, you can design one using totem's messages. If not,
in evs4j node names are integers, so the coordinator can be the lowest
integer. This is checked by evs4j.Configuration.getCoordinator().

I don't know the difference between REMOVE_NODE and FAILED_NODE. In totem
there is no difference between the two.

The only other class I think I need to comment on is Cluster. It
resembles a jms session, even being coupled to actual jms interfaces. You
can definitely implement producers and consumers and put them on top of
evs4j. The method send(Destination, Message) would have to encode Message
on top of fixed-length evs4j messages. No problem here.

Personally, I would not have mixed half the jms api with an original api.
I don't think it sends a good message as far as risk management goes. I
think people are prepared to deal with a product that says 'we assume jms'
or 'we are completely home-grown because we are so much better', but not a
mix of the two. Anyway that's not for me to say. Whatever works.

In conclusion, yes, I think you could build an implementation of AC on top
of evs4j.

BTW, how does AC defend against the problem of a split-brain cluster?
Shared scsi disk? Majority voting? Curious.

Guglielmo

Re: Replication using totem protocol

Posted by li...@bway.net.

> Interesting.  Can you suggest a protocol we should use for
> pessimistic distributed locking?   I expect the cluster size to be
> between 2-16 nodes with the sweet spot at 4 nodes.   Each node will
> be processing about 500-1000 tps and each tps will require on average
> about 1-4 lock requests (most likely just one request for the web
> session).  Nodes should be able to join and leave the cluster easily.

If you must be a pessimist, then get shared locks for reads, exclusive
locks for writes, two locks conflicting if at least one of them is an
exclusive lock. Hold the locks acquired until after commit (strict 2pl).

To get a lock you send a totem message and wait for it to arrive. A few ms.

The latency for 4 nodes should be very respectable. For 16 nodes it might
still be acceptable. I would measure the throughput/latency curve in your
lab and based on that you can decide at what point you need something more
sophisticated (which for me would be an independent replicated lock
manager which can be reached through short tcp messages and some basic
load balancing.)

This paper actually shows some simulations of various concurrency control
protocols, so you can make an educated decision:

http://citeseer.ist.psu.edu/299097.html

Guglielmo

Re: Replication using totem protocol

Posted by Dain Sundstrom <da...@iq80.com>.

On Jan 12, 2006, at 3:43 PM, lichtner@bway.net wrote:

>> No.  You license the code to the Apache Software Foundation giving
>> the foundation the rights to relicense under any license (so the
>> foundation can upgrade the license as they did with ASL2).  We do ask
>> that you change the copyrights on the version of the code you give to
>> the ASF to something like "Copyright 2004 The Apache Software
>> Foundation or its licensors, as applicable."
>
> That _is_ transferring the copyright.
>
> As I told Jeff on the phone, I would definitely considering this if it
> turns that evs4j will really be used, but I would rather not grant  
> someone
> an unlimited license at the present time. Jeff said we are going to  
> have a
> discussion, so we'll know more soon enough.
>
>> Nothing better to do between jobs than coding :)
>
> You should see the next program I am writing ;)
>
>>> Also, what do you need to locks for?
>>
>> Locking web sessions and stateful session beans in the cluster when a
>> node is working on it.
>
> I see. I don't think I would pass the token around all the nodes  
> just for
> session replication. It's a low-sharing workload, meaning you could  
> have
> 50 servers but you only want 3 copies of a session, say.
>
> But you could write a high-available lock manager using totem, say,  
> with
> three copies of the system, and write a low-latency tcp-based  
> protocol to
> grab the lock. The time to get the lock would be the tcp round-trip  
> plus
> the time it takes for totem to send itself a 'safe' message, which on
> average takes 1.5 token rotations (as opposed to 0.5). And you would
> load-balance among the three copies. That would probably get a  
> latency of
> about 5 ms total to get a lock (just a gut feeling) and also  
> scalability.
> And you can always add more copies.

Interesting.  Can you suggest a protocol we should use for  
pessimistic distributed locking?   I expect the cluster size to be  
between 2-16 nodes with the sweet spot at 4 nodes.   Each node will  
be processing about 500-1000 tps and each tps will require on average  
about 1-4 lock requests (most likely just one request for the web  
session).  Nodes should be able to join and leave the cluster easily.

-dain

Re: Replication using totem protocol

Posted by li...@bway.net.

> No.  You license the code to the Apache Software Foundation giving
> the foundation the rights to relicense under any license (so the
> foundation can upgrade the license as they did with ASL2).  We do ask
> that you change the copyrights on the version of the code you give to
> the ASF to something like "Copyright 2004 The Apache Software
> Foundation or its licensors, as applicable."

That _is_ transferring the copyright.

As I told Jeff on the phone, I would definitely considering this if it
turns that evs4j will really be used, but I would rather not grant someone
an unlimited license at the present time. Jeff said we are going to have a
discussion, so we'll know more soon enough.

> Nothing better to do between jobs than coding :)

You should see the next program I am writing ;)

>> Also, what do you need to locks for?
>
> Locking web sessions and stateful session beans in the cluster when a
> node is working on it.

I see. I don't think I would pass the token around all the nodes just for
session replication. It's a low-sharing workload, meaning you could have
50 servers but you only want 3 copies of a session, say.

But you could write a high-available lock manager using totem, say, with
three copies of the system, and write a low-latency tcp-based protocol to
grab the lock. The time to get the lock would be the tcp round-trip plus
the time it takes for totem to send itself a 'safe' message, which on
average takes 1.5 token rotations (as opposed to 0.5). And you would
load-balance among the three copies. That would probably get a latency of
about 5 ms total to get a lock (just a gut feeling) and also scalability.
And you can always add more copies.

Guglielmo

Re: Replication using totem protocol

Posted by Dain Sundstrom <da...@iq80.com>.

On Jan 12, 2006, at 12:28 PM, lichtner@bway.net wrote:

> I didn't see it - I'm not sure why.
>
>> According to the website (http://www.bway.net/~lichtner/evs4j.html):
>>
>>      "Extended Virtual Synchrony for Java (EVS4J), an Apache-
>> Licensed, pure-Java implementation of the fastest known totally
>> ordered reliable multicast protocol."
>
> Yes, I wrote that.
>
>> Once you have a total ordered messing protocol, implementing a
>> distributed lock is trivial (I can go into detail if you want).
>
> Yes. You just send a totally-ordered message and wait for it to  
> arrive.
>
>> I suggest we ask Guglielmo if he would like to donate his
>> implementation to this incubator project
>
> I don't know about donating it. Who would they want me to transfer the
> copyright to?

No.  You license the code to the Apache Software Foundation giving  
the foundation the rights to relicense under any license (so the  
foundation can upgrade the license as they did with ASL2).  We do ask  
that you change the copyrights on the version of the code you give to  
the ASF to something like "Copyright 2004 The Apache Software  
Foundation or its licensors, as applicable."

>> and if he would like to work on a pessimistic distributed locking
> implementation.
>> What do you think?
>
> I would definitely like to work on it, but I still work for a  
> living, so
> that's something to think about. (I happen to be between jobs right  
> now.)

Nothing better to do between jobs than coding :)

> Also, what do you need to locks for?

Locking web sessions and stateful session beans in the cluster when a  
node is working on it.

-dain

Re: Fwd: Replication using totem protocol

Posted by li...@bway.net.

I didn't see it - I'm not sure why.

> According to the website (http://www.bway.net/~lichtner/evs4j.html):
>
>      "Extended Virtual Synchrony for Java (EVS4J), an Apache-
> Licensed, pure-Java implementation of the fastest known totally
> ordered reliable multicast protocol."

Yes, I wrote that.

> Once you have a total ordered messing protocol, implementing a
> distributed lock is trivial (I can go into detail if you want).

Yes. You just send a totally-ordered message and wait for it to arrive.

> I suggest we ask Guglielmo if he would like to donate his
> implementation to this incubator project

I don't know about donating it. Who would they want me to transfer the
copyright to?

> and if he would like to work on a pessimistic distributed locking
implementation.
> What do you think?

I would definitely like to work on it, but I still work for a living, so
that's something to think about. (I happen to be between jobs right now.)

Also, what do you need to locks for?

Guglielmo

Re: Infiniband

Posted by lichtner <li...@bway.net>.

On Fri, 13 Jan 2006, Alan D. Cabrera wrote:

> > The infiniband transport would be native code, so you could use JNI.
> > However, it would definitely be worth it.
>
> Do you have any references to the where one could get a peek at the
> transport API?

http://infiniband.sourceforge.net/

Re: Infiniband

Posted by "Alan D. Cabrera" <li...@toolazydogs.com>.

lichtner@bway.net wrote, On 1/13/2006 11:51 AM:
> With regard to clustering, I also want to mention a remote option, which
> is to use infiniband RDMA for inter-node communication.
> 
> With an infiniband link between two machines you can copy a buffer
> directly from the memory of one to the memory of the other, without
> switching context. This means the kernel scheduler is not involved at all,
> and there are no copies.
> 
> I think the bandwidth can be up to 30Gbps right now. Pathscale makes an IB
> adapter which plugs into the new HTX hypertransport slot, which is to say
> it bypasses the pci bus (!). They report an 8-byte message latency of 1.32
> microseconds.
> 
> I think IB costs about $500 per node. But the cost is going down steadily
> because the people who use IB typically buy thousands of network cards at
> a time (for supercomputers.)
> 
> The infiniband transport would be native code, so you could use JNI.
> However, it would definitely be worth it.

Do you have any references to the where one could get a peek at the 
transport API?


Regards,
Alan

Re: Infiniband

Posted by lichtner <li...@bway.net>.

I think I have found some information which if I had hardware available
would lead me to skip the prototyping stage entirely:

This paper benchmarks the performance of infiniband through 1) UDAPL and
2) Sockets Direct Protocol (SDP) - also available from openib.org:

"Sockets Direct Protocol over Infiniband Clusters: Is it Beneficial?"
Balaji et al.

I think the value of SDP over UDAPL is that it looks like a socket
interface, which means porting applications over could even be trivial.

However, I think that SDP in that paper does not have zero copy, and that
is why the paper shows that UDAPL is faster.

However, Mellanox has a zero-copy SDP:

"Transparently Achieving Superior Socket Performance Using Zero Copy
Socket Direct Protocol over 20Gb/s Infiniband Links"
Goldenberg et al.

It's just enlightening to see the two main sources of waste in network
applications be removed one at a time, namely 1) context switches and 2)
copies.

Using zero-copy SDP from Java should be pretty easy, although interfacing
with UDAPL would also be valuable.

I have found evidence that Sun was planning to include support for Sockets
Direct Protocol in jdk 1.6, but that they gave up because infiniband is
not mainstream hardware (yet).

I think IBM may have put some of this in WebSphere 6:

http://domino.research.ibm.com/comm/research.nsf/pages/r.distributed.innovation.html?Open&printable

That would be just typical of IBM, understating or flat out
hiding important achievements. When Parallex Sysplex came out IBM did a
test with 100% data sharing, meaning _all_ reads and writes where remote
(to the CF), and measured the scalability, and it's basically linear, but
off the diagonal by (only) 13%. Instead of understanding that mainframes
now scaled horizontally, the press focused on the "overhead". This
prompted Lou Goestner that if he invited the press to his house and
showcased his dog walking on water the press would report "Goestner buys
dog that can't swim."

I think if I had a few thousand dollars to spare I would definitely get
a couple of opteron boxes and get this off the ground.

I think from the numbers you can conclude that even where a person wants
to be perverse and keep using object serialization, they will still get
much better throughput (if not better latency) because half the cpu can be
spent executing the jvm rather than switching context and copying data
from the memory to itself.

I hope somebody with a budget picks this up soon.

Guglielmo

On Sun, 15 Jan 2006, James Strachan wrote:

> On 14 Jan 2006, at 22:27, lichtner wrote:
> > On Fri, 13 Jan 2006, James Strachan wrote:
> >
> >>> The infiniband transport would be native code, so you could use JNI.
> >>> However, it would definitely be worth it.
> >>
> >> Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages
> >> & google every once in a while to see if one shows up :)
> >>
> >> It looks like MPI has support for Infiniband; would it be worth
> >> trying to wrap that in JNI?
> >> http://www-unix.mcs.anl.gov/mpi/
> >> http://www-unix.mcs.anl.gov/mpi/mpich2/
> >
> > I did find that HP has a Java interface for MPI. However, to me it
> > doesn't
> > necessarily seem that this is the way to go. I think for writing
> > distributed computations it would be the right choice, but I think
> > that
> > the people who write those choose to work in a natively compiled
> > language,
> > and I think that this may be the reason why this Java mpi doesn't
> > appear
> > to be that well-known.
> >
> > However I did find something which might work for us, namely UDAPL
> > from the DAT Collaborative. A consortium created a spec for
> > interface to
> > anything that provides RDMA capabilities:
> >
> > http://www.datcollaborative.org/udapl.html
> >
> > The header files and the spec are right there.
> >
> > I downloaded the only release made by infiniband.sf.net and they claim
> > that it only works with kernel 2.4, and that for 2.6 you have to use
> > openib.org. The latter claims to provide an implementation of UDAPL:
> >
> > http://openib.org/doc.html
> >
> > The wiki has a lot of info.
> >
> > From the mailing list archive you can tell that this project has a
> > lot of
> > momentum:
> >
> > http://openib.org/pipermail/openib-general/
>
> Awesome! Thanks for all the links
>
>
> > I think the next thing to do would be to prove that using RDMA as
> > opposed
> > to udp is worthwhile. I think it is, because JITs are so fast now,
> > but I
> > think that before planning anything long-term I would get two
> > infiniband-enabled boxes and write a little prototype.
>
> Agreed; the important issue is gonna be, can Java with JNI (or Unsafe
> or one of the alternatives to JNI: http://weblog.janek.org/Archive/
> 2005/07/28/AlternativestoJavaNativeI.html) work with RDMA using
> native ByteBuffers so that the zero copy is avoided and so that
> things perform better than just using some Infiniband-optimised TCP/
> IP implementation. Though to be able to test this we need to make a
> prototype Java API to RDMA :) But it is definitely well worth the
> experiment IMHO
>
> The main big win is obviously avoiding the double copy of working
> with TCP/IP though there are other benefits like improved flow
> control (you know that you can send a message to a consumer & how
> much capacity it has at any point in time so there is no need to
> worry about slow/dead consumer detection) another is concurrency; in
> a point-cast model, sending to multiple consumers in 1 thread is
> trivial (and multi-threading definitely slows down messaging as we
> found in ActiveMQ).
>
> In ActiveMQ the use of RDMA would allow us to do straight through
> processing for messages which could dramatically cut down on the
> number of objects created per message & the GC overhead. (Indeed
> we've been musing that it might be possible to avoid most per-message
> dispatch object allocations if selectors are not used and we wrote a
> custom RDMA friendly version of ActiveMQMessage; we should also be
> able to optimise the use of the Journal as we can just pass around
> the ByteBuffer rather than using OpenWire marshalling.
>
>
> > I think Appro sells
> > infiniband blades with Mellanox hcas.
> >
> > There is also IBM's proprietary API for clustering mainframes, the
> > Coupling Facility:
> >
> > http://www.research.ibm.com/journal/sj36-2.html
> >
> > There are some amazing articles there.
>
> Great stuff - thanks for the link!
>
>
> > Personally I also think there is value in implementing replication
> > using
> > udp (process groups libraries such as evs4j), so I would pursue
> > both at
> > the same time.
>
> Yeah; like many things in distributed systems & messaging; it all
> depends on what you are doing as to what solution is the best for
> your scenario. Certainly both technologies are useful tools to have
> in your bag when creating middleware. I personally see RDMA as a
> possible faster alternative for TCP/IP inside message brokers such as
> ActiveMQ as well as for request-response messaging such as in openejb.
>
> James
> -------
> http://radio.weblogs.com/0112098/
>
>
>
> ___________________________________________________________
> To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com
>

Re: Infiniband

Posted by James Strachan <ja...@yahoo.co.uk>.

On 14 Jan 2006, at 22:27, lichtner wrote:
> On Fri, 13 Jan 2006, James Strachan wrote:
>
>>> The infiniband transport would be native code, so you could use JNI.
>>> However, it would definitely be worth it.
>>
>> Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages
>> & google every once in a while to see if one shows up :)
>>
>> It looks like MPI has support for Infiniband; would it be worth
>> trying to wrap that in JNI?
>> http://www-unix.mcs.anl.gov/mpi/
>> http://www-unix.mcs.anl.gov/mpi/mpich2/
>
> I did find that HP has a Java interface for MPI. However, to me it  
> doesn't
> necessarily seem that this is the way to go. I think for writing
> distributed computations it would be the right choice, but I think  
> that
> the people who write those choose to work in a natively compiled  
> language,
> and I think that this may be the reason why this Java mpi doesn't  
> appear
> to be that well-known.
>
> However I did find something which might work for us, namely UDAPL
> from the DAT Collaborative. A consortium created a spec for  
> interface to
> anything that provides RDMA capabilities:
>
> http://www.datcollaborative.org/udapl.html
>
> The header files and the spec are right there.
>
> I downloaded the only release made by infiniband.sf.net and they claim
> that it only works with kernel 2.4, and that for 2.6 you have to use
> openib.org. The latter claims to provide an implementation of UDAPL:
>
> http://openib.org/doc.html
>
> The wiki has a lot of info.
>
> From the mailing list archive you can tell that this project has a  
> lot of
> momentum:
>
> http://openib.org/pipermail/openib-general/

Awesome! Thanks for all the links

> I think the next thing to do would be to prove that using RDMA as  
> opposed
> to udp is worthwhile. I think it is, because JITs are so fast now,  
> but I
> think that before planning anything long-term I would get two
> infiniband-enabled boxes and write a little prototype.

Agreed; the important issue is gonna be, can Java with JNI (or Unsafe  
or one of the alternatives to JNI: http://weblog.janek.org/Archive/ 
2005/07/28/AlternativestoJavaNativeI.html) work with RDMA using  
native ByteBuffers so that the zero copy is avoided and so that  
things perform better than just using some Infiniband-optimised TCP/ 
IP implementation. Though to be able to test this we need to make a  
prototype Java API to RDMA :) But it is definitely well worth the  
experiment IMHO

The main big win is obviously avoiding the double copy of working  
with TCP/IP though there are other benefits like improved flow  
control (you know that you can send a message to a consumer & how  
much capacity it has at any point in time so there is no need to  
worry about slow/dead consumer detection) another is concurrency; in  
a point-cast model, sending to multiple consumers in 1 thread is  
trivial (and multi-threading definitely slows down messaging as we  
found in ActiveMQ).

In ActiveMQ the use of RDMA would allow us to do straight through  
processing for messages which could dramatically cut down on the  
number of objects created per message & the GC overhead. (Indeed  
we've been musing that it might be possible to avoid most per-message  
dispatch object allocations if selectors are not used and we wrote a  
custom RDMA friendly version of ActiveMQMessage; we should also be  
able to optimise the use of the Journal as we can just pass around  
the ByteBuffer rather than using OpenWire marshalling.

> I think Appro sells
> infiniband blades with Mellanox hcas.
>
> There is also IBM's proprietary API for clustering mainframes, the
> Coupling Facility:
>
> http://www.research.ibm.com/journal/sj36-2.html
>
> There are some amazing articles there.

Great stuff - thanks for the link!

> Personally I also think there is value in implementing replication  
> using
> udp (process groups libraries such as evs4j), so I would pursue  
> both at
> the same time.

Yeah; like many things in distributed systems & messaging; it all  
depends on what you are doing as to what solution is the best for  
your scenario. Certainly both technologies are useful tools to have  
in your bag when creating middleware. I personally see RDMA as a  
possible faster alternative for TCP/IP inside message brokers such as  
ActiveMQ as well as for request-response messaging such as in openejb.

James
-------
http://radio.weblogs.com/0112098/

___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

Re: Infiniband

Posted by lichtner <li...@bway.net>.

On Fri, 13 Jan 2006, James Strachan wrote:

> > The infiniband transport would be native code, so you could use JNI.
> > However, it would definitely be worth it.
>
> Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages
> & google every once in a while to see if one shows up :)
>
> It looks like MPI has support for Infiniband; would it be worth
> trying to wrap that in JNI?
> http://www-unix.mcs.anl.gov/mpi/
> http://www-unix.mcs.anl.gov/mpi/mpich2/

I did find that HP has a Java interface for MPI. However, to me it doesn't
necessarily seem that this is the way to go. I think for writing
distributed computations it would be the right choice, but I think that
the people who write those choose to work in a natively compiled language,
and I think that this may be the reason why this Java mpi doesn't appear
to be that well-known.

However I did find something which might work for us, namely UDAPL
from the DAT Collaborative. A consortium created a spec for interface to
anything that provides RDMA capabilities:

http://www.datcollaborative.org/udapl.html

The header files and the spec are right there.

I downloaded the only release made by infiniband.sf.net and they claim
that it only works with kernel 2.4, and that for 2.6 you have to use
openib.org. The latter claims to provide an implementation of UDAPL:

http://openib.org/doc.html

The wiki has a lot of info.

>From the mailing list archive you can tell that this project has a lot of
momentum:

http://openib.org/pipermail/openib-general/

I think the next thing to do would be to prove that using RDMA as opposed
to udp is worthwhile. I think it is, because JITs are so fast now, but I
think that before planning anything long-term I would get two
infiniband-enabled boxes and write a little prototype. I think Appro sells
infiniband blades with Mellanox hcas.

There is also IBM's proprietary API for clustering mainframes, the
Coupling Facility:

http://www.research.ibm.com/journal/sj36-2.html

There are some amazing articles there.

Personally I also think there is value in implementing replication using
udp (process groups libraries such as evs4j), so I would pursue both at
the same time.

Guglielmo

Re: Infiniband

Posted by lichtner <li...@bway.net>.

On Fri, 13 Jan 2006, James Strachan wrote:

> > The infiniband transport would be native code, so you could use JNI.
> > However, it would definitely be worth it.
>
> Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages
> & google every once in a while to see if one shows up :)
>
> It looks like MPI has support for Infiniband; would it be worth
> trying to wrap that in JNI?
> http://www-unix.mcs.anl.gov/mpi/
> http://www-unix.mcs.anl.gov/mpi/mpich2/

I don't know MPI. Do you think it's a better interface, or that it is much
easier?

I will take a look at MPI.

Guglielmo

Re: Infiniband

Posted by James Strachan <ja...@gmail.com>.

On 13 Jan 2006, at 11:51, lichtner@bway.net wrote:
> With regard to clustering, I also want to mention a remote option,  
> which
> is to use infiniband RDMA for inter-node communication.
>
> With an infiniband link between two machines you can copy a buffer
> directly from the memory of one to the memory of the other, without
> switching context. This means the kernel scheduler is not involved  
> at all,
> and there are no copies.

I love infiniband RDMA! :)


> I think the bandwidth can be up to 30Gbps right now. Pathscale  
> makes an IB
> adapter which plugs into the new HTX hypertransport slot, which is  
> to say
> it bypasses the pci bus (!). They report an 8-byte message latency  
> of 1.32
> microseconds.
>
> I think IB costs about $500 per node. But the cost is going down  
> steadily
> because the people who use IB typically buy thousands of network  
> cards at
> a time (for supercomputers.)
>
> The infiniband transport would be native code, so you could use JNI.
> However, it would definitely be worth it.

Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages  
& google every once in a while to see if one shows up :)

It looks like MPI has support for Infiniband; would it be worth  
trying to wrap that in JNI?
http://www-unix.mcs.anl.gov/mpi/
http://www-unix.mcs.anl.gov/mpi/mpich2/

James
-------
http://radio.weblogs.com/0112098/

Infiniband

Posted by li...@bway.net.

With regard to clustering, I also want to mention a remote option, which
is to use infiniband RDMA for inter-node communication.

With an infiniband link between two machines you can copy a buffer
directly from the memory of one to the memory of the other, without
switching context. This means the kernel scheduler is not involved at all,
and there are no copies.

I think the bandwidth can be up to 30Gbps right now. Pathscale makes an IB
adapter which plugs into the new HTX hypertransport slot, which is to say
it bypasses the pci bus (!). They report an 8-byte message latency of 1.32
microseconds.

I think IB costs about $500 per node. But the cost is going down steadily
because the people who use IB typically buy thousands of network cards at
a time (for supercomputers.)

The infiniband transport would be native code, so you could use JNI.
However, it would definitely be worth it.

Guglielmo

Re: Fwd: Replication using totem protocol

Posted by Jeff Genender <jg...@apache.org>.

Yes...awesome.  Bruce had chatted with me about this too...I am very
interested.

Guglielmo, I would be very interested in speaking with you further on
this.  This is looks like something we could heavily use.  What's your
thoughts?

Jeff

Dain Sundstrom wrote:
> I'm not sure if you saw this email....
> 
> According to the website (http://www.bway.net/~lichtner/evs4j.html):
> 
>     "Extended Virtual Synchrony for Java (EVS4J), an Apache-Licensed,
> pure-Java implementation of the fastest known totally ordered reliable
> multicast protocol."
> 
> 
> Once you have a total ordered messing protocol, implementing a
> distributed lock is trivial (I can go into detail if you want).  I
> suggest we ask Guglielmo if he would like to donate his implementation
> to this incubator project and if he would like to work on a pessimistic
> distributed locking implementation.
> 
> What do you think?
> 
> -dain
> 
> Begin forwarded message:
> 
>> From: lichtner@bway.net
>> Date: January 4, 2006 10:46:44 AM PST
>> To: user@geronimo.apache.org
>> Subject: Replication using totem protocol
>> Reply-To: user@geronimo.apache.org
>>
>>
>> On google I saw that the apache geronimo incubator had traces of a totem
>> protocol implementation (author "adc" ??). Is this in geronimo,
>> replication using totem?
>>
>> If you google "java totem protocol" the first link is my evs4j project,
>> the second one is a reference to old apache geronimo incubator code.

Re: Fwd: Replication using totem protocol

Posted by li...@bway.net.

I didn't see it - I'm not sure why.

> According to the website (http://www.bway.net/~lichtner/evs4j.html):
>
>      "Extended Virtual Synchrony for Java (EVS4J), an Apache-
> Licensed, pure-Java implementation of the fastest known totally
> ordered reliable multicast protocol."

Yes, I wrote that.

> Once you have a total ordered messing protocol, implementing a
> distributed lock is trivial (I can go into detail if you want).

Yes. You just send a totally-ordered message and wait for it to arrive.

> I suggest we ask Guglielmo if he would like to donate his
> implementation to this incubator project

I don't know about donating it. Who would they want me to transfer the
copyright to?

> and if he would like to work on a pessimistic distributed locking
implementation.
> What do you think?

I would definitely like to work on it, but I still work for a living, so
that's something to think about. (I happen to be between jobs right now.)

Also, what do you need to locks for?

Guglielmo

Fwd: Replication using totem protocol

Posted by Dain Sundstrom <da...@iq80.com>.

I'm not sure if you saw this email....

According to the website (http://www.bway.net/~lichtner/evs4j.html):

     "Extended Virtual Synchrony for Java (EVS4J), an Apache- 
Licensed, pure-Java implementation of the fastest known totally  
ordered reliable multicast protocol."


Once you have a total ordered messing protocol, implementing a  
distributed lock is trivial (I can go into detail if you want).  I  
suggest we ask Guglielmo if he would like to donate his  
implementation to this incubator project and if he would like to work  
on a pessimistic distributed locking implementation.

What do you think?

-dain

Begin forwarded message:

> From: lichtner@bway.net
> Date: January 4, 2006 10:46:44 AM PST
> To: user@geronimo.apache.org
> Subject: Replication using totem protocol
> Reply-To: user@geronimo.apache.org
>
>
> On google I saw that the apache geronimo incubator had traces of a  
> totem
> protocol implementation (author "adc" ??). Is this in geronimo,
> replication using totem?
>
> If you google "java totem protocol" the first link is my evs4j  
> project,
> the second one is a reference to old apache geronimo incubator code.

Fwd: Replication using totem protocol

Posted by Dain Sundstrom <da...@iq80.com>.

I'm not sure if you saw this email....

According to the website (http://www.bway.net/~lichtner/evs4j.html):

     "Extended Virtual Synchrony for Java (EVS4J), an Apache- 
Licensed, pure-Java implementation of the fastest known totally  
ordered reliable multicast protocol."


Once you have a total ordered messing protocol, implementing a  
distributed lock is trivial (I can go into detail if you want).  I  
suggest we ask Guglielmo if he would like to donate his  
implementation to this incubator project and if he would like to work  
on a pessimistic distributed locking implementation.

What do you think?

-dain

Begin forwarded message:

> From: lichtner@bway.net
> Date: January 4, 2006 10:46:44 AM PST
> To: user@geronimo.apache.org
> Subject: Replication using totem protocol
> Reply-To: user@geronimo.apache.org
>
>
> On google I saw that the apache geronimo incubator had traces of a  
> totem
> protocol implementation (author "adc" ??). Is this in geronimo,
> replication using totem?
>
> If you google "java totem protocol" the first link is my evs4j  
> project,
> the second one is a reference to old apache geronimo incubator code.