You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@geronimo.apache.org by Andy Piper <an...@bea.com> on 2005/08/02 14:06:51 UTC

Re: Clustering (long)

Hi Jules

At 05:37 AM 7/27/2005, Jules Gosnell wrote:

>I agree on the SPoF thing - but I think you misunderstand my 
>Coordinator arch. I do not have a single static Coordinator node, 
>but a dynamic Coordinator role, into which a node may be elected. 
>Thus every node is a potential Coordinator. If the elected 
>Coordinator dies, another is immediately elected. The election 
>strategy is pluggable, although it will probably end up being 
>hardwired to "oldest-cluster-member". The reason behind this is that 
>relaying out your cluster is much simpler if it is done in a single 
>vm. I originally tried to do it in multiple vms, each taking 
>responsibility for pieces of the cluster, but if the vms views are 
>not completely in sync, things get very hairy, and completely in 
>sync is an expensive thing to achieve - and would introduce a 
>cluster-wide single point of contention. So I do it in a single vm, 
>as fast as I can, with fail over, in case that vm evaporates. Does 
>that sound better than the scenario that you had in mind ?

This is exactly the "hard" computer science problem that you 
shouldn't be trying to solve if at all possible. Its hard because 
network partitions or hung processes (think GC) make it very easy for 
your colleagues to think you are dead when you do not share that 
view. The result is two processes who think they are the coordinator 
and anarchy can ensue (commonly called split-brain syndrome). I can 
point you at papers if you want, but I really suggest that you aim 
for an implementation that is independent of a central coordinator. 
Note that a central coordinator is necessary if you want to implement 
a strongly-consistent in-memory database, but this is not usually a 
requirement for session replication say.

http://research.microsoft.com/Lampson/58-Consensus/Abstract.html 
gives a good introduction to some of these things. I also presented 
at JavaOne on related issues, you should be able to download the 
presentation from dev2dev.bea.com at some point (not there yet - I 
just checked).

>The Coordinator is not there to support session replication, but 
>rather the management of the distributed map (map of which a few 
>buckets live on each node) which is used by WADI to discover very 
>efficiently whether a session exists and where it is located. This 
>map must be rearranged, in the most efficient way possible, each 
>time a node joins or leaves the cluster.

Understood. Once you have a fault-tolerant singleton coordinator you 
can solve lots of interesting problems, its just hard and often not 
worth the effort or the expense (typical implementations involve HA 
HW or an HA DB or at least 3 server processes).

>Replication is NYI - but I'm running a few mental background threads 
>that suggest that an extension to the index will mean that it 
>associates the session's id not just to its current location, but 
>also to the location of a number of replicants. I also have ideas on 
>how a session might choose nodes into which it will place its 
>replicants and how I can avoid the primary session copy ever being 
>colocated with a replicant (potential SPoF - if you only have one 
>replicant), etc...

Right definitely something you want to avoid.

>Yes, I can see that happening - I have an improvement (NYI) to 
>WADI's evacuation strategy (how sessions are evacuated when a node 
>wishes to leave). Each session will be evacuated to the node which 
>owns the bucket into which its id hashes. This is because colocation 
>of the session with the bucket allows many messages concered with 
>its future destruction and relocation to be optimised away. Future 
>requests falling elsewhere but needing this session should, in the 
>most efficient case, be relocated to this same node, other wise the 
>session may be relocated, but at a cost...

How do you relocate the request? Many HW load-balancers do not 
support this (or else it requires using proprietary APIs), so you 
probably have to count on
moving sessions in the normal failover case.

>I would be very grateful in any thoughts or feedback that you could 
>give me. I hope to get much more information about WADI into the 
>wiki over the next few weeks. That should help generate more 
>discussion, although I would be more than happy for people to ask me 
>questions here on Geronimo-dev because this will give me an idea of 
>what documentation I should write and how existing documentation may 
>be lacking or misleading.

I guess my general comment would be that you might find it better to 
think specifically about the end-user problem you are trying to solve 
(say session replication) and work towards a solution based on that. 
Most short-cuts / optimizations that vendors make are specific to the 
problem domain and do not generally apply to all clustering problems.

Hope this helps

andy

Re: Clustering (long)

Posted by Jules Gosnell <ju...@coredevelopers.net>.

Joe Bohn wrote:

> You can define an order to the semaphores when locking and thereby 
> avoid a deadlock.

Good idea - If I order the nodes according to some unique id and try for 
the lease on their bucket table in the same order, then multiple nodes 
trying at the same time should not deadlock... - it will take a little 
longer, since I will be acquiring locks sequentially, not concurrently, 
but ...

I order locks in a single vm all the time, yet didn't make the mental 
leap to doing it in different vms without your pointing it out -Thanks :-)

Jules

> If each node being added or terminating itself honors the order then 
> you will never have a deadlock.   However, you still need to deal with 
> the case of an uncontrolled failure either adding or removing a note 
> and possibly never releasing a lock.
>
> Joe
>
> Jules Gosnell wrote:
>
>> hmm... hmmm... :-)
>>
>> more thoughts on (1) and (2)...
>>
>> When a node leaves/joins it needs to acquire a lease on the bucket 
>> tables of every node that it intends to move buckets from/to. If two 
>> nodes are doing this at the same time, their requirement will collide 
>> (deadlock) somewhere in the cluster. At this point they may be 
>> notified and e.g. compare ip addresses to decide who continues and 
>> who backs off for a while.
>>
>> So, (1) and (2), whilst being possible are probably more complex than 
>> I initially imagined. If we have Paxos for the more general purpose 
>> case (3) anyway, it would probably be smart just to go with this, 
>> until such optimisations becomes necessary, if at all.
>>
>> Jules
>>
>>
>> Jules Gosnell wrote:
>>
>>> hmmm...
>>>
>>> now I'm wondering about my solutions to (1) and (2) - if more than 
>>> one node tries to join or leave at the same time I may be in trouble 
>>> - so it may be safer to go straight to (3) for all cases...
>>>
>>> more thought needed :-)
>>>
>>> Jules
>>>
>>>
>>>
>>> Jules Gosnell wrote:
>>>
>>>> I've had a look at the Lampson paper, but didn't take it all in on 
>>>> the first pass - I think it will need some serious concentration. 
>>>> The Paxos algorithm looks interesting, I will definitely pursue 
>>>> this avenue.
>>>>
>>>> I've also given a little thought to exactly why I need a 
>>>> Coordinator and how Paxos might be used to replace it. My use of a 
>>>> Coordinator and plans for its future do not actually seem that far 
>>>> from Paxos, on a preliminary reading.
>>>>
>>>> Given that WADI currently uses a distributed map of 
>>>> sessionId:sessionLocation, that this distribution is achieved by 
>>>> sharing out responsibility for the set number of buckets that 
>>>> comprise the map roughly evenly between the cluster members and 
>>>> that this is currently my most satisfying design, I can break my 
>>>> problem space (for bucket arrangement) down into 3 basic cases :
>>>>
>>>> 1) Node joins
>>>> 2) Node leaves in controlled fashion
>>>> 3) Node dies
>>>>
>>>> If the node under discussion is the only cluster member, then no 
>>>> bucket rearrangement is necessary - this node will either create or 
>>>> destroy the full set of buckets. I'll leave this set of subcases as 
>>>> trivial.
>>>>
>>>> 1)  The joining node will need to assume responsibility for a 
>>>> number of buckets. If buckets-per-node is to be kept roughly the 
>>>> same for every node, it is likely that the joining node will 
>>>> require transfer of a small number of buckets from every current 
>>>> cluster member i.e. we are starting a bucket rearrangement that 
>>>> will involve every cluster member and only need be done if the join 
>>>> is successful. So, although we wish to avoid an SPoF, if that SPoF 
>>>> turns out to be the joining node, then I don't see it as a problem, 
>>>> If the node joining dies, then we no longer have to worry about 
>>>> rearranging our buckets (unless we have lost some that had already 
>>>> been transferred - see (3)). Thus the joining node may be used as a 
>>>> single Coordinator/Leader for this negotiation without fear of the 
>>>> SPoF problem. Are we on the same page here ?
>>>>
>>>> 2) The same argument may be applied in reverse to a node leaving in 
>>>> a controlled fashion. It will wish to evacuate its buckets roughly 
>>>> equally to all remaining cluster members. If it shuts down cleanly, 
>>>> this would form part of its shutdown protocol. If it dies before or 
>>>> during the execution of this protocol then we are back at (3), if 
>>>> not, then the SPoF issue may again be put to one side.
>>>>
>>>> 3) This is where things get tricky :-) Currently WADI has, for the 
>>>> sake of simplicity, one single algorithm / thread / 
>>>> point-of-failure which recalculates a complete bucket arrangement 
>>>> if it detects (1), (2) or (3). It would be simple enough to offload 
>>>> the work done for (1) and (2) to the node joining/leaving and this 
>>>> should reduce wadi's current vulnerability, but we still need to 
>>>> deal with catastrophic failure. Currently WADI rebuilds the missing 
>>>> buckets by querying the cluster for the locations of any sessions 
>>>> that fall within them, but it could equally carry a replicated 
>>>> backup and dust it off as part of this procedure. It's just a 
>>>> trade-off between work done up front and work done in exceptional 
>>>> circumstance... This is the place where the Paxos algorithm may 
>>>> come in handy - bucet recomposition and rearrangement. I need to 
>>>> give this further thought. For the immediate future, however, I 
>>>> think WADI will stay with a single Coordinator in this situation, 
>>>> which fails-over if http://activecluster.codehaus.org says it 
>>>> should - I'm delegating the really thorny problem to James :-). I 
>>>> agree with you that this is an SPoF and that WADI's ability to 
>>>> recover from failure here depends directly on how we decide if a 
>>>> node is alive or dead - a very tricky thing to do.
>>>>
>>>> In conclusion then, I think that we have usefully identified a 
>>>> weakness that will become more relevant as the rest of WADI's 
>>>> features mature. The Lampson paper mentioned describes an algorithm 
>>>> for allowing nodes to reach a consensus on actions to be performed, 
>>>> in a redundant manner with no SPoF and I shall consider how this 
>>>> might replace WADI's currently single Coordintor, whilst also 
>>>> looking at performing other Coordination on joining/leaving nodes 
>>>> where its failure, coinciding with that of its host node, will be 
>>>> irrelevant, since the very condition that it was intended to 
>>>> resolve has ceased to exist.
>>>>
>>>> How does that sound, Andy ? Do you agree with my thoughts on (1) & 
>>>> (2) ? This is great input - thanks,
>>>>
>>>>
>>>> Jules
>>>>
>>>>
>>>> Jules Gosnell wrote:
>>>>
>>>>> Andy Piper wrote:
>>>>>
>>>>>> Hi Jules
>>>>>>
>>>>>> At 05:37 AM 7/27/2005, Jules Gosnell wrote:
>>>>>>
>>>>>>> I agree on the SPoF thing - but I think you misunderstand my 
>>>>>>> Coordinator arch. I do not have a single static Coordinator 
>>>>>>> node, but a dynamic Coordinator role, into which a node may be 
>>>>>>> elected. Thus every node is a potential Coordinator. If the 
>>>>>>> elected Coordinator dies, another is immediately elected. The 
>>>>>>> election strategy is pluggable, although it will probably end up 
>>>>>>> being hardwired to "oldest-cluster-member". The reason behind 
>>>>>>> this is that relaying out your cluster is much simpler if it is 
>>>>>>> done in a single vm. I originally tried to do it in multiple 
>>>>>>> vms, each taking responsibility for pieces of the cluster, but 
>>>>>>> if the vms views are not completely in sync, things get very 
>>>>>>> hairy, and completely in sync is an expensive thing to achieve - 
>>>>>>> and would introduce a cluster-wide single point of contention. 
>>>>>>> So I do it in a single vm, as fast as I can, with fail over, in 
>>>>>>> case that vm evaporates. Does that sound better than the 
>>>>>>> scenario that you had in mind ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> This is exactly the "hard" computer science problem that you 
>>>>>> shouldn't be trying to solve if at all possible. Its hard because 
>>>>>> network partitions or hung processes (think GC) make it very easy 
>>>>>> for your colleagues to think you are dead when you do not share 
>>>>>> that view. The result is two processes who think they are the 
>>>>>> coordinator and anarchy can ensue (commonly called split-brain 
>>>>>> syndrome). I can point you at papers if you want, but I really 
>>>>>> suggest that you aim for an implementation that is independent of 
>>>>>> a central coordinator. Note that a central coordinator is 
>>>>>> necessary if you want to implement a strongly-consistent 
>>>>>> in-memory database, but this is not usually a requirement for 
>>>>>> session replication say.
>>>>>>
>>>>>> http://research.microsoft.com/Lampson/58-Consensus/Abstract.html 
>>>>>> gives a good introduction to some of these things. I also 
>>>>>> presented at JavaOne on related issues, you should be able to 
>>>>>> download the presentation from dev2dev.bea.com at some point (not 
>>>>>> there yet - I just checked).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> OK - I will have a look at these papers and reconsider... perhaps 
>>>>> I can come up with some sort of fractal algorithm which 
>>>>> recursively breaks down the cluster into subclusters each of which 
>>>>> is capable of doing likewise to itself and then  layout the 
>>>>> buckets recursively via this metaphor... - this would be much more 
>>>>> robust, as you point out, but, I think, a more complicated 
>>>>> architecture. I will give it some serious thought. Have you any 
>>>>> suggestions/papers as to how you might do something like this in a 
>>>>> distributed manner, bearing in mind that as a node joins, some 
>>>>> existing nodes will see it as having joined and some will not yet 
>>>>> have noticed and vice-versa on leaving....
>>>>>
>>>>>>
>>>>>>> The Coordinator is not there to support session replication, but 
>>>>>>> rather the management of the distributed map (map of which a few 
>>>>>>> buckets live on each node) which is used by WADI to discover 
>>>>>>> very efficiently whether a session exists and where it is 
>>>>>>> located. This map must be rearranged, in the most efficient way 
>>>>>>> possible, each time a node joins or leaves the cluster.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Understood. Once you have a fault-tolerant singleton coordinator 
>>>>>> you can solve lots of interesting problems, its just hard and 
>>>>>> often not worth the effort or the expense (typical 
>>>>>> implementations involve HA HW or an HA DB or at least 3 server 
>>>>>> processes).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Since I am only currently using the singleton coordinator for 
>>>>> bucket arrangement, I may just live with it for the moment, in 
>>>>> order to move forward, but make a note to replace it and start 
>>>>> background threads on how that might be achieved...
>>>>>
>>>>>>
>>>>>>> Replication is NYI - but I'm running a few mental background 
>>>>>>> threads that suggest that an extension to the index will mean 
>>>>>>> that it associates the session's id not just to its current 
>>>>>>> location, but also to the location of a number of replicants. I 
>>>>>>> also have ideas on how a session might choose nodes into which 
>>>>>>> it will place its replicants and how I can avoid the primary 
>>>>>>> session copy ever being colocated with a replicant (potential 
>>>>>>> SPoF - if you only have one replicant), etc...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Right definitely something you want to avoid.
>>>>>>
>>>>>>> Yes, I can see that happening - I have an improvement (NYI) to 
>>>>>>> WADI's evacuation strategy (how sessions are evacuated when a 
>>>>>>> node wishes to leave). Each session will be evacuated to the 
>>>>>>> node which owns the bucket into which its id hashes. This is 
>>>>>>> because colocation of the session with the bucket allows many 
>>>>>>> messages concered with its future destruction and relocation to 
>>>>>>> be optimised away. Future requests falling elsewhere but needing 
>>>>>>> this session should, in the most efficient case, be relocated to 
>>>>>>> this same node, other wise the session may be relocated, but at 
>>>>>>> a cost...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> How do you relocate the request? Many HW load-balancers do not 
>>>>>> support this (or else it requires using proprietary APIs), so you 
>>>>>> probably have to count on
>>>>>> moving sessions in the normal failover case.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> If I can squeeze the behaviour that I require out of the 
>>>>> load-balancer, then, depending on the request type I may be able 
>>>>> to get away with a redirection with a changed session cookie or 
>>>>> url param, or, failing this an http-proxy, across from a filter 
>>>>> above the servlet on one side to the http-port on the node that 
>>>>> owns the session...
>>>>>
>>>>> The LB-integration object is pluggable and the aim is to supply 
>>>>> wadi with a good selection of LB integrations - currently I only 
>>>>> have a ModJK[2] plugin working. This is able to 'restick' clients 
>>>>> to their session's new location (although messing with the session 
>>>>> id is a little dodgy...).
>>>>>
>>>>>>
>>>>>>> I would be very grateful in any thoughts or feedback that you 
>>>>>>> could give me. I hope to get much more information about WADI 
>>>>>>> into the wiki over the next few weeks. That should help generate 
>>>>>>> more discussion, although I would be more than happy for people 
>>>>>>> to ask me questions here on Geronimo-dev because this will give 
>>>>>>> me an idea of what documentation I should write and how existing 
>>>>>>> documentation may be lacking or misleading.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I guess my general comment would be that you might find it better 
>>>>>> to think specifically about the end-user problem you are trying 
>>>>>> to solve (say session replication) and work towards a solution 
>>>>>> based on that. Most short-cuts / optimizations that vendors make 
>>>>>> are specific to the problem domain and do not generally apply to 
>>>>>> all clustering problems.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The end problem is really clustered web and ejb sessions at the 
>>>>> moment, although it looks as if by the time we have solved these 
>>>>> issues we may well have written a fault-tolerant 
>>>>> distributed/partitioned index that might be very useful as a 
>>>>> generic distributed cache building block.
>>>>>
>>>>> One thing that I do want wadi to do, is to still work when 
>>>>> replication is switched off. i.e., if a session only exists as a 
>>>>> primary copy, even if affinity breaks down, wadi will continue to 
>>>>> correctly render requests for that session unless some form of 
>>>>> catastrophic failure causes the session to evaporate. This means 
>>>>> that I need to ensure the session's timely evacuation from a node 
>>>>> that chooses to leave the cluster to a remaining node, so that it 
>>>>> may remain active beyond the lifetime of its original node. All of 
>>>>> this must work flawlessly under stress, so that an admin may add 
>>>>> or remove nodes to a running cluster without having to worry about 
>>>>> the user state that it is managing. Nodes are added by simply 
>>>>> starting them, and nodes removed via e.g. ctl-c-ing them.
>>>>>
>>>>> If it is decided that a few more nines are needed in terms of 
>>>>> session availability and the cluster owner understands the extra 
>>>>> cost involved in in-vm replication in terms of extra hardware and 
>>>>> bandwidth that they will have to purchase and is happy to go with 
>>>>> in-vm-replication, then it should be sufficient to up the number 
>>>>> of replicated copies kept by the cluster from '0' to e.g. '2' and 
>>>>> restart (It might even be possible to vary this setting on a node 
>>>>> to node basis so that this change does not even involve a complete 
>>>>> cluster cold start). WADI should deal with the rest.
>>>>>
>>>>> So, I believe that I have a pretty clear idea of what WADI will 
>>>>> do, and aside from the replication stuff (phase2) it currently 
>>>>> does most of what iIhad in mind for phase1, except that it is not 
>>>>> yet happy under stress. I figure it will probably take one or two 
>>>>> more redesign/reimplementation iterations to get it to this stage, 
>>>>> then I can consider replication.
>>>>>
>>>>> I have spoken to members of the OpenEJB team about  wadi's ability 
>>>>> to relocate requests as well as sessions and we came to the 
>>>>> conclusion that it was just as applicable in the EJB world as the 
>>>>> web world. If the node an ejb client is talking to leaves the 
>>>>> cluster in between calls, the client may try to contact it and 
>>>>> then failover to another node that it hopes holds the session. If, 
>>>>> due to other nodes leaving/joining it is not always clear which 
>>>>> node will contain the session, the ability to reply to an RMI and 
>>>>> just say "not here - there!" - i.e. an rmi redirection - would not 
>>>>> be hard to add and would resolve this situation. Transactions are 
>>>>> another item which I have marked phase2.
>>>>>
>>>>> So, I am trying hard to stay very focussed on the problem domain, 
>>>>> otherwise this will never get finished :-)
>>>>>
>>>>> Right, off to read those papers now - thanks for your posting and 
>>>>> your interest,
>>>>>
>>>>> Jules
>>>>>
>>>>>>
>>>>>> Hope this helps
>>>>>>
>>>>>> andy 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Clustering (long)

Posted by Joe Bohn <jo...@earthlink.net>.

You can define an order to the semaphores when locking and thereby avoid 
a deadlock.  If each node being added or terminating itself honors the 
order then you will never have a deadlock.   However, you still need to 
deal with the case of an uncontrolled failure either adding or removing 
a note and possibly never releasing a lock.

Joe

Jules Gosnell wrote:

> hmm... hmmm... :-)
>
> more thoughts on (1) and (2)...
>
> When a node leaves/joins it needs to acquire a lease on the bucket 
> tables of every node that it intends to move buckets from/to. If two 
> nodes are doing this at the same time, their requirement will collide 
> (deadlock) somewhere in the cluster. At this point they may be 
> notified and e.g. compare ip addresses to decide who continues and who 
> backs off for a while.
>
> So, (1) and (2), whilst being possible are probably more complex than 
> I initially imagined. If we have Paxos for the more general purpose 
> case (3) anyway, it would probably be smart just to go with this, 
> until such optimisations becomes necessary, if at all.
>
> Jules
>
>
> Jules Gosnell wrote:
>
>> hmmm...
>>
>> now I'm wondering about my solutions to (1) and (2) - if more than 
>> one node tries to join or leave at the same time I may be in trouble 
>> - so it may be safer to go straight to (3) for all cases...
>>
>> more thought needed :-)
>>
>> Jules
>>
>>
>>
>> Jules Gosnell wrote:
>>
>>> I've had a look at the Lampson paper, but didn't take it all in on 
>>> the first pass - I think it will need some serious concentration. 
>>> The Paxos algorithm looks interesting, I will definitely pursue this 
>>> avenue.
>>>
>>> I've also given a little thought to exactly why I need a Coordinator 
>>> and how Paxos might be used to replace it. My use of a Coordinator 
>>> and plans for its future do not actually seem that far from Paxos, 
>>> on a preliminary reading.
>>>
>>> Given that WADI currently uses a distributed map of 
>>> sessionId:sessionLocation, that this distribution is achieved by 
>>> sharing out responsibility for the set number of buckets that 
>>> comprise the map roughly evenly between the cluster members and that 
>>> this is currently my most satisfying design, I can break my problem 
>>> space (for bucket arrangement) down into 3 basic cases :
>>>
>>> 1) Node joins
>>> 2) Node leaves in controlled fashion
>>> 3) Node dies
>>>
>>> If the node under discussion is the only cluster member, then no 
>>> bucket rearrangement is necessary - this node will either create or 
>>> destroy the full set of buckets. I'll leave this set of subcases as 
>>> trivial.
>>>
>>> 1)  The joining node will need to assume responsibility for a number 
>>> of buckets. If buckets-per-node is to be kept roughly the same for 
>>> every node, it is likely that the joining node will require transfer 
>>> of a small number of buckets from every current cluster member i.e. 
>>> we are starting a bucket rearrangement that will involve every 
>>> cluster member and only need be done if the join is successful. So, 
>>> although we wish to avoid an SPoF, if that SPoF turns out to be the 
>>> joining node, then I don't see it as a problem, If the node joining 
>>> dies, then we no longer have to worry about rearranging our buckets 
>>> (unless we have lost some that had already been transferred - see 
>>> (3)). Thus the joining node may be used as a single 
>>> Coordinator/Leader for this negotiation without fear of the SPoF 
>>> problem. Are we on the same page here ?
>>>
>>> 2) The same argument may be applied in reverse to a node leaving in 
>>> a controlled fashion. It will wish to evacuate its buckets roughly 
>>> equally to all remaining cluster members. If it shuts down cleanly, 
>>> this would form part of its shutdown protocol. If it dies before or 
>>> during the execution of this protocol then we are back at (3), if 
>>> not, then the SPoF issue may again be put to one side.
>>>
>>> 3) This is where things get tricky :-) Currently WADI has, for the 
>>> sake of simplicity, one single algorithm / thread / point-of-failure 
>>> which recalculates a complete bucket arrangement if it detects (1), 
>>> (2) or (3). It would be simple enough to offload the work done for 
>>> (1) and (2) to the node joining/leaving and this should reduce 
>>> wadi's current vulnerability, but we still need to deal with 
>>> catastrophic failure. Currently WADI rebuilds the missing buckets by 
>>> querying the cluster for the locations of any sessions that fall 
>>> within them, but it could equally carry a replicated backup and dust 
>>> it off as part of this procedure. It's just a trade-off between work 
>>> done up front and work done in exceptional circumstance... This is 
>>> the place where the Paxos algorithm may come in handy - bucet 
>>> recomposition and rearrangement. I need to give this further 
>>> thought. For the immediate future, however, I think WADI will stay 
>>> with a single Coordinator in this situation, which fails-over if 
>>> http://activecluster.codehaus.org says it should - I'm delegating 
>>> the really thorny problem to James :-). I agree with you that this 
>>> is an SPoF and that WADI's ability to recover from failure here 
>>> depends directly on how we decide if a node is alive or dead - a 
>>> very tricky thing to do.
>>>
>>> In conclusion then, I think that we have usefully identified a 
>>> weakness that will become more relevant as the rest of WADI's 
>>> features mature. The Lampson paper mentioned describes an algorithm 
>>> for allowing nodes to reach a consensus on actions to be performed, 
>>> in a redundant manner with no SPoF and I shall consider how this 
>>> might replace WADI's currently single Coordintor, whilst also 
>>> looking at performing other Coordination on joining/leaving nodes 
>>> where its failure, coinciding with that of its host node, will be 
>>> irrelevant, since the very condition that it was intended to resolve 
>>> has ceased to exist.
>>>
>>> How does that sound, Andy ? Do you agree with my thoughts on (1) & 
>>> (2) ? This is great input - thanks,
>>>
>>>
>>> Jules
>>>
>>>
>>> Jules Gosnell wrote:
>>>
>>>> Andy Piper wrote:
>>>>
>>>>> Hi Jules
>>>>>
>>>>> At 05:37 AM 7/27/2005, Jules Gosnell wrote:
>>>>>
>>>>>> I agree on the SPoF thing - but I think you misunderstand my 
>>>>>> Coordinator arch. I do not have a single static Coordinator node, 
>>>>>> but a dynamic Coordinator role, into which a node may be elected. 
>>>>>> Thus every node is a potential Coordinator. If the elected 
>>>>>> Coordinator dies, another is immediately elected. The election 
>>>>>> strategy is pluggable, although it will probably end up being 
>>>>>> hardwired to "oldest-cluster-member". The reason behind this is 
>>>>>> that relaying out your cluster is much simpler if it is done in a 
>>>>>> single vm. I originally tried to do it in multiple vms, each 
>>>>>> taking responsibility for pieces of the cluster, but if the vms 
>>>>>> views are not completely in sync, things get very hairy, and 
>>>>>> completely in sync is an expensive thing to achieve - and would 
>>>>>> introduce a cluster-wide single point of contention. So I do it 
>>>>>> in a single vm, as fast as I can, with fail over, in case that vm 
>>>>>> evaporates. Does that sound better than the scenario that you had 
>>>>>> in mind ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> This is exactly the "hard" computer science problem that you 
>>>>> shouldn't be trying to solve if at all possible. Its hard because 
>>>>> network partitions or hung processes (think GC) make it very easy 
>>>>> for your colleagues to think you are dead when you do not share 
>>>>> that view. The result is two processes who think they are the 
>>>>> coordinator and anarchy can ensue (commonly called split-brain 
>>>>> syndrome). I can point you at papers if you want, but I really 
>>>>> suggest that you aim for an implementation that is independent of 
>>>>> a central coordinator. Note that a central coordinator is 
>>>>> necessary if you want to implement a strongly-consistent in-memory 
>>>>> database, but this is not usually a requirement for session 
>>>>> replication say.
>>>>>
>>>>> http://research.microsoft.com/Lampson/58-Consensus/Abstract.html 
>>>>> gives a good introduction to some of these things. I also 
>>>>> presented at JavaOne on related issues, you should be able to 
>>>>> download the presentation from dev2dev.bea.com at some point (not 
>>>>> there yet - I just checked).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> OK - I will have a look at these papers and reconsider... perhaps I 
>>>> can come up with some sort of fractal algorithm which recursively 
>>>> breaks down the cluster into subclusters each of which is capable 
>>>> of doing likewise to itself and then  layout the buckets 
>>>> recursively via this metaphor... - this would be much more robust, 
>>>> as you point out, but, I think, a more complicated architecture. I 
>>>> will give it some serious thought. Have you any suggestions/papers 
>>>> as to how you might do something like this in a distributed manner, 
>>>> bearing in mind that as a node joins, some existing nodes will see 
>>>> it as having joined and some will not yet have noticed and 
>>>> vice-versa on leaving....
>>>>
>>>>>
>>>>>> The Coordinator is not there to support session replication, but 
>>>>>> rather the management of the distributed map (map of which a few 
>>>>>> buckets live on each node) which is used by WADI to discover very 
>>>>>> efficiently whether a session exists and where it is located. 
>>>>>> This map must be rearranged, in the most efficient way possible, 
>>>>>> each time a node joins or leaves the cluster.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Understood. Once you have a fault-tolerant singleton coordinator 
>>>>> you can solve lots of interesting problems, its just hard and 
>>>>> often not worth the effort or the expense (typical implementations 
>>>>> involve HA HW or an HA DB or at least 3 server processes).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Since I am only currently using the singleton coordinator for 
>>>> bucket arrangement, I may just live with it for the moment, in 
>>>> order to move forward, but make a note to replace it and start 
>>>> background threads on how that might be achieved...
>>>>
>>>>>
>>>>>> Replication is NYI - but I'm running a few mental background 
>>>>>> threads that suggest that an extension to the index will mean 
>>>>>> that it associates the session's id not just to its current 
>>>>>> location, but also to the location of a number of replicants. I 
>>>>>> also have ideas on how a session might choose nodes into which it 
>>>>>> will place its replicants and how I can avoid the primary session 
>>>>>> copy ever being colocated with a replicant (potential SPoF - if 
>>>>>> you only have one replicant), etc...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Right definitely something you want to avoid.
>>>>>
>>>>>> Yes, I can see that happening - I have an improvement (NYI) to 
>>>>>> WADI's evacuation strategy (how sessions are evacuated when a 
>>>>>> node wishes to leave). Each session will be evacuated to the node 
>>>>>> which owns the bucket into which its id hashes. This is because 
>>>>>> colocation of the session with the bucket allows many messages 
>>>>>> concered with its future destruction and relocation to be 
>>>>>> optimised away. Future requests falling elsewhere but needing 
>>>>>> this session should, in the most efficient case, be relocated to 
>>>>>> this same node, other wise the session may be relocated, but at a 
>>>>>> cost...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> How do you relocate the request? Many HW load-balancers do not 
>>>>> support this (or else it requires using proprietary APIs), so you 
>>>>> probably have to count on
>>>>> moving sessions in the normal failover case.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> If I can squeeze the behaviour that I require out of the 
>>>> load-balancer, then, depending on the request type I may be able to 
>>>> get away with a redirection with a changed session cookie or url 
>>>> param, or, failing this an http-proxy, across from a filter above 
>>>> the servlet on one side to the http-port on the node that owns the 
>>>> session...
>>>>
>>>> The LB-integration object is pluggable and the aim is to supply 
>>>> wadi with a good selection of LB integrations - currently I only 
>>>> have a ModJK[2] plugin working. This is able to 'restick' clients 
>>>> to their session's new location (although messing with the session 
>>>> id is a little dodgy...).
>>>>
>>>>>
>>>>>> I would be very grateful in any thoughts or feedback that you 
>>>>>> could give me. I hope to get much more information about WADI 
>>>>>> into the wiki over the next few weeks. That should help generate 
>>>>>> more discussion, although I would be more than happy for people 
>>>>>> to ask me questions here on Geronimo-dev because this will give 
>>>>>> me an idea of what documentation I should write and how existing 
>>>>>> documentation may be lacking or misleading.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I guess my general comment would be that you might find it better 
>>>>> to think specifically about the end-user problem you are trying to 
>>>>> solve (say session replication) and work towards a solution based 
>>>>> on that. Most short-cuts / optimizations that vendors make are 
>>>>> specific to the problem domain and do not generally apply to all 
>>>>> clustering problems.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> The end problem is really clustered web and ejb sessions at the 
>>>> moment, although it looks as if by the time we have solved these 
>>>> issues we may well have written a fault-tolerant 
>>>> distributed/partitioned index that might be very useful as a 
>>>> generic distributed cache building block.
>>>>
>>>> One thing that I do want wadi to do, is to still work when 
>>>> replication is switched off. i.e., if a session only exists as a 
>>>> primary copy, even if affinity breaks down, wadi will continue to 
>>>> correctly render requests for that session unless some form of 
>>>> catastrophic failure causes the session to evaporate. This means 
>>>> that I need to ensure the session's timely evacuation from a node 
>>>> that chooses to leave the cluster to a remaining node, so that it 
>>>> may remain active beyond the lifetime of its original node. All of 
>>>> this must work flawlessly under stress, so that an admin may add or 
>>>> remove nodes to a running cluster without having to worry about the 
>>>> user state that it is managing. Nodes are added by simply starting 
>>>> them, and nodes removed via e.g. ctl-c-ing them.
>>>>
>>>> If it is decided that a few more nines are needed in terms of 
>>>> session availability and the cluster owner understands the extra 
>>>> cost involved in in-vm replication in terms of extra hardware and 
>>>> bandwidth that they will have to purchase and is happy to go with 
>>>> in-vm-replication, then it should be sufficient to up the number of 
>>>> replicated copies kept by the cluster from '0' to e.g. '2' and 
>>>> restart (It might even be possible to vary this setting on a node 
>>>> to node basis so that this change does not even involve a complete 
>>>> cluster cold start). WADI should deal with the rest.
>>>>
>>>> So, I believe that I have a pretty clear idea of what WADI will do, 
>>>> and aside from the replication stuff (phase2) it currently does 
>>>> most of what iIhad in mind for phase1, except that it is not yet 
>>>> happy under stress. I figure it will probably take one or two more 
>>>> redesign/reimplementation iterations to get it to this stage, then 
>>>> I can consider replication.
>>>>
>>>> I have spoken to members of the OpenEJB team about  wadi's ability 
>>>> to relocate requests as well as sessions and we came to the 
>>>> conclusion that it was just as applicable in the EJB world as the 
>>>> web world. If the node an ejb client is talking to leaves the 
>>>> cluster in between calls, the client may try to contact it and then 
>>>> failover to another node that it hopes holds the session. If, due 
>>>> to other nodes leaving/joining it is not always clear which node 
>>>> will contain the session, the ability to reply to an RMI and just 
>>>> say "not here - there!" - i.e. an rmi redirection - would not be 
>>>> hard to add and would resolve this situation. Transactions are 
>>>> another item which I have marked phase2.
>>>>
>>>> So, I am trying hard to stay very focussed on the problem domain, 
>>>> otherwise this will never get finished :-)
>>>>
>>>> Right, off to read those papers now - thanks for your posting and 
>>>> your interest,
>>>>
>>>> Jules
>>>>
>>>>>
>>>>> Hope this helps
>>>>>
>>>>> andy 
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

-- 
Joe Bohn     

joe.bohn@earthlink.net 
"He is no fool who gives what he cannot keep, to gain what he cannot lose."   -- Jim Elliot

Re: Clustering (long)

Posted by Jules Gosnell <ju...@coredevelopers.net>.

hmm... hmmm... :-)

more thoughts on (1) and (2)...

When a node leaves/joins it needs to acquire a lease on the bucket 
tables of every node that it intends to move buckets from/to. If two 
nodes are doing this at the same time, their requirement will collide 
(deadlock) somewhere in the cluster. At this point they may be notified 
and e.g. compare ip addresses to decide who continues and who backs off 
for a while.

So, (1) and (2), whilst being possible are probably more complex than I 
initially imagined. If we have Paxos for the more general purpose case 
(3) anyway, it would probably be smart just to go with this, until such 
optimisations becomes necessary, if at all.

Jules


Jules Gosnell wrote:

> hmmm...
>
> now I'm wondering about my solutions to (1) and (2) - if more than one 
> node tries to join or leave at the same time I may be in trouble - so 
> it may be safer to go straight to (3) for all cases...
>
> more thought needed :-)
>
> Jules
>
>
>
> Jules Gosnell wrote:
>
>> I've had a look at the Lampson paper, but didn't take it all in on 
>> the first pass - I think it will need some serious concentration. The 
>> Paxos algorithm looks interesting, I will definitely pursue this avenue.
>>
>> I've also given a little thought to exactly why I need a Coordinator 
>> and how Paxos might be used to replace it. My use of a Coordinator 
>> and plans for its future do not actually seem that far from Paxos, on 
>> a preliminary reading.
>>
>> Given that WADI currently uses a distributed map of 
>> sessionId:sessionLocation, that this distribution is achieved by 
>> sharing out responsibility for the set number of buckets that 
>> comprise the map roughly evenly between the cluster members and that 
>> this is currently my most satisfying design, I can break my problem 
>> space (for bucket arrangement) down into 3 basic cases :
>>
>> 1) Node joins
>> 2) Node leaves in controlled fashion
>> 3) Node dies
>>
>> If the node under discussion is the only cluster member, then no 
>> bucket rearrangement is necessary - this node will either create or 
>> destroy the full set of buckets. I'll leave this set of subcases as 
>> trivial.
>>
>> 1)  The joining node will need to assume responsibility for a number 
>> of buckets. If buckets-per-node is to be kept roughly the same for 
>> every node, it is likely that the joining node will require transfer 
>> of a small number of buckets from every current cluster member i.e. 
>> we are starting a bucket rearrangement that will involve every 
>> cluster member and only need be done if the join is successful. So, 
>> although we wish to avoid an SPoF, if that SPoF turns out to be the 
>> joining node, then I don't see it as a problem, If the node joining 
>> dies, then we no longer have to worry about rearranging our buckets 
>> (unless we have lost some that had already been transferred - see 
>> (3)). Thus the joining node may be used as a single 
>> Coordinator/Leader for this negotiation without fear of the SPoF 
>> problem. Are we on the same page here ?
>>
>> 2) The same argument may be applied in reverse to a node leaving in a 
>> controlled fashion. It will wish to evacuate its buckets roughly 
>> equally to all remaining cluster members. If it shuts down cleanly, 
>> this would form part of its shutdown protocol. If it dies before or 
>> during the execution of this protocol then we are back at (3), if 
>> not, then the SPoF issue may again be put to one side.
>>
>> 3) This is where things get tricky :-) Currently WADI has, for the 
>> sake of simplicity, one single algorithm / thread / point-of-failure 
>> which recalculates a complete bucket arrangement if it detects (1), 
>> (2) or (3). It would be simple enough to offload the work done for 
>> (1) and (2) to the node joining/leaving and this should reduce wadi's 
>> current vulnerability, but we still need to deal with catastrophic 
>> failure. Currently WADI rebuilds the missing buckets by querying the 
>> cluster for the locations of any sessions that fall within them, but 
>> it could equally carry a replicated backup and dust it off as part of 
>> this procedure. It's just a trade-off between work done up front and 
>> work done in exceptional circumstance... This is the place where the 
>> Paxos algorithm may come in handy - bucet recomposition and 
>> rearrangement. I need to give this further thought. For the immediate 
>> future, however, I think WADI will stay with a single Coordinator in 
>> this situation, which fails-over if http://activecluster.codehaus.org 
>> says it should - I'm delegating the really thorny problem to James 
>> :-). I agree with you that this is an SPoF and that WADI's ability to 
>> recover from failure here depends directly on how we decide if a node 
>> is alive or dead - a very tricky thing to do.
>>
>> In conclusion then, I think that we have usefully identified a 
>> weakness that will become more relevant as the rest of WADI's 
>> features mature. The Lampson paper mentioned describes an algorithm 
>> for allowing nodes to reach a consensus on actions to be performed, 
>> in a redundant manner with no SPoF and I shall consider how this 
>> might replace WADI's currently single Coordintor, whilst also looking 
>> at performing other Coordination on joining/leaving nodes where its 
>> failure, coinciding with that of its host node, will be irrelevant, 
>> since the very condition that it was intended to resolve has ceased 
>> to exist.
>>
>> How does that sound, Andy ? Do you agree with my thoughts on (1) & 
>> (2) ? This is great input - thanks,
>>
>>
>> Jules
>>
>>
>> Jules Gosnell wrote:
>>
>>> Andy Piper wrote:
>>>
>>>> Hi Jules
>>>>
>>>> At 05:37 AM 7/27/2005, Jules Gosnell wrote:
>>>>
>>>>> I agree on the SPoF thing - but I think you misunderstand my 
>>>>> Coordinator arch. I do not have a single static Coordinator node, 
>>>>> but a dynamic Coordinator role, into which a node may be elected. 
>>>>> Thus every node is a potential Coordinator. If the elected 
>>>>> Coordinator dies, another is immediately elected. The election 
>>>>> strategy is pluggable, although it will probably end up being 
>>>>> hardwired to "oldest-cluster-member". The reason behind this is 
>>>>> that relaying out your cluster is much simpler if it is done in a 
>>>>> single vm. I originally tried to do it in multiple vms, each 
>>>>> taking responsibility for pieces of the cluster, but if the vms 
>>>>> views are not completely in sync, things get very hairy, and 
>>>>> completely in sync is an expensive thing to achieve - and would 
>>>>> introduce a cluster-wide single point of contention. So I do it in 
>>>>> a single vm, as fast as I can, with fail over, in case that vm 
>>>>> evaporates. Does that sound better than the scenario that you had 
>>>>> in mind ?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> This is exactly the "hard" computer science problem that you 
>>>> shouldn't be trying to solve if at all possible. Its hard because 
>>>> network partitions or hung processes (think GC) make it very easy 
>>>> for your colleagues to think you are dead when you do not share 
>>>> that view. The result is two processes who think they are the 
>>>> coordinator and anarchy can ensue (commonly called split-brain 
>>>> syndrome). I can point you at papers if you want, but I really 
>>>> suggest that you aim for an implementation that is independent of a 
>>>> central coordinator. Note that a central coordinator is necessary 
>>>> if you want to implement a strongly-consistent in-memory database, 
>>>> but this is not usually a requirement for session replication say.
>>>>
>>>> http://research.microsoft.com/Lampson/58-Consensus/Abstract.html 
>>>> gives a good introduction to some of these things. I also presented 
>>>> at JavaOne on related issues, you should be able to download the 
>>>> presentation from dev2dev.bea.com at some point (not there yet - I 
>>>> just checked).
>>>
>>>
>>>
>>>
>>> OK - I will have a look at these papers and reconsider... perhaps I 
>>> can come up with some sort of fractal algorithm which recursively 
>>> breaks down the cluster into subclusters each of which is capable of 
>>> doing likewise to itself and then  layout the buckets recursively 
>>> via this metaphor... - this would be much more robust, as you point 
>>> out, but, I think, a more complicated architecture. I will give it 
>>> some serious thought. Have you any suggestions/papers as to how you 
>>> might do something like this in a distributed manner, bearing in 
>>> mind that as a node joins, some existing nodes will see it as having 
>>> joined and some will not yet have noticed and vice-versa on leaving....
>>>
>>>>
>>>>> The Coordinator is not there to support session replication, but 
>>>>> rather the management of the distributed map (map of which a few 
>>>>> buckets live on each node) which is used by WADI to discover very 
>>>>> efficiently whether a session exists and where it is located. This 
>>>>> map must be rearranged, in the most efficient way possible, each 
>>>>> time a node joins or leaves the cluster.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Understood. Once you have a fault-tolerant singleton coordinator 
>>>> you can solve lots of interesting problems, its just hard and often 
>>>> not worth the effort or the expense (typical implementations 
>>>> involve HA HW or an HA DB or at least 3 server processes).
>>>
>>>
>>>
>>>
>>> Since I am only currently using the singleton coordinator for bucket 
>>> arrangement, I may just live with it for the moment, in order to 
>>> move forward, but make a note to replace it and start background 
>>> threads on how that might be achieved...
>>>
>>>>
>>>>> Replication is NYI - but I'm running a few mental background 
>>>>> threads that suggest that an extension to the index will mean that 
>>>>> it associates the session's id not just to its current location, 
>>>>> but also to the location of a number of replicants. I also have 
>>>>> ideas on how a session might choose nodes into which it will place 
>>>>> its replicants and how I can avoid the primary session copy ever 
>>>>> being colocated with a replicant (potential SPoF - if you only 
>>>>> have one replicant), etc...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Right definitely something you want to avoid.
>>>>
>>>>> Yes, I can see that happening - I have an improvement (NYI) to 
>>>>> WADI's evacuation strategy (how sessions are evacuated when a node 
>>>>> wishes to leave). Each session will be evacuated to the node which 
>>>>> owns the bucket into which its id hashes. This is because 
>>>>> colocation of the session with the bucket allows many messages 
>>>>> concered with its future destruction and relocation to be 
>>>>> optimised away. Future requests falling elsewhere but needing this 
>>>>> session should, in the most efficient case, be relocated to this 
>>>>> same node, other wise the session may be relocated, but at a cost...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> How do you relocate the request? Many HW load-balancers do not 
>>>> support this (or else it requires using proprietary APIs), so you 
>>>> probably have to count on
>>>> moving sessions in the normal failover case.
>>>
>>>
>>>
>>>
>>> If I can squeeze the behaviour that I require out of the 
>>> load-balancer, then, depending on the request type I may be able to 
>>> get away with a redirection with a changed session cookie or url 
>>> param, or, failing this an http-proxy, across from a filter above 
>>> the servlet on one side to the http-port on the node that owns the 
>>> session...
>>>
>>> The LB-integration object is pluggable and the aim is to supply wadi 
>>> with a good selection of LB integrations - currently I only have a 
>>> ModJK[2] plugin working. This is able to 'restick' clients to their 
>>> session's new location (although messing with the session id is a 
>>> little dodgy...).
>>>
>>>>
>>>>> I would be very grateful in any thoughts or feedback that you 
>>>>> could give me. I hope to get much more information about WADI into 
>>>>> the wiki over the next few weeks. That should help generate more 
>>>>> discussion, although I would be more than happy for people to ask 
>>>>> me questions here on Geronimo-dev because this will give me an 
>>>>> idea of what documentation I should write and how existing 
>>>>> documentation may be lacking or misleading.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I guess my general comment would be that you might find it better 
>>>> to think specifically about the end-user problem you are trying to 
>>>> solve (say session replication) and work towards a solution based 
>>>> on that. Most short-cuts / optimizations that vendors make are 
>>>> specific to the problem domain and do not generally apply to all 
>>>> clustering problems.
>>>
>>>
>>>
>>>
>>> The end problem is really clustered web and ejb sessions at the 
>>> moment, although it looks as if by the time we have solved these 
>>> issues we may well have written a fault-tolerant 
>>> distributed/partitioned index that might be very useful as a generic 
>>> distributed cache building block.
>>>
>>> One thing that I do want wadi to do, is to still work when 
>>> replication is switched off. i.e., if a session only exists as a 
>>> primary copy, even if affinity breaks down, wadi will continue to 
>>> correctly render requests for that session unless some form of 
>>> catastrophic failure causes the session to evaporate. This means 
>>> that I need to ensure the session's timely evacuation from a node 
>>> that chooses to leave the cluster to a remaining node, so that it 
>>> may remain active beyond the lifetime of its original node. All of 
>>> this must work flawlessly under stress, so that an admin may add or 
>>> remove nodes to a running cluster without having to worry about the 
>>> user state that it is managing. Nodes are added by simply starting 
>>> them, and nodes removed via e.g. ctl-c-ing them.
>>>
>>> If it is decided that a few more nines are needed in terms of 
>>> session availability and the cluster owner understands the extra 
>>> cost involved in in-vm replication in terms of extra hardware and 
>>> bandwidth that they will have to purchase and is happy to go with 
>>> in-vm-replication, then it should be sufficient to up the number of 
>>> replicated copies kept by the cluster from '0' to e.g. '2' and 
>>> restart (It might even be possible to vary this setting on a node to 
>>> node basis so that this change does not even involve a complete 
>>> cluster cold start). WADI should deal with the rest.
>>>
>>> So, I believe that I have a pretty clear idea of what WADI will do, 
>>> and aside from the replication stuff (phase2) it currently does most 
>>> of what iIhad in mind for phase1, except that it is not yet happy 
>>> under stress. I figure it will probably take one or two more 
>>> redesign/reimplementation iterations to get it to this stage, then I 
>>> can consider replication.
>>>
>>> I have spoken to members of the OpenEJB team about  wadi's ability 
>>> to relocate requests as well as sessions and we came to the 
>>> conclusion that it was just as applicable in the EJB world as the 
>>> web world. If the node an ejb client is talking to leaves the 
>>> cluster in between calls, the client may try to contact it and then 
>>> failover to another node that it hopes holds the session. If, due to 
>>> other nodes leaving/joining it is not always clear which node will 
>>> contain the session, the ability to reply to an RMI and just say 
>>> "not here - there!" - i.e. an rmi redirection - would not be hard to 
>>> add and would resolve this situation. Transactions are another item 
>>> which I have marked phase2.
>>>
>>> So, I am trying hard to stay very focussed on the problem domain, 
>>> otherwise this will never get finished :-)
>>>
>>> Right, off to read those papers now - thanks for your posting and 
>>> your interest,
>>>
>>> Jules
>>>
>>>>
>>>> Hope this helps
>>>>
>>>> andy 
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Clustering (long)

Posted by Jules Gosnell <ju...@coredevelopers.net>.

hmmm...

now I'm wondering about my solutions to (1) and (2) - if more than one 
node tries to join or leave at the same time I may be in trouble - so it 
may be safer to go straight to (3) for all cases...

more thought needed :-)

Jules



Jules Gosnell wrote:

> I've had a look at the Lampson paper, but didn't take it all in on the 
> first pass - I think it will need some serious concentration. The 
> Paxos algorithm looks interesting, I will definitely pursue this avenue.
>
> I've also given a little thought to exactly why I need a Coordinator 
> and how Paxos might be used to replace it. My use of a Coordinator and 
> plans for its future do not actually seem that far from Paxos, on a 
> preliminary reading.
>
> Given that WADI currently uses a distributed map of 
> sessionId:sessionLocation, that this distribution is achieved by 
> sharing out responsibility for the set number of buckets that comprise 
> the map roughly evenly between the cluster members and that this is 
> currently my most satisfying design, I can break my problem space (for 
> bucket arrangement) down into 3 basic cases :
>
> 1) Node joins
> 2) Node leaves in controlled fashion
> 3) Node dies
>
> If the node under discussion is the only cluster member, then no 
> bucket rearrangement is necessary - this node will either create or 
> destroy the full set of buckets. I'll leave this set of subcases as 
> trivial.
>
> 1)  The joining node will need to assume responsibility for a number 
> of buckets. If buckets-per-node is to be kept roughly the same for 
> every node, it is likely that the joining node will require transfer 
> of a small number of buckets from every current cluster member i.e. we 
> are starting a bucket rearrangement that will involve every cluster 
> member and only need be done if the join is successful. So, although 
> we wish to avoid an SPoF, if that SPoF turns out to be the joining 
> node, then I don't see it as a problem, If the node joining dies, then 
> we no longer have to worry about rearranging our buckets (unless we 
> have lost some that had already been transferred - see (3)). Thus the 
> joining node may be used as a single Coordinator/Leader for this 
> negotiation without fear of the SPoF problem. Are we on the same page 
> here ?
>
> 2) The same argument may be applied in reverse to a node leaving in a 
> controlled fashion. It will wish to evacuate its buckets roughly 
> equally to all remaining cluster members. If it shuts down cleanly, 
> this would form part of its shutdown protocol. If it dies before or 
> during the execution of this protocol then we are back at (3), if not, 
> then the SPoF issue may again be put to one side.
>
> 3) This is where things get tricky :-) Currently WADI has, for the 
> sake of simplicity, one single algorithm / thread / point-of-failure 
> which recalculates a complete bucket arrangement if it detects (1), 
> (2) or (3). It would be simple enough to offload the work done for (1) 
> and (2) to the node joining/leaving and this should reduce wadi's 
> current vulnerability, but we still need to deal with catastrophic 
> failure. Currently WADI rebuilds the missing buckets by querying the 
> cluster for the locations of any sessions that fall within them, but 
> it could equally carry a replicated backup and dust it off as part of 
> this procedure. It's just a trade-off between work done up front and 
> work done in exceptional circumstance... This is the place where the 
> Paxos algorithm may come in handy - bucet recomposition and 
> rearrangement. I need to give this further thought. For the immediate 
> future, however, I think WADI will stay with a single Coordinator in 
> this situation, which fails-over if http://activecluster.codehaus.org 
> says it should - I'm delegating the really thorny problem to James 
> :-). I agree with you that this is an SPoF and that WADI's ability to 
> recover from failure here depends directly on how we decide if a node 
> is alive or dead - a very tricky thing to do.
>
> In conclusion then, I think that we have usefully identified a 
> weakness that will become more relevant as the rest of WADI's features 
> mature. The Lampson paper mentioned describes an algorithm for 
> allowing nodes to reach a consensus on actions to be performed, in a 
> redundant manner with no SPoF and I shall consider how this might 
> replace WADI's currently single Coordintor, whilst also looking at 
> performing other Coordination on joining/leaving nodes where its 
> failure, coinciding with that of its host node, will be irrelevant, 
> since the very condition that it was intended to resolve has ceased to 
> exist.
>
> How does that sound, Andy ? Do you agree with my thoughts on (1) & (2) 
> ? This is great input - thanks,
>
>
> Jules
>
>
> Jules Gosnell wrote:
>
>> Andy Piper wrote:
>>
>>> Hi Jules
>>>
>>> At 05:37 AM 7/27/2005, Jules Gosnell wrote:
>>>
>>>> I agree on the SPoF thing - but I think you misunderstand my 
>>>> Coordinator arch. I do not have a single static Coordinator node, 
>>>> but a dynamic Coordinator role, into which a node may be elected. 
>>>> Thus every node is a potential Coordinator. If the elected 
>>>> Coordinator dies, another is immediately elected. The election 
>>>> strategy is pluggable, although it will probably end up being 
>>>> hardwired to "oldest-cluster-member". The reason behind this is 
>>>> that relaying out your cluster is much simpler if it is done in a 
>>>> single vm. I originally tried to do it in multiple vms, each taking 
>>>> responsibility for pieces of the cluster, but if the vms views are 
>>>> not completely in sync, things get very hairy, and completely in 
>>>> sync is an expensive thing to achieve - and would introduce a 
>>>> cluster-wide single point of contention. So I do it in a single vm, 
>>>> as fast as I can, with fail over, in case that vm evaporates. Does 
>>>> that sound better than the scenario that you had in mind ?
>>>
>>>
>>>
>>>
>>> This is exactly the "hard" computer science problem that you 
>>> shouldn't be trying to solve if at all possible. Its hard because 
>>> network partitions or hung processes (think GC) make it very easy 
>>> for your colleagues to think you are dead when you do not share that 
>>> view. The result is two processes who think they are the coordinator 
>>> and anarchy can ensue (commonly called split-brain syndrome). I can 
>>> point you at papers if you want, but I really suggest that you aim 
>>> for an implementation that is independent of a central coordinator. 
>>> Note that a central coordinator is necessary if you want to 
>>> implement a strongly-consistent in-memory database, but this is not 
>>> usually a requirement for session replication say.
>>>
>>> http://research.microsoft.com/Lampson/58-Consensus/Abstract.html 
>>> gives a good introduction to some of these things. I also presented 
>>> at JavaOne on related issues, you should be able to download the 
>>> presentation from dev2dev.bea.com at some point (not there yet - I 
>>> just checked).
>>
>>
>>
>> OK - I will have a look at these papers and reconsider... perhaps I 
>> can come up with some sort of fractal algorithm which recursively 
>> breaks down the cluster into subclusters each of which is capable of 
>> doing likewise to itself and then  layout the buckets recursively via 
>> this metaphor... - this would be much more robust, as you point out, 
>> but, I think, a more complicated architecture. I will give it some 
>> serious thought. Have you any suggestions/papers as to how you might 
>> do something like this in a distributed manner, bearing in mind that 
>> as a node joins, some existing nodes will see it as having joined and 
>> some will not yet have noticed and vice-versa on leaving....
>>
>>>
>>>> The Coordinator is not there to support session replication, but 
>>>> rather the management of the distributed map (map of which a few 
>>>> buckets live on each node) which is used by WADI to discover very 
>>>> efficiently whether a session exists and where it is located. This 
>>>> map must be rearranged, in the most efficient way possible, each 
>>>> time a node joins or leaves the cluster.
>>>
>>>
>>>
>>>
>>> Understood. Once you have a fault-tolerant singleton coordinator you 
>>> can solve lots of interesting problems, its just hard and often not 
>>> worth the effort or the expense (typical implementations involve HA 
>>> HW or an HA DB or at least 3 server processes).
>>
>>
>>
>> Since I am only currently using the singleton coordinator for bucket 
>> arrangement, I may just live with it for the moment, in order to move 
>> forward, but make a note to replace it and start background threads 
>> on how that might be achieved...
>>
>>>
>>>> Replication is NYI - but I'm running a few mental background 
>>>> threads that suggest that an extension to the index will mean that 
>>>> it associates the session's id not just to its current location, 
>>>> but also to the location of a number of replicants. I also have 
>>>> ideas on how a session might choose nodes into which it will place 
>>>> its replicants and how I can avoid the primary session copy ever 
>>>> being colocated with a replicant (potential SPoF - if you only have 
>>>> one replicant), etc...
>>>
>>>
>>>
>>>
>>> Right definitely something you want to avoid.
>>>
>>>> Yes, I can see that happening - I have an improvement (NYI) to 
>>>> WADI's evacuation strategy (how sessions are evacuated when a node 
>>>> wishes to leave). Each session will be evacuated to the node which 
>>>> owns the bucket into which its id hashes. This is because 
>>>> colocation of the session with the bucket allows many messages 
>>>> concered with its future destruction and relocation to be optimised 
>>>> away. Future requests falling elsewhere but needing this session 
>>>> should, in the most efficient case, be relocated to this same node, 
>>>> other wise the session may be relocated, but at a cost...
>>>
>>>
>>>
>>>
>>> How do you relocate the request? Many HW load-balancers do not 
>>> support this (or else it requires using proprietary APIs), so you 
>>> probably have to count on
>>> moving sessions in the normal failover case.
>>
>>
>>
>> If I can squeeze the behaviour that I require out of the 
>> load-balancer, then, depending on the request type I may be able to 
>> get away with a redirection with a changed session cookie or url 
>> param, or, failing this an http-proxy, across from a filter above the 
>> servlet on one side to the http-port on the node that owns the 
>> session...
>>
>> The LB-integration object is pluggable and the aim is to supply wadi 
>> with a good selection of LB integrations - currently I only have a 
>> ModJK[2] plugin working. This is able to 'restick' clients to their 
>> session's new location (although messing with the session id is a 
>> little dodgy...).
>>
>>>
>>>> I would be very grateful in any thoughts or feedback that you could 
>>>> give me. I hope to get much more information about WADI into the 
>>>> wiki over the next few weeks. That should help generate more 
>>>> discussion, although I would be more than happy for people to ask 
>>>> me questions here on Geronimo-dev because this will give me an idea 
>>>> of what documentation I should write and how existing documentation 
>>>> may be lacking or misleading.
>>>
>>>
>>>
>>>
>>> I guess my general comment would be that you might find it better to 
>>> think specifically about the end-user problem you are trying to 
>>> solve (say session replication) and work towards a solution based on 
>>> that. Most short-cuts / optimizations that vendors make are specific 
>>> to the problem domain and do not generally apply to all clustering 
>>> problems.
>>
>>
>>
>> The end problem is really clustered web and ejb sessions at the 
>> moment, although it looks as if by the time we have solved these 
>> issues we may well have written a fault-tolerant 
>> distributed/partitioned index that might be very useful as a generic 
>> distributed cache building block.
>>
>> One thing that I do want wadi to do, is to still work when 
>> replication is switched off. i.e., if a session only exists as a 
>> primary copy, even if affinity breaks down, wadi will continue to 
>> correctly render requests for that session unless some form of 
>> catastrophic failure causes the session to evaporate. This means that 
>> I need to ensure the session's timely evacuation from a node that 
>> chooses to leave the cluster to a remaining node, so that it may 
>> remain active beyond the lifetime of its original node. All of this 
>> must work flawlessly under stress, so that an admin may add or remove 
>> nodes to a running cluster without having to worry about the user 
>> state that it is managing. Nodes are added by simply starting them, 
>> and nodes removed via e.g. ctl-c-ing them.
>>
>> If it is decided that a few more nines are needed in terms of session 
>> availability and the cluster owner understands the extra cost 
>> involved in in-vm replication in terms of extra hardware and 
>> bandwidth that they will have to purchase and is happy to go with 
>> in-vm-replication, then it should be sufficient to up the number of 
>> replicated copies kept by the cluster from '0' to e.g. '2' and 
>> restart (It might even be possible to vary this setting on a node to 
>> node basis so that this change does not even involve a complete 
>> cluster cold start). WADI should deal with the rest.
>>
>> So, I believe that I have a pretty clear idea of what WADI will do, 
>> and aside from the replication stuff (phase2) it currently does most 
>> of what iIhad in mind for phase1, except that it is not yet happy 
>> under stress. I figure it will probably take one or two more 
>> redesign/reimplementation iterations to get it to this stage, then I 
>> can consider replication.
>>
>> I have spoken to members of the OpenEJB team about  wadi's ability to 
>> relocate requests as well as sessions and we came to the conclusion 
>> that it was just as applicable in the EJB world as the web world. If 
>> the node an ejb client is talking to leaves the cluster in between 
>> calls, the client may try to contact it and then failover to another 
>> node that it hopes holds the session. If, due to other nodes 
>> leaving/joining it is not always clear which node will contain the 
>> session, the ability to reply to an RMI and just say "not here - 
>> there!" - i.e. an rmi redirection - would not be hard to add and 
>> would resolve this situation. Transactions are another item which I 
>> have marked phase2.
>>
>> So, I am trying hard to stay very focussed on the problem domain, 
>> otherwise this will never get finished :-)
>>
>> Right, off to read those papers now - thanks for your posting and 
>> your interest,
>>
>> Jules
>>
>>>
>>> Hope this helps
>>>
>>> andy 
>>
>>
>>
>>
>>
>
>


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Clustering (long)

Posted by Jules Gosnell <ju...@coredevelopers.net>.

I've had a look at the Lampson paper, but didn't take it all in on the 
first pass - I think it will need some serious concentration. The Paxos 
algorithm looks interesting, I will definitely pursue this avenue.

I've also given a little thought to exactly why I need a Coordinator and 
how Paxos might be used to replace it. My use of a Coordinator and plans 
for its future do not actually seem that far from Paxos, on a 
preliminary reading.

Given that WADI currently uses a distributed map of 
sessionId:sessionLocation, that this distribution is achieved by sharing 
out responsibility for the set number of buckets that comprise the map 
roughly evenly between the cluster members and that this is currently my 
most satisfying design, I can break my problem space (for bucket 
arrangement) down into 3 basic cases :

1) Node joins
2) Node leaves in controlled fashion
3) Node dies

If the node under discussion is the only cluster member, then no bucket 
rearrangement is necessary - this node will either create or destroy the 
full set of buckets. I'll leave this set of subcases as trivial.

1)  The joining node will need to assume responsibility for a number of 
buckets. If buckets-per-node is to be kept roughly the same for every 
node, it is likely that the joining node will require transfer of a 
small number of buckets from every current cluster member i.e. we are 
starting a bucket rearrangement that will involve every cluster member 
and only need be done if the join is successful. So, although we wish to 
avoid an SPoF, if that SPoF turns out to be the joining node, then I 
don't see it as a problem, If the node joining dies, then we no longer 
have to worry about rearranging our buckets (unless we have lost some 
that had already been transferred - see (3)). Thus the joining node may 
be used as a single Coordinator/Leader for this negotiation without fear 
of the SPoF problem. Are we on the same page here ?

2) The same argument may be applied in reverse to a node leaving in a 
controlled fashion. It will wish to evacuate its buckets roughly equally 
to all remaining cluster members. If it shuts down cleanly, this would 
form part of its shutdown protocol. If it dies before or during the 
execution of this protocol then we are back at (3), if not, then the 
SPoF issue may again be put to one side.

3) This is where things get tricky :-) Currently WADI has, for the sake 
of simplicity, one single algorithm / thread / point-of-failure which 
recalculates a complete bucket arrangement if it detects (1), (2) or 
(3). It would be simple enough to offload the work done for (1) and (2) 
to the node joining/leaving and this should reduce wadi's current 
vulnerability, but we still need to deal with catastrophic failure. 
Currently WADI rebuilds the missing buckets by querying the cluster for 
the locations of any sessions that fall within them, but it could 
equally carry a replicated backup and dust it off as part of this 
procedure. It's just a trade-off between work done up front and work 
done in exceptional circumstance... This is the place where the Paxos 
algorithm may come in handy - bucet recomposition and rearrangement. I 
need to give this further thought. For the immediate future, however, I 
think WADI will stay with a single Coordinator in this situation, which 
fails-over if http://activecluster.codehaus.org says it should - I'm 
delegating the really thorny problem to James :-). I agree with you that 
this is an SPoF and that WADI's ability to recover from failure here 
depends directly on how we decide if a node is alive or dead - a very 
tricky thing to do.

In conclusion then, I think that we have usefully identified a weakness 
that will become more relevant as the rest of WADI's features mature. 
The Lampson paper mentioned describes an algorithm for allowing nodes to 
reach a consensus on actions to be performed, in a redundant manner with 
no SPoF and I shall consider how this might replace WADI's currently 
single Coordintor, whilst also looking at performing other Coordination 
on joining/leaving nodes where its failure, coinciding with that of its 
host node, will be irrelevant, since the very condition that it was 
intended to resolve has ceased to exist.

How does that sound, Andy ? Do you agree with my thoughts on (1) & (2) ? 
This is great input - thanks,


Jules


Jules Gosnell wrote:

> Andy Piper wrote:
>
>> Hi Jules
>>
>> At 05:37 AM 7/27/2005, Jules Gosnell wrote:
>>
>>> I agree on the SPoF thing - but I think you misunderstand my 
>>> Coordinator arch. I do not have a single static Coordinator node, 
>>> but a dynamic Coordinator role, into which a node may be elected. 
>>> Thus every node is a potential Coordinator. If the elected 
>>> Coordinator dies, another is immediately elected. The election 
>>> strategy is pluggable, although it will probably end up being 
>>> hardwired to "oldest-cluster-member". The reason behind this is that 
>>> relaying out your cluster is much simpler if it is done in a single 
>>> vm. I originally tried to do it in multiple vms, each taking 
>>> responsibility for pieces of the cluster, but if the vms views are 
>>> not completely in sync, things get very hairy, and completely in 
>>> sync is an expensive thing to achieve - and would introduce a 
>>> cluster-wide single point of contention. So I do it in a single vm, 
>>> as fast as I can, with fail over, in case that vm evaporates. Does 
>>> that sound better than the scenario that you had in mind ?
>>
>>
>>
>> This is exactly the "hard" computer science problem that you 
>> shouldn't be trying to solve if at all possible. Its hard because 
>> network partitions or hung processes (think GC) make it very easy for 
>> your colleagues to think you are dead when you do not share that 
>> view. The result is two processes who think they are the coordinator 
>> and anarchy can ensue (commonly called split-brain syndrome). I can 
>> point you at papers if you want, but I really suggest that you aim 
>> for an implementation that is independent of a central coordinator. 
>> Note that a central coordinator is necessary if you want to implement 
>> a strongly-consistent in-memory database, but this is not usually a 
>> requirement for session replication say.
>>
>> http://research.microsoft.com/Lampson/58-Consensus/Abstract.html 
>> gives a good introduction to some of these things. I also presented 
>> at JavaOne on related issues, you should be able to download the 
>> presentation from dev2dev.bea.com at some point (not there yet - I 
>> just checked).
>
>
> OK - I will have a look at these papers and reconsider... perhaps I 
> can come up with some sort of fractal algorithm which recursively 
> breaks down the cluster into subclusters each of which is capable of 
> doing likewise to itself and then  layout the buckets recursively via 
> this metaphor... - this would be much more robust, as you point out, 
> but, I think, a more complicated architecture. I will give it some 
> serious thought. Have you any suggestions/papers as to how you might 
> do something like this in a distributed manner, bearing in mind that 
> as a node joins, some existing nodes will see it as having joined and 
> some will not yet have noticed and vice-versa on leaving....
>
>>
>>> The Coordinator is not there to support session replication, but 
>>> rather the management of the distributed map (map of which a few 
>>> buckets live on each node) which is used by WADI to discover very 
>>> efficiently whether a session exists and where it is located. This 
>>> map must be rearranged, in the most efficient way possible, each 
>>> time a node joins or leaves the cluster.
>>
>>
>>
>> Understood. Once you have a fault-tolerant singleton coordinator you 
>> can solve lots of interesting problems, its just hard and often not 
>> worth the effort or the expense (typical implementations involve HA 
>> HW or an HA DB or at least 3 server processes).
>
>
> Since I am only currently using the singleton coordinator for bucket 
> arrangement, I may just live with it for the moment, in order to move 
> forward, but make a note to replace it and start background threads on 
> how that might be achieved...
>
>>
>>> Replication is NYI - but I'm running a few mental background threads 
>>> that suggest that an extension to the index will mean that it 
>>> associates the session's id not just to its current location, but 
>>> also to the location of a number of replicants. I also have ideas on 
>>> how a session might choose nodes into which it will place its 
>>> replicants and how I can avoid the primary session copy ever being 
>>> colocated with a replicant (potential SPoF - if you only have one 
>>> replicant), etc...
>>
>>
>>
>> Right definitely something you want to avoid.
>>
>>> Yes, I can see that happening - I have an improvement (NYI) to 
>>> WADI's evacuation strategy (how sessions are evacuated when a node 
>>> wishes to leave). Each session will be evacuated to the node which 
>>> owns the bucket into which its id hashes. This is because colocation 
>>> of the session with the bucket allows many messages concered with 
>>> its future destruction and relocation to be optimised away. Future 
>>> requests falling elsewhere but needing this session should, in the 
>>> most efficient case, be relocated to this same node, other wise the 
>>> session may be relocated, but at a cost...
>>
>>
>>
>> How do you relocate the request? Many HW load-balancers do not 
>> support this (or else it requires using proprietary APIs), so you 
>> probably have to count on
>> moving sessions in the normal failover case.
>
>
> If I can squeeze the behaviour that I require out of the 
> load-balancer, then, depending on the request type I may be able to 
> get away with a redirection with a changed session cookie or url 
> param, or, failing this an http-proxy, across from a filter above the 
> servlet on one side to the http-port on the node that owns the session...
>
> The LB-integration object is pluggable and the aim is to supply wadi 
> with a good selection of LB integrations - currently I only have a 
> ModJK[2] plugin working. This is able to 'restick' clients to their 
> session's new location (although messing with the session id is a 
> little dodgy...).
>
>>
>>> I would be very grateful in any thoughts or feedback that you could 
>>> give me. I hope to get much more information about WADI into the 
>>> wiki over the next few weeks. That should help generate more 
>>> discussion, although I would be more than happy for people to ask me 
>>> questions here on Geronimo-dev because this will give me an idea of 
>>> what documentation I should write and how existing documentation may 
>>> be lacking or misleading.
>>
>>
>>
>> I guess my general comment would be that you might find it better to 
>> think specifically about the end-user problem you are trying to solve 
>> (say session replication) and work towards a solution based on that. 
>> Most short-cuts / optimizations that vendors make are specific to the 
>> problem domain and do not generally apply to all clustering problems.
>
>
> The end problem is really clustered web and ejb sessions at the 
> moment, although it looks as if by the time we have solved these 
> issues we may well have written a fault-tolerant 
> distributed/partitioned index that might be very useful as a generic 
> distributed cache building block.
>
> One thing that I do want wadi to do, is to still work when replication 
> is switched off. i.e., if a session only exists as a primary copy, 
> even if affinity breaks down, wadi will continue to correctly render 
> requests for that session unless some form of catastrophic failure 
> causes the session to evaporate. This means that I need to ensure the 
> session's timely evacuation from a node that chooses to leave the 
> cluster to a remaining node, so that it may remain active beyond the 
> lifetime of its original node. All of this must work flawlessly under 
> stress, so that an admin may add or remove nodes to a running cluster 
> without having to worry about the user state that it is managing. 
> Nodes are added by simply starting them, and nodes removed via e.g. 
> ctl-c-ing them.
>
> If it is decided that a few more nines are needed in terms of session 
> availability and the cluster owner understands the extra cost involved 
> in in-vm replication in terms of extra hardware and bandwidth that 
> they will have to purchase and is happy to go with in-vm-replication, 
> then it should be sufficient to up the number of replicated copies 
> kept by the cluster from '0' to e.g. '2' and restart (It might even be 
> possible to vary this setting on a node to node basis so that this 
> change does not even involve a complete cluster cold start). WADI 
> should deal with the rest.
>
> So, I believe that I have a pretty clear idea of what WADI will do, 
> and aside from the replication stuff (phase2) it currently does most 
> of what iIhad in mind for phase1, except that it is not yet happy 
> under stress. I figure it will probably take one or two more 
> redesign/reimplementation iterations to get it to this stage, then I 
> can consider replication.
>
> I have spoken to members of the OpenEJB team about  wadi's ability to 
> relocate requests as well as sessions and we came to the conclusion 
> that it was just as applicable in the EJB world as the web world. If 
> the node an ejb client is talking to leaves the cluster in between 
> calls, the client may try to contact it and then failover to another 
> node that it hopes holds the session. If, due to other nodes 
> leaving/joining it is not always clear which node will contain the 
> session, the ability to reply to an RMI and just say "not here - 
> there!" - i.e. an rmi redirection - would not be hard to add and would 
> resolve this situation. Transactions are another item which I have 
> marked phase2.
>
> So, I am trying hard to stay very focussed on the problem domain, 
> otherwise this will never get finished :-)
>
> Right, off to read those papers now - thanks for your posting and your 
> interest,
>
> Jules
>
>>
>> Hope this helps
>>
>> andy 
>
>
>
>


-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/

Re: Clustering (long)

Posted by Jules Gosnell <ju...@coredevelopers.net>.

Andy Piper wrote:

> Hi Jules
>
> At 05:37 AM 7/27/2005, Jules Gosnell wrote:
>
>> I agree on the SPoF thing - but I think you misunderstand my 
>> Coordinator arch. I do not have a single static Coordinator node, but 
>> a dynamic Coordinator role, into which a node may be elected. Thus 
>> every node is a potential Coordinator. If the elected Coordinator 
>> dies, another is immediately elected. The election strategy is 
>> pluggable, although it will probably end up being hardwired to 
>> "oldest-cluster-member". The reason behind this is that relaying out 
>> your cluster is much simpler if it is done in a single vm. I 
>> originally tried to do it in multiple vms, each taking responsibility 
>> for pieces of the cluster, but if the vms views are not completely in 
>> sync, things get very hairy, and completely in sync is an expensive 
>> thing to achieve - and would introduce a cluster-wide single point of 
>> contention. So I do it in a single vm, as fast as I can, with fail 
>> over, in case that vm evaporates. Does that sound better than the 
>> scenario that you had in mind ?
>
>
> This is exactly the "hard" computer science problem that you shouldn't 
> be trying to solve if at all possible. Its hard because network 
> partitions or hung processes (think GC) make it very easy for your 
> colleagues to think you are dead when you do not share that view. The 
> result is two processes who think they are the coordinator and anarchy 
> can ensue (commonly called split-brain syndrome). I can point you at 
> papers if you want, but I really suggest that you aim for an 
> implementation that is independent of a central coordinator. Note that 
> a central coordinator is necessary if you want to implement a 
> strongly-consistent in-memory database, but this is not usually a 
> requirement for session replication say.
>
> http://research.microsoft.com/Lampson/58-Consensus/Abstract.html gives 
> a good introduction to some of these things. I also presented at 
> JavaOne on related issues, you should be able to download the 
> presentation from dev2dev.bea.com at some point (not there yet - I 
> just checked).

OK - I will have a look at these papers and reconsider... perhaps I can 
come up with some sort of fractal algorithm which recursively breaks 
down the cluster into subclusters each of which is capable of doing 
likewise to itself and then  layout the buckets recursively via this 
metaphor... - this would be much more robust, as you point out, but, I 
think, a more complicated architecture. I will give it some serious 
thought. Have you any suggestions/papers as to how you might do 
something like this in a distributed manner, bearing in mind that as a 
node joins, some existing nodes will see it as having joined and some 
will not yet have noticed and vice-versa on leaving....

>
>> The Coordinator is not there to support session replication, but 
>> rather the management of the distributed map (map of which a few 
>> buckets live on each node) which is used by WADI to discover very 
>> efficiently whether a session exists and where it is located. This 
>> map must be rearranged, in the most efficient way possible, each time 
>> a node joins or leaves the cluster.
>
>
> Understood. Once you have a fault-tolerant singleton coordinator you 
> can solve lots of interesting problems, its just hard and often not 
> worth the effort or the expense (typical implementations involve HA HW 
> or an HA DB or at least 3 server processes).

Since I am only currently using the singleton coordinator for bucket 
arrangement, I may just live with it for the moment, in order to move 
forward, but make a note to replace it and start background threads on 
how that might be achieved...

>
>> Replication is NYI - but I'm running a few mental background threads 
>> that suggest that an extension to the index will mean that it 
>> associates the session's id not just to its current location, but 
>> also to the location of a number of replicants. I also have ideas on 
>> how a session might choose nodes into which it will place its 
>> replicants and how I can avoid the primary session copy ever being 
>> colocated with a replicant (potential SPoF - if you only have one 
>> replicant), etc...
>
>
> Right definitely something you want to avoid.
>
>> Yes, I can see that happening - I have an improvement (NYI) to WADI's 
>> evacuation strategy (how sessions are evacuated when a node wishes to 
>> leave). Each session will be evacuated to the node which owns the 
>> bucket into which its id hashes. This is because colocation of the 
>> session with the bucket allows many messages concered with its future 
>> destruction and relocation to be optimised away. Future requests 
>> falling elsewhere but needing this session should, in the most 
>> efficient case, be relocated to this same node, other wise the 
>> session may be relocated, but at a cost...
>
>
> How do you relocate the request? Many HW load-balancers do not support 
> this (or else it requires using proprietary APIs), so you probably 
> have to count on
> moving sessions in the normal failover case.

If I can squeeze the behaviour that I require out of the load-balancer, 
then, depending on the request type I may be able to get away with a 
redirection with a changed session cookie or url param, or, failing this 
an http-proxy, across from a filter above the servlet on one side to the 
http-port on the node that owns the session...

The LB-integration object is pluggable and the aim is to supply wadi 
with a good selection of LB integrations - currently I only have a 
ModJK[2] plugin working. This is able to 'restick' clients to their 
session's new location (although messing with the session id is a little 
dodgy...).

>
>> I would be very grateful in any thoughts or feedback that you could 
>> give me. I hope to get much more information about WADI into the wiki 
>> over the next few weeks. That should help generate more discussion, 
>> although I would be more than happy for people to ask me questions 
>> here on Geronimo-dev because this will give me an idea of what 
>> documentation I should write and how existing documentation may be 
>> lacking or misleading.
>
>
> I guess my general comment would be that you might find it better to 
> think specifically about the end-user problem you are trying to solve 
> (say session replication) and work towards a solution based on that. 
> Most short-cuts / optimizations that vendors make are specific to the 
> problem domain and do not generally apply to all clustering problems.

The end problem is really clustered web and ejb sessions at the moment, 
although it looks as if by the time we have solved these issues we may 
well have written a fault-tolerant distributed/partitioned index that 
might be very useful as a generic distributed cache building block.

One thing that I do want wadi to do, is to still work when replication 
is switched off. i.e., if a session only exists as a primary copy, even 
if affinity breaks down, wadi will continue to correctly render requests 
for that session unless some form of catastrophic failure causes the 
session to evaporate. This means that I need to ensure the session's 
timely evacuation from a node that chooses to leave the cluster to a 
remaining node, so that it may remain active beyond the lifetime of its 
original node. All of this must work flawlessly under stress, so that an 
admin may add or remove nodes to a running cluster without having to 
worry about the user state that it is managing. Nodes are added by 
simply starting them, and nodes removed via e.g. ctl-c-ing them.

If it is decided that a few more nines are needed in terms of session 
availability and the cluster owner understands the extra cost involved 
in in-vm replication in terms of extra hardware and bandwidth that they 
will have to purchase and is happy to go with in-vm-replication, then it 
should be sufficient to up the number of replicated copies kept by the 
cluster from '0' to e.g. '2' and restart (It might even be possible to 
vary this setting on a node to node basis so that this change does not 
even involve a complete cluster cold start). WADI should deal with the rest.

So, I believe that I have a pretty clear idea of what WADI will do, and 
aside from the replication stuff (phase2) it currently does most of what 
iIhad in mind for phase1, except that it is not yet happy under stress. 
I figure it will probably take one or two more redesign/reimplementation 
iterations to get it to this stage, then I can consider replication.

I have spoken to members of the OpenEJB team about  wadi's ability to 
relocate requests as well as sessions and we came to the conclusion that 
it was just as applicable in the EJB world as the web world. If the node 
an ejb client is talking to leaves the cluster in between calls, the 
client may try to contact it and then failover to another node that it 
hopes holds the session. If, due to other nodes leaving/joining it is 
not always clear which node will contain the session, the ability to 
reply to an RMI and just say "not here - there!" - i.e. an rmi 
redirection - would not be hard to add and would resolve this situation. 
Transactions are another item which I have marked phase2.

So, I am trying hard to stay very focussed on the problem domain, 
otherwise this will never get finished :-)

Right, off to read those papers now - thanks for your posting and your 
interest,

Jules

>
> Hope this helps
>
> andy 

-- 
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
 * Jules Gosnell
 * Partner
 * Core Developers Network (Europe)
 *
 *    www.coredevelopers.net
 *
 * Open Source Training & Support.
 **********************************/