You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@tomcat.apache.org by John Bass <jo...@anicca.net> on 2011/09/13 11:51:42 UTC

Clustering / High Availability edge cases?

Hi all,

I'm relatively new to clustering with Tomcat and I'm trying to understand
the edge cases.  If I'd like to guarantee continuous availability, what are
the caveats?

As I understand it, Tomcat clustering will ensure that session information
is persisted in the event of a failure.  That's fine, however, what about
long running I/O operations?  What if my node dies in the middle of serving
an HTTP response?  In the event of a node failure, I'm assuming that there's
no way to recover from that and the failure will be visible to a client
application.

Similarly, if a node fails during a long running calculation, I'm assuming
that there's no way to persist that execution state.

Are those assumptions correct?  If anyone has any other comments on further
scenarios where clustering and session persistence will not be useful in an
HA context, i'd love to hear them.

thanks,

John

Re: Clustering / High Availability edge cases?

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark,

On 9/15/2011 3:17 AM, Mark Thomas wrote:
> On 14/09/2011 23:03, Christopher Schultz wrote:
>> John,
>> 
>> On 9/13/2011 5:51 AM, John Bass wrote:
>>> In the event of a node failure, I'm assuming that there's no
>>> way to recover from that and the failure will be visible to a
>>> client application.
>> 
>> Correct: no other node in the cluster can serve the response
>> being generated by a dying Tomcat instance. As Pid points out,
>> this isn't unique to Tomcat.
> 
> Wrong. See my longer response.

To pick a nit, the dying Tomcat dies. No other server can send it's
response. Instead, your load-balancer can retry the request on another
server. That's not the same thing -- especially when OP is talking
about long-running requests.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5ziGMACgkQ9CaO5/Lv0PBvCgCcDKrLZwF2mZI7VnAA4mLDLYEC
S0AAoIf96gjZdnesKzor34CtG1QZhwRU
=eaLo
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Clustering / High Availability edge cases?

Posted by Mark Thomas <ma...@apache.org>.

On 14/09/2011 23:03, Christopher Schultz wrote:
> John,
> 
> On 9/13/2011 5:51 AM, John Bass wrote:
>> In the event of a node failure, I'm assuming that there's no way
>> to recover from that and the failure will be visible to a client 
>> application.
> 
> Correct: no other node in the cluster can serve the response being 
> generated by a dying Tomcat instance. As Pid points out, this
> isn't unique to Tomcat.

Wrong. See my longer response.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Clustering / High Availability edge cases?

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John,

On 9/13/2011 5:51 AM, John Bass wrote:
> In the event of a node failure, I'm assuming that there's no way to
> recover from that and the failure will be visible to a client 
> application.

Correct: no other node in the cluster can serve the response being
generated by a dying Tomcat instance. As Pid points out, this isn't
unique to Tomcat.

> Similarly, if a node fails during a long running calculation, I'm
> assuming that there's no way to persist that execution state.

There's nothing that Tomcat does that would persist any state, unless
your "long-running calculation" periodically saves it's state into the
user's session, and you are using distributed or persisted sessions.

If you have long-running tasks, I would encourage you to architect the
code such that the state /can/ be saved somewhere trivial such as in
an HttpSession or even a relational database, and that a replacement
data processing thread can take-over and resume operation on a
partially-completed job. If the initial node goes down, a second
request that goes to another node can resume that operation
in-progress without starting over again.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5xJJgACgkQ9CaO5/Lv0PBbrwCgofwAJWqCWEImqEvpDZE16QqX
oLAAnjFJDJWeJBElIUaImqZbRrTS4wY/
=+cpD
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Clustering / High Availability edge cases?

Posted by Mark Thomas <ma...@apache.org>.

On 13/09/2011 10:51, John Bass wrote:
> Hi all,
> 
> I'm relatively new to clustering with Tomcat and I'm trying to understand
> the edge cases.  If I'd like to guarantee continuous availability, what are
> the caveats?
> 
> As I understand it, Tomcat clustering will ensure that session information
> is persisted in the event of a failure.  That's fine, however, what about
> long running I/O operations?  What if my node dies in the middle of serving
> an HTTP response?  In the event of a node failure, I'm assuming that there's
> no way to recover from that and the failure will be visible to a client
> application.

Wrong. Recovery options depend on the exact failure mode and the
load-balancer configuration.

The typical sequence of events is:
- load-balancer sends request to Tomcat
- request fails
- load-balancer detects failure (either by return code or lack of response)
- load-balancer replays request to a different Tomcat node
- Tomcat generates response
- load-balancer returns response to the client
- client is unaware of failure although the request may appear slow
particularly if the failure was detected via a timeout

The load-balacer configuration will control the exact circumstances
under which a request will be replayed.

> Similarly, if a node fails during a long running calculation, I'm assuming
> that there's no way to persist that execution state.

Out of the box, no. You'd need to code that within the app.

> Are those assumptions correct?  If anyone has any other comments on further
> scenarios where clustering and session persistence will not be useful in an
> HA context, i'd love to hear them.

Another failure mode to consider is node failure after the request has
been processed but before the updated session data has been replicated
to other nodes in the cluster.

If you use synchronous replication (the replication happens before the
response is completed) then this can't happen but your responses are
delayed until the replication completes.

If you use asynchronous replication then there is the possibility of
node failure before the data is replicated. Also, you must use sticky
sessions in this case since you don't want the next request being
directed to a different node before the updated session data has been
replicated.

Finally, if using the back-up manager multiple node failures in quick
succession will cause the loss of session data. With this manager, each
node distributes the backup copies of the session data (each primary
session has a single backup) around the other nodes in the cluster. So,
for example, in a four node cluster if node A has 30 primary sessions 10
of those will be backed up on node B, 10 on node C and 10 on node D.

If node A fails, the remaining nodes will detect this, make themselves
the primary node for the sessions they are backing up and start the
process of creating new backups on one of the remaining nodes. If a
second node fails before this is complete there is the possibility of
session loss.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Clustering / High Availability edge cases?

Posted by Pid <pi...@pidster.com>.

On 13/09/2011 10:51, John Bass wrote:
> Hi all,
> 
> I'm relatively new to clustering with Tomcat and I'm trying to understand
> the edge cases.  If I'd like to guarantee continuous availability, what are
> the caveats?
> 
> As I understand it, Tomcat clustering will ensure that session information
> is persisted in the event of a failure.  That's fine, however, what about
> long running I/O operations?

Operations relating to the session, or something else?

> What if my node dies in the middle of serving an HTTP response?  

Well, the point to point connection will be severed, so an error will
likely occur in the client.  This is not an issue unique to Tomcat.

> In the event of a node failure, I'm assuming that there's
> no way to recover from that and the failure will be visible to a client
> application.

Depends what the client & node are doing at the time & whether there's
anything in between them (e.g. a proxy).

> Similarly, if a node fails during a long running calculation, I'm assuming
> that there's no way to persist that execution state.

Again, not a Tomcat specific problem.


p

> Are those assumptions correct?  If anyone has any other comments on further
> scenarios where clustering and session persistence will not be useful in an
> HA context, i'd love to hear them.
> 
> thanks,
> 
> John
>