You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by DHM <dh...@gmail.com> on 2008/11/10 21:21:35 UTC

Re: Apache / Tomcat Cluster Worker Fails and Entire Cluster becomes Unavailable

Hi -

Based on your suggestion which I do appreciate we added timeouts for
connect, pre_post and connection_pool so now the workers looks like:

worker.list=loadbalancer
worker.loadbalancer.type=lb
worker.loadbalancer.balance_workers=cbap1,cbap2
worker.loadbalancer.sticky_session=1
worker.cbap1.port=8690
worker.cbap1.host= edited for privacy
worker.cbap1.type=ajp13
worker.cbap1.lbfactor=10
worker.cbap1.socket_keepalive=1
# worker.cbap1.cachesize=5
worker.cbap1.connect_timeout=10000
worker.cbap1.prepost_timeout=5000
worker.cbap1.connection_pool_timeout=7000

This has helped me with another issue very well that we didn't pay attention
to before which was connections.  Our connections were going way too high
and not recovering after a Sun Cluster Patch we did.  Now we see a nice
steady set of connections which seems much better.

BUT... we had another issue again today where 1 of my cluster members went
into an error state but the whole cluster seemed hung.  Summary of log
findings show:  After running 10 days with decent activity 1 of my workers
has an issue.  Events happened as follows:

1 - Ap 1 becomes unstable with a OutOfMemoryError: PermGen space error -
org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet jsp threw exception
java.lang.OutOfMemoryError: PermGen space.  We are reviewing Max sizes for
that to avoid

2 - Ap 1 limps along but due to new connection settings take effect I can
see Ap 1 is declared in an error state - Mon Nov 10 15:47:43 2008]
[4673:0001] [info]  service::jk_lb_worker.c (906): service failed, worker
cbap1 is in error state

3 - In my 1 Web and 2 App server cluster the application speed is severely
impacted to point of being unusable.

4 - Cannot shutdown the App1 gracefully.  It has to be killed from command
prompt.

5 - Cluster performance returns as now App 1 is dead.

6 - Restart App 1

7 - App2 sees that the member has joined but cannot establish a Cluster with
it.  So now with OSCACHE we are trying to re-establish cache of objects in
memory management.

Nov 10, 2008 4:05:52 PM
com.opensymphony.oscache.plugins.clustersupport.JavaGroupsBroadcastingListener
memberJoined
INFO: A new member at address 'EDITED' has joined the cluster
bufferedreader ready: false

Nov 10, 2008 4:05:58 PM org.jgroups.protocols.FD_SOCK run
SEVERE: socket address for EDITED could not be fetched, retrying
fetchEstimateSpec = 413
Nov 10, 2008 4:06:06 PM org.jgroups.protocols.FD_SOCK run
SEVERE: socket address for EDITED could not be fetched, retrying
Nov 10, 2008 4:06:15 PM org.jgroups.protocols.FD_SOCK run

8 - Now the cluster cannot manage memory properly .  Causing the cluster to
Synch on it's objects.  Which is the alternative to the clustering of memory
objects.  THis is not desirable.

9 - Decide to stop Tomcat on Ap2.  Cannot.  It also has to be killed.

10 - Bring Ap2 up and then cluster rejoins normally.  Running OK again.

So from the event which occurred on App1 I have to go through and eventually
kill both App Servers  which is not ideal.

Questions -

What else to look for on handling of the member when it goes into Error
State and Cluster is still being taken down with it?

Thanks!

On Wed, Oct 15, 2008 at 12:48 AM, Mladen Turk <mt...@apache.org> wrote:

> DHM wrote:
>
>> Hi -
>>
>> worker.list=loadbalancer
>> worker.loadbalancer.type=lb
>> worker.loadbalancer.balance_workers=cbap1,cbap2
>> worker.loadbalancer.sticky_session=1
>> worker.cbap1.port=8690
>> worker.cbap1.host= edited for privacy
>> worker.cbap1.type=ajp13
>> worker.cbap1.lbfactor=10
>> worker.cbap1.socket_keepalive=1
>> # worker.cbap1.cachesize=5
>>
>>
>> worker.cbap2.port=8690
>> worker.cbap2.host=edited for privacy
>> worker.cbap2.type=ajp13
>> worker.cbap2.lbfactor=10
>> worker.cbap2.socket_keepalive=1
>> # worker.cbap2.cachesize=5
>>
>>
>>
>> I am looking for suggestion as to other configuration properties which
>> should be added to help with error handling of the worker in error
>> state.  Or other points to review welcome.
>> THANK YOU for any help here.
>>
>
> You should add connect_timeout and prepost_timeout to
> each of the ajp workers. Those are exactly meant to
> be used with hanged Tomcats.
>
> Regards
> --
> ^(TM)
>
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>