You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@tomcat.apache.org by nageshsrao <na...@yahoo.com> on 2007/08/14 12:32:29 UTC

Frequent "SEVERE: Unable to receive message through TCP channel" messages

Hi,

In our prod environment we have two tomcat's [ 5.0.27] running on two linux
boxes [ RHAS 3.0 update8 ] and using mod_jk2.0 thru apache for accessing the
information.

very frequently we see the following messages in the catalina.out and there
are about 2 instances where tomcat stopped responding and we had to restart.
the only errors that we see are the following.. There are INFO which keeps
telling us member is disappeared and added and once in a while we have
SEVERE messages.

Could you let us know, what could be causing this problem? is there any
additional configuration that are needed?, This environment is running for
almost 18 months in production and off-late [ in the last 6 months] we have
seen this happenned twice. I have attached both the error log found in the
catalina.out and also the server.xml from both the tomcat.

http://www.nabble.com/file/p12142134/catalina-error.out catalina-error.out
http://www.nabble.com/file/p12142134/server-app1.xml server-app1.xml
http://www.nabble.com/file/p12142134/server-app2.xml server-app2.xml
--
View this message in context: http://www.nabble.com/Frequent-%22SEVERE%3A-Unable-to-receive-message-through-TCP-channel%22-messages-tf4266454.html#a12142134
Sent from the Tomcat - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Frequent "SEVERE: Unable to receive message through TCP channel" messages

Posted by Rainer Jung <ra...@kippdata.de>.

nageshsrao wrote:
> we are getting the GC printed on to the same catalina.out and we see that the
> memberAdded messages appear almost at the time of GB getting printed, does
> it prove that longer GC pauses are causing this? is there any other data
> points/proof can be get?

E.g. -XX:+PrintGCApplicationStoppedTime

> rearding "network problems", we are requesting the network to capture the
> multicast traffic between these nodes, is there anything you suggest us to
> do?

If you are doing the multicast only inside a subnet, the usual basic 
network monitoring should be sufficient. But often during phases were 
you have problems that might be network related it is good to keep in 
touch with the network people in order to discuss, if they know about 
any general network problems.

If you do multicasting crossing the borders of subnets, the network 
needs to use multicast group membership protocols, which involves 
complicated configuration of routers. Most users though don't need to 
cross subnets.

> regaring "increase the membership timeout" we plan to increase this to 5
> minutes, do you have any other suggestions.  tomcat startup takes almost 70
> seconds ( it hosts almost 32 apps) and all of them are clustered.

I would expect, that your GC even with a big heap won't take longer than 
20 seconds. Most likely it's much less. On the other hand if you go to 5 
minutes, you would always need to wait 5 minutes between shutting down 
one node and starting it up again. It seems unreasonable to me, that IT 
staff will obey that. I would suggest 30 seconds and a clear message in 
the startup script, to remember people using it, that they have to wait 
30 seconds after stopping and before starting again.

Regards,

Rainer

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Frequent "SEVERE: Unable to receive message through TCP channel" messages

Posted by nageshsrao <na...@yahoo.com>.

we are getting the GC printed on to the same catalina.out and we see that the
memberAdded messages appear almost at the time of GB getting printed, does
it prove that longer GC pauses are causing this? is there any other data
points/proof can be get?

rearding "network problems", we are requesting the network to capture the
multicast traffic between these nodes, is there anything you suggest us to
do?

regaring "increase the membership timeout" we plan to increase this to 5
minutes, do you have any other suggestions.  tomcat startup takes almost 70
seconds ( it hosts almost 32 apps) and all of them are clustered.

regards,


Rainer Jung-3 wrote:
> 
> You configured a 3 seconds timeout for your heartbeat. If a node doesn't 
> receive a heartbeat packet for 3 seconds, it assumes the other node is 
> dead and closes the incoming replication connection. If the other node 
> is not really dead, it will try to use this replication connection which 
> will not work any more.
> 
> Why could this happen: one possible reason are GC pauses. If you've got 
> longer GC pauses, than your membership heartbeat timeout, then you run 
> into such problems.
> 
> During normal operations you should not observe any memberDisappeared 
> messages. They should only show up, ehen you stop a node or it crashes, 
> or you've got serious network problems with impact on the multicast 
> heartbeat packets.
> 
> If you decide to increase the membership timeout (which sounds like a 
> good idea), keep in mind, that you need to wait the given time between 
> stopping and restarting a node.
> 
> Regards,
> 
> Rainer
> 
> nageshsrao wrote:
>> Hi,
>> 
>> In our prod environment we have two tomcat's [ 5.0.27]  running on two
>> linux
>> boxes [ RHAS 3.0 update8 ] and using mod_jk2.0 thru apache for accessing
>> the
>> information. 
>> 
>> very frequently we see the following messages in the catalina.out and
>> there
>> are about 2 instances where tomcat stopped responding and we had to
>> restart.
>> the only errors that we see are the following.. There are INFO which
>> keeps
>> telling us member is disappeared and added and once in a while we have
>> SEVERE messages.
>> 
>> Could you let us know, what could be causing this problem? is there any
>> additional configuration that are needed?,  This environment is running
>> for
>> almost 18 months in production and off-late [ in the last 6 months] we
>> have
>> seen this happenned twice.  I have attached both the error log found in
>> the
>> catalina.out and also the server.xml from both the tomcat.
>> 
>> 
>> http://www.nabble.com/file/p12142134/catalina-error.out
>> catalina-error.out 
>> http://www.nabble.com/file/p12142134/server-app1.xml server-app1.xml 
>> http://www.nabble.com/file/p12142134/server-app2.xml server-app2.xml 
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Frequent-%22SEVERE%3A-Unable-to-receive-message-through-TCP-channel%22-messages-tf4266454.html#a12176135
Sent from the Tomcat - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Frequent "SEVERE: Unable to receive message through TCP channel" messages

Posted by Rainer Jung <ra...@kippdata.de>.

You configured a 3 seconds timeout for your heartbeat. If a node doesn't 
receive a heartbeat packet for 3 seconds, it assumes the other node is 
dead and closes the incoming replication connection. If the other node 
is not really dead, it will try to use this replication connection which 
will not work any more.

Why could this happen: one possible reason are GC pauses. If you've got 
longer GC pauses, than your membership heartbeat timeout, then you run 
into such problems.

During normal operations you should not observe any memberDisappeared 
messages. They should only show up, ehen you stop a node or it crashes, 
or you've got serious network problems with impact on the multicast 
heartbeat packets.

If you decide to increase the membership timeout (which sounds like a 
good idea), keep in mind, that you need to wait the given time between 
stopping and restarting a node.

Regards,

Rainer

nageshsrao wrote:
> Hi,
> 
> In our prod environment we have two tomcat's [ 5.0.27]  running on two linux
> boxes [ RHAS 3.0 update8 ] and using mod_jk2.0 thru apache for accessing the
> information. 
> 
> very frequently we see the following messages in the catalina.out and there
> are about 2 instances where tomcat stopped responding and we had to restart.
> the only errors that we see are the following.. There are INFO which keeps
> telling us member is disappeared and added and once in a while we have
> SEVERE messages.
> 
> Could you let us know, what could be causing this problem? is there any
> additional configuration that are needed?,  This environment is running for
> almost 18 months in production and off-late [ in the last 6 months] we have
> seen this happenned twice.  I have attached both the error log found in the
> catalina.out and also the server.xml from both the tomcat.
> 
> 
> http://www.nabble.com/file/p12142134/catalina-error.out catalina-error.out 
> http://www.nabble.com/file/p12142134/server-app1.xml server-app1.xml 
> http://www.nabble.com/file/p12142134/server-app2.xml server-app2.xml 

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org