You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-user@james.apache.org by Cesar Bonadio <bo...@picture.com.br> on 2004/04/13 16:59:36 UTC

James Running Out Of Connections

Hello 

I still have the problema of james running out of connection, today
I start a new debug session and found in the smtp log hundreds of
messages
like the following:

13/04/04 10:58:24 DEBUG smtpserver: Calling reset() default Worker #40
13/04/04 10:58:24 DEBUG smtpserver: Calling reset() default Worker #40
13/04/04 10:58:24 DEBUG smtpserver: Calling reset() default Worker #40
13/04/04 10:58:24 DEBUG smtpserver: Calling reset() default Worker #40
13/04/04 10:58:24 DEBUG smtpserver: Calling reset() default Worker #40
13/04/04 10:58:24 DEBUG smtpserver: Calling reset() default Worker #40
13/04/04 10:58:24 DEBUG smtpserver: Calling reset() default Worker #40


the interesting is that some messages only occours a few times but this
work #40 appears several hundred times.

Any idea if this can be the source of the problem

Thanks




RE: James Running Out Of Connections

Posted by Cesar Bonadio <bo...@picture.com.br>.
> That sounds like a problem we may have already fixed.  I'd like to know what you see with the current test build.  If it isn't fixed in the test build, we should be able to modify the instrumentation to help track this problem.
> 
> 	--- Noel
> 

Ok
I will try to put the test build to run in one server

[]s



> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
> For additional commands, e-mail: server-user-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by Cesar Bonadio <bo...@picture.com.br>.
On Thu, 2004-04-15 at 13:02, Noel J. Bergman wrote:

> > I am not using mordred, I configured to use the other method 
> 
> Oh?  When did you make that change?  


when I installed the build version, the other version 2.1.3 is still
running with modred


> Only just now?  Are the other systems that have this problem still using mordred?


I will try to change back to modred and see if the problem occours
again?


> 
> 	--- Noel
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
> For additional commands, e-mail: server-user-help@james.apache.org
> 

RE: James Running Out Of Connections

Posted by "Noel J. Bergman" <no...@devtech.com>.
> I am not using mordred, I configured to use the other method 

Oh?  When did you make that change?  Only just now?  Are the other systems that have this problem still using mordred?

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by Cesar Bonadio <bo...@picture.com.br>.
I am not using mordred, I configured to use the other method


On Thu, 2004-04-15 at 09:48, Noel J. Bergman wrote:

> Cesar Bonadio wrote:
> > I put the test build to run in our server and so far the
> > problem did not happen again
> 
> > I will keep check until tomorrow, if everything is fine I will
> > install this version in all of our server
> 
> Good to hear.
> 
> Unfortunately, I am seeing this problem consistently now with the current code in CVS.  The only change I can think of that might have effected it is, oddly, a change to mordred.  I don't think that any other change has been anywhere near this code path, although I'm looking.
> 
> I've made some logging changes to help track this down.
> 
> 	--- Noel
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
> For additional commands, e-mail: server-user-help@james.apache.org
> 

RE: James Running Out Of Connections

Posted by Cesar Bonadio <bo...@picture.com.br>.
On Fri, 2004-04-16 at 12:55, Noel J. Bergman wrote:
> > I am running 1.4.2_03 and its working fine too
> 
> With which code?  2.2.0a18 or the code currently in CVS?  I hope to post a

2.2.0a18

[]s



---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by "Noel J. Bergman" <no...@devtech.com>.
> I am running 1.4.2_03 and its working fine too

With which code?  2.2.0a18 or the code currently in CVS?  I hope to post a
new test build late today with everything planned for 2.2.0 except for DSN
format bounces, so hopefully you can verify that it continues to work well
in your environment.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by Cesar Bonadio <bo...@picture.com.br>.
> 
> Seems to be running fine with 1.4.2_04, using the same code that was crashing consistently with 1.4.1_02.
> 

I am running 1.4.2_03 and its working fine too


> 	--- Noel
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
> For additional commands, e-mail: server-user-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by "Noel J. Bergman" <no...@devtech.com>.
> > I did a thread dump, and could see the threads waiting on
> > the synchronized pool object.  The odd thing is that I
> > could not find anything that owned it.

> Apparently, this is a defect in Sun JVM 1.4.0 and 1.4.1.  Testing
> with 1.4.2 now.  The problem has been occuring every 6 hours, so
> we'll see in the AM.

Seems to be running fine with 1.4.2_04, using the same code that was crashing consistently with 1.4.1_02.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by "Noel J. Bergman" <no...@devtech.com>.
> I did a thread dump, and could see the threads waiting on
> the synchronized pool object.  The odd thing is that I
> could not find anything that owned it.

Apparently, this is a defect in Sun JVM 1.4.0 and 1.4.1.  Testing with 1.4.2 now.  The problem has been occuring every 6 hours, so we'll see in the AM.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by "Noel J. Bergman" <no...@devtech.com>.
> I am using org.apache.james.util.mordred.JdbcDataSource

> So the test build with mordred is working just fine for me.

With the changes submitted yesterday to fix a deadlock problem, I was seeing consistent failures within about 6 hours.  I did a thread dump, and could see the threads waiting on the synchronized pool object.  The odd thing is that I could not find anything that owned it.

I am testing a change that uses an ArrayList instead of a Vector (inherently synchromized), and ensures that all uses of the pool are explicitly synchronized.  One conjecture from a friend is that there might be a bug in the JVM I'm using (one of the Sun 1.4.1 vintages), and that simplifying the synchronization could relieve it.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by Cesar Bonadio <bo...@picture.com.br>.
Sorry my mistake

I am using 

org.apache.james.util.mordred.JdbcDataSource

I just checked the config.xml in the Test Build server and the database
is using mordred.

So the test build with mordred is working just fine for me.


[]s





On Thu, 2004-04-15 at 14:13, Noel J. Bergman wrote:

> > The only change I can think of that might have effected it is, oddly,
> > a change to mordred.  I don't think that any other change has been
> > anywhere near this code path, although I'm looking.
> 
> I'm very much leaning towards that being the cause.  When you encounter this problem, look in Phoenix's temp/ directory, and see if there are a bunch of .m64 files that correspond with when the problem starts.
> 
> My current hypothesis is that a problem in mordred is causing all of the calling threads to hang when they try to save to the spool.  One thread must be stuck, and all of the others are blocked on a synchronize.  I had never encounted this before, but have several times since making a change to mordred yesterday.
> 
> 	--- Noel
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
> For additional commands, e-mail: server-user-help@james.apache.org
> 

RE: James Running Out Of Connections

Posted by "Noel J. Bergman" <no...@devtech.com>.
> The only change I can think of that might have effected it is, oddly,
> a change to mordred.  I don't think that any other change has been
> anywhere near this code path, although I'm looking.

I'm very much leaning towards that being the cause.  When you encounter this problem, look in Phoenix's temp/ directory, and see if there are a bunch of .m64 files that correspond with when the problem starts.

My current hypothesis is that a problem in mordred is causing all of the calling threads to hang when they try to save to the spool.  One thread must be stuck, and all of the others are blocked on a synchronize.  I had never encounted this before, but have several times since making a change to mordred yesterday.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by "Noel J. Bergman" <no...@devtech.com>.
Cesar Bonadio wrote:
> I put the test build to run in our server and so far the
> problem did not happen again

> I will keep check until tomorrow, if everything is fine I will
> install this version in all of our server

Good to hear.

Unfortunately, I am seeing this problem consistently now with the current code in CVS.  The only change I can think of that might have effected it is, oddly, a change to mordred.  I don't think that any other change has been anywhere near this code path, although I'm looking.

I've made some logging changes to help track this down.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by Cesar Bonadio <bo...@picture.com.br>.
I put the test build to run in our server and so far the problem did not
happen again
the connection log shows a smaller number of connections than netstat
what is correct

Usually I would aready have more connections in the connection log than
in netstat after
12 hours of server up.

I will keep check until tomorrow, if everything is fine I will install
this version in 
all of our server


[]s




On Thu, 2004-04-15 at 00:57, Noel J. Bergman wrote:

> I encountered this problem tonight, and have opened up a JIRA issue to track this problem: http://nagoya.apache.org/jira/browse/JAMES-254
> 
> Unfortunately, I reset the server before checking the state.  I did review the logs, and I have a few ideas about what might be causing it, but there wasn't sufficient information in the logs to be conclusive.  I am making some changes to the logging, so that exceptions that terminate the connection are logged at ERROR instead of DEBUG level.
> 
> Some of you have apparently been encountering this problem all-too-frequently, and already have DEBUG enabled for the SMTPHandler.  What I'd like for you to do is check the smtpserver log to see if you have a bunch of messages of the form:
> 
>  $DATE ERROR smtpserver: SMTP Connection has idled out.
> 
> but without a subsequent exception, which might be of the form:
> 
>  $DATE DEBUG smtpserver: Socket to $remoteHost ($remoteIP) timeout.
> 
> Also, if the connection log says that you are maxed, and netstat --inet does not show that many connections, do a KILL -3 on the james process so that we can get a thread dump, before you restart James.
> 
> Right now I am letting a connection idle out before the HELO to see what it does.  If that works properly, I'll let it idle out after the HELO, and will walk my way through until I reproduce the problem.
> 
> 	--- Noel
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
> For additional commands, e-mail: server-user-help@james.apache.org
> 

RE: James Running Out Of Connections

Posted by "Noel J. Bergman" <no...@devtech.com>.
I encountered this problem tonight, and have opened up a JIRA issue to track this problem: http://nagoya.apache.org/jira/browse/JAMES-254

Unfortunately, I reset the server before checking the state.  I did review the logs, and I have a few ideas about what might be causing it, but there wasn't sufficient information in the logs to be conclusive.  I am making some changes to the logging, so that exceptions that terminate the connection are logged at ERROR instead of DEBUG level.

Some of you have apparently been encountering this problem all-too-frequently, and already have DEBUG enabled for the SMTPHandler.  What I'd like for you to do is check the smtpserver log to see if you have a bunch of messages of the form:

 $DATE ERROR smtpserver: SMTP Connection has idled out.

but without a subsequent exception, which might be of the form:

 $DATE DEBUG smtpserver: Socket to $remoteHost ($remoteIP) timeout.

Also, if the connection log says that you are maxed, and netstat --inet does not show that many connections, do a KILL -3 on the james process so that we can get a thread dump, before you restart James.

Right now I am letting a connection idle out before the HELO to see what it does.  If that works properly, I'll let it idle out after the HELO, and will walk my way through until I reproduce the problem.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by "Noel J. Bergman" <no...@devtech.com>.
Please do not use HTML e-mail.

> > Have you run netstat to check the active connections?
> Yes there was 30 ESTABLISHED connections with 26 in the SMTP service

OK, if they are ESTABLISHED, with whom are they ESTABLISHED?  That does not sound like you have a lot of "hung" connections.

> smtpserver-2004-04-13-10-53.log:13/04/04 10:53:29 DEBUG smtpserver: Calling reset() default Worker #40
> ......... 34576 times the same message

Hmm ... if you are saying that you had 34576 messages in one second, that should eliminate data arriving that quickly.

> smtpserver-2004-04-13-10-53.log:13/04/04 10:58:59 DEBUG smtpserver: Calling stop() default Worker #40
> smtpserver-2004-04-13-10-53.log:13/04/04 10:58:59 DEBUG smtpserver: Calling disposeWatchdog() default Worker #40
> smtpserver-2004-04-13-10-53.log:13/04/04 10:58:59 DEBUG smtpserver: Watchdog default Worker #40 is exiting run().

OK, that was alive for 30 seconds, and appears to have cleaned up just fine.  If you knew to whom that was connected, you could find out what they were doing.

> What I find is that when I start the server the number of
> connection in the connection log is smaller but closer to
> the number os connections with NETSTAT but after 2-3 hours
> the number of connections in the connection log starts to grow
> little by little and passes the number of connections in NETSTAT

That sounds like a problem we may have already fixed.  I'd like to know what you see with the current test build.  If it isn't fixed in the test build, we should be able to modify the instrumentation to help track this problem.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


RE: James Running Out Of Connections

Posted by Cesar Bonadio <bo...@picture.com.br>.
On Tue, 2004-04-13 at 13:31, Noel J. Bergman wrote:

> Which version of James?  


2.1.3


> Have you run netstat to check the active connections?  

Yes there was 30 ESTABLISHED connections with 26 in the SMTP service


>  Is it possible that your service is simply being hit by lots of connections today from internet worms?


Not really the system works fine today


> 
> > in the smtp log hundreds of messages like the following:
> > 13/04/04 10:58:24 DEBUG smtpserver: Calling reset() default Worker #40
> > 13/04/04 10:58:24 DEBUG smtpserver: Calling reset() default Worker #40
> > ...
> > Any idea if this can be the source of the problem
> 
> Actually, that could be helpful.  That message is emitted by the Watchdog for that connection when it is reset.  The reset happens on one of two conditions:
> 
>   - a new command is entered
>   - a certain amount (20K default) of data is received during
>     processing of the DATA command


The messages started as:

smtpserver-2004-04-13-10-53.log:13/04/04 10:53:27 DEBUG smtpserver:
Watchdog default Worker #40 has time to sleep 360000
smtpserver-2004-04-13-10-53.log:13/04/04 10:53:28 DEBUG smtpserver:
Calling reset() default Worker #40
smtpserver-2004-04-13-10-53.log:13/04/04 10:53:29 DEBUG smtpserver:
Calling reset() default Worker #40
......... 34576 times the same message
smtpserver-2004-04-13-10-53.log:13/04/04 10:58:59 DEBUG smtpserver:
Calling stop() default Worker #40
smtpserver-2004-04-13-10-53.log:13/04/04 10:58:59 DEBUG smtpserver:
Calling disposeWatchdog() default Worker #40
smtpserver-2004-04-13-10-53.log:13/04/04 10:58:59 DEBUG smtpserver:
Watchdog default Worker #40 is exiting run().

then it resume with normal operation


What I find is that when I start the server the number of connection in
the connection log is smaller but closer to the
number os connections with NETSTAT but after 2-3 hours the number of
connections in the connection log starts to grow
little by little and passes the number of connections in NETSTAT

Like now I have in the connection log

13/04/04 16:41:39 DEBUG connections: Releasing one connection, leaving a
total of 62
13/04/04 16:41:39 DEBUG connections: Returning a
org.apache.james.util.connection.ServerConnection$ClientConnectionRunner
to the pool
13/04/04 16:41:39 DEBUG connections: Retrieving a
org.apache.james.util.connection.ServerConnection$ClientConnectionRunner
from the pool
13/04/04 16:41:39 DEBUG connections: Adding one connection for a total
of 63
13/04/04 16:41:39 DEBUG connections: Starting connection on
200.229.128.62:25 from 200.77.160.40:3805
13/04/04 16:41:39 DEBUG connections: Ending connection on
200.229.128.62:25 from 200.247.6.3:1409
13/04/04 16:41:39 DEBUG connections: Releasing one connection, leaving a
total of 62
13/04/04 16:41:39 DEBUG connections: Returning a
org.apache.james.util.connection.ServerConnection$ClientConnectionRunner
to the pool
13/04/04 16:41:39 DEBUG connections: Ending connection on
200.229.128.62:25 from 67.64.81.156:10326
13/04/04 16:41:39 DEBUG connections: Releasing one connection, leaving a
total of 61
13/04/04 16:41:39 DEBUG connections: Returning a
org.apache.james.util.connection.ServerConnection$ClientConnectionRunner
to the pool

But NETSTAT says:

[root@mx05 logs]# netstat -n | grep ESTA | wc -l
     41
[root@mx05 logs]# date
Tue Apr 13 16:41:50 BRT 2004


So after 2 days my server starts to refuse connections because it thinks
that there are more than 200 simultaneous connections what is
not correct.


Any idea on how to debug more information about that?

Thanks







> 
> Either someone is entering 100s of lines, they are sending you a huge message in a short period of time, or there is a really weird bug in the code.  You showed 7 entries at 10:58:24, so that would imply 140K in one second.  Are all of the 100s of entries at the same time?  What is the rate at which they are logged?
> 
> I'm leaning towards someone sending a blank command, which would not be logged but would result in the watchdog being reset.
> 
> 	--- Noel
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
> For additional commands, e-mail: server-user-help@james.apache.org
> 

RE: James Running Out Of Connections

Posted by "Noel J. Bergman" <no...@devtech.com>.
Which version of James?  Have you run netstat to check the active connections?   Is it possible that your service is simply being hit by lots of connections today from internet worms?

> in the smtp log hundreds of messages like the following:
> 13/04/04 10:58:24 DEBUG smtpserver: Calling reset() default Worker #40
> 13/04/04 10:58:24 DEBUG smtpserver: Calling reset() default Worker #40
> ...
> Any idea if this can be the source of the problem

Actually, that could be helpful.  That message is emitted by the Watchdog for that connection when it is reset.  The reset happens on one of two conditions:

  - a new command is entered
  - a certain amount (20K default) of data is received during
    processing of the DATA command

Either someone is entering 100s of lines, they are sending you a huge message in a short period of time, or there is a really weird bug in the code.  You showed 7 entries at 10:58:24, so that would imply 140K in one second.  Are all of the 100s of entries at the same time?  What is the rate at which they are logged?

I'm leaning towards someone sending a blank command, which would not be logged but would result in the watchdog being reset.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org