You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by Danny Angus <da...@apache.org> on 2002/10/13 23:53:43 UTC

Socket Performance

In my opinion James socket problems would be greatly reduced in impact if James behaviour was as follows..

connections are accepted
-> resources are consumed
-> limits are approached
-> connections are refused
-> resources are freed
-> connections are accepted


rather than the current situation which is that connections are accpeted until resources are exhausted, and James never recovers.


In addition it concerns me that we can't run James under the -server JVM otpion on linux because Avalon causes a failure (attached message)
Tomcat 3 under heavy and sustained load ends up with an out of memory exception, -server cures it, largely because of the more agressive garbage collection.

In my opinion it is right for us to optimise our use of resources, but impossible to create a server that will sustain any load applied, what we need to do is ensure that the server will continue to function, even if this means rejecting connections.
This route will provide a scalable and robust solution.

d.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Socket Performance

Posted by Danny Angus <da...@apache.org>.

Sorry didn't attach after all (!) here it is.
d.


> >In addition it concerns me that we can't run James under the 
> -server JVM otpion on linux because Avalon causes a failure 
> (attached message)
> >
> 
> No seeing an attached message - can you resend ?

Re: Socket Performance

Posted by Stephen McConnell <mc...@apache.org>.


Danny Angus wrote:

>In my opinion James socket problems would be greatly reduced in impact if James behaviour was as follows..
>
>connections are accepted
>-> resources are consumed
>-> limits are approached
>-> connections are refused
>-> resources are freed
>-> connections are accepted
>
>
>rather than the current situation which is that connections are accpeted until resources are exhausted, and James never recovers.
>
>
>In addition it concerns me that we can't run James under the -server JVM otpion on linux because Avalon causes a failure (attached message)
>

No seeing an attached message - can you resend ?

Steve.

>Tomcat 3 under heavy and sustained load ends up with an out of memory exception, -server cures it, largely because of the more agressive garbage collection.
>
>In my opinion it is right for us to optimise our use of resources, but impossible to create a server that will sustain any load applied, what we need to do is ensure that the server will continue to function, even if this means rejecting connections.
>This route will provide a scalable and robust solution.
>
>d.
>
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>
>  
>

-- 

Stephen J. McConnell

OSM SARL
digital products for a global economy
mailto:mcconnell@osm.net
http://www.osm.net




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Socket Performance

Posted by "Noel J. Bergman" <no...@devtech.com>.

> The current problem is that the mis-use of the Scheduler

I'd prefer to simply say that it is being applied for a purpose that does
not match it's design point.  :-)

> Basically, with only five concurrent connections,
> you can easily kill the Scheduler implementation
> with consistent load.

I've done that several times today with Harmeet, testing various changes to
his idea of how to make use of the Scheduler or TimerTask.  Eiter way the
memory usage kills the server.  Here is a set of tests:

time,messages,data(K),errors,connections,SSL connections
15:16,705,606,32,490,0
15:17,700,587,23,461,0
15:18,464,374,38,308,0
15:19,4,2,223,0,0

time,messages,data(K),errors,connections,SSL connections
15:39,687,584,29,476,0
15:40,629,536,24,425,0
15:41,523,420,36,339,0

time,messages,data(K),errors,connections,SSL connections
16:19,868,730,39,614,0
16:20,858,738,27,572,0

time,messages,data(K),errors,connections,SSL connections
16:33,1044,872,21,709,0
16:34,872,740,48,570,0
16:35,0,0,192,0,0

Those are 4 separate tests, each one stopping when the server died.  Each
test shows James dying within 2-3 minutes.  The final test was with
Harmeet's latest code, and with sendMail commented out.

Here is my current log (still running) for Peter's server with the sendMail
commented out:

time,messages,data(K),errors,connections,SSL connections
18:27,579,479,45,410,0
18:28,682,575,3,465,0
18:29,692,554,4,450,0
18:30,679,572,16,453,0
18:31,682,560,5,453,0
18:32,691,568,4,455,0
18:33,690,565,5,454,0
18:34,678,564,6,459,0
18:35,677,565,2,452,0
18:36,692,566,7,448,0
18:37,684,577,10,451,0
18:38,681,552,10,463,0
18:39,674,568,16,460,0
18:40,691,577,9,459,0
18:41,688,565,6,455,0
18:42,679,555,4,464,0
18:43,688,570,4,465,0
18:44,687,551,8,457,0
18:45,681,558,4,450,0
18:46,680,564,12,461,0
18:47,695,598,6,446,0
18:48,680,577,3,456,0
18:49,676,562,3,460,0
18:50,684,586,11,457,0
18:51,686,554,7,465,0
18:52,694,578,7,448,0
18:53,689,563,5,462,0
18:54,683,568,2,454,0
18:55,681,572,5,453,0
18:56,683,563,6,460,0
18:57,687,571,4,463,0
18:58,688,571,12,454,0
18:59,690,573,7,458,0
19:00,679,557,8,445,0
19:01,693,576,2,449,0
19:02,637,532,2,428,0
19:03,684,572,2,451,0

That's 36 minutes.  Performance appears to be consistent, CPU and memory
appear to be consistent according to Peter on his end.  We're leaving this
running.  Last night I was not able to achieve this result, but he has since
made changes to the watchdog implementation, and commented out sendMail
(turns out that there may be some issues in the spool manager, too).

> The priority queue will hold on to the events, causing out
> of memory errors.  This is one reason why I believe the
> scheduler is the wrong approach.

Harmeet believes that he can come up with a new data structure that will
provide for rapid change.  I've pointed out to him that it needs to support
1000s of resets per second (or more).  He believes that using two threads
per connection doesn't scale.  That is his specific issue.  I suggest that
as the number of connections goes up, a priority queue problem has scaling
problems, and for James' purposes, I don't believe that the two thread
solution doesn't scale suitably.  However, the interface to the watchdog
mechanism doesn't the two thread solution, so it can be changed.  The
interface to the Scheduler, however, isn't designed to be a watchdog.

The Watchdog interface is simple:

public interface Watchdog {
    void startWatchdog();
    void resetWatchdog();
    void stopWatchdog();
}

The current implementation(s) have a constructor like:

    public WatchdogImpl(long timeout, WatchdogTarget target);

That's it.  There is no lookup in the interface, neither is there a mandate
for either a queued or multi-threaded implementation.  If we needed to
handled so many watchdogs that we didn't want to spawn a thread for each, a
new Watchdog implementation could decide how to handle it.  One thread,
multiple threads balancing the length of a priority queue, or whatever.  I
think that this is a simple, reusable interface that does what we need, and
the current implementation is demonstrating that it can handle the load
we're putting on it.

	--- Noel


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Socket Performance

Posted by "Peter M. Goldstein" <pe...@yahoo.com>.

Danny,

> In my opinion James socket problems would be greatly reduced in impact
if
> James behaviour was as follows..
> 
> connections are accepted
> -> resources are consumed
> -> limits are approached
> -> connections are refused
> -> resources are freed
> -> connections are accepted

Some of this capability is already present.  It simply requires correct
configuration.  Use of the <connections> sub element <maxconnections>
(newly introduced with the ConnectionManager change of a few weeks ago)
allows you to throttle the number of connections per server connection.

The current problem is that the mis-use of the Scheduler requires that
the maxconnections number be kept artificially low.  Basically, with
only five concurrent connections, you can easily kill the Scheduler
implementation with consistent load.

> In addition it concerns me that we can't run James under the -server
JVM
> otpion on linux because Avalon causes a failure (attached message)
> Tomcat 3 under heavy and sustained load ends up with an out of memory
> exception, -server cures it, largely because of the more agressive
garbage
> collection.

It concerns me too.  We should push the Avalon folks to figure out what
the problem is.  Possibly this would fix the Scheduler crash, possibly
not.  Seems doubtful to me, as the problem results from the fact that
the global scheduler or timer has references to events that have been
expired and thus GC won't remove these events.  As far as I can tell the
exact same problem exists with Harmeet's scheduler as does with the
previous scheduler.  The priority queue will hold on to the events,
causing out of memory errors.  This is one reason why I believe the
scheduler is the wrong approach.
 
> In my opinion it is right for us to optimise our use of resources, but
> impossible to create a server that will sustain any load applied, what
we
> need to do is ensure that the server will continue to function, even
if
> this means rejecting connections.
> This route will provide a scalable and robust solution.

I don't disagree with this point.  And a correctly configured server
(after the watchdog fix) does this properly.  Specifically, each service
requires a base number of threads (~2) to function.  Each service
requires either 1 or 2 threads per handler, depending on whether we're
using the old code or the new code.  The SpoolManager consumes num of
spool threads plus one.  The NNTP Repository consumes the number of
spooler threads plus one.  Fetchpop consumes a single thread.  So sum
that all up based on your configuration, and set that to the max of your
thread pool.  If you do, no problem.  

This is basically what I've been trying to work towards.  Obviously
James can't take arbitrarily high loads.  But the current maximum load
is well below what a real production system should be able to take.  And
the current response of the server in the case of overload is clearly
not acceptable.  Server needs to be robust.

How do we solve this problem?  Proper configuration and a source base
that doesn't tip over from OutOfMemoryErrors.  I believe the current
patch helps alleviate this situation.  I understand you're having
issues, and can only tell you that I am not.  I'm happy to work with you
to get through those issues, but I need more info on your configuration
and assembly.

--Peter

P.S.: The problem from last night's test has been identified.  Basically
the problem lay in the spool.  The spool processing fell woefully behind
the rate at which emails were coming in.  This led to a multi-GB backlog
in the spool of hundreds of thousands of files of ~1 K.  This led to O/S
level problems, as Win2k doesn't handle this very well.  It's taken me
well over an hour to attempt to delete these files, and I'm not done
yet.  But there is no indication of a problem with the handler.  Just a
problem with the underlying O/S.      



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Socket Performance

Posted by "Noel J. Bergman" <no...@devtech.com>.

> I believe that "del * /Q" is the thing to do (not the godforsaken Windows
GUI delete)

It still took ages.

> what JDK version.. I can't ever get it to run for long without getting a
hotspot error

java version "1.3.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0)
Java HotSpot(TM) Server VM (build 1.3.0, mixed mode)



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Socket Performance

Posted by Danny Angus <da...@apache.org>.

> There are bugs in it that Peter is working on, but first he had 
> to delete 12
> hours worth of messages (100s of 1000s).

I'm not a keen windows user but I believe that "del * /Q" is the thing to do (not the godforsaken Windows GUI delete)

> > In addition it concerns me that we can't run James under the -server JVM
> otpion on linux
> 
> I do.

Surely not? what JDK version.. I can't ever get it to run for long without getting a hotspot error 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Socket Performance

Posted by "Noel J. Bergman" <no...@devtech.com>.

>  -> connections are accepted
>  -> resources are consumed
>  -> limits are approached
>  -> connections are refused
>  -> resources are freed
>  -> connections are accepted

I believe that the new code facilitates that, but the code still needs work.
There are bugs in it that Peter is working on, but first he had to delete 12
hours worth of messages (100s of 1000s).  Meanwhile, I have been working
with Harmeet, too, and crashing his server in about 2 mins each test.  Both
of them have removed the sendMail call from SMTPHandler so that we can test
JUST the handler (and so that Peter doesn't have to delete all of those
messages -- which is a completely separate overall issue).

Harmeet's still crashes just in the SMTPHandler with an out of memory error.
Right now I am hitting Peter's server at about the same rate as I was
hitting Harmeet's, except that there doesn't appear to be any leaking
because instead of creating scheduler entries or TimerTask objects (the
Avalon default scheduler and java.util.Timer both maintain priority queues
and simply mark cancelled objects as removable), there is a constant and
small watchdog.  More on this in another message.

> In addition it concerns me that we can't run James under the -server JVM
otpion on linux

I do.

>  what we need to do is ensure that the server will continue
> to function, even if this means rejecting connections.

Agreed.

	--- Noel


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>