You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@tomcat.apache.org by Patrik Kudo <ku...@pingpong.net> on 2010/03/16 14:58:58 UTC

Sporadic errors during high load

Hi all!

We run a fairly large web application which we're currently trying to do some load tests on but we're experiencing some sporadic errors which we can't find the cause of.

We run a load test scenario using the Proxysniffer load testing tool on a machine connected to the same switch as the server under load. The load test simulates 3100 users looping over 27 pages of varying complexity. Each loop takes 2175 seconds on average and the average response time per page is 0.16 seconds. The test runs for about 5 hours and after a while, normaly around 1 hour but sometimes as soon as after a little more than 30 minutes and sometimes longer, there are occasional errors. The errors always come clustered with a bunch on each occurance. After each occurance everything runs fine for a lenght of time until the next occurance.

Proxysniffer reports all errors as "Network Connection aborted by Server" but when we look at each error in detail we can see that they don't all occur at the same stage in the request cycle. Some occur on "transmit http request", some on "open network connection", some on "wait for server response", but all within the same second.

On one of the tests we had a total of more than 3000000 requests and had only 14 errors divided over 2 occations during the 5 hour test.

The problem is 100% reproducable with the current setup and the setups we've tested but the errors occur with some randomness.

The application logs show nothing unusual. The access logs show nothing unusual. We've included the session ids in the tomcat logs and the failing urls doesn't show up in the access log at all for the given session id (cookies are shown in the error report).

During the test the machine is under some load, but I wouldn't call it heavy load. The application is quite database intensive so postgres works a lot harder than java/tomcat.

At first we used apache 2.2 with mod_jk to in front of tomcat and the errors were more numerous at that time and we got a bunch of errors in the mod_jk.log stating apache could not connect to tomcat. To be able to pinpoint the problem we've now excluded apache httpd and run only tomcat with the NIO HTTP connector. We also tried the vanilla HTTP connector.

We've tried to use both the default garbage collector with default settings and the flags "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalMode". No significant difference in times and errors with both settings.

We've been able to match some of the errors with full collections reported by the flags "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" but some errors occur where there are no full GC occuring.

I'm running out of ideas here... What am I missing? What am I doing wrong? What could I try?

The full JVM flags are:

# general options
JAVA_OPTS="-server -Dbuild.compiler.emacs=true"
# Memory limits (we've tried both higher and lower values here)
JAVA_OPTS="${JAVA_OPTS} -XX:MaxPermSize=192m -Xmx1800m -Xms1800m"
# GC logging
JAVA_OPTS="${JAVA_OPTS} -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"
# GC engine (Tried with excluding this and usinging the default values)
#JAVA_OPTS="${JAVA_OPTS} -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalMode"
# GC tuning (tried with excluding these as well)
#JAVA_OPTS="${JAVA_OPTS} -Xmn2g -XX:ParallelGCThreads=8 -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=31"
# JVM options
JAVA_OPTS="${JAVA_OPTS} -Dfile.encoding=utf-8 -Djava.awt.headless=true"

Software involved:
FreeBSD 8.0-RELEASE-p2 with diablo-jdk1.6.0 (we also tried openjdk6). Tomcat 6.0.26 (previously 6.0.20 with same problem). The application uses org.apache.commons.dbcp.BasicDataSource to connect to postgresql 8.4.2 on the same machine. Most part of the application uses hibernate and ehcache to access the database but some part use vanilla jdbc and some older parts still use a homebrew connection pool. We use spring for transaction management and autowiring of some handler/service objects.

Hardware:
16 CPU cores (Intel(R) Xeon(R) X5550 @ 2.67GHz)
32 GB RAM

Thanks in advance,
Patrik Kudo

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Sporadic errors during high load

Posted by ssureshceg <ss...@hotmail.com>.


We are also seeing the similar issue. We are also running a load test of
3000  concurent users and from 3000 ip address and aproximately 4
connection/ipaddress sec loading  5 pages which contains few portlets.
The test runs for an hour but after 5 minutes we started seeing the failures
in establising connection. In the TCP layer we are see the SYN packet from
the client to server  but we are noting seeing SYN-ACK back. The SYN-ACK is
not happening for all the connection but the failure rate goes on by as the
time goes on. the load test is similated using Spirent tool

The same 3000 concurrent connection with 1 ipaddress  work very well using
the grider tool.We didn't see any failures.

First we tried with our portal web application. with default HTTPConnector
we see failure as mentioned above. Next we tried apache webserver and mod_jk
infront of portal web application we saw the same results. 

What i mean by portal web app is that Liferay 5.2.6 + Tomcat 6.0.24

Following is the configuration we have

1. With HTTP connector this the configuration

<Connector port="8081" maxHttpHeaderSize="8192"
       maxThreads="200" minSpareThreads="50" maxSpareThreads="100"
       enableLookups="false" acceptCount="400"
       protocol="HTTP/1.1"
       acceptorThreadCount="8"
       connectionTimeout="20000"
       redirectPort="8444"  disableUploadTimeout="true"
       URIEncoding="UTF-8" />

There is no other change in default setting.

2. With apache webserver + mod_jk infront of  tomcat.


 <!-- Define an AJP 1.3 Connector on port 8009 -->
 <Connector port="8010" protocol="AJP/1.3" redirectPort="8444"
URIEncoding="UTF-8" backlog="100"/>

worker.prorperties file content
worker.list=liferay

worker.liferay.type=ajp13
worker.liferay.host=localhost
worker.liferay.port=8010

Host Details –

[root@ecp-qa-6 ~]# uname -a
Linux ecp-qa-6.cisco.com 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT
2010 x86_64 x86_64 x86_64 GNU/Linux
[root@ecp-qa-6 ~]#


[root@ecp-qa-6 ~]# java -version
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
[root@ecp-qa-6 ~]#


Apriciate for your help.

Cheers 
Suresh Subramanian

Patrik Kudo wrote:
> 
> Hi all!
> 
> We run a fairly large web application which we're currently trying to do
> some load tests on but we're experiencing some sporadic errors which we
> can't find the cause of.
> 
> We run a load test scenario using the Proxysniffer load testing tool on a
> machine connected to the same switch as the server under load. The load
> test simulates 3100 users looping over 27 pages of varying complexity.
> Each loop takes 2175 seconds on average and the average response time per
> page is 0.16 seconds. The test runs for about 5 hours and after a while,
> normaly around 1 hour but sometimes as soon as after a little more than 30
> minutes and sometimes longer, there are occasional errors. The errors
> always come clustered with a bunch on each occurance. After each occurance
> everything runs fine for a lenght of time until the next occurance.
> 
> Proxysniffer reports all errors as "Network Connection aborted by Server"
> but when we look at each error in detail we can see that they don't all
> occur at the same stage in the request cycle. Some occur on "transmit http
> request", some on "open network connection", some on "wait for server
> response", but all within the same second.
> 
> On one of the tests we had a total of more than 3000000 requests and had
> only 14 errors divided over 2 occations during the 5 hour test.
> 
> The problem is 100% reproducable with the current setup and the setups
> we've tested but the errors occur with some randomness.
> 
> The application logs show nothing unusual. The access logs show nothing
> unusual. We've included the session ids in the tomcat logs and the failing
> urls doesn't show up in the access log at all for the given session id
> (cookies are shown in the error report). 
> 
> During the test the machine is under some load, but I wouldn't call it
> heavy load. The application is quite database intensive so postgres works
> a lot harder than java/tomcat.
> 
> At first we used apache 2.2 with mod_jk to in front of tomcat and the
> errors were more numerous at that time and we got a bunch of errors in the
> mod_jk.log stating apache could not connect to tomcat. To be able to
> pinpoint the problem we've now excluded apache httpd and run only tomcat
> with the NIO HTTP connector. We also tried the vanilla HTTP connector.
> 
> We've tried to use both the default garbage collector with default
> settings and the flags "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSIncrementalMode". No significant difference in times and errors
> with both settings.
> 
> We've been able to match some of the errors with full collections reported
> by the flags "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" but
> some errors occur where there are no full GC occuring.
> 
> 
> 
> I'm running out of ideas here... What am I missing? What am I doing wrong?
> What could I try?
> 
> 
> 
> The full JVM flags are:
> 
> # general options
> JAVA_OPTS="-server -Dbuild.compiler.emacs=true"
> # Memory limits (we've tried both higher and lower values here)
> JAVA_OPTS="${JAVA_OPTS} -XX:MaxPermSize=192m -Xmx1800m -Xms1800m"
> # GC logging
> JAVA_OPTS="${JAVA_OPTS}  -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps"
> # GC engine (Tried with excluding this and usinging the default values)
> #JAVA_OPTS="${JAVA_OPTS}  -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSIncrementalMode"
> # GC tuning (tried with excluding these as well)
> #JAVA_OPTS="${JAVA_OPTS}  -Xmn2g -XX:ParallelGCThreads=8
> -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90
> -XX:MaxTenuringThreshold=31"
> # JVM options
> JAVA_OPTS="${JAVA_OPTS} -Dfile.encoding=utf-8 -Djava.awt.headless=true"
> 
> 
> Software involved:
> FreeBSD 8.0-RELEASE-p2 with diablo-jdk1.6.0 (we also tried openjdk6).
> Tomcat 6.0.26 (previously 6.0.20 with same problem). The application uses
> org.apache.commons.dbcp.BasicDataSource to connect to postgresql 8.4.2 on
> the same machine. Most part of the application uses hibernate and ehcache
> to access the database but some part use vanilla jdbc and some older parts
> still use a homebrew connection pool. We use spring for transaction
> management and autowiring of some handler/service objects.
> 
> Hardware:
> 16 CPU cores (Intel(R) Xeon(R) X5550  @ 2.67GHz)
> 32 GB RAM
> 
> 
> Thanks in advance,
> Patrik Kudo
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 
> 



-- 
View this message in context: http://old.nabble.com/Sporadic-errors-during-high-load-tp27918213p28024936.html
Sent from the Tomcat - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Sporadic errors during high load

Posted by ssureshceg <ss...@hotmail.com>.

We are also seeing the similar issue. We are also running a load test of 3000 
concurent users and from 3000 ip address and aproximately 4
connection/ipaddress sec loading  5 pages which contains few portlets.
The test runs for an hour but after 5 minutes we started seeing the failures
in establising connection. In the TCP layer we are see the SYN packet from
the client to server  but we are noting seeing SYN-ACK back. The SYN-ACK is
not happening for all the connection but the failure rate goes on by as the
time goes on. the load test is similated using Spirent tool

The same 300 concurrent connection with 1 ipaddress  work very well using
the grider tool.We didn't see any failures.

First we tried with our portal web application. with default HTTPConnector
we see failure as mentioned above. Next we tried apache webserver and mod_jk
infront of portal web application we saw the same results. 

What i mean by portal web app is that Liferay 5.2.6 + Tomcat 6.0.24

Following is the configuration we have

1. With HTTP connector this the configuration

<Connector port="8081" maxHttpHeaderSize="8192"
       maxThreads="200" minSpareThreads="50" maxSpareThreads="100"
       enableLookups="false" acceptCount="400"
       protocol="HTTP/1.1"
       acceptorThreadCount="8"
       connectionTimeout="20000"
       redirectPort="8444"  disableUploadTimeout="true"
       URIEncoding="UTF-8" />

There is no other change in default setting.

2. With apache webserver + mod_jk infront of  tomcat.


 <!-- Define an AJP 1.3 Connector on port 8009 -->
 <Connector port="8010" protocol="AJP/1.3" redirectPort="8444"
URIEncoding="UTF-8" backlog="100"/>

worker.prorperties file content
worker.list=liferay

worker.liferay.type=ajp13
worker.liferay.host=localhost
worker.liferay.port=8010


Apriciate for your help.

Cheers 
Suresh Subramanian

Patrik Kudo wrote:
> 
> Hi all!
> 
> We run a fairly large web application which we're currently trying to do
> some load tests on but we're experiencing some sporadic errors which we
> can't find the cause of.
> 
> We run a load test scenario using the Proxysniffer load testing tool on a
> machine connected to the same switch as the server under load. The load
> test simulates 3100 users looping over 27 pages of varying complexity.
> Each loop takes 2175 seconds on average and the average response time per
> page is 0.16 seconds. The test runs for about 5 hours and after a while,
> normaly around 1 hour but sometimes as soon as after a little more than 30
> minutes and sometimes longer, there are occasional errors. The errors
> always come clustered with a bunch on each occurance. After each occurance
> everything runs fine for a lenght of time until the next occurance.
> 
> Proxysniffer reports all errors as "Network Connection aborted by Server"
> but when we look at each error in detail we can see that they don't all
> occur at the same stage in the request cycle. Some occur on "transmit http
> request", some on "open network connection", some on "wait for server
> response", but all within the same second.
> 
> On one of the tests we had a total of more than 3000000 requests and had
> only 14 errors divided over 2 occations during the 5 hour test.
> 
> The problem is 100% reproducable with the current setup and the setups
> we've tested but the errors occur with some randomness.
> 
> The application logs show nothing unusual. The access logs show nothing
> unusual. We've included the session ids in the tomcat logs and the failing
> urls doesn't show up in the access log at all for the given session id
> (cookies are shown in the error report). 
> 
> During the test the machine is under some load, but I wouldn't call it
> heavy load. The application is quite database intensive so postgres works
> a lot harder than java/tomcat.
> 
> At first we used apache 2.2 with mod_jk to in front of tomcat and the
> errors were more numerous at that time and we got a bunch of errors in the
> mod_jk.log stating apache could not connect to tomcat. To be able to
> pinpoint the problem we've now excluded apache httpd and run only tomcat
> with the NIO HTTP connector. We also tried the vanilla HTTP connector.
> 
> We've tried to use both the default garbage collector with default
> settings and the flags "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSIncrementalMode". No significant difference in times and errors
> with both settings.
> 
> We've been able to match some of the errors with full collections reported
> by the flags "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" but
> some errors occur where there are no full GC occuring.
> 
> 
> 
> I'm running out of ideas here... What am I missing? What am I doing wrong?
> What could I try?
> 
> 
> 
> The full JVM flags are:
> 
> # general options
> JAVA_OPTS="-server -Dbuild.compiler.emacs=true"
> # Memory limits (we've tried both higher and lower values here)
> JAVA_OPTS="${JAVA_OPTS} -XX:MaxPermSize=192m -Xmx1800m -Xms1800m"
> # GC logging
> JAVA_OPTS="${JAVA_OPTS}  -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps"
> # GC engine (Tried with excluding this and usinging the default values)
> #JAVA_OPTS="${JAVA_OPTS}  -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSIncrementalMode"
> # GC tuning (tried with excluding these as well)
> #JAVA_OPTS="${JAVA_OPTS}  -Xmn2g -XX:ParallelGCThreads=8
> -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90
> -XX:MaxTenuringThreshold=31"
> # JVM options
> JAVA_OPTS="${JAVA_OPTS} -Dfile.encoding=utf-8 -Djava.awt.headless=true"
> 
> 
> Software involved:
> FreeBSD 8.0-RELEASE-p2 with diablo-jdk1.6.0 (we also tried openjdk6).
> Tomcat 6.0.26 (previously 6.0.20 with same problem). The application uses
> org.apache.commons.dbcp.BasicDataSource to connect to postgresql 8.4.2 on
> the same machine. Most part of the application uses hibernate and ehcache
> to access the database but some part use vanilla jdbc and some older parts
> still use a homebrew connection pool. We use spring for transaction
> management and autowiring of some handler/service objects.
> 
> Hardware:
> 16 CPU cores (Intel(R) Xeon(R) X5550  @ 2.67GHz)
> 32 GB RAM
> 
> 
> Thanks in advance,
> Patrik Kudo
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Sporadic-errors-during-high-load-tp27918213p28024628.html
Sent from the Tomcat - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Sporadic errors during high load

Posted by Peter Crowther <pe...@melandra.com>.

Thanks for a comprehensive statement of the problem - so many people don't
include the basics, let alone the details!

A few thoughts inline.

On 16 March 2010 13:58, Patrik Kudo <ku...@pingpong.net> wrote:

> We run a load test scenario using the Proxysniffer load testing tool on a
> machine connected to the same switch as the server under load. The load test
> simulates 3100 users

Why this number?  What happens if you increase it - does the incidence of
the problem increase?  This might make it easier to track down.

> looping over 27 pages of varying complexity.

Again, can you force the issue by tuning which pages are requested?

> Proxysniffer reports all errors as "Network Connection aborted by Server"
> but when we look at each error in detail we can see that they don't all
> occur at the same stage in the request cycle. Some occur on "transmit http
> request", some on "open network connection", some on "wait for server
> response", but all within the same second.
>

It'd be interesting to run (say) Wireshark and sniff the TCP connections.
In particular, that sounds like TCP RSTs coming off the server but it would
be good to verify that and to see at which points in the negotiation they
happen.

>
> The application logs show nothing unusual. The access logs show nothing
> unusual. We've included the session ids in the tomcat logs and the failing
> urls doesn't show up in the access log at all for the given session id
> (cookies are shown in the error report).
>

That's interesting; I'll leave better-qualified people to comment on what
code paths this eliminates.

> We've been able to match some of the errors with full collections reported
> by the flags "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" but
> some errors occur where there are no full GC occuring.
>

How long do your full GCs take?

> # Memory limits (we've tried both higher and lower values here)
> JAVA_OPTS="${JAVA_OPTS} -XX:MaxPermSize=192m -Xmx1800m -Xms1800m"
>

That's a small part of a 32G machine, but you're seeing no out of memory
errors so it's a sign of good design and coding ;-).

FreeBSD 8.0-RELEASE-p2 with diablo-jdk1.6.0 (we also tried openjdk6).

Can you tell us *exactly* which versions of the JDKs?
http://www.freebsd.org/java/ tells me 1.6.0-7 is current - sorry, I'm not as
well up on FreeBSD Java versions as some other OSs.

- Peter