You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@tomcat.apache.org by Lars Engholm Johansen <la...@gmail.com> on 2014/06/16 11:40:48 UTC

Connection count explosion due to thread http-nio-80-ClientPoller-x death

Our company are running several Tomcat 7.0.52 high volume Ubuntu 12.04
production servers.
We are using Tomcat WebSockets (JSR356 implementation) heavily with 100M
text messages (100GiB) per day.

We monitor webserver health by measuring several key parameters every
minute, including tomcat connection count using:

    mBeanServer.getAttribute(threadPool, "connectionCount");     //
threadPool is MBean of "type=ThreadPool"

______The problem______

Our sites can run for days without problems, but once in a while the tomcat
connection count suddenly starts growing abnormally fast.
See this graph:  http://imgur.com/s4fOUte
netstat shows these TCP connections to be mostly in CLOSE_WAIT state.

Our sites still functions normally with no cpu spikes during this build up
until around 60,000 connections, but then the server refuses further
connections and a manual Tomcat restart is required.

We have no output in tomcat or our logs at the time when this event occurs.
The only sign is when comparing full java thread dump with a dump from a
newly launched Tomcat:
One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1  is
missing/has died.

We have observed this problem at least since Tomcat 7.0.48 and can not find
indications in Tomcat 7.0.x change logs that it should have been fixed in
newer releases.

Any help or advises are appreciated,
Best regards,
Lars Engholm Johansen


Our connector configuration:
<Connector
acceptCount="1500"
acceptorThreadCount="4"
asyncTimeout="100000"
connectionTimeout="60000"
connectionUploadTimeout="120000"
disableUploadTimeout="false"
enableLookups="false"
keepAliveTimeout="120000"
maxConnections="100000"
maxPostSize="3000000"
maxThreads="300"
port="80"
protocol="org.apache.coyote.http11.Http11NioProtocol"
socket.soKeepAlive="true"
/>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Lars Engholm Johansen <la...@gmail.com>.

Thanks for looking further into this, Mark,

We are running:

  java version "1.7.0_65"
  OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.12.04.2)
  OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

  Linux 3.11.0-15-generic #25~precise1-Ubuntu SMP Thu Jan 30 17:39:31 UTC
2014 x86_64 x86_64 x86_64 GNU/Linux
  Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz

The bug happened very seldom compared to the 100M text WebSocket messages a
day, about once a week, so I guess it could be an external problem like the
JVM.

I already added a temporary fix to our Tomcat installation as mentioned
earlier, but other users could also suffer from this.
How do we proceed?

Best regards,
Lars Engholm Johansen

On Wed, Nov 12, 2014 at 12:15 PM, Mark Thomas <ma...@apache.org> wrote:

> On 10/11/2014 09:57, Lars Engholm Johansen wrote:
> > Hi Mark,
> >
> > I looked into our javax.websocket.Endpoint implementation and found the
> > following suspicious code:
> >
> > When we need to close the WebSocket session already in .onOpen() method
> > (rejecting a connection), we are calling session.close() asynchronously
> > after 1 second via a java.util.Timer task.
> > This was due to bug
> https://issues.apache.org/bugzilla/show_bug.cgi?id=54716,
> > that I can see should be fixed long time ago (thanks).
> >
> > Can this cause the selector's keyset to be accessed by more than one
> thread?
>
> I don't see how.
>
> I've just double checked the NIO Poller code and the only places the
> keyset is used is in Poller.run() and Poller.timeout() - both of which
> are only ever accessed from the Poller thread.
>
> I've also looked over the run() and timeout() methods and haven't yet
> found anything that could trigger this.
>
> There are multiple Pollers but each Poller has a distinct set of sockets
> to manage.
>
> I'm beginning to wonder if there is a JVM bug here. Which JVM are you
> using?
>
> Mark
>
> >
> > Best regards,
> > Lars Engholm Johansen
> >
> > On Mon, Oct 6, 2014 at 2:14 PM, Mark Thomas <ma...@apache.org> wrote:
> >
> >> On 06/10/2014 10:11, Lars Engholm Johansen wrote:
> >>> Hi all,
> >>>
> >>> I have good news as I have identified the reason for the devastating
> >>> NioEndpoint.Poller thread death:
> >>>
> >>> In rare circumstances a ConcurrentModification can occur in the
> Poller's
> >>> connection timeout handling called from OUTSIDE the
> try-catch(Throwable)
> >> of
> >>> Poller.run()
> >>>
> >>> java.util.ConcurrentModificationException
> >>>         at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
> >>>         at java.util.HashMap$KeyIterator.next(HashMap.java:956)
> >>>         at
> >>>
> >>
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067)
> >>>         at
> >>>
> >>
> org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437)
> >>>         at
> >>>
> org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143)
> >>>         at java.lang.Thread.run(Thread.java:745)
> >>>
> >>> Somehow the Poller's Selector object gets modified from another thread.
> >>
> >> Any idea how? I've been looking through that code for some time now
> >> (this stack trace appears to be from 7.0.55 for those that want to look
> >> at this themselves) and I can't see anywhere where the selector's keyset
> >> is accessed by more than one thread.
> >>
> >>> As a remedy until fixed properly by the Tomcat team, I have added a
> >>> try-catch(ConcurrentModificationException) surrounding the for loop in
> >>> Poller.timeout().
> >>> That way, in case of the rare problem, a full iteration of the Selector
> >>> will be retried in the next call to Poller.timeout().
> >>
> >> That seems like a reasonable work-around but before we start making
> >> changes to the Tomcat code I'd really like to understand the root
> >> cause(s) of the issue else we might not be fixing the actual issue and
> >> could make it worse for some folks.
> >>
> >> Mark
> >>
> >>
> >>>
> >>> I am really happy now as all our production servers have been rock
> stable
> >>> for two weeks now.
> >>>
> >>> Best regards to all,
> >>> Lars Engholm Johansen
> >>>
> >>>
> >>> On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik <fi...@hanik.com> wrote:
> >>>
> >>>> Thanks Lars, if you are indeed experiencing a non caught error, let us
> >> know
> >>>> what it is.
> >>>>
> >>>> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen <
> >> larsjo@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Thanks guys for all the feedback.
> >>>>>
> >>>>> I have tried the following suggested tasks:
> >>>>>
> >>>>>    - Upgrading Tomcat to the newest 7.0.55 on all our servers ->
> >> Problem
> >>>>>    still persists
> >>>>>    - Force a System.gc() when connection count is on the loose ->
> >>>>>    Connection count is not dropping
> >>>>>    - Lowering the log level of NioEndpoint class that contains the
> >> Poller
> >>>>>    code -> No info about why the poller thread exits in any tomcat
> logs
> >>>>>    - Reverting the JVM stack size per thread to the default is
> >> discussed
> >>>>>    previously -> Problem still persists
> >>>>>
> >>>>> I have now checked out the NioEndpoint source code and recompiled it
> >>>> with a
> >>>>> logging try-catch surrounding the whole of the Poller.run()
> >>>> implementation
> >>>>> as I noticed that the outer try-catch here only catches OOME.
> >>>>> I will report back with my findings as soon as the problem arises
> >> again.
> >>>>>
> >>>>> /Lars
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
> >>>>> chris@christopherschultz.net> wrote:
> >>>>>
> >>> Filip,
> >>>
> >>> On 6/27/14, 11:36 AM, Filip Hanik wrote:
> >>>>>>>> Are there any log entries that would indicate that the poller
> >>>>>>>> thread has died? This/these thread/s start when Tomcat starts. and
> >>>>>>>> a stack over flow on a processing thread should never affect the
> >>>>>>>> poller thread.
> >>>
> >>> OP reported in the initial post that the thread had disappeared:
> >>>
> >>> On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
> >>>>>>>> We have no output in tomcat or our logs at the time when this
> event
> >>>>>>>>  occurs. The only sign is when comparing full java thread dump
> with
> >>>>>>>> a dump from a newly launched Tomcat:
> >>>>>>>>
> >>>>>>>> One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
> >>>>>>>> is missing/has died.
> >>>
> >>> -chris
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> >>>>>> For additional commands, e-mail: users-help@tomcat.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> >> For additional commands, e-mail: users-help@tomcat.apache.org
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Mark Thomas <ma...@apache.org>.

On 10/11/2014 09:57, Lars Engholm Johansen wrote:
> Hi Mark,
> 
> I looked into our javax.websocket.Endpoint implementation and found the
> following suspicious code:
> 
> When we need to close the WebSocket session already in .onOpen() method
> (rejecting a connection), we are calling session.close() asynchronously
> after 1 second via a java.util.Timer task.
> This was due to bug https://issues.apache.org/bugzilla/show_bug.cgi?id=54716,
> that I can see should be fixed long time ago (thanks).
> 
> Can this cause the selector's keyset to be accessed by more than one thread?

I don't see how.

I've just double checked the NIO Poller code and the only places the
keyset is used is in Poller.run() and Poller.timeout() - both of which
are only ever accessed from the Poller thread.

I've also looked over the run() and timeout() methods and haven't yet
found anything that could trigger this.

There are multiple Pollers but each Poller has a distinct set of sockets
to manage.

I'm beginning to wonder if there is a JVM bug here. Which JVM are you using?

Mark

> 
> Best regards,
> Lars Engholm Johansen
> 
> On Mon, Oct 6, 2014 at 2:14 PM, Mark Thomas <ma...@apache.org> wrote:
> 
>> On 06/10/2014 10:11, Lars Engholm Johansen wrote:
>>> Hi all,
>>>
>>> I have good news as I have identified the reason for the devastating
>>> NioEndpoint.Poller thread death:
>>>
>>> In rare circumstances a ConcurrentModification can occur in the Poller's
>>> connection timeout handling called from OUTSIDE the try-catch(Throwable)
>> of
>>> Poller.run()
>>>
>>> java.util.ConcurrentModificationException
>>>         at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
>>>         at java.util.HashMap$KeyIterator.next(HashMap.java:956)
>>>         at
>>>
>> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067)
>>>         at
>>>
>> org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437)
>>>         at
>>> org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143)
>>>         at java.lang.Thread.run(Thread.java:745)
>>>
>>> Somehow the Poller's Selector object gets modified from another thread.
>>
>> Any idea how? I've been looking through that code for some time now
>> (this stack trace appears to be from 7.0.55 for those that want to look
>> at this themselves) and I can't see anywhere where the selector's keyset
>> is accessed by more than one thread.
>>
>>> As a remedy until fixed properly by the Tomcat team, I have added a
>>> try-catch(ConcurrentModificationException) surrounding the for loop in
>>> Poller.timeout().
>>> That way, in case of the rare problem, a full iteration of the Selector
>>> will be retried in the next call to Poller.timeout().
>>
>> That seems like a reasonable work-around but before we start making
>> changes to the Tomcat code I'd really like to understand the root
>> cause(s) of the issue else we might not be fixing the actual issue and
>> could make it worse for some folks.
>>
>> Mark
>>
>>
>>>
>>> I am really happy now as all our production servers have been rock stable
>>> for two weeks now.
>>>
>>> Best regards to all,
>>> Lars Engholm Johansen
>>>
>>>
>>> On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik <fi...@hanik.com> wrote:
>>>
>>>> Thanks Lars, if you are indeed experiencing a non caught error, let us
>> know
>>>> what it is.
>>>>
>>>> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen <
>> larsjo@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks guys for all the feedback.
>>>>>
>>>>> I have tried the following suggested tasks:
>>>>>
>>>>>    - Upgrading Tomcat to the newest 7.0.55 on all our servers ->
>> Problem
>>>>>    still persists
>>>>>    - Force a System.gc() when connection count is on the loose ->
>>>>>    Connection count is not dropping
>>>>>    - Lowering the log level of NioEndpoint class that contains the
>> Poller
>>>>>    code -> No info about why the poller thread exits in any tomcat logs
>>>>>    - Reverting the JVM stack size per thread to the default is
>> discussed
>>>>>    previously -> Problem still persists
>>>>>
>>>>> I have now checked out the NioEndpoint source code and recompiled it
>>>> with a
>>>>> logging try-catch surrounding the whole of the Poller.run()
>>>> implementation
>>>>> as I noticed that the outer try-catch here only catches OOME.
>>>>> I will report back with my findings as soon as the problem arises
>> again.
>>>>>
>>>>> /Lars
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
>>>>> chris@christopherschultz.net> wrote:
>>>>>
>>> Filip,
>>>
>>> On 6/27/14, 11:36 AM, Filip Hanik wrote:
>>>>>>>> Are there any log entries that would indicate that the poller
>>>>>>>> thread has died? This/these thread/s start when Tomcat starts. and
>>>>>>>> a stack over flow on a processing thread should never affect the
>>>>>>>> poller thread.
>>>
>>> OP reported in the initial post that the thread had disappeared:
>>>
>>> On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
>>>>>>>> We have no output in tomcat or our logs at the time when this event
>>>>>>>>  occurs. The only sign is when comparing full java thread dump with
>>>>>>>> a dump from a newly launched Tomcat:
>>>>>>>>
>>>>>>>> One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
>>>>>>>> is missing/has died.
>>>
>>> -chris
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>>>>>> For additional commands, e-mail: users-help@tomcat.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: users-help@tomcat.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Lars Engholm Johansen <la...@gmail.com>.

Hi Mark,

I looked into our javax.websocket.Endpoint implementation and found the
following suspicious code:

When we need to close the WebSocket session already in .onOpen() method
(rejecting a connection), we are calling session.close() asynchronously
after 1 second via a java.util.Timer task.
This was due to bug https://issues.apache.org/bugzilla/show_bug.cgi?id=54716,
that I can see should be fixed long time ago (thanks).

Can this cause the selector's keyset to be accessed by more than one thread?

Best regards,
Lars Engholm Johansen

On Mon, Oct 6, 2014 at 2:14 PM, Mark Thomas <ma...@apache.org> wrote:

> On 06/10/2014 10:11, Lars Engholm Johansen wrote:
> > Hi all,
> >
> > I have good news as I have identified the reason for the devastating
> > NioEndpoint.Poller thread death:
> >
> > In rare circumstances a ConcurrentModification can occur in the Poller's
> > connection timeout handling called from OUTSIDE the try-catch(Throwable)
> of
> > Poller.run()
> >
> > java.util.ConcurrentModificationException
> >         at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
> >         at java.util.HashMap$KeyIterator.next(HashMap.java:956)
> >         at
> >
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067)
> >         at
> >
> org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437)
> >         at
> > org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143)
> >         at java.lang.Thread.run(Thread.java:745)
> >
> > Somehow the Poller's Selector object gets modified from another thread.
>
> Any idea how? I've been looking through that code for some time now
> (this stack trace appears to be from 7.0.55 for those that want to look
> at this themselves) and I can't see anywhere where the selector's keyset
> is accessed by more than one thread.
>
> > As a remedy until fixed properly by the Tomcat team, I have added a
> > try-catch(ConcurrentModificationException) surrounding the for loop in
> > Poller.timeout().
> > That way, in case of the rare problem, a full iteration of the Selector
> > will be retried in the next call to Poller.timeout().
>
> That seems like a reasonable work-around but before we start making
> changes to the Tomcat code I'd really like to understand the root
> cause(s) of the issue else we might not be fixing the actual issue and
> could make it worse for some folks.
>
> Mark
>
>
> >
> > I am really happy now as all our production servers have been rock stable
> > for two weeks now.
> >
> > Best regards to all,
> > Lars Engholm Johansen
> >
> >
> > On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik <fi...@hanik.com> wrote:
> >
> >> Thanks Lars, if you are indeed experiencing a non caught error, let us
> know
> >> what it is.
> >>
> >> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen <
> larsjo@gmail.com>
> >> wrote:
> >>
> >>> Thanks guys for all the feedback.
> >>>
> >>> I have tried the following suggested tasks:
> >>>
> >>>    - Upgrading Tomcat to the newest 7.0.55 on all our servers ->
> Problem
> >>>    still persists
> >>>    - Force a System.gc() when connection count is on the loose ->
> >>>    Connection count is not dropping
> >>>    - Lowering the log level of NioEndpoint class that contains the
> Poller
> >>>    code -> No info about why the poller thread exits in any tomcat logs
> >>>    - Reverting the JVM stack size per thread to the default is
> discussed
> >>>    previously -> Problem still persists
> >>>
> >>> I have now checked out the NioEndpoint source code and recompiled it
> >> with a
> >>> logging try-catch surrounding the whole of the Poller.run()
> >> implementation
> >>> as I noticed that the outer try-catch here only catches OOME.
> >>> I will report back with my findings as soon as the problem arises
> again.
> >>>
> >>> /Lars
> >>>
> >>>
> >>>
> >>> On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
> >>> chris@christopherschultz.net> wrote:
> >>>
> > Filip,
> >
> > On 6/27/14, 11:36 AM, Filip Hanik wrote:
> >>>>>> Are there any log entries that would indicate that the poller
> >>>>>> thread has died? This/these thread/s start when Tomcat starts. and
> >>>>>> a stack over flow on a processing thread should never affect the
> >>>>>> poller thread.
> >
> > OP reported in the initial post that the thread had disappeared:
> >
> > On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
> >>>>>> We have no output in tomcat or our logs at the time when this event
> >>>>>>  occurs. The only sign is when comparing full java thread dump with
> >>>>>> a dump from a newly launched Tomcat:
> >>>>>>
> >>>>>> One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
> >>>>>> is missing/has died.
> >
> > -chris
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> >>>> For additional commands, e-mail: users-help@tomcat.apache.org
> >>>>
> >>>>
> >>>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Mark Thomas <ma...@apache.org>.

On 06/10/2014 10:11, Lars Engholm Johansen wrote:
> Hi all,
> 
> I have good news as I have identified the reason for the devastating
> NioEndpoint.Poller thread death:
> 
> In rare circumstances a ConcurrentModification can occur in the Poller's
> connection timeout handling called from OUTSIDE the try-catch(Throwable) of
> Poller.run()
> 
> java.util.ConcurrentModificationException
>         at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
>         at java.util.HashMap$KeyIterator.next(HashMap.java:956)
>         at
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067)
>         at
> org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437)
>         at
> org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143)
>         at java.lang.Thread.run(Thread.java:745)
> 
> Somehow the Poller's Selector object gets modified from another thread.

Any idea how? I've been looking through that code for some time now
(this stack trace appears to be from 7.0.55 for those that want to look
at this themselves) and I can't see anywhere where the selector's keyset
is accessed by more than one thread.

> As a remedy until fixed properly by the Tomcat team, I have added a
> try-catch(ConcurrentModificationException) surrounding the for loop in
> Poller.timeout().
> That way, in case of the rare problem, a full iteration of the Selector
> will be retried in the next call to Poller.timeout().

That seems like a reasonable work-around but before we start making
changes to the Tomcat code I'd really like to understand the root
cause(s) of the issue else we might not be fixing the actual issue and
could make it worse for some folks.

Mark


> 
> I am really happy now as all our production servers have been rock stable
> for two weeks now.
> 
> Best regards to all,
> Lars Engholm Johansen
> 
> 
> On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik <fi...@hanik.com> wrote:
> 
>> Thanks Lars, if you are indeed experiencing a non caught error, let us know
>> what it is.
>>
>> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen <la...@gmail.com>
>> wrote:
>>
>>> Thanks guys for all the feedback.
>>>
>>> I have tried the following suggested tasks:
>>>
>>>    - Upgrading Tomcat to the newest 7.0.55 on all our servers -> Problem
>>>    still persists
>>>    - Force a System.gc() when connection count is on the loose ->
>>>    Connection count is not dropping
>>>    - Lowering the log level of NioEndpoint class that contains the Poller
>>>    code -> No info about why the poller thread exits in any tomcat logs
>>>    - Reverting the JVM stack size per thread to the default is discussed
>>>    previously -> Problem still persists
>>>
>>> I have now checked out the NioEndpoint source code and recompiled it
>> with a
>>> logging try-catch surrounding the whole of the Poller.run()
>> implementation
>>> as I noticed that the outer try-catch here only catches OOME.
>>> I will report back with my findings as soon as the problem arises again.
>>>
>>> /Lars
>>>
>>>
>>>
>>> On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
>>> chris@christopherschultz.net> wrote:
>>>
> Filip,
> 
> On 6/27/14, 11:36 AM, Filip Hanik wrote:
>>>>>> Are there any log entries that would indicate that the poller
>>>>>> thread has died? This/these thread/s start when Tomcat starts. and
>>>>>> a stack over flow on a processing thread should never affect the
>>>>>> poller thread.
> 
> OP reported in the initial post that the thread had disappeared:
> 
> On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
>>>>>> We have no output in tomcat or our logs at the time when this event
>>>>>>  occurs. The only sign is when comparing full java thread dump with
>>>>>> a dump from a newly launched Tomcat:
>>>>>>
>>>>>> One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
>>>>>> is missing/has died.
> 
> -chris
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>>>> For additional commands, e-mail: users-help@tomcat.apache.org
>>>>
>>>>
>>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Lars Engholm Johansen <la...@gmail.com>.

Hi all,

I have good news as I have identified the reason for the devastating
NioEndpoint.Poller thread death:

In rare circumstances a ConcurrentModification can occur in the Poller's
connection timeout handling called from OUTSIDE the try-catch(Throwable) of
Poller.run()

java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
        at java.util.HashMap$KeyIterator.next(HashMap.java:956)
        at
java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067)
        at
org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437)
        at
org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143)
        at java.lang.Thread.run(Thread.java:745)

Somehow the Poller's Selector object gets modified from another thread.

As a remedy until fixed properly by the Tomcat team, I have added a
try-catch(ConcurrentModificationException) surrounding the for loop in
Poller.timeout().
That way, in case of the rare problem, a full iteration of the Selector
will be retried in the next call to Poller.timeout().

I am really happy now as all our production servers have been rock stable
for two weeks now.

Best regards to all,
Lars Engholm Johansen


On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik <fi...@hanik.com> wrote:

> Thanks Lars, if you are indeed experiencing a non caught error, let us know
> what it is.
>
> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen <la...@gmail.com>
> wrote:
>
> > Thanks guys for all the feedback.
> >
> > I have tried the following suggested tasks:
> >
> >    - Upgrading Tomcat to the newest 7.0.55 on all our servers -> Problem
> >    still persists
> >    - Force a System.gc() when connection count is on the loose ->
> >    Connection count is not dropping
> >    - Lowering the log level of NioEndpoint class that contains the Poller
> >    code -> No info about why the poller thread exits in any tomcat logs
> >    - Reverting the JVM stack size per thread to the default is discussed
> >    previously -> Problem still persists
> >
> > I have now checked out the NioEndpoint source code and recompiled it
> with a
> > logging try-catch surrounding the whole of the Poller.run()
> implementation
> > as I noticed that the outer try-catch here only catches OOME.
> > I will report back with my findings as soon as the problem arises again.
> >
> > /Lars
> >
> >
> >
> > On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
> > chris@christopherschultz.net> wrote:
> >
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: SHA256
> > >
> > > Filip,
> > >
> > > On 6/27/14, 11:36 AM, Filip Hanik wrote:
> > > > Are there any log entries that would indicate that the poller
> > > > thread has died? This/these thread/s start when Tomcat starts. and
> > > > a stack over flow on a processing thread should never affect the
> > > > poller thread.
> > >
> > > OP reported in the initial post that the thread had disappeared:
> > >
> > > On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
> > > > We have no output in tomcat or our logs at the time when this event
> > > >  occurs. The only sign is when comparing full java thread dump with
> > > > a dump from a newly launched Tomcat:
> > > >
> > > > One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
> > > > is missing/has died.
> > >
> > > - -chris
> > > -----BEGIN PGP SIGNATURE-----
> > > Version: GnuPG v1
> > > Comment: GPGTools - http://gpgtools.org
> > > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> > >
> > > iQIcBAEBCAAGBQJTrb+yAAoJEBzwKT+lPKRYhYEP/05kiei/EUFhtxL6RMIl70Ok
> > > cb3I9XEvrQDBTkEDnGLvxw8MQSs6ocHaxdEOxzie289sYxvkuLWxOsKpikWkuUHH
> > > pEgHM5WuGuCS2AmcrTGiH6WPCnNAj8YM/zyx25NZOn8turWIbvh8GRzBFf265qP5
> > > 79z2Vb15NisYyNEqvkWHvli5CeDeOW2fgHcgv5Ec5fWb1/KyXAyVtRmEWnHpy/LB
> > > j/VLjzbBtFSJGT64W4i572qQ7C+f/XRgNzV6Fh/53gwPf+ggz5vKS9XEQEpa5SOz
> > > rlTrWuVs+WehBoCLE9TZB2J+argV7noqSQDumYcXeSf/4THkfhbhAlcBKXa/YLgH
> > > Paip710VV6S+9K1dAZOt4i1h28YXZ+qNviO6b/auo1DEdt21ezpklEOQyZbQcHYf
> > > H4VZ2mcSaMQo3QpWpze6QxvSsRZFAofpkLoqCRfsORlnV2c2xfjhRC1YtZ0sshfM
> > > zNnWQCEjRe5V+UB69mtjatJrDG16qjTcUZQlot3r4zxdjMq5D0W9XmC6WH2eCXhl
> > > aeH8SMISdn4GcYGMoUm7hWSWHs5azyBPma9AWJfYC+mLk8UbmvLP9gZN+KWenWOr
> > > xLiqCgMUvpLiOFsbNs8oWMDWGW59xT2zBjS3Aa20ZYJP/GeLWJkOrAPwTeqIaXG+
> > > tV1WjkDkejPrC4WWKwzm
> > > =sTia
> > > -----END PGP SIGNATURE-----
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> > > For additional commands, e-mail: users-help@tomcat.apache.org
> > >
> > >
> >
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Filip Hanik <fi...@hanik.com>.

Thanks Lars, if you are indeed experiencing a non caught error, let us know
what it is.

On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen <la...@gmail.com>
wrote:

> Thanks guys for all the feedback.
>
> I have tried the following suggested tasks:
>
>    - Upgrading Tomcat to the newest 7.0.55 on all our servers -> Problem
>    still persists
>    - Force a System.gc() when connection count is on the loose ->
>    Connection count is not dropping
>    - Lowering the log level of NioEndpoint class that contains the Poller
>    code -> No info about why the poller thread exits in any tomcat logs
>    - Reverting the JVM stack size per thread to the default is discussed
>    previously -> Problem still persists
>
> I have now checked out the NioEndpoint source code and recompiled it with a
> logging try-catch surrounding the whole of the Poller.run() implementation
> as I noticed that the outer try-catch here only catches OOME.
> I will report back with my findings as soon as the problem arises again.
>
> /Lars
>
>
>
> On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
> chris@christopherschultz.net> wrote:
>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > Filip,
> >
> > On 6/27/14, 11:36 AM, Filip Hanik wrote:
> > > Are there any log entries that would indicate that the poller
> > > thread has died? This/these thread/s start when Tomcat starts. and
> > > a stack over flow on a processing thread should never affect the
> > > poller thread.
> >
> > OP reported in the initial post that the thread had disappeared:
> >
> > On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
> > > We have no output in tomcat or our logs at the time when this event
> > >  occurs. The only sign is when comparing full java thread dump with
> > > a dump from a newly launched Tomcat:
> > >
> > > One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
> > > is missing/has died.
> >
> > - -chris
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1
> > Comment: GPGTools - http://gpgtools.org
> > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> >
> > iQIcBAEBCAAGBQJTrb+yAAoJEBzwKT+lPKRYhYEP/05kiei/EUFhtxL6RMIl70Ok
> > cb3I9XEvrQDBTkEDnGLvxw8MQSs6ocHaxdEOxzie289sYxvkuLWxOsKpikWkuUHH
> > pEgHM5WuGuCS2AmcrTGiH6WPCnNAj8YM/zyx25NZOn8turWIbvh8GRzBFf265qP5
> > 79z2Vb15NisYyNEqvkWHvli5CeDeOW2fgHcgv5Ec5fWb1/KyXAyVtRmEWnHpy/LB
> > j/VLjzbBtFSJGT64W4i572qQ7C+f/XRgNzV6Fh/53gwPf+ggz5vKS9XEQEpa5SOz
> > rlTrWuVs+WehBoCLE9TZB2J+argV7noqSQDumYcXeSf/4THkfhbhAlcBKXa/YLgH
> > Paip710VV6S+9K1dAZOt4i1h28YXZ+qNviO6b/auo1DEdt21ezpklEOQyZbQcHYf
> > H4VZ2mcSaMQo3QpWpze6QxvSsRZFAofpkLoqCRfsORlnV2c2xfjhRC1YtZ0sshfM
> > zNnWQCEjRe5V+UB69mtjatJrDG16qjTcUZQlot3r4zxdjMq5D0W9XmC6WH2eCXhl
> > aeH8SMISdn4GcYGMoUm7hWSWHs5azyBPma9AWJfYC+mLk8UbmvLP9gZN+KWenWOr
> > xLiqCgMUvpLiOFsbNs8oWMDWGW59xT2zBjS3Aa20ZYJP/GeLWJkOrAPwTeqIaXG+
> > tV1WjkDkejPrC4WWKwzm
> > =sTia
> > -----END PGP SIGNATURE-----
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> > For additional commands, e-mail: users-help@tomcat.apache.org
> >
> >
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Lars Engholm Johansen <la...@gmail.com>.

Thanks guys for all the feedback.

I have tried the following suggested tasks:

   - Upgrading Tomcat to the newest 7.0.55 on all our servers -> Problem
   still persists
   - Force a System.gc() when connection count is on the loose ->
   Connection count is not dropping
   - Lowering the log level of NioEndpoint class that contains the Poller
   code -> No info about why the poller thread exits in any tomcat logs
   - Reverting the JVM stack size per thread to the default is discussed
   previously -> Problem still persists

I have now checked out the NioEndpoint source code and recompiled it with a
logging try-catch surrounding the whole of the Poller.run() implementation
as I noticed that the outer try-catch here only catches OOME.
I will report back with my findings as soon as the problem arises again.

/Lars



On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
chris@christopherschultz.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Filip,
>
> On 6/27/14, 11:36 AM, Filip Hanik wrote:
> > Are there any log entries that would indicate that the poller
> > thread has died? This/these thread/s start when Tomcat starts. and
> > a stack over flow on a processing thread should never affect the
> > poller thread.
>
> OP reported in the initial post that the thread had disappeared:
>
> On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
> > We have no output in tomcat or our logs at the time when this event
> >  occurs. The only sign is when comparing full java thread dump with
> > a dump from a newly launched Tomcat:
> >
> > One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
> > is missing/has died.
>
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIcBAEBCAAGBQJTrb+yAAoJEBzwKT+lPKRYhYEP/05kiei/EUFhtxL6RMIl70Ok
> cb3I9XEvrQDBTkEDnGLvxw8MQSs6ocHaxdEOxzie289sYxvkuLWxOsKpikWkuUHH
> pEgHM5WuGuCS2AmcrTGiH6WPCnNAj8YM/zyx25NZOn8turWIbvh8GRzBFf265qP5
> 79z2Vb15NisYyNEqvkWHvli5CeDeOW2fgHcgv5Ec5fWb1/KyXAyVtRmEWnHpy/LB
> j/VLjzbBtFSJGT64W4i572qQ7C+f/XRgNzV6Fh/53gwPf+ggz5vKS9XEQEpa5SOz
> rlTrWuVs+WehBoCLE9TZB2J+argV7noqSQDumYcXeSf/4THkfhbhAlcBKXa/YLgH
> Paip710VV6S+9K1dAZOt4i1h28YXZ+qNviO6b/auo1DEdt21ezpklEOQyZbQcHYf
> H4VZ2mcSaMQo3QpWpze6QxvSsRZFAofpkLoqCRfsORlnV2c2xfjhRC1YtZ0sshfM
> zNnWQCEjRe5V+UB69mtjatJrDG16qjTcUZQlot3r4zxdjMq5D0W9XmC6WH2eCXhl
> aeH8SMISdn4GcYGMoUm7hWSWHs5azyBPma9AWJfYC+mLk8UbmvLP9gZN+KWenWOr
> xLiqCgMUvpLiOFsbNs8oWMDWGW59xT2zBjS3Aa20ZYJP/GeLWJkOrAPwTeqIaXG+
> tV1WjkDkejPrC4WWKwzm
> =sTia
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Filip,

On 6/27/14, 11:36 AM, Filip Hanik wrote:
> Are there any log entries that would indicate that the poller
> thread has died? This/these thread/s start when Tomcat starts. and
> a stack over flow on a processing thread should never affect the
> poller thread.

OP reported in the initial post that the thread had disappeared:

On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
> We have no output in tomcat or our logs at the time when this event
>  occurs. The only sign is when comparing full java thread dump with
> a dump from a newly launched Tomcat:
> 
> One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
> is missing/has died.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJTrb+yAAoJEBzwKT+lPKRYhYEP/05kiei/EUFhtxL6RMIl70Ok
cb3I9XEvrQDBTkEDnGLvxw8MQSs6ocHaxdEOxzie289sYxvkuLWxOsKpikWkuUHH
pEgHM5WuGuCS2AmcrTGiH6WPCnNAj8YM/zyx25NZOn8turWIbvh8GRzBFf265qP5
79z2Vb15NisYyNEqvkWHvli5CeDeOW2fgHcgv5Ec5fWb1/KyXAyVtRmEWnHpy/LB
j/VLjzbBtFSJGT64W4i572qQ7C+f/XRgNzV6Fh/53gwPf+ggz5vKS9XEQEpa5SOz
rlTrWuVs+WehBoCLE9TZB2J+argV7noqSQDumYcXeSf/4THkfhbhAlcBKXa/YLgH
Paip710VV6S+9K1dAZOt4i1h28YXZ+qNviO6b/auo1DEdt21ezpklEOQyZbQcHYf
H4VZ2mcSaMQo3QpWpze6QxvSsRZFAofpkLoqCRfsORlnV2c2xfjhRC1YtZ0sshfM
zNnWQCEjRe5V+UB69mtjatJrDG16qjTcUZQlot3r4zxdjMq5D0W9XmC6WH2eCXhl
aeH8SMISdn4GcYGMoUm7hWSWHs5azyBPma9AWJfYC+mLk8UbmvLP9gZN+KWenWOr
xLiqCgMUvpLiOFsbNs8oWMDWGW59xT2zBjS3Aa20ZYJP/GeLWJkOrAPwTeqIaXG+
tV1WjkDkejPrC4WWKwzm
=sTia
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Filip Hanik <fi...@hanik.com>.

Are there any log entries that would indicate that the poller thread has
died?
This/these thread/s start when Tomcat starts. and a stack over flow on a
processing thread should never affect the poller thread.

Filip



On Thu, Jun 26, 2014 at 4:01 PM, Christopher Schultz <
chris@christopherschultz.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> André,
>
> On 6/26/14, 11:09 AM, André Warnier wrote:
> > Lars Engholm Johansen wrote:
> >> Thanks for all the replies guys.
> >>
> >> Have you observed a performance increase by setting
> >>> acceptorThreadCount to 4 instead of a lower number? I'm just
> >>> curious.
> >>
> >>
> >> No, but this was the consensus after elongated discussions in my
> >> team. We have 12 cpu cores - better save than sorry. I know that
> >> the official docs reads "although you would never really need
> >> more than 2" :-)
> >>
> >> The GC that Andre suggested was to get rid of some of CLOSE_WAIT
> >>> connections in netstat output, in case if those are owned by
> >>> some abandoned and non properly closed I/O classes that are
> >>> still present in JVM memory.
> >>
> >>
> >> Please check out the "open connections" graph at
> >> http://imgur.com/s4fOUte As far as I interpret, we only have a
> >> slight connection count growth during the days until the poller
> >> thread die. These may or may not disappear by forcing a GC, but
> >> the amount is not problematic until we hit the
> >> http-nio-80-ClientPoller-x thread death.
> >
> > Just to make sure : what kind of connections does this graph
> > actually show ? in which TCP state ? does it count only the
> > "established", or also the "FIN_WAIT", "CLOSE_WAIT", "LISTEN" etc..
> > ?
>
> I think the state of the connections is a red herring: Tomcat will
> hold those connections forever because the poller thread has died.
> Nothing else matters.
>
> Even if the CLOSE_WAIT connections were somehow cleared, Tomcat would
> never respond properly to another request, ever. A Tomcat restart is
> required if the poller thread dies.
>
> One could argue that the poller threads should maybe "try harder" not
> to die, but sometimes you can't stop thread death.
>
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIcBAEBCAAGBQJTrKZWAAoJEBzwKT+lPKRYZm8QAJ9fD8acAGZY/Tddt4TvUfat
> rjpEhyfhUkBIMZkPito/pKHsT8EuwP36g3spa8tmj94gSB+ajgqDjJ6jZ47AN/eG
> /0mSEMaedsaEXdO5e3RELDsDv16/fS0+OStvkG0/K22bWgf4Lbh7V9sC+LtIF04S
> szo++GV9ykgv9fmFVyxXKRwpDdNWxNzJvgGCi/gXo/1bpSYUTwRUQcXb0aANvU2i
> 90KAj4ng9SJqyGwLKvYencYH7Ga2vqmuePHNLKKtiNT6iRLz9ZI8O1qW+SzJIG+e
> moqS0VOz8C9v2yk1Dl7Ox7gw9A1dAd4GhLwtpsAcJFdpA4PKzto1hvKIuCLr3j7h
> 7pyw0/N2Nldl+nEOOhQiRU41e2L+wci0Rln2b83azvuqO2GrxUlJVthqjbTvLjMX
> TbFYAjpIUPZFlIMKtefXA+cPF7JUh1expXk5J6/l1u6hWHR8a/uQ8G5M/5DK+ObV
> 1n5f/xU9eoLjbGs7/RC87VhWsxW/WutPh68cPKeC2oH6Hk1VL0lkjUiQ3i4DC+ym
> yx4BvO1HUqY9uvjoGF0XoJRTcVlOyUP1G3zpxBBvL7ZpaU+r/c5MjjcMrgg6yzZe
> NhMYKGJJ+qIfAtZHCm3snPCFYPysz9JwuUIGPB6ZBOukzDUAi59YLM/u9CIexkiJ
> hZpHrNNu9zANMbFvt6jh
> =SJdz
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

André,

On 6/26/14, 11:09 AM, André Warnier wrote:
> Lars Engholm Johansen wrote:
>> Thanks for all the replies guys.
>> 
>> Have you observed a performance increase by setting
>>> acceptorThreadCount to 4 instead of a lower number? I'm just
>>> curious.
>> 
>> 
>> No, but this was the consensus after elongated discussions in my
>> team. We have 12 cpu cores - better save than sorry. I know that
>> the official docs reads "although you would never really need
>> more than 2" :-)
>> 
>> The GC that Andre suggested was to get rid of some of CLOSE_WAIT
>>> connections in netstat output, in case if those are owned by
>>> some abandoned and non properly closed I/O classes that are
>>> still present in JVM memory.
>> 
>> 
>> Please check out the "open connections" graph at
>> http://imgur.com/s4fOUte As far as I interpret, we only have a
>> slight connection count growth during the days until the poller
>> thread die. These may or may not disappear by forcing a GC, but
>> the amount is not problematic until we hit the 
>> http-nio-80-ClientPoller-x thread death.
> 
> Just to make sure : what kind of connections does this graph
> actually show ? in which TCP state ? does it count only the
> "established", or also the "FIN_WAIT", "CLOSE_WAIT", "LISTEN" etc..
> ?

I think the state of the connections is a red herring: Tomcat will
hold those connections forever because the poller thread has died.
Nothing else matters.

Even if the CLOSE_WAIT connections were somehow cleared, Tomcat would
never respond properly to another request, ever. A Tomcat restart is
required if the poller thread dies.

One could argue that the poller threads should maybe "try harder" not
to die, but sometimes you can't stop thread death.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJTrKZWAAoJEBzwKT+lPKRYZm8QAJ9fD8acAGZY/Tddt4TvUfat
rjpEhyfhUkBIMZkPito/pKHsT8EuwP36g3spa8tmj94gSB+ajgqDjJ6jZ47AN/eG
/0mSEMaedsaEXdO5e3RELDsDv16/fS0+OStvkG0/K22bWgf4Lbh7V9sC+LtIF04S
szo++GV9ykgv9fmFVyxXKRwpDdNWxNzJvgGCi/gXo/1bpSYUTwRUQcXb0aANvU2i
90KAj4ng9SJqyGwLKvYencYH7Ga2vqmuePHNLKKtiNT6iRLz9ZI8O1qW+SzJIG+e
moqS0VOz8C9v2yk1Dl7Ox7gw9A1dAd4GhLwtpsAcJFdpA4PKzto1hvKIuCLr3j7h
7pyw0/N2Nldl+nEOOhQiRU41e2L+wci0Rln2b83azvuqO2GrxUlJVthqjbTvLjMX
TbFYAjpIUPZFlIMKtefXA+cPF7JUh1expXk5J6/l1u6hWHR8a/uQ8G5M/5DK+ObV
1n5f/xU9eoLjbGs7/RC87VhWsxW/WutPh68cPKeC2oH6Hk1VL0lkjUiQ3i4DC+ym
yx4BvO1HUqY9uvjoGF0XoJRTcVlOyUP1G3zpxBBvL7ZpaU+r/c5MjjcMrgg6yzZe
NhMYKGJJ+qIfAtZHCm3snPCFYPysz9JwuUIGPB6ZBOukzDUAi59YLM/u9CIexkiJ
hZpHrNNu9zANMbFvt6jh
=SJdz
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by André Warnier <aw...@ice-sa.com>.

Lars Engholm Johansen wrote:
> Thanks for all the replies guys.
> 
> Have you observed a performance increase by setting
>> acceptorThreadCount to 4 instead of a lower number? I'm just curious.
> 
> 
> No, but this was the consensus after elongated discussions in my team. We
> have 12 cpu cores - better save than sorry. I know that the official docs
> reads "although you would never really need more than 2" :-)
> 
> The GC that Andre suggested was to get rid of some of CLOSE_WAIT
>> connections in netstat output, in case if those are owned by some
>> abandoned and non properly closed I/O classes that are still present
>> in JVM memory.
> 
> 
> Please check out the "open connections" graph at http://imgur.com/s4fOUte
> As far as I interpret, we only have a slight connection count growth during
> the days until the poller thread die. These may or may not disappear by
> forcing a GC, but the amount is not problematic until we hit the
> http-nio-80-ClientPoller-x
> thread death.

Just to make sure : what kind of connections does this graph actually show ? in which TCP 
state ? does it count only the "established", or also the "FIN_WAIT", "CLOSE_WAIT", 
"LISTEN" etc.. ?

> 
> The insidious part is that everything may look fine for a long time (apart
>> from an occasional long list of CLOSE_WAIT connections).  A GC will happen
>> from time to time (*), which will get rid of these connections.  And those
>> CLOSE_WAIT connections do not consume a lot of resources, so you'll never
>> notice.
>> Until at some point, the number of these CLOSE_WAIT connections gets just
>> at the point where the OS can't swallow any more of them, and then you have
>> a big problem.
>> (*) and this is the "insidious squared" part : the smaller the Heap, the
>> more often a GC will happen, so the sooner these CLOSE_WAIT connections
>> will disappear.  Conversely, by increasing the Heap size, you leave more
>> time between GCs, and make the problem more likely to happen.
> 
> 
> You are correct. The bigger the Heap size the rarer a GC will happen - and
> we have set aside 32GiB of ram. But again, referring to my "connection
> count" graph, a missing close in the code does not seem to be the culprit.
> 
> A critical error (java.lang.ThreadDeath,
>> java.lang.VirtualMachineError) will cause death of a thread.
>> A subtype of the latter is java.lang.OutOfMemoryError.
> 
> 
> I just realized that StackOverflowError is also a subclass of
> VirtualMachineError,
> and remembered that we due to company historical reasons had configured the
> JVM stack size to 256KiB (down from the default 1GiB on 64 bit machines).
> This was to support a huge number of threads on limited memory in the past.
> I have now removed the -Xss jvm parameter and are exited if this solves our
> poller thread problems.
> Thanks for the hint, Konstantin.
> 
> I promise to report back to you guys :-)
> 
> 
> 
> On Fri, Jun 20, 2014 at 2:49 AM, Filip Hanik <fi...@hanik.com> wrote:
> 
>> "Our sites still functions normally with no cpu spikes during this build up
>> until around 60,000 connections, but then the server refuses further
>> connections and a manual Tomcat restart is required."
>>
>> yes, the connection limit is a 16 bit short count minus some reserved
>> addresses. So your system should become unresponsive, you've run out of
>> ports (the 16 bit value in a TCP connection).
>>
>> netstat -na should give you your connection state when this happens, and
>> that is helpful debug information.
>>
>> Filip
>>
>>
>>
>>
>> On Thu, Jun 19, 2014 at 2:44 PM, André Warnier <aw...@ice-sa.com> wrote:
>>
>>> Konstantin Kolinko wrote:
>>>
>>>> 2014-06-19 17:10 GMT+04:00 Lars Engholm Johansen <la...@gmail.com>:
>>>>
>>>>> I will try to force a GC next time I am at the console about to
>> restart a
>>>>> Tomcat where one of the http-nio-80-ClientPoller-x threads have died
>> and
>>>>> connection count is exploding.
>>>>>
>>>>> But I do not see this as a solution - can you somehow deduct why this
>>>>> thread died from the outcome from a GC?
>>>>>
>>>> Nobody said that a thread died because of GC.
>>>>
>>>> The GC that Andre suggested was to get rid of some of CLOSE_WAIT
>>>> connections in netstat output, in case if those are owned by some
>>>> abandoned and non properly closed I/O classes that are still present
>>>> in JVM memory.
>>>>
>>> Exactly, thanks Konstantin for clarifying.
>>>
>>> I was going per the following in the original post :
>>>
>>> "Our sites still functions normally with no cpu spikes during this build
>> up
>>> until around 60,000 connections, but then the server refuses further
>>> connections and a manual Tomcat restart is required."
>>>
>>> CLOSE_WAIT is a normal state for a TCP connection, but it should not
>>> normally last long.
>>> It indicates basically that the other side has closed the connection, and
>>> that this side should do the same. But it doesn't, and as long as it
>>> doesn't the connection remains in the CLOSE_WAIT state.  It's like
>>> "half-closed", but not entirely, and as long as it isn't, the OS cannot
>> get
>>> rid of it.
>>> For a more precise explanation, Google for "TCP CLOSE_WAIT state".
>>>
>>> I have noticed in the past, with some Linux versions, that when the
>> number
>>> of such CLOSE_WAIT connections goes above a certain level (several
>>> hundred), the TCP/IP stack can become totally unresponsive and not accept
>>> any new connections at all, on any port.
>>> In my case, this was due to the following kind of scenario :
>>> Some class Xconnection instantiates an object, and upon creation this
>>> object opens a TCP connection to something. This object is now used as an
>>> "alias" for this connection.  Time passes, and finally the object goes
>> out
>>> of scope (e.g. the reference to it is set to "null"), and one may believe
>>> that the underlying connection gets closed as a side-effect.  But it
>>> doesn't, not as long as this object is not actually garbage-collected,
>>> which triggers the actual object destruction and the closing of the
>>> underlying connection.
>>> Forcing a GC is a way to provoke this (and restarting Tomcat another, but
>>> more drastic).
>>>
>>> If a forced GC gets rid of your many CLOSE_WAIT connections and makes
>> your
>>> Tomcat operative again, that would be a sign that something similar to
>> the
>>> above is occurring; and then you would need to look in your application
>> for
>>> the oversight. (e.g. the class should have a "close" method (closing the
>>> underlying connection), which should be invoked before letting the object
>>> go out of scope).
>>>
>>> The insidious part is that everything may look fine for a long time
>> (apart
>>> from an occasional long list of CLOSE_WAIT connections).  A GC will
>> happen
>>> from time to time (*), which will get rid of these connections.  And
>> those
>>> CLOSE_WAIT connections do not consume a lot of resources, so you'll never
>>> notice.
>>> Until at some point, the number of these CLOSE_WAIT connections gets just
>>> at the point where the OS can't swallow any more of them, and then you
>> have
>>> a big problem.
>>>
>>> That sounds a bit like your case, doesn't it ?
>>>
>>> (*) and this is the "insidious squared" part : the smaller the Heap, the
>>> more often a GC will happen, so the sooner these CLOSE_WAIT connections
>>> will disappear.  Conversely, by increasing the Heap size, you leave more
>>> time between GCs, and make the problem more likely to happen.
>>>
>>>
>>> I believe that the rest below may be either a consequence, or a red
>>> herring, and I would first eliminate the above as a cause.
>>>
>>>
>>>
>>>>  And could an Exception/Error in Tomcat thread
>>  http-nio-80-ClientPoller-0
>>>>>  or  http-nio-80-ClientPoller-1  make the thread die with no Stacktrace
>>>>> in
>>>>> the Tomcat logs?
>>>>>
>>>>>
>>>> A critical error (java.lang.ThreadDeath,
>>>> java.lang.VirtualMachineError) will cause death of a thread.
>>>>
>>>> A subtype of the latter is java.lang.OutOfMemoryError.
>>>>
>>>> As of now, such errors are passed through and are not logged by
>>>> Tomcat, but are logged by java.lang.ThreadGroup.uncaughtException().
>>>> ThreadGroup prints them to System.err (catalina.out).
>>>>
>>>>
>>>> Best regards,
>>>> Konstantin Kolinko
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>>>> For additional commands, e-mail: users-help@tomcat.apache.org
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>>> For additional commands, e-mail: users-help@tomcat.apache.org
>>>
>>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Lars,

On 6/26/14, 9:56 AM, Lars Engholm Johansen wrote:
> Thanks for all the replies guys.
> 
> Have you observed a performance increase by setting
>> acceptorThreadCount to 4 instead of a lower number? I'm just
>> curious.
> 
> 
> No, but this was the consensus after elongated discussions in my
> team. We have 12 cpu cores - better save than sorry. I know that
> the official docs reads "although you would never really need more
> than 2" :-)

Okay. You might want to do some actual benchmarking. You may find that
more contention for an exclusive lock actually /decreases/ performance.

>> The GC that Andre suggested was to get rid of some of CLOSE_WAIT 
>> connections in netstat output, in case if those are owned by
>> some abandoned and non properly closed I/O classes that are still
>> present in JVM memory.
> 
> 
> Please check out the "open connections" graph at
> http://imgur.com/s4fOUte As far as I interpret, we only have a
> slight connection count growth during the days until the poller
> thread die. These may or may not disappear by forcing a GC, but the
> amount is not problematic until we hit the 
> http-nio-80-ClientPoller-x thread death.

Like I said, when the poller thread(s) die, you are totally screwed.

> The insidious part is that everything may look fine for a long time
> (apart
>> from an occasional long list of CLOSE_WAIT connections).  A GC
>> will happen from time to time (*), which will get rid of these
>> connections.  And those CLOSE_WAIT connections do not consume a
>> lot of resources, so you'll never notice. Until at some point,
>> the number of these CLOSE_WAIT connections gets just at the point
>> where the OS can't swallow any more of them, and then you have a
>> big problem. (*) and this is the "insidious squared" part : the
>> smaller the Heap, the more often a GC will happen, so the sooner
>> these CLOSE_WAIT connections will disappear.  Conversely, by
>> increasing the Heap size, you leave more time between GCs, and
>> make the problem more likely to happen.
> 
> 
> You are correct. The bigger the Heap size the rarer a GC will
> happen - and we have set aside 32GiB of ram. But again, referring
> to my "connection count" graph, a missing close in the code does
> not seem to be the culprit.
> 
> A critical error (java.lang.ThreadDeath,
>> java.lang.VirtualMachineError) will cause death of a thread. A
>> subtype of the latter is java.lang.OutOfMemoryError.
> 
> 
> I just realized that StackOverflowError is also a subclass of 
> VirtualMachineError, and remembered that we due to company
> historical reasons had configured the JVM stack size to 256KiB
> (down from the default 1GiB on 64 bit machines). This was to
> support a huge number of threads on limited memory in the past. I
> have now removed the -Xss jvm parameter and are exited if this
> solves our poller thread problems. Thanks for the hint,
> Konstantin.

Definitely let us know. A StackOverflowError should be relatively
rare, but if you have set your stack size to something very low, this
can happen.

Remember, since you are using the NIO connector, you don't need a huge
number of threads to support a huge number of connections. The stack
size relates to the number of threads you want to have active.

It looks like you haven't set the number of request-processor threads,
so you'll get the default value of 200.

The default stack size for Oracle's HotSpot JVM is 1MiB, not 1GiB.
200MiB in a 64-bit heap shouldn't be too much for your JVM, and will
hopefully cut-down on your stack problems.

Do you have a lot of recursive algorithms, or anything that calls
very-deep stacks?

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJTrKXlAAoJEBzwKT+lPKRYbT8P+wR71t/UOmD1VoXZaCnSsq/m
spUp1sQMTIsk1OT1rQXQiHe77XMBaVPEsaa9/DVnzk2T3ZJwab6uzCB+fQcULL9c
62ucaHRLJiMF91G7uHMd96Ep7hw2+vrAT3HIlZbKex6Eat4bgCVSbM+Xj9SAwN3v
l0e66E+Pw1v1n0p5tK8W4JeOTx5I0+rF/ebwIY80y00umy6eW7lTnVmG6gauFsHb
XUaqqzOh6zqArXjWdhbhIBfrShjsUWfKDNQnuZ4JRwmBYqsm7paNcZ5mxixizv/a
LTOaLn8dSktWQsvf1F/qQVcCE23WIag5xg+jykAJY2kyU4LIvjdvXUPMcWKkVpZj
CtwKIdoaK33+Mt0w01yoEyfpWghtiwy2rJvEZOZYR9n8krpAOXCXnHa7PuyRN5lZ
LF82CPYgztkLC9OmlZoAub5qFeUYzfpdiDP6MBnSp54cFYRwvUtKgLt2YGIrWlwI
r9/NlGVpzT8lurzHkxAe65rNifRqsrl3Duz9ORHNMVGepRzpJJP2e3fjG79/S0JB
zMo6j8arETnwkVxYtLTu/VqCjirn79jyZaqOtAyHJa+41/e+GbxMNzXk0FUevZ42
ySHCySEo13wdx4QcWN9CtTGSD9OR3hdK7E/56RGtgvgIZFyiiGT3Kl8HFA0a2bo0
iKnMg6CXezCCJkPq6K4E
=VM0k
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Lars Engholm Johansen <la...@gmail.com>.

Thanks for all the replies guys.

Have you observed a performance increase by setting
> acceptorThreadCount to 4 instead of a lower number? I'm just curious.


No, but this was the consensus after elongated discussions in my team. We
have 12 cpu cores - better save than sorry. I know that the official docs
reads "although you would never really need more than 2" :-)

The GC that Andre suggested was to get rid of some of CLOSE_WAIT
> connections in netstat output, in case if those are owned by some
> abandoned and non properly closed I/O classes that are still present
> in JVM memory.


Please check out the "open connections" graph at http://imgur.com/s4fOUte
As far as I interpret, we only have a slight connection count growth during
the days until the poller thread die. These may or may not disappear by
forcing a GC, but the amount is not problematic until we hit the
http-nio-80-ClientPoller-x
thread death.

The insidious part is that everything may look fine for a long time (apart
> from an occasional long list of CLOSE_WAIT connections).  A GC will happen
> from time to time (*), which will get rid of these connections.  And those
> CLOSE_WAIT connections do not consume a lot of resources, so you'll never
> notice.
> Until at some point, the number of these CLOSE_WAIT connections gets just
> at the point where the OS can't swallow any more of them, and then you have
> a big problem.
> (*) and this is the "insidious squared" part : the smaller the Heap, the
> more often a GC will happen, so the sooner these CLOSE_WAIT connections
> will disappear.  Conversely, by increasing the Heap size, you leave more
> time between GCs, and make the problem more likely to happen.


You are correct. The bigger the Heap size the rarer a GC will happen - and
we have set aside 32GiB of ram. But again, referring to my "connection
count" graph, a missing close in the code does not seem to be the culprit.

A critical error (java.lang.ThreadDeath,
> java.lang.VirtualMachineError) will cause death of a thread.
> A subtype of the latter is java.lang.OutOfMemoryError.


I just realized that StackOverflowError is also a subclass of
VirtualMachineError,
and remembered that we due to company historical reasons had configured the
JVM stack size to 256KiB (down from the default 1GiB on 64 bit machines).
This was to support a huge number of threads on limited memory in the past.
I have now removed the -Xss jvm parameter and are exited if this solves our
poller thread problems.
Thanks for the hint, Konstantin.

I promise to report back to you guys :-)



On Fri, Jun 20, 2014 at 2:49 AM, Filip Hanik <fi...@hanik.com> wrote:

> "Our sites still functions normally with no cpu spikes during this build up
> until around 60,000 connections, but then the server refuses further
> connections and a manual Tomcat restart is required."
>
> yes, the connection limit is a 16 bit short count minus some reserved
> addresses. So your system should become unresponsive, you've run out of
> ports (the 16 bit value in a TCP connection).
>
> netstat -na should give you your connection state when this happens, and
> that is helpful debug information.
>
> Filip
>
>
>
>
> On Thu, Jun 19, 2014 at 2:44 PM, André Warnier <aw...@ice-sa.com> wrote:
>
> > Konstantin Kolinko wrote:
> >
> >> 2014-06-19 17:10 GMT+04:00 Lars Engholm Johansen <la...@gmail.com>:
> >>
> >>> I will try to force a GC next time I am at the console about to
> restart a
> >>> Tomcat where one of the http-nio-80-ClientPoller-x threads have died
> and
> >>> connection count is exploding.
> >>>
> >>> But I do not see this as a solution - can you somehow deduct why this
> >>> thread died from the outcome from a GC?
> >>>
> >>
> >> Nobody said that a thread died because of GC.
> >>
> >> The GC that Andre suggested was to get rid of some of CLOSE_WAIT
> >> connections in netstat output, in case if those are owned by some
> >> abandoned and non properly closed I/O classes that are still present
> >> in JVM memory.
> >>
> >
> > Exactly, thanks Konstantin for clarifying.
> >
> > I was going per the following in the original post :
> >
> > "Our sites still functions normally with no cpu spikes during this build
> up
> > until around 60,000 connections, but then the server refuses further
> > connections and a manual Tomcat restart is required."
> >
> > CLOSE_WAIT is a normal state for a TCP connection, but it should not
> > normally last long.
> > It indicates basically that the other side has closed the connection, and
> > that this side should do the same. But it doesn't, and as long as it
> > doesn't the connection remains in the CLOSE_WAIT state.  It's like
> > "half-closed", but not entirely, and as long as it isn't, the OS cannot
> get
> > rid of it.
> > For a more precise explanation, Google for "TCP CLOSE_WAIT state".
> >
> > I have noticed in the past, with some Linux versions, that when the
> number
> > of such CLOSE_WAIT connections goes above a certain level (several
> > hundred), the TCP/IP stack can become totally unresponsive and not accept
> > any new connections at all, on any port.
> > In my case, this was due to the following kind of scenario :
> > Some class Xconnection instantiates an object, and upon creation this
> > object opens a TCP connection to something. This object is now used as an
> > "alias" for this connection.  Time passes, and finally the object goes
> out
> > of scope (e.g. the reference to it is set to "null"), and one may believe
> > that the underlying connection gets closed as a side-effect.  But it
> > doesn't, not as long as this object is not actually garbage-collected,
> > which triggers the actual object destruction and the closing of the
> > underlying connection.
> > Forcing a GC is a way to provoke this (and restarting Tomcat another, but
> > more drastic).
> >
> > If a forced GC gets rid of your many CLOSE_WAIT connections and makes
> your
> > Tomcat operative again, that would be a sign that something similar to
> the
> > above is occurring; and then you would need to look in your application
> for
> > the oversight. (e.g. the class should have a "close" method (closing the
> > underlying connection), which should be invoked before letting the object
> > go out of scope).
> >
> > The insidious part is that everything may look fine for a long time
> (apart
> > from an occasional long list of CLOSE_WAIT connections).  A GC will
> happen
> > from time to time (*), which will get rid of these connections.  And
> those
> > CLOSE_WAIT connections do not consume a lot of resources, so you'll never
> > notice.
> > Until at some point, the number of these CLOSE_WAIT connections gets just
> > at the point where the OS can't swallow any more of them, and then you
> have
> > a big problem.
> >
> > That sounds a bit like your case, doesn't it ?
> >
> > (*) and this is the "insidious squared" part : the smaller the Heap, the
> > more often a GC will happen, so the sooner these CLOSE_WAIT connections
> > will disappear.  Conversely, by increasing the Heap size, you leave more
> > time between GCs, and make the problem more likely to happen.
> >
> >
> > I believe that the rest below may be either a consequence, or a red
> > herring, and I would first eliminate the above as a cause.
> >
> >
> >
> >>  And could an Exception/Error in Tomcat thread
>  http-nio-80-ClientPoller-0
> >>>  or  http-nio-80-ClientPoller-1  make the thread die with no Stacktrace
> >>> in
> >>> the Tomcat logs?
> >>>
> >>>
> >> A critical error (java.lang.ThreadDeath,
> >> java.lang.VirtualMachineError) will cause death of a thread.
> >>
> >> A subtype of the latter is java.lang.OutOfMemoryError.
> >>
> >> As of now, such errors are passed through and are not logged by
> >> Tomcat, but are logged by java.lang.ThreadGroup.uncaughtException().
> >> ThreadGroup prints them to System.err (catalina.out).
> >>
> >>
> >> Best regards,
> >> Konstantin Kolinko
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> >> For additional commands, e-mail: users-help@tomcat.apache.org
> >>
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> > For additional commands, e-mail: users-help@tomcat.apache.org
> >
> >
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Filip Hanik <fi...@hanik.com>.

"Our sites still functions normally with no cpu spikes during this build up
until around 60,000 connections, but then the server refuses further
connections and a manual Tomcat restart is required."

yes, the connection limit is a 16 bit short count minus some reserved
addresses. So your system should become unresponsive, you've run out of
ports (the 16 bit value in a TCP connection).

netstat -na should give you your connection state when this happens, and
that is helpful debug information.

Filip




On Thu, Jun 19, 2014 at 2:44 PM, André Warnier <aw...@ice-sa.com> wrote:

> Konstantin Kolinko wrote:
>
>> 2014-06-19 17:10 GMT+04:00 Lars Engholm Johansen <la...@gmail.com>:
>>
>>> I will try to force a GC next time I am at the console about to restart a
>>> Tomcat where one of the http-nio-80-ClientPoller-x threads have died and
>>> connection count is exploding.
>>>
>>> But I do not see this as a solution - can you somehow deduct why this
>>> thread died from the outcome from a GC?
>>>
>>
>> Nobody said that a thread died because of GC.
>>
>> The GC that Andre suggested was to get rid of some of CLOSE_WAIT
>> connections in netstat output, in case if those are owned by some
>> abandoned and non properly closed I/O classes that are still present
>> in JVM memory.
>>
>
> Exactly, thanks Konstantin for clarifying.
>
> I was going per the following in the original post :
>
> "Our sites still functions normally with no cpu spikes during this build up
> until around 60,000 connections, but then the server refuses further
> connections and a manual Tomcat restart is required."
>
> CLOSE_WAIT is a normal state for a TCP connection, but it should not
> normally last long.
> It indicates basically that the other side has closed the connection, and
> that this side should do the same. But it doesn't, and as long as it
> doesn't the connection remains in the CLOSE_WAIT state.  It's like
> "half-closed", but not entirely, and as long as it isn't, the OS cannot get
> rid of it.
> For a more precise explanation, Google for "TCP CLOSE_WAIT state".
>
> I have noticed in the past, with some Linux versions, that when the number
> of such CLOSE_WAIT connections goes above a certain level (several
> hundred), the TCP/IP stack can become totally unresponsive and not accept
> any new connections at all, on any port.
> In my case, this was due to the following kind of scenario :
> Some class Xconnection instantiates an object, and upon creation this
> object opens a TCP connection to something. This object is now used as an
> "alias" for this connection.  Time passes, and finally the object goes out
> of scope (e.g. the reference to it is set to "null"), and one may believe
> that the underlying connection gets closed as a side-effect.  But it
> doesn't, not as long as this object is not actually garbage-collected,
> which triggers the actual object destruction and the closing of the
> underlying connection.
> Forcing a GC is a way to provoke this (and restarting Tomcat another, but
> more drastic).
>
> If a forced GC gets rid of your many CLOSE_WAIT connections and makes your
> Tomcat operative again, that would be a sign that something similar to the
> above is occurring; and then you would need to look in your application for
> the oversight. (e.g. the class should have a "close" method (closing the
> underlying connection), which should be invoked before letting the object
> go out of scope).
>
> The insidious part is that everything may look fine for a long time (apart
> from an occasional long list of CLOSE_WAIT connections).  A GC will happen
> from time to time (*), which will get rid of these connections.  And those
> CLOSE_WAIT connections do not consume a lot of resources, so you'll never
> notice.
> Until at some point, the number of these CLOSE_WAIT connections gets just
> at the point where the OS can't swallow any more of them, and then you have
> a big problem.
>
> That sounds a bit like your case, doesn't it ?
>
> (*) and this is the "insidious squared" part : the smaller the Heap, the
> more often a GC will happen, so the sooner these CLOSE_WAIT connections
> will disappear.  Conversely, by increasing the Heap size, you leave more
> time between GCs, and make the problem more likely to happen.
>
>
> I believe that the rest below may be either a consequence, or a red
> herring, and I would first eliminate the above as a cause.
>
>
>
>>  And could an Exception/Error in Tomcat thread  http-nio-80-ClientPoller-0
>>>  or  http-nio-80-ClientPoller-1  make the thread die with no Stacktrace
>>> in
>>> the Tomcat logs?
>>>
>>>
>> A critical error (java.lang.ThreadDeath,
>> java.lang.VirtualMachineError) will cause death of a thread.
>>
>> A subtype of the latter is java.lang.OutOfMemoryError.
>>
>> As of now, such errors are passed through and are not logged by
>> Tomcat, but are logged by java.lang.ThreadGroup.uncaughtException().
>> ThreadGroup prints them to System.err (catalina.out).
>>
>>
>> Best regards,
>> Konstantin Kolinko
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: users-help@tomcat.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by André Warnier <aw...@ice-sa.com>.

Konstantin Kolinko wrote:
> 2014-06-19 17:10 GMT+04:00 Lars Engholm Johansen <la...@gmail.com>:
>> I will try to force a GC next time I am at the console about to restart a
>> Tomcat where one of the http-nio-80-ClientPoller-x threads have died and
>> connection count is exploding.
>>
>> But I do not see this as a solution - can you somehow deduct why this
>> thread died from the outcome from a GC?
> 
> Nobody said that a thread died because of GC.
> 
> The GC that Andre suggested was to get rid of some of CLOSE_WAIT
> connections in netstat output, in case if those are owned by some
> abandoned and non properly closed I/O classes that are still present
> in JVM memory.

Exactly, thanks Konstantin for clarifying.

I was going per the following in the original post :
"Our sites still functions normally with no cpu spikes during this build up
until around 60,000 connections, but then the server refuses further
connections and a manual Tomcat restart is required."

CLOSE_WAIT is a normal state for a TCP connection, but it should not normally last long.
It indicates basically that the other side has closed the connection, and that this side 
should do the same. But it doesn't, and as long as it doesn't the connection remains in 
the CLOSE_WAIT state.  It's like "half-closed", but not entirely, and as long as it isn't, 
the OS cannot get rid of it.
For a more precise explanation, Google for "TCP CLOSE_WAIT state".

I have noticed in the past, with some Linux versions, that when the number of such 
CLOSE_WAIT connections goes above a certain level (several hundred), the TCP/IP stack can 
become totally unresponsive and not accept any new connections at all, on any port.
In my case, this was due to the following kind of scenario :
Some class Xconnection instantiates an object, and upon creation this object opens a TCP 
connection to something. This object is now used as an "alias" for this connection.  Time 
passes, and finally the object goes out of scope (e.g. the reference to it is set to 
"null"), and one may believe that the underlying connection gets closed as a side-effect. 
  But it doesn't, not as long as this object is not actually garbage-collected, which 
triggers the actual object destruction and the closing of the underlying connection.
Forcing a GC is a way to provoke this (and restarting Tomcat another, but more drastic).

If a forced GC gets rid of your many CLOSE_WAIT connections and makes your Tomcat 
operative again, that would be a sign that something similar to the above is occurring; 
and then you would need to look in your application for the oversight. (e.g. the class 
should have a "close" method (closing the underlying connection), which should be invoked 
before letting the object go out of scope).

The insidious part is that everything may look fine for a long time (apart from an 
occasional long list of CLOSE_WAIT connections).  A GC will happen from time to time (*), 
which will get rid of these connections.  And those CLOSE_WAIT connections do not consume 
a lot of resources, so you'll never notice.
Until at some point, the number of these CLOSE_WAIT connections gets just at the point 
where the OS can't swallow any more of them, and then you have a big problem.

That sounds a bit like your case, doesn't it ?

(*) and this is the "insidious squared" part : the smaller the Heap, the more often a GC 
will happen, so the sooner these CLOSE_WAIT connections will disappear.  Conversely, by 
increasing the Heap size, you leave more time between GCs, and make the problem more 
likely to happen.

I believe that the rest below may be either a consequence, or a red herring, and I would 
first eliminate the above as a cause.

> 
>> And could an Exception/Error in Tomcat thread  http-nio-80-ClientPoller-0
>>  or  http-nio-80-ClientPoller-1  make the thread die with no Stacktrace in
>> the Tomcat logs?
>>
> 
> A critical error (java.lang.ThreadDeath,
> java.lang.VirtualMachineError) will cause death of a thread.
> 
> A subtype of the latter is java.lang.OutOfMemoryError.
> 
> As of now, such errors are passed through and are not logged by
> Tomcat, but are logged by java.lang.ThreadGroup.uncaughtException().
> ThreadGroup prints them to System.err (catalina.out).
> 
> 
> Best regards,
> Konstantin Kolinko
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Konstantin Kolinko <kn...@gmail.com>.

2014-06-19 17:10 GMT+04:00 Lars Engholm Johansen <la...@gmail.com>:
> I will try to force a GC next time I am at the console about to restart a
> Tomcat where one of the http-nio-80-ClientPoller-x threads have died and
> connection count is exploding.
>
> But I do not see this as a solution - can you somehow deduct why this
> thread died from the outcome from a GC?

Nobody said that a thread died because of GC.

The GC that Andre suggested was to get rid of some of CLOSE_WAIT
connections in netstat output, in case if those are owned by some
abandoned and non properly closed I/O classes that are still present
in JVM memory.

> And could an Exception/Error in Tomcat thread  http-nio-80-ClientPoller-0
>  or  http-nio-80-ClientPoller-1  make the thread die with no Stacktrace in
> the Tomcat logs?
>

A critical error (java.lang.ThreadDeath,
java.lang.VirtualMachineError) will cause death of a thread.

A subtype of the latter is java.lang.OutOfMemoryError.

As of now, such errors are passed through and are not logged by
Tomcat, but are logged by java.lang.ThreadGroup.uncaughtException().
ThreadGroup prints them to System.err (catalina.out).


Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Lars Engholm Johansen <la...@gmail.com>.

I will try to force a GC next time I am at the console about to restart a
Tomcat where one of the http-nio-80-ClientPoller-x threads have died and
connection count is exploding.

But I do not see this as a solution - can you somehow deduct why this
thread died from the outcome from a GC?
And could an Exception/Error in Tomcat thread  http-nio-80-ClientPoller-0
 or  http-nio-80-ClientPoller-1  make the thread die with no Stacktrace in
the Tomcat logs?

(This problem is quite critical for our production environment)


On Mon, Jun 16, 2014 at 5:56 PM, André Warnier <aw...@ice-sa.com> wrote:

> Lars Engholm Johansen wrote:
>
>> Our company are running several Tomcat 7.0.52 high volume Ubuntu 12.04
>> production servers.
>> We are using Tomcat WebSockets (JSR356 implementation) heavily with 100M
>> text messages (100GiB) per day.
>>
>> We monitor webserver health by measuring several key parameters every
>> minute, including tomcat connection count using:
>>
>>     mBeanServer.getAttribute(threadPool, "connectionCount");     //
>> threadPool is MBean of "type=ThreadPool"
>>
>> ______The problem______
>>
>> Our sites can run for days without problems, but once in a while the
>> tomcat
>> connection count suddenly starts growing abnormally fast.
>> See this graph:  http://imgur.com/s4fOUte
>> netstat shows these TCP connections to be mostly in CLOSE_WAIT state.
>>
>
> And if at that moment, you force the JVM that runs Tomcat to do a Garbage
> Collection, do you still have these numerous connections in CLOSE_WAIT
> state after the GC completed ?
>
>
>
>> Our sites still functions normally with no cpu spikes during this build up
>> until around 60,000 connections, but then the server refuses further
>> connections and a manual Tomcat restart is required.
>>
>> We have no output in tomcat or our logs at the time when this event
>> occurs.
>> The only sign is when comparing full java thread dump with a dump from a
>> newly launched Tomcat:
>> One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1  is
>> missing/has died.
>>
>> We have observed this problem at least since Tomcat 7.0.48 and can not
>> find
>> indications in Tomcat 7.0.x change logs that it should have been fixed in
>> newer releases.
>>
>> Any help or advises are appreciated,
>> Best regards,
>> Lars Engholm Johansen
>>
>>
>> Our connector configuration:
>> <Connector
>> acceptCount="1500"
>> acceptorThreadCount="4"
>> asyncTimeout="100000"
>> connectionTimeout="60000"
>> connectionUploadTimeout="120000"
>> disableUploadTimeout="false"
>> enableLookups="false"
>> keepAliveTimeout="120000"
>> maxConnections="100000"
>> maxPostSize="3000000"
>> maxThreads="300"
>> port="80"
>> protocol="org.apache.coyote.http11.Http11NioProtocol"
>> socket.soKeepAlive="true"
>> />
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by André Warnier <aw...@ice-sa.com>.

Lars Engholm Johansen wrote:
> Our company are running several Tomcat 7.0.52 high volume Ubuntu 12.04
> production servers.
> We are using Tomcat WebSockets (JSR356 implementation) heavily with 100M
> text messages (100GiB) per day.
> 
> We monitor webserver health by measuring several key parameters every
> minute, including tomcat connection count using:
> 
>     mBeanServer.getAttribute(threadPool, "connectionCount");     //
> threadPool is MBean of "type=ThreadPool"
> 
> ______The problem______
> 
> Our sites can run for days without problems, but once in a while the tomcat
> connection count suddenly starts growing abnormally fast.
> See this graph:  http://imgur.com/s4fOUte
> netstat shows these TCP connections to be mostly in CLOSE_WAIT state.

And if at that moment, you force the JVM that runs Tomcat to do a Garbage Collection, do 
you still have these numerous connections in CLOSE_WAIT state after the GC completed ?

> 
> Our sites still functions normally with no cpu spikes during this build up
> until around 60,000 connections, but then the server refuses further
> connections and a manual Tomcat restart is required.
> 
> We have no output in tomcat or our logs at the time when this event occurs.
> The only sign is when comparing full java thread dump with a dump from a
> newly launched Tomcat:
> One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1  is
> missing/has died.
> 
> We have observed this problem at least since Tomcat 7.0.48 and can not find
> indications in Tomcat 7.0.x change logs that it should have been fixed in
> newer releases.
> 
> Any help or advises are appreciated,
> Best regards,
> Lars Engholm Johansen
> 
> 
> Our connector configuration:
> <Connector
> acceptCount="1500"
> acceptorThreadCount="4"
> asyncTimeout="100000"
> connectionTimeout="60000"
> connectionUploadTimeout="120000"
> disableUploadTimeout="false"
> enableLookups="false"
> keepAliveTimeout="120000"
> maxConnections="100000"
> maxPostSize="3000000"
> maxThreads="300"
> port="80"
> protocol="org.apache.coyote.http11.Http11NioProtocol"
> socket.soKeepAlive="true"
> />
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Lars,

On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
> Our sites can run for days without problems, but once in a while
> the tomcat connection count suddenly starts growing abnormally
> fast. See this graph:  http://imgur.com/s4fOUte netstat shows these
> TCP connections to be mostly in CLOSE_WAIT state.
> 
> Our sites still functions normally with no cpu spikes during this
> build up until around 60,000 connections, but then the server
> refuses further connections and a manual Tomcat restart is
> required.
> 
> We have no output in tomcat or our logs at the time when this event
> occurs. The only sign is when comparing full java thread dump with
> a dump from a newly launched Tomcat: One of
> http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1  is 
> missing/has died.

If the poller thread has died, you are completely SOL. Is there no
indication in any log why the poller thread died? Particularly, check
catalina.out for OOME or an exception that might kill such a thread.

Can you configure your logger to increase the log level for the poller
component? Actually, the NIO Poller only contains log.error messages
except for 2 @ debug that indicate encountering JDK 1.5 bugs, so
changing the log level probably won't help. The only other thing that
looks like it might cause complications would be an OOME, which should
definitely show up in the logs.

> Our connector configuration: <Connector acceptCount="1500" 
> acceptorThreadCount="4"

Have you observed a performance increase by setting
acceptorThreadCount to 4 instead of a lower number? I'm just curious.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJTouYeAAoJEBzwKT+lPKRYf98P/0kYjX8jH1ZCvZ/FS9hXnYBw
hJtrcYHYe6ulPJLz66QU0FY0xNlu6ADISukW1njqcr7Cz/WJ8nIQI6fXnOVgcZj1
F/JLj4OhKQP8b0WA6NkN0KTtRBt/OsIZx4pp/VHBuBbkr2c/B7Ixfh3a15/XZjrl
TGa+KMUIvJp/gBuLCsrvmY39SACCfKvVAL6soe6+4LozURSpv/OdGBgitbiHY8sf
AqB2g4+psltMI5VGXC/wSPpmnIGfalgFyX1o1EshBeiz6sg6YejjHY5NI/eLcdW4
nfKRz+Dbm0y/uY3fs7U7tOvI2rhbXBXwu1XoscoGzlRaB4pD5nJLdIpczH4vBu2r
vjKyVE2v1QitpMr/hzsZkngSkiPe7JtnrQIQGqJ+pvgdtbpWhmOmG8OcCLR3xxH5
1arfgscgxCzH0apwuWZmO2kUXqDk9z7qyHuGHOMwFcoFfHaNUF6C83tlKEnGM0ME
pc+pCRwwVOIoLllVo1WizALFv/oI3MiWxs8C5FhibYqzpWAR1E6zfUz9YVarCqiV
F8Cu56mVlTD9bdtunvutfVcByOuVEqxw1cMKS5nVTKueKqIi1nKh8nr9c8OAK1QP
D0CeHAC6EDTTKjxUmRzApEq8LhrSCJVY6XxNdck1qeVoWp7nceQs5yrPW6t/70pE
BAbG0nqUNjrPqaJP98w9
=A3kD
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org