You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@tomcat.apache.org by Lars Engholm Johansen <la...@gmail.com> on 2014/10/06 11:11:33 UTC

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Hi all,

I have good news as I have identified the reason for the devastating
NioEndpoint.Poller thread death:

In rare circumstances a ConcurrentModification can occur in the Poller's
connection timeout handling called from OUTSIDE the try-catch(Throwable) of
Poller.run()

java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
        at java.util.HashMap$KeyIterator.next(HashMap.java:956)
        at
java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067)
        at
org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437)
        at
org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143)
        at java.lang.Thread.run(Thread.java:745)

Somehow the Poller's Selector object gets modified from another thread.

As a remedy until fixed properly by the Tomcat team, I have added a
try-catch(ConcurrentModificationException) surrounding the for loop in
Poller.timeout().
That way, in case of the rare problem, a full iteration of the Selector
will be retried in the next call to Poller.timeout().

I am really happy now as all our production servers have been rock stable
for two weeks now.

Best regards to all,
Lars Engholm Johansen


On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik <fi...@hanik.com> wrote:

> Thanks Lars, if you are indeed experiencing a non caught error, let us know
> what it is.
>
> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen <la...@gmail.com>
> wrote:
>
> > Thanks guys for all the feedback.
> >
> > I have tried the following suggested tasks:
> >
> >    - Upgrading Tomcat to the newest 7.0.55 on all our servers -> Problem
> >    still persists
> >    - Force a System.gc() when connection count is on the loose ->
> >    Connection count is not dropping
> >    - Lowering the log level of NioEndpoint class that contains the Poller
> >    code -> No info about why the poller thread exits in any tomcat logs
> >    - Reverting the JVM stack size per thread to the default is discussed
> >    previously -> Problem still persists
> >
> > I have now checked out the NioEndpoint source code and recompiled it
> with a
> > logging try-catch surrounding the whole of the Poller.run()
> implementation
> > as I noticed that the outer try-catch here only catches OOME.
> > I will report back with my findings as soon as the problem arises again.
> >
> > /Lars
> >
> >
> >
> > On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
> > chris@christopherschultz.net> wrote:
> >
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: SHA256
> > >
> > > Filip,
> > >
> > > On 6/27/14, 11:36 AM, Filip Hanik wrote:
> > > > Are there any log entries that would indicate that the poller
> > > > thread has died? This/these thread/s start when Tomcat starts. and
> > > > a stack over flow on a processing thread should never affect the
> > > > poller thread.
> > >
> > > OP reported in the initial post that the thread had disappeared:
> > >
> > > On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
> > > > We have no output in tomcat or our logs at the time when this event
> > > >  occurs. The only sign is when comparing full java thread dump with
> > > > a dump from a newly launched Tomcat:
> > > >
> > > > One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
> > > > is missing/has died.
> > >
> > > - -chris
> > > -----BEGIN PGP SIGNATURE-----
> > > Version: GnuPG v1
> > > Comment: GPGTools - http://gpgtools.org
> > > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> > >
> > > iQIcBAEBCAAGBQJTrb+yAAoJEBzwKT+lPKRYhYEP/05kiei/EUFhtxL6RMIl70Ok
> > > cb3I9XEvrQDBTkEDnGLvxw8MQSs6ocHaxdEOxzie289sYxvkuLWxOsKpikWkuUHH
> > > pEgHM5WuGuCS2AmcrTGiH6WPCnNAj8YM/zyx25NZOn8turWIbvh8GRzBFf265qP5
> > > 79z2Vb15NisYyNEqvkWHvli5CeDeOW2fgHcgv5Ec5fWb1/KyXAyVtRmEWnHpy/LB
> > > j/VLjzbBtFSJGT64W4i572qQ7C+f/XRgNzV6Fh/53gwPf+ggz5vKS9XEQEpa5SOz
> > > rlTrWuVs+WehBoCLE9TZB2J+argV7noqSQDumYcXeSf/4THkfhbhAlcBKXa/YLgH
> > > Paip710VV6S+9K1dAZOt4i1h28YXZ+qNviO6b/auo1DEdt21ezpklEOQyZbQcHYf
> > > H4VZ2mcSaMQo3QpWpze6QxvSsRZFAofpkLoqCRfsORlnV2c2xfjhRC1YtZ0sshfM
> > > zNnWQCEjRe5V+UB69mtjatJrDG16qjTcUZQlot3r4zxdjMq5D0W9XmC6WH2eCXhl
> > > aeH8SMISdn4GcYGMoUm7hWSWHs5azyBPma9AWJfYC+mLk8UbmvLP9gZN+KWenWOr
> > > xLiqCgMUvpLiOFsbNs8oWMDWGW59xT2zBjS3Aa20ZYJP/GeLWJkOrAPwTeqIaXG+
> > > tV1WjkDkejPrC4WWKwzm
> > > =sTia
> > > -----END PGP SIGNATURE-----
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> > > For additional commands, e-mail: users-help@tomcat.apache.org
> > >
> > >
> >
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Lars Engholm Johansen <la...@gmail.com>.

Thanks for looking further into this, Mark,

We are running:

  java version "1.7.0_65"
  OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.12.04.2)
  OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

  Linux 3.11.0-15-generic #25~precise1-Ubuntu SMP Thu Jan 30 17:39:31 UTC
2014 x86_64 x86_64 x86_64 GNU/Linux
  Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz

The bug happened very seldom compared to the 100M text WebSocket messages a
day, about once a week, so I guess it could be an external problem like the
JVM.

I already added a temporary fix to our Tomcat installation as mentioned
earlier, but other users could also suffer from this.
How do we proceed?

Best regards,
Lars Engholm Johansen

On Wed, Nov 12, 2014 at 12:15 PM, Mark Thomas <ma...@apache.org> wrote:

> On 10/11/2014 09:57, Lars Engholm Johansen wrote:
> > Hi Mark,
> >
> > I looked into our javax.websocket.Endpoint implementation and found the
> > following suspicious code:
> >
> > When we need to close the WebSocket session already in .onOpen() method
> > (rejecting a connection), we are calling session.close() asynchronously
> > after 1 second via a java.util.Timer task.
> > This was due to bug
> https://issues.apache.org/bugzilla/show_bug.cgi?id=54716,
> > that I can see should be fixed long time ago (thanks).
> >
> > Can this cause the selector's keyset to be accessed by more than one
> thread?
>
> I don't see how.
>
> I've just double checked the NIO Poller code and the only places the
> keyset is used is in Poller.run() and Poller.timeout() - both of which
> are only ever accessed from the Poller thread.
>
> I've also looked over the run() and timeout() methods and haven't yet
> found anything that could trigger this.
>
> There are multiple Pollers but each Poller has a distinct set of sockets
> to manage.
>
> I'm beginning to wonder if there is a JVM bug here. Which JVM are you
> using?
>
> Mark
>
> >
> > Best regards,
> > Lars Engholm Johansen
> >
> > On Mon, Oct 6, 2014 at 2:14 PM, Mark Thomas <ma...@apache.org> wrote:
> >
> >> On 06/10/2014 10:11, Lars Engholm Johansen wrote:
> >>> Hi all,
> >>>
> >>> I have good news as I have identified the reason for the devastating
> >>> NioEndpoint.Poller thread death:
> >>>
> >>> In rare circumstances a ConcurrentModification can occur in the
> Poller's
> >>> connection timeout handling called from OUTSIDE the
> try-catch(Throwable)
> >> of
> >>> Poller.run()
> >>>
> >>> java.util.ConcurrentModificationException
> >>>         at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
> >>>         at java.util.HashMap$KeyIterator.next(HashMap.java:956)
> >>>         at
> >>>
> >>
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067)
> >>>         at
> >>>
> >>
> org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437)
> >>>         at
> >>>
> org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143)
> >>>         at java.lang.Thread.run(Thread.java:745)
> >>>
> >>> Somehow the Poller's Selector object gets modified from another thread.
> >>
> >> Any idea how? I've been looking through that code for some time now
> >> (this stack trace appears to be from 7.0.55 for those that want to look
> >> at this themselves) and I can't see anywhere where the selector's keyset
> >> is accessed by more than one thread.
> >>
> >>> As a remedy until fixed properly by the Tomcat team, I have added a
> >>> try-catch(ConcurrentModificationException) surrounding the for loop in
> >>> Poller.timeout().
> >>> That way, in case of the rare problem, a full iteration of the Selector
> >>> will be retried in the next call to Poller.timeout().
> >>
> >> That seems like a reasonable work-around but before we start making
> >> changes to the Tomcat code I'd really like to understand the root
> >> cause(s) of the issue else we might not be fixing the actual issue and
> >> could make it worse for some folks.
> >>
> >> Mark
> >>
> >>
> >>>
> >>> I am really happy now as all our production servers have been rock
> stable
> >>> for two weeks now.
> >>>
> >>> Best regards to all,
> >>> Lars Engholm Johansen
> >>>
> >>>
> >>> On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik <fi...@hanik.com> wrote:
> >>>
> >>>> Thanks Lars, if you are indeed experiencing a non caught error, let us
> >> know
> >>>> what it is.
> >>>>
> >>>> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen <
> >> larsjo@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Thanks guys for all the feedback.
> >>>>>
> >>>>> I have tried the following suggested tasks:
> >>>>>
> >>>>>    - Upgrading Tomcat to the newest 7.0.55 on all our servers ->
> >> Problem
> >>>>>    still persists
> >>>>>    - Force a System.gc() when connection count is on the loose ->
> >>>>>    Connection count is not dropping
> >>>>>    - Lowering the log level of NioEndpoint class that contains the
> >> Poller
> >>>>>    code -> No info about why the poller thread exits in any tomcat
> logs
> >>>>>    - Reverting the JVM stack size per thread to the default is
> >> discussed
> >>>>>    previously -> Problem still persists
> >>>>>
> >>>>> I have now checked out the NioEndpoint source code and recompiled it
> >>>> with a
> >>>>> logging try-catch surrounding the whole of the Poller.run()
> >>>> implementation
> >>>>> as I noticed that the outer try-catch here only catches OOME.
> >>>>> I will report back with my findings as soon as the problem arises
> >> again.
> >>>>>
> >>>>> /Lars
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
> >>>>> chris@christopherschultz.net> wrote:
> >>>>>
> >>> Filip,
> >>>
> >>> On 6/27/14, 11:36 AM, Filip Hanik wrote:
> >>>>>>>> Are there any log entries that would indicate that the poller
> >>>>>>>> thread has died? This/these thread/s start when Tomcat starts. and
> >>>>>>>> a stack over flow on a processing thread should never affect the
> >>>>>>>> poller thread.
> >>>
> >>> OP reported in the initial post that the thread had disappeared:
> >>>
> >>> On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
> >>>>>>>> We have no output in tomcat or our logs at the time when this
> event
> >>>>>>>>  occurs. The only sign is when comparing full java thread dump
> with
> >>>>>>>> a dump from a newly launched Tomcat:
> >>>>>>>>
> >>>>>>>> One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
> >>>>>>>> is missing/has died.
> >>>
> >>> -chris
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> >>>>>> For additional commands, e-mail: users-help@tomcat.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> >> For additional commands, e-mail: users-help@tomcat.apache.org
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Mark Thomas <ma...@apache.org>.

On 10/11/2014 09:57, Lars Engholm Johansen wrote:
> Hi Mark,
> 
> I looked into our javax.websocket.Endpoint implementation and found the
> following suspicious code:
> 
> When we need to close the WebSocket session already in .onOpen() method
> (rejecting a connection), we are calling session.close() asynchronously
> after 1 second via a java.util.Timer task.
> This was due to bug https://issues.apache.org/bugzilla/show_bug.cgi?id=54716,
> that I can see should be fixed long time ago (thanks).
> 
> Can this cause the selector's keyset to be accessed by more than one thread?

I don't see how.

I've just double checked the NIO Poller code and the only places the
keyset is used is in Poller.run() and Poller.timeout() - both of which
are only ever accessed from the Poller thread.

I've also looked over the run() and timeout() methods and haven't yet
found anything that could trigger this.

There are multiple Pollers but each Poller has a distinct set of sockets
to manage.

I'm beginning to wonder if there is a JVM bug here. Which JVM are you using?

Mark

> 
> Best regards,
> Lars Engholm Johansen
> 
> On Mon, Oct 6, 2014 at 2:14 PM, Mark Thomas <ma...@apache.org> wrote:
> 
>> On 06/10/2014 10:11, Lars Engholm Johansen wrote:
>>> Hi all,
>>>
>>> I have good news as I have identified the reason for the devastating
>>> NioEndpoint.Poller thread death:
>>>
>>> In rare circumstances a ConcurrentModification can occur in the Poller's
>>> connection timeout handling called from OUTSIDE the try-catch(Throwable)
>> of
>>> Poller.run()
>>>
>>> java.util.ConcurrentModificationException
>>>         at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
>>>         at java.util.HashMap$KeyIterator.next(HashMap.java:956)
>>>         at
>>>
>> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067)
>>>         at
>>>
>> org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437)
>>>         at
>>> org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143)
>>>         at java.lang.Thread.run(Thread.java:745)
>>>
>>> Somehow the Poller's Selector object gets modified from another thread.
>>
>> Any idea how? I've been looking through that code for some time now
>> (this stack trace appears to be from 7.0.55 for those that want to look
>> at this themselves) and I can't see anywhere where the selector's keyset
>> is accessed by more than one thread.
>>
>>> As a remedy until fixed properly by the Tomcat team, I have added a
>>> try-catch(ConcurrentModificationException) surrounding the for loop in
>>> Poller.timeout().
>>> That way, in case of the rare problem, a full iteration of the Selector
>>> will be retried in the next call to Poller.timeout().
>>
>> That seems like a reasonable work-around but before we start making
>> changes to the Tomcat code I'd really like to understand the root
>> cause(s) of the issue else we might not be fixing the actual issue and
>> could make it worse for some folks.
>>
>> Mark
>>
>>
>>>
>>> I am really happy now as all our production servers have been rock stable
>>> for two weeks now.
>>>
>>> Best regards to all,
>>> Lars Engholm Johansen
>>>
>>>
>>> On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik <fi...@hanik.com> wrote:
>>>
>>>> Thanks Lars, if you are indeed experiencing a non caught error, let us
>> know
>>>> what it is.
>>>>
>>>> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen <
>> larsjo@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks guys for all the feedback.
>>>>>
>>>>> I have tried the following suggested tasks:
>>>>>
>>>>>    - Upgrading Tomcat to the newest 7.0.55 on all our servers ->
>> Problem
>>>>>    still persists
>>>>>    - Force a System.gc() when connection count is on the loose ->
>>>>>    Connection count is not dropping
>>>>>    - Lowering the log level of NioEndpoint class that contains the
>> Poller
>>>>>    code -> No info about why the poller thread exits in any tomcat logs
>>>>>    - Reverting the JVM stack size per thread to the default is
>> discussed
>>>>>    previously -> Problem still persists
>>>>>
>>>>> I have now checked out the NioEndpoint source code and recompiled it
>>>> with a
>>>>> logging try-catch surrounding the whole of the Poller.run()
>>>> implementation
>>>>> as I noticed that the outer try-catch here only catches OOME.
>>>>> I will report back with my findings as soon as the problem arises
>> again.
>>>>>
>>>>> /Lars
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
>>>>> chris@christopherschultz.net> wrote:
>>>>>
>>> Filip,
>>>
>>> On 6/27/14, 11:36 AM, Filip Hanik wrote:
>>>>>>>> Are there any log entries that would indicate that the poller
>>>>>>>> thread has died? This/these thread/s start when Tomcat starts. and
>>>>>>>> a stack over flow on a processing thread should never affect the
>>>>>>>> poller thread.
>>>
>>> OP reported in the initial post that the thread had disappeared:
>>>
>>> On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
>>>>>>>> We have no output in tomcat or our logs at the time when this event
>>>>>>>>  occurs. The only sign is when comparing full java thread dump with
>>>>>>>> a dump from a newly launched Tomcat:
>>>>>>>>
>>>>>>>> One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
>>>>>>>> is missing/has died.
>>>
>>> -chris
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>>>>>> For additional commands, e-mail: users-help@tomcat.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: users-help@tomcat.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Lars Engholm Johansen <la...@gmail.com>.

Hi Mark,

I looked into our javax.websocket.Endpoint implementation and found the
following suspicious code:

When we need to close the WebSocket session already in .onOpen() method
(rejecting a connection), we are calling session.close() asynchronously
after 1 second via a java.util.Timer task.
This was due to bug https://issues.apache.org/bugzilla/show_bug.cgi?id=54716,
that I can see should be fixed long time ago (thanks).

Can this cause the selector's keyset to be accessed by more than one thread?

Best regards,
Lars Engholm Johansen

On Mon, Oct 6, 2014 at 2:14 PM, Mark Thomas <ma...@apache.org> wrote:

> On 06/10/2014 10:11, Lars Engholm Johansen wrote:
> > Hi all,
> >
> > I have good news as I have identified the reason for the devastating
> > NioEndpoint.Poller thread death:
> >
> > In rare circumstances a ConcurrentModification can occur in the Poller's
> > connection timeout handling called from OUTSIDE the try-catch(Throwable)
> of
> > Poller.run()
> >
> > java.util.ConcurrentModificationException
> >         at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
> >         at java.util.HashMap$KeyIterator.next(HashMap.java:956)
> >         at
> >
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067)
> >         at
> >
> org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437)
> >         at
> > org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143)
> >         at java.lang.Thread.run(Thread.java:745)
> >
> > Somehow the Poller's Selector object gets modified from another thread.
>
> Any idea how? I've been looking through that code for some time now
> (this stack trace appears to be from 7.0.55 for those that want to look
> at this themselves) and I can't see anywhere where the selector's keyset
> is accessed by more than one thread.
>
> > As a remedy until fixed properly by the Tomcat team, I have added a
> > try-catch(ConcurrentModificationException) surrounding the for loop in
> > Poller.timeout().
> > That way, in case of the rare problem, a full iteration of the Selector
> > will be retried in the next call to Poller.timeout().
>
> That seems like a reasonable work-around but before we start making
> changes to the Tomcat code I'd really like to understand the root
> cause(s) of the issue else we might not be fixing the actual issue and
> could make it worse for some folks.
>
> Mark
>
>
> >
> > I am really happy now as all our production servers have been rock stable
> > for two weeks now.
> >
> > Best regards to all,
> > Lars Engholm Johansen
> >
> >
> > On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik <fi...@hanik.com> wrote:
> >
> >> Thanks Lars, if you are indeed experiencing a non caught error, let us
> know
> >> what it is.
> >>
> >> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen <
> larsjo@gmail.com>
> >> wrote:
> >>
> >>> Thanks guys for all the feedback.
> >>>
> >>> I have tried the following suggested tasks:
> >>>
> >>>    - Upgrading Tomcat to the newest 7.0.55 on all our servers ->
> Problem
> >>>    still persists
> >>>    - Force a System.gc() when connection count is on the loose ->
> >>>    Connection count is not dropping
> >>>    - Lowering the log level of NioEndpoint class that contains the
> Poller
> >>>    code -> No info about why the poller thread exits in any tomcat logs
> >>>    - Reverting the JVM stack size per thread to the default is
> discussed
> >>>    previously -> Problem still persists
> >>>
> >>> I have now checked out the NioEndpoint source code and recompiled it
> >> with a
> >>> logging try-catch surrounding the whole of the Poller.run()
> >> implementation
> >>> as I noticed that the outer try-catch here only catches OOME.
> >>> I will report back with my findings as soon as the problem arises
> again.
> >>>
> >>> /Lars
> >>>
> >>>
> >>>
> >>> On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
> >>> chris@christopherschultz.net> wrote:
> >>>
> > Filip,
> >
> > On 6/27/14, 11:36 AM, Filip Hanik wrote:
> >>>>>> Are there any log entries that would indicate that the poller
> >>>>>> thread has died? This/these thread/s start when Tomcat starts. and
> >>>>>> a stack over flow on a processing thread should never affect the
> >>>>>> poller thread.
> >
> > OP reported in the initial post that the thread had disappeared:
> >
> > On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
> >>>>>> We have no output in tomcat or our logs at the time when this event
> >>>>>>  occurs. The only sign is when comparing full java thread dump with
> >>>>>> a dump from a newly launched Tomcat:
> >>>>>>
> >>>>>> One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
> >>>>>> is missing/has died.
> >
> > -chris
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> >>>> For additional commands, e-mail: users-help@tomcat.apache.org
> >>>>
> >>>>
> >>>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

Re: Connection count explosion due to thread http-nio-80-ClientPoller-x death

Posted by Mark Thomas <ma...@apache.org>.

On 06/10/2014 10:11, Lars Engholm Johansen wrote:
> Hi all,
> 
> I have good news as I have identified the reason for the devastating
> NioEndpoint.Poller thread death:
> 
> In rare circumstances a ConcurrentModification can occur in the Poller's
> connection timeout handling called from OUTSIDE the try-catch(Throwable) of
> Poller.run()
> 
> java.util.ConcurrentModificationException
>         at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
>         at java.util.HashMap$KeyIterator.next(HashMap.java:956)
>         at
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067)
>         at
> org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1437)
>         at
> org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1143)
>         at java.lang.Thread.run(Thread.java:745)
> 
> Somehow the Poller's Selector object gets modified from another thread.

Any idea how? I've been looking through that code for some time now
(this stack trace appears to be from 7.0.55 for those that want to look
at this themselves) and I can't see anywhere where the selector's keyset
is accessed by more than one thread.

> As a remedy until fixed properly by the Tomcat team, I have added a
> try-catch(ConcurrentModificationException) surrounding the for loop in
> Poller.timeout().
> That way, in case of the rare problem, a full iteration of the Selector
> will be retried in the next call to Poller.timeout().

That seems like a reasonable work-around but before we start making
changes to the Tomcat code I'd really like to understand the root
cause(s) of the issue else we might not be fixing the actual issue and
could make it worse for some folks.

Mark


> 
> I am really happy now as all our production servers have been rock stable
> for two weeks now.
> 
> Best regards to all,
> Lars Engholm Johansen
> 
> 
> On Thu, Sep 18, 2014 at 7:03 PM, Filip Hanik <fi...@hanik.com> wrote:
> 
>> Thanks Lars, if you are indeed experiencing a non caught error, let us know
>> what it is.
>>
>> On Thu, Sep 18, 2014 at 2:30 AM, Lars Engholm Johansen <la...@gmail.com>
>> wrote:
>>
>>> Thanks guys for all the feedback.
>>>
>>> I have tried the following suggested tasks:
>>>
>>>    - Upgrading Tomcat to the newest 7.0.55 on all our servers -> Problem
>>>    still persists
>>>    - Force a System.gc() when connection count is on the loose ->
>>>    Connection count is not dropping
>>>    - Lowering the log level of NioEndpoint class that contains the Poller
>>>    code -> No info about why the poller thread exits in any tomcat logs
>>>    - Reverting the JVM stack size per thread to the default is discussed
>>>    previously -> Problem still persists
>>>
>>> I have now checked out the NioEndpoint source code and recompiled it
>> with a
>>> logging try-catch surrounding the whole of the Poller.run()
>> implementation
>>> as I noticed that the outer try-catch here only catches OOME.
>>> I will report back with my findings as soon as the problem arises again.
>>>
>>> /Lars
>>>
>>>
>>>
>>> On Fri, Jun 27, 2014 at 9:02 PM, Christopher Schultz <
>>> chris@christopherschultz.net> wrote:
>>>
> Filip,
> 
> On 6/27/14, 11:36 AM, Filip Hanik wrote:
>>>>>> Are there any log entries that would indicate that the poller
>>>>>> thread has died? This/these thread/s start when Tomcat starts. and
>>>>>> a stack over flow on a processing thread should never affect the
>>>>>> poller thread.
> 
> OP reported in the initial post that the thread had disappeared:
> 
> On 6/16/14, 5:40 AM, Lars Engholm Johansen wrote:
>>>>>> We have no output in tomcat or our logs at the time when this event
>>>>>>  occurs. The only sign is when comparing full java thread dump with
>>>>>> a dump from a newly launched Tomcat:
>>>>>>
>>>>>> One of  http-nio-80-ClientPoller-0  or  http-nio-80-ClientPoller-1
>>>>>> is missing/has died.
> 
> -chris
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>>>> For additional commands, e-mail: users-help@tomcat.apache.org
>>>>
>>>>
>>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org