You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by bu...@apache.org on 2015/05/22 15:29:47 UTC

[Bug 57943] NioEndpoint, one poller thread died

https://bz.apache.org/bugzilla/show_bug.cgi?id=57943

--- Comment #1 from Mark Thomas <ma...@apache.org> ---
This is the third report we have received about this issue.

The other two are:
http://markmail.org/message/xgxblpk4v4ykgi5y
http://markmail.org/message/eypk42i6gdpztkpy

I have looked through the NIO Poller code again and I still can't see how this
can be happening. I've also dug into the JRE code and I can't see any issues
there either (although I'll not this has been reported on a variety of JVMs and
I haven't checked the specific ones mentions).

I think the best thing to do at this point is catch the error, log a warning
and maybe ask folks to report it if they can repeat it.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Re: [Bug 57943] NioEndpoint, one poller thread died

Posted by Ian Luo <ia...@gmail.com>.
Hi Chris, Sure, just done.

Thanks,
-Ian.

On Thu, Jul 2, 2015 at 9:55 PM, Christopher Schultz <
chris@christopherschultz.net> wrote:

> Ian,
>
> This is a nice follow-up. Could you attach it directly to the bug report
> in Bugzilla, so the commentary is all in one place?
>
> https://bz.apache.org/bugzilla/show_bug.cgi?id=57943
>
> Thanks,
> -chris
>
> On 7/2/15 6:29 AM, Ian Luo wrote:
> > Hi Mark,
> >
> > I am the colleague of Sun Qi who reported bug57943
> > <https://bz.apache.org/bugzilla/show_bug.cgi?id=57943> one month ago. I
> > believe we managed to find out why ConcurrentModificationException
> happens
> > in our scenario.
> >
> > Here’s the scenario, our application uses async servlet to write the
> > response back to the client. The operation we put in async servlet which
> > run in another thread is quite time-consuming. When the incoming traffic
> is
> > heavy, which usually happens when the instances of tomcat server reboots,
> > we hit this issue very easily.
> >
> > When the ClientPoller thread run into timeout method (in NioEndpoint) and
> > execute the following code:
> >
> > } else if (!ka.isAsync() || ka.getTimeout() > 0) {
> >     // Async requests with a timeout of 0 or less never timeout
> >     long delta = now - ka.getLastAccess();
> >     long timeout =
> >
> (ka.getTimeout()==-1)?((long)socketProperties.getSoTimeout()):(ka.getTimeout());
> >     boolean isTimedout = delta > timeout;
> >     if (isTimedout) {
> >         // Prevent subsequent timeouts if the timeout event takes a
> > while to process
> >         ka.access(Long.MAX_VALUE);
> >         processSocket(ka.getChannel(), SocketStatus.TIMEOUT, true, 6);
> >     }}
> >
> > Eventually it calls processSocket(ka.getChannel()) and puts back the
> > NioChannel instance to the cache nioChannels for the late re-use:
> >
> > if (ka!=null) ka.setComet(false);
> > socket.getPoller().cancelledKey(key, SocketStatus.ERROR, false);if
> > (running && !paused) {
> >     nioChannels.offer(socket);}
> >
> > Unfortunately in our scenarios, when QPS is greater than 1000req/sec,
> async
> > servlet threads have very large chance to start async-complete and begin
> to
> > call processSocket (in Http11NioProcessor) at the same time:
> >
> > } else if (actionCode == ActionCode.ASYNC_COMPLETE) {
> >     if (asyncStateMachine.asyncComplete()) {
> >     ((NioEndpoint)endpoint).processSocket(socketWrapper.getSocket(),
> > SocketStatus.OPEN_READ, true, 10);}
> >
> > This leads to cache pollution in nioChannels since there’re chances to
> > offer the same object into the cache multiple times. When the Acceptor
> > thread polls the object and hands it over to the ClientPollers, the same
> > object could be pass into the different ClientPoller threads, then leads
> to
> > very serious problem.
> >
> > You may notice the code I show here is the code from tomcat 7.0.54, and
> > claim the similar issue has already been addressed, but this is not true.
> > We noticed the similar issue was reported on Bug57340
> > <https://bz.apache.org/bugzilla/show_bug.cgi?id=57340>, and the solution
> > has already been checked into the later release.
> >
> > But this solution is not a complete solution, in our scenario, there’s
> > chance to offer the duplicated object to the cache in else if clause
> where
> > the solution for Bug57340 doesn’t cover:
> >
> > } else if (handshake == -1 ) {
> >     if (key != null) {
> >         socket.getPoller().cancelledKey(key, SocketStatus.DISCONNECT,
> false);
> >     }
> >     nioChannels.offer(socket);
> >     socket = null;
> >     if ( ka!=null ) keyCache.offer(ka);
> >     ka = null;}
> >
> > This is easily to understand. When handshake equals -1, one possibility
> is
> > the key passed into the doRun method is null, and when key is null, it
> > means some other thread has already finished processing on the same
> socket:
> >
> > public void run() {
> >     SelectionKey key = socket.getIOChannel().keyFor(
> >         socket.getPoller().getSelector());
> >
> > We propose the change below in order to address this issue throughly. In
> > this case, I think we can simply drop the object instead of offering it
> > since it looks there’s no other ideal way to not use lock.
> >
> > } else if (handshake == -1 ) {
> >     if (key != null) {
> >         if (socket.getPoller().cancelledKey(key,
> > SocketStatus.DISCONNECT, false) != null) {
> >             nioChannels.offer(socket);
> >             if (ka != null) keyCache.offer(ka);
> >         }
> >     }
> >     socket = null;
> >     ka = null;
> >
> > Best Regards,
> > -Ian.
> >
> > On Fri, May 22, 2015 at 9:29 PM, <bu...@apache.org> wrote:
> >
> >> https://bz.apache.org/bugzilla/show_bug.cgi?id=57943
> >>
> >> --- Comment #1 from Mark Thomas <ma...@apache.org> ---
> >> This is the third report we have received about this issue.
> >>
> >> The other two are:
> >> http://markmail.org/message/xgxblpk4v4ykgi5y
> >> http://markmail.org/message/eypk42i6gdpztkpy
> >>
> >> I have looked through the NIO Poller code again and I still can't see
> how
> >> this
> >> can be happening. I've also dug into the JRE code and I can't see any
> >> issues
> >> there either (although I'll not this has been reported on a variety of
> >> JVMs and
> >> I haven't checked the specific ones mentions).
> >>
> >> I think the best thing to do at this point is catch the error, log a
> >> warning
> >> and maybe ask folks to report it if they can repeat it.
> >>
> >> --
> >> You are receiving this mail because:
> >> You are the assignee for the bug.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
> >> For additional commands, e-mail: dev-help@tomcat.apache.org
> >>
> >>
> >
>
>

Re: [Bug 57943] NioEndpoint, one poller thread died

Posted by Christopher Schultz <ch...@christopherschultz.net>.
Ian,

This is a nice follow-up. Could you attach it directly to the bug report
in Bugzilla, so the commentary is all in one place?

https://bz.apache.org/bugzilla/show_bug.cgi?id=57943

Thanks,
-chris

On 7/2/15 6:29 AM, Ian Luo wrote:
> Hi Mark,
> 
> I am the colleague of Sun Qi who reported bug57943
> <https://bz.apache.org/bugzilla/show_bug.cgi?id=57943> one month ago. I
> believe we managed to find out why ConcurrentModificationException happens
> in our scenario.
> 
> Here’s the scenario, our application uses async servlet to write the
> response back to the client. The operation we put in async servlet which
> run in another thread is quite time-consuming. When the incoming traffic is
> heavy, which usually happens when the instances of tomcat server reboots,
> we hit this issue very easily.
> 
> When the ClientPoller thread run into timeout method (in NioEndpoint) and
> execute the following code:
> 
> } else if (!ka.isAsync() || ka.getTimeout() > 0) {
>     // Async requests with a timeout of 0 or less never timeout
>     long delta = now - ka.getLastAccess();
>     long timeout =
> (ka.getTimeout()==-1)?((long)socketProperties.getSoTimeout()):(ka.getTimeout());
>     boolean isTimedout = delta > timeout;
>     if (isTimedout) {
>         // Prevent subsequent timeouts if the timeout event takes a
> while to process
>         ka.access(Long.MAX_VALUE);
>         processSocket(ka.getChannel(), SocketStatus.TIMEOUT, true, 6);
>     }}
> 
> Eventually it calls processSocket(ka.getChannel()) and puts back the
> NioChannel instance to the cache nioChannels for the late re-use:
> 
> if (ka!=null) ka.setComet(false);
> socket.getPoller().cancelledKey(key, SocketStatus.ERROR, false);if
> (running && !paused) {
>     nioChannels.offer(socket);}
> 
> Unfortunately in our scenarios, when QPS is greater than 1000req/sec, async
> servlet threads have very large chance to start async-complete and begin to
> call processSocket (in Http11NioProcessor) at the same time:
> 
> } else if (actionCode == ActionCode.ASYNC_COMPLETE) {
>     if (asyncStateMachine.asyncComplete()) {
>     ((NioEndpoint)endpoint).processSocket(socketWrapper.getSocket(),
> SocketStatus.OPEN_READ, true, 10);}
> 
> This leads to cache pollution in nioChannels since there’re chances to
> offer the same object into the cache multiple times. When the Acceptor
> thread polls the object and hands it over to the ClientPollers, the same
> object could be pass into the different ClientPoller threads, then leads to
> very serious problem.
> 
> You may notice the code I show here is the code from tomcat 7.0.54, and
> claim the similar issue has already been addressed, but this is not true.
> We noticed the similar issue was reported on Bug57340
> <https://bz.apache.org/bugzilla/show_bug.cgi?id=57340>, and the solution
> has already been checked into the later release.
> 
> But this solution is not a complete solution, in our scenario, there’s
> chance to offer the duplicated object to the cache in else if clause where
> the solution for Bug57340 doesn’t cover:
> 
> } else if (handshake == -1 ) {
>     if (key != null) {
>         socket.getPoller().cancelledKey(key, SocketStatus.DISCONNECT, false);
>     }
>     nioChannels.offer(socket);
>     socket = null;
>     if ( ka!=null ) keyCache.offer(ka);
>     ka = null;}
> 
> This is easily to understand. When handshake equals -1, one possibility is
> the key passed into the doRun method is null, and when key is null, it
> means some other thread has already finished processing on the same socket:
> 
> public void run() {
>     SelectionKey key = socket.getIOChannel().keyFor(
>         socket.getPoller().getSelector());
> 
> We propose the change below in order to address this issue throughly. In
> this case, I think we can simply drop the object instead of offering it
> since it looks there’s no other ideal way to not use lock.
> 
> } else if (handshake == -1 ) {
>     if (key != null) {
>         if (socket.getPoller().cancelledKey(key,
> SocketStatus.DISCONNECT, false) != null) {
>             nioChannels.offer(socket);
>             if (ka != null) keyCache.offer(ka);
>         }
>     }
>     socket = null;
>     ka = null;
> 
> Best Regards,
> -Ian.
> 
> On Fri, May 22, 2015 at 9:29 PM, <bu...@apache.org> wrote:
> 
>> https://bz.apache.org/bugzilla/show_bug.cgi?id=57943
>>
>> --- Comment #1 from Mark Thomas <ma...@apache.org> ---
>> This is the third report we have received about this issue.
>>
>> The other two are:
>> http://markmail.org/message/xgxblpk4v4ykgi5y
>> http://markmail.org/message/eypk42i6gdpztkpy
>>
>> I have looked through the NIO Poller code again and I still can't see how
>> this
>> can be happening. I've also dug into the JRE code and I can't see any
>> issues
>> there either (although I'll not this has been reported on a variety of
>> JVMs and
>> I haven't checked the specific ones mentions).
>>
>> I think the best thing to do at this point is catch the error, log a
>> warning
>> and maybe ask folks to report it if they can repeat it.
>>
>> --
>> You are receiving this mail because:
>> You are the assignee for the bug.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: dev-help@tomcat.apache.org
>>
>>
> 


Re: [Bug 57943] NioEndpoint, one poller thread died

Posted by Ian Luo <ia...@gmail.com>.
Hi Mark,

I am the colleague of Sun Qi who reported bug57943
<https://bz.apache.org/bugzilla/show_bug.cgi?id=57943> one month ago. I
believe we managed to find out why ConcurrentModificationException happens
in our scenario.

Here’s the scenario, our application uses async servlet to write the
response back to the client. The operation we put in async servlet which
run in another thread is quite time-consuming. When the incoming traffic is
heavy, which usually happens when the instances of tomcat server reboots,
we hit this issue very easily.

When the ClientPoller thread run into timeout method (in NioEndpoint) and
execute the following code:

} else if (!ka.isAsync() || ka.getTimeout() > 0) {
    // Async requests with a timeout of 0 or less never timeout
    long delta = now - ka.getLastAccess();
    long timeout =
(ka.getTimeout()==-1)?((long)socketProperties.getSoTimeout()):(ka.getTimeout());
    boolean isTimedout = delta > timeout;
    if (isTimedout) {
        // Prevent subsequent timeouts if the timeout event takes a
while to process
        ka.access(Long.MAX_VALUE);
        processSocket(ka.getChannel(), SocketStatus.TIMEOUT, true, 6);
    }}

Eventually it calls processSocket(ka.getChannel()) and puts back the
NioChannel instance to the cache nioChannels for the late re-use:

if (ka!=null) ka.setComet(false);
socket.getPoller().cancelledKey(key, SocketStatus.ERROR, false);if
(running && !paused) {
    nioChannels.offer(socket);}

Unfortunately in our scenarios, when QPS is greater than 1000req/sec, async
servlet threads have very large chance to start async-complete and begin to
call processSocket (in Http11NioProcessor) at the same time:

} else if (actionCode == ActionCode.ASYNC_COMPLETE) {
    if (asyncStateMachine.asyncComplete()) {
    ((NioEndpoint)endpoint).processSocket(socketWrapper.getSocket(),
SocketStatus.OPEN_READ, true, 10);}

This leads to cache pollution in nioChannels since there’re chances to
offer the same object into the cache multiple times. When the Acceptor
thread polls the object and hands it over to the ClientPollers, the same
object could be pass into the different ClientPoller threads, then leads to
very serious problem.

You may notice the code I show here is the code from tomcat 7.0.54, and
claim the similar issue has already been addressed, but this is not true.
We noticed the similar issue was reported on Bug57340
<https://bz.apache.org/bugzilla/show_bug.cgi?id=57340>, and the solution
has already been checked into the later release.

But this solution is not a complete solution, in our scenario, there’s
chance to offer the duplicated object to the cache in else if clause where
the solution for Bug57340 doesn’t cover:

} else if (handshake == -1 ) {
    if (key != null) {
        socket.getPoller().cancelledKey(key, SocketStatus.DISCONNECT, false);
    }
    nioChannels.offer(socket);
    socket = null;
    if ( ka!=null ) keyCache.offer(ka);
    ka = null;}

This is easily to understand. When handshake equals -1, one possibility is
the key passed into the doRun method is null, and when key is null, it
means some other thread has already finished processing on the same socket:

public void run() {
    SelectionKey key = socket.getIOChannel().keyFor(
        socket.getPoller().getSelector());

We propose the change below in order to address this issue throughly. In
this case, I think we can simply drop the object instead of offering it
since it looks there’s no other ideal way to not use lock.

} else if (handshake == -1 ) {
    if (key != null) {
        if (socket.getPoller().cancelledKey(key,
SocketStatus.DISCONNECT, false) != null) {
            nioChannels.offer(socket);
            if (ka != null) keyCache.offer(ka);
        }
    }
    socket = null;
    ka = null;

Best Regards,
-Ian.

On Fri, May 22, 2015 at 9:29 PM, <bu...@apache.org> wrote:

> https://bz.apache.org/bugzilla/show_bug.cgi?id=57943
>
> --- Comment #1 from Mark Thomas <ma...@apache.org> ---
> This is the third report we have received about this issue.
>
> The other two are:
> http://markmail.org/message/xgxblpk4v4ykgi5y
> http://markmail.org/message/eypk42i6gdpztkpy
>
> I have looked through the NIO Poller code again and I still can't see how
> this
> can be happening. I've also dug into the JRE code and I can't see any
> issues
> there either (although I'll not this has been reported on a variety of
> JVMs and
> I haven't checked the specific ones mentions).
>
> I think the best thing to do at this point is catch the error, log a
> warning
> and maybe ask folks to report it if they can repeat it.
>
> --
> You are receiving this mail because:
> You are the assignee for the bug.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: dev-help@tomcat.apache.org
>
>