You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Rainer Jung <ra...@kippdata.de> on 2013/01/21 10:20:27 UTC

TC looping on CPU due to "Critical poller failure" when using APR (7.0.35, on reproducible)

I did another run for "ant test" for 7.0.35 and when using APR I got the
following error the first time (never saw that before):

    [junit] Running org.apache.jasper.compiler.TestParser

...

    [junit] Jan 18, 2013 4:37:56 PM
org.apache.catalina.core.AprLifecycleListener init
    [junit] INFO: Loaded APR based Apache Tomcat Native library 1.1.24
using APR version 1.4.6.
    [junit] Jan 18, 2013 4:37:56 PM
org.apache.catalina.core.AprLifecycleListener init
    [junit] INFO: APR capabilities: IPv6 [true], sendfile [true], accept
filters [false], random [true].
    [junit] Jan 18, 2013 4:37:57 PM
org.apache.catalina.core.AprLifecycleListener initializeSSL
    [junit] INFO: OpenSSL successfully initialized (OpenSSL 1.0.1c 10
May 2012)
    [junit] Jan 18, 2013 4:37:57 PM org.apache.coyote.AbstractProtocol init
    [junit] INFO: Initializing ProtocolHandler ["http-apr-127.0.0.1-auto-2"]
    [junit] Jan 18, 2013 4:37:57 PM
org.apache.catalina.core.StandardService startInternal
    [junit] INFO: Starting service Tomcat
    [junit] Jan 18, 2013 4:37:57 PM
org.apache.catalina.core.StandardEngine startInternal
    [junit] INFO: Starting Servlet Engine: Apache Tomcat/7.0.35
    [junit] Jan 18, 2013 4:37:57 PM
org.apache.catalina.startup.ContextConfig getDefaultWebXmlFragment
    [junit] INFO: No global web.xml found
    [junit] Jan 18, 2013 4:37:57 PM org.apache.coyote.AbstractProtocol start
    [junit] INFO: Starting ProtocolHandler
["http-apr-127.0.0.1-auto-2-42652"]
    [junit] Jan 18, 2013 4:37:59 PM org.apache.coyote.AbstractProtocol pause
    [junit] INFO: Pausing ProtocolHandler
["http-apr-127.0.0.1-auto-2-42652"]
    [junit] Jan 18, 2013 4:37:59 PM
org.apache.catalina.core.StandardService stopInternal
    [junit] INFO: Stopping service Tomcat
    [junit] Jan 18, 2013 4:37:59 PM org.apache.coyote.AbstractProtocol stop
    [junit] INFO: Stopping ProtocolHandler
["http-apr-127.0.0.1-auto-2-42652"]
    [junit] Jan 18, 2013 4:37:59 PM
org.apache.tomcat.util.net.AprEndpoint$Poller doPoll
    [junit] SEVERE: Critical poller failure (restarting poller): [9] Bad
file number

            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


OS is Solaris 10, using Java 1.6.0_37.

The ant and Tomcat processes are still there to investigate, Tomcat is
looping on CPU in Thread 26. Stack is:

-----------------  lwp# 26 / thread# 26  --------------------
 fe024ccc apr_pool_destroy (8a6da0, b667f830, 0, 0, 16ed48, 13040d8) + 184
 fc00d700 * org/apache/tomcat/jni/Pool.destroy(J)V+1478
 fc00d6ac * org/apache/tomcat/jni/Pool.destroy(J)V+0
 fc005ab0 * org/apache/tomcat/util/net/AprEndpoint$Poller.destroy()V+68
(line 2347)
 fc005ab0 * org/apache/tomcat/util/net/AprEndpoint$Poller.doPoll(J)Z+292
(line 2688)
 fc005868 * org/apache/tomcat/util/net/AprEndpoint$Poller.run()V+345
(line 2629)

In the java world it is:

"http-apr-127.0.0.1-auto-2-Poller-0" daemon prio=3 tid=0x00645400
nid=0x1a runnable [0xb667f000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.tomcat.jni.Pool.destroy(Native Method)
        at
org.apache.tomcat.util.net.AprEndpoint$Poller.destroy(AprEndpoint.java:1177)
        at
org.apache.tomcat.util.net.AprEndpoint$Poller.doPoll(AprEndpoint.java:1362)
        - locked <0xe662d6a0> (a
org.apache.tomcat.util.net.AprEndpoint$Poller)
        at
org.apache.tomcat.util.net.AprEndpoint$Poller.run(AprEndpoint.java:1310)


main thread is at:

"main" prio=3 tid=0x00031400 nid=0x2 waiting for monitor entry [0xfe37e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0xe662d6a0> (a
org.apache.tomcat.util.net.AprEndpoint$Poller)
        at java.lang.Thread.join(Thread.java:1194)
        - locked <0xe662d6a0> (a
org.apache.tomcat.util.net.AprEndpoint$Poller)
        at
org.apache.tomcat.util.net.AprEndpoint$Poller.destroy(AprEndpoint.java:1183)
        at
org.apache.tomcat.util.net.AprEndpoint.stopInternal(AprEndpoint.java:685)
        at
org.apache.tomcat.util.net.AbstractEndpoint.stop(AbstractEndpoint.java:664)
        at
org.apache.coyote.AbstractProtocol.stop(AbstractProtocol.java:488)
        at
org.apache.catalina.connector.Connector.stopInternal(Connector.java:1037)
        at
org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232)


The code is looping in allocator_free, because it tries to traverse a
pointer chain that is cyclic:

(gdb) print *node
$5 = {next = 0x13040d8, ref = 0x678748, index = 1, free_index = 1,
first_avail = 0x8a6e50 "", endp = 0x8a8d88 ""}
(gdb) print *node->next
$6 = {next = 0x8a6d88, ref = 0x13040d8, index = 1, free_index = 109,
first_avail = 0x1304418 "\001\060D ", endp = 0x13060d8 ""}
(gdb) print *node->next->next
$7 = {next = 0x13040d8, ref = 0x678748, index = 1, free_index = 1,
first_avail = 0x8a6e50 "", endp = 0x8a8d88 ""}

So node->next->next == node.

This is an indication for a synchronization problem in the code using
the pool.

The allocator is:

(gdb) print *allocator
$8 = {max_index = 1, max_free_index = 0, current_free_index = 0, mutex =
0x32f5b8, owner = 0x32f578, free = {0x678748, 0x8a6d88, 0x0 <repeats 18
times>}}


and the pool is:

(gdb) print *pool
$9 = {parent = 0x49c478, child = 0x0, sibling = 0x0, ref = 0xe78d20,
cleanups = 0x0, free_cleanups = 0x0, allocator = 0x16ed48, subprocesses
= 0x0, abort_fn = 0,
  user_data = 0x0, tag = 0x0, active = 0x678748, self = 0x8a6d88,
self_first_avail = 0x8a6de0 "", pre_cleanups = 0x0}

It should be the "serverSockPool" in AprEndpoint$Poller.

Walking up the pool hierarchy:

(gdb) print *pool->parent
$11 = {parent = 0x479830, child = 0x1265170, sibling = 0x0, ref =
0x1018370, cleanups = 0x0, free_cleanups = 0x0, allocator = 0x16ed48,
subprocesses = 0x0, abort_fn = 0,
  user_data = 0x0, tag = 0x0, active = 0x49c460, self = 0x49c460,
self_first_avail = 0x49c4b8 "", pre_cleanups = 0x0}

(gdb) print *pool->parent->parent
$12 = {parent = 0x54ac40, child = 0x1018368, sibling = 0x0, ref =
0x54ac44, cleanups = 0x0, free_cleanups = 0x0, allocator = 0x16ed48,
subprocesses = 0x0, abort_fn = 0,
  user_data = 0x0, tag = 0x0, active = 0x479818, self = 0x479818,
self_first_avail = 0x479870 "", pre_cleanups = 0x0}

(gdb) print *pool->parent->parent->parent
$13 = {parent = 0x32f578, child = 0x479830, sibling = 0xb61a90, ref =
0x32f57c, cleanups = 0x54b4f8, free_cleanups = 0x0, allocator =
0x16ed48, subprocesses = 0x0,
  abort_fn = 0, user_data = 0x0, tag = 0x0, active = 0x54ac28, self =
0x54ac28, self_first_avail = 0x54ac80 "/dev/urandom", pre_cleanups = 0x0}

(gdb) print *pool->parent->parent->parent->parent
$14 = {parent = 0x0, child = 0x54ac40, sibling = 0x0, ref = 0x0,
cleanups = 0x32f5d8, free_cleanups = 0x0, allocator = 0x16ed48,
subprocesses = 0x0, abort_fn = 0,
  user_data = 0x0, tag = 0xfe032a60 "apr_global_pool", active =
0x32f560, self = 0x32f560, self_first_avail = 0x32f5b8 "", pre_cleanups
= 0x0}


Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Re: TC looping on CPU due to "Critical poller failure" when using APR (7.0.35, on reproducible)

Posted by Rainer Jung <ra...@kippdata.de>.
On 21.01.2013 10:41, Konstantin Kolinko wrote:
> 2013/1/21 Rainer Jung <ra...@kippdata.de>:
>> I did another run for "ant test" for 7.0.35 and when using APR I got the
>> following error the first time (never saw that before):
>>
>> (...)
>>     [junit] INFO: Stopping ProtocolHandler
>> ["http-apr-127.0.0.1-auto-2-42652"]
>>     [junit] Jan 18, 2013 4:37:59 PM
>> org.apache.tomcat.util.net.AprEndpoint$Poller doPoll
>>     [junit] SEVERE: Critical poller failure (restarting poller): [9] Bad
>> file number
>>
>>             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>>
>> OS is Solaris 10, using Java 1.6.0_37.
>>
>> The ant and Tomcat processes are still there to investigate, Tomcat is
>> looping on CPU in Thread 26. Stack is:
>> (...)
>>
> 
> Well, it happens only during shutdown, so it might be that it is not
> so important...

It looks like the shutdown is part of the reason. See below.

> What is the downside here -- Tomcat hangs due to an infinite loop? If
> so then it is not good.

Yes, it hangs in an infinite loop.

> I wonder why that log message says "restarting poller". Does it try to
> start the poller when everything else shuts down?

The code following the failure first calls destroy(), then init().

>> This is an indication for a synchronization problem in the code using
>> the pool.

I'd speculate:

- the shutdown code somehow triggered the error message.

- then the shutdown code and the code running after the error both enter
"destroy" for the Poller (can be seen from the stack dump). It seems
method Poller.destroy() is not thread safe, because it calls destroy for
the pool without locking. Only the call coming from the error path has a
"synchronized", not the call coming from the shutdown path. That could
corrupt the allocator in the pool and lead to the infinite looping.

> I wonder whether some test case has higher chance of triggering this.
> The test case where it happened is "org.apache.jasper.compiler.TestParser" ?

Yes that was the test where it happened.

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Re: TC looping on CPU due to "Critical poller failure" when using APR (7.0.35, on reproducible)

Posted by Konstantin Kolinko <kn...@gmail.com>.
2013/1/21 Rainer Jung <ra...@kippdata.de>:
> I did another run for "ant test" for 7.0.35 and when using APR I got the
> following error the first time (never saw that before):
>
> (...)
>     [junit] INFO: Stopping ProtocolHandler
> ["http-apr-127.0.0.1-auto-2-42652"]
>     [junit] Jan 18, 2013 4:37:59 PM
> org.apache.tomcat.util.net.AprEndpoint$Poller doPoll
>     [junit] SEVERE: Critical poller failure (restarting poller): [9] Bad
> file number
>
>             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>
> OS is Solaris 10, using Java 1.6.0_37.
>
> The ant and Tomcat processes are still there to investigate, Tomcat is
> looping on CPU in Thread 26. Stack is:
> (...)
>

Well, it happens only during shutdown, so it might be that it is not
so important...

What is the downside here -- Tomcat hangs due to an infinite loop? If
so then it is not good.

I wonder why that log message says "restarting poller". Does it try to
start the poller when everything else shuts down?

> This is an indication for a synchronization problem in the code using
> the pool.

I wonder whether some test case has higher chance of triggering this.
The test case where it happened is "org.apache.jasper.compiler.TestParser" ?

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org