You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tomcat.apache.org by Rainer Jung <ra...@kippdata.de> on 2013/09/22 13:17:15 UTC

Issue in AprEndpoint detected by TestCoyoteAdapter

I debugged around my occasional failures for TestCoyoteAdapter when
using APR.

Error is:

SEVERE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Poller failed with
error [81] : [File descriptor in bad state]


In the native world it is:

19372/128:      port_associate(6, 4, 0xFFFFFFFF, 0x00000001, 0x0089ABE8)
Err#81 EBADFD

The bad FD here is 0xFFFFFFFF. Normally the FDs put into this slot of
port_associate during the test are small numbers between 6 and 20
(decimal). Here's a -1 and we must have an error before. Wat we would
have expected is FD 11. In the truss snippet, Thread 2 is the client
opening the connection, thread 130 accepts it and thread 133 starts to
work on it:

19372/2:        so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, "",
SOV_DEFAULT) = 8


Connect

19372/2:        connect(8, 0xFDF7DAE0, 16, SOV_DEFAULT)         = 0
19372/2:                AF_INET  name = 127.0.0.1  port = 42932

Accept

19372/130:      accept(5, 0xB397F608, 0xB397F5F4, SOV_DEFAULT)  = 11
19372/130:              AF_INET  name = 127.0.0.1  port = 42936
19372/2:        getsockname(8, 0xFDF7DAE0, 0xFDF7DA2C, SOV_DEFAULT) = 0
19372/2:                AF_INET  name = 127.0.0.1  port = 42936
19372/130:      fcntl(11, F_GETFD, 0x00000003)                  = 0
19372/130:      fcntl(11, F_SETFD, 0x00000001)                  = 0
19372/2:        setsockopt(8, tcp, TCP_NODELAY, 0xFDF7DE10, 4,
SOV_DEFAULT) = 0
19372/130:      lwp_cond_signal(0x0031E3B0)                     = 0
19372/130:              condvar type: USYNC_THREAD

Send request

19372/2:        send(8, " G E T   / f o o ; j s e".., 227, 0)   = 227
19372/13:       lwp_cond_wait(0x0031E3B0, 0x0031E398, 0xB457F278, 0) = 0
19372/13:               condvar type: USYNC_THREAD
19372/13:               mutex type: USYNC_THREAD
19372/13:               timeout: 0.999998773 sec
19372/130:      lwp_cond_signal(0x004215F0)                     = 0
19372/130:              condvar type: USYNC_THREAD
19372/133:      lwp_cond_wait(0x004215F0, 0x004215D8, 0x00000000, 0) = 0
19372/133:              condvar type: USYNC_THREAD
19372/133:              mutex type: USYNC_THREAD
19372/133:      setsockopt(11, tcp, TCP_NODELAY, 0xB3C7F40C, 4,
SOV_DEFAULT) = 0
19372/133:      fcntl(11, F_GETFL)                              = 2
                FWRITE
19372/133:      fcntl(11, F_SETFL, FWRITE|FNONBLOCK)            = 0

19372/133:      lwp_cond_signal(0x0031E3B0)                     = 0
19372/133:              condvar type: USYNC_THREAD
19372/133:      lwp_cond_signal(0x00274448)                     = 0
19372/133:              condvar type: USYNC_THREAD
19372/128:      lwp_cond_wait(0x00274448, 0x00274430, 0xB417F580, 0) = 0
19372/128:              condvar type: USYNC_THREAD
19372/128:              mutex type: USYNC_THREAD
19372/128:              timeout: 9.999997356 sec
19372/128:      lwp_cond_signal(0x0031E3B0)                     = 0
19372/128:              condvar type: USYNC_THREAD
19372/128:      lwp_cond_signal(0x0031E3B0)                     = 0
19372/128:              condvar type: USYNC_THREAD

Adding to the pollset with wrong FD -1 instead of 11.

19372/128:      port_associate(6, 4, 0xFFFFFFFF, 0x00000001, 0x0089ABE8)
Err#81 EBADFD

I added some Java log statements to track calls to the Poll class
(please ignore that hey are log level warn):

22-Sep-2013 12:19:12.909 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,218,784]
22-Sep-2013 12:19:12.910 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 of size 0 socket 2218784
22-Sep-2013 12:19:12.911 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 returned with 0

Here begins the root cause:

- start polling poller of size 1, containing only socket 2218784:

22-Sep-2013 12:19:12.912 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 of size 1 with timeout 2000

- start removing from the same poller concurrently in the exec thread:

22-Sep-2013 12:19:12.914 WARNING [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removing
from poller number 0 of size 1 socket 2856392

- polling returns with one fd and decreases size to 0:

22-Sep-2013 12:19:12.913 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 returned with 1

- remove returns with success and decreases size to -1! :

22-Sep-2013 12:19:12.915 WARNING [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removing
from poller number 0 returned with 0


22-Sep-2013 12:19:12.916 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket
[2,218,784] for event(s) [1]
22-Sep-2013 12:19:12.916 FINE [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint.destroySocket socket [2,856,392],
doIt [true]
22-Sep-2013 12:19:12.929 FINE [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint$Poller.add socket [2,218,784],
timeout [3,000], flags [1]
22-Sep-2013 12:19:12.930 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,218,784]


Here the poller data structure is corrupt: it says size is -1!
The value printed is "actualPollerSize-pollerSpace[i]", so
pollerSpace[i] was incremented once to many.


22-Sep-2013 12:19:12.931 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 of size -1 socket 2218784
22-Sep-2013 12:19:12.931 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 returned with 0
22-Sep-2013 12:19:22.935 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.maintain Timing out [2,218,784
22-Sep-2013 12:19:22.937 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint.destroySocket socket [2,218,784],
doIt [true]
22-Sep-2013 12:19:22.940 FINE [http-apr-127.0.0.1-auto-13-Acceptor-0]
org.apache.tomcat.util.net.AprEndpoint.processSocketWithOptions socket
[2,848,192]
22-Sep-2013 12:19:22.947 FINE [http-apr-127.0.0.1-auto-13-exec-2]
org.apache.tomcat.util.net.AprEndpoint$Poller.add socket [2,848,192],
timeout [3,000], flags [1]
22-Sep-2013 12:19:22.948 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,848,192]
22-Sep-2013 12:19:22.949 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 of size 0 socket 2848192
22-Sep-2013 12:19:22.949 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 returned with 0
22-Sep-2013 12:19:22.950 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 of size 1 with timeout 2000

Here we get EBADFD (-81 on Solaris):

22-Sep-2013 12:19:22.950 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 returned with -81


Not sure whether the problem is more in the concurrent poll plus remove,
or the following code in poll.c:

TCN_IMPLEMENT_CALL(jint, Poll, remove)(TCN_STDARGS, jlong pollset,
                                       jlong socket)
{
    apr_pollfd_t fd;
    apr_status_t rv;
    tcn_pollset_t *p = J2P(pollset,  tcn_pollset_t *);
    tcn_socket_t  *s = J2P(socket, tcn_socket_t *);

    UNREFERENCED_STDARGS;
    TCN_ASSERT(socket != 0);

    if (s->pe == NULL) {
        /* Already removed */
        return APR_SUCCESS;
    }

Here we return APR_SUCCESS and the code calling Poll.remove in
AprEndpoint always does:

                    rv = Poll.remove(pollers[i], socket);
                    if (rv != Status.APR_NOTFOUND) {
                        pollerSpace[i]++;
                        connectionCount--;
                        break;
                    }

So the pollerSpace and connectionCount numbers are (in/de)cremented.

I don't have a real solution yet, because I'm not sure which of the two
observations is the real culprit, concurrency or "wrong" return code.

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: Issue in AprEndpoint detected by TestCoyoteAdapter

Posted by Rainer Jung <ra...@kippdata.de>.

On 22.09.2013 15:39, Rainer Jung wrote:
> On 22.09.2013 13:17, Rainer Jung wrote:
>> I debugged around my occasional failures for TestCoyoteAdapter when
>> using APR.
>>
>> Error is:
>>
>> SEVERE [http-apr-127.0.0.1-auto-13-Poller]
>> org.apache.tomcat.util.net.AprEndpoint$Poller.run Poller failed with
>> error [81] : [File descriptor in bad state]
> 
> ...
> 
> The following patch seems to fix it for me, at least 150 test runs for
> TestCoyoteAdapter were successful:
> 
> Index: ../native/branches/1.1.x/native/src/poll.c
> ===================================================================
> --- ../native/branches/1.1.x/native/src/poll.c  (revision 1525348)
> +++ ../native/branches/1.1.x/native/src/poll.c  (working copy)
> @@ -259,7 +259,7 @@
> 
>      if (s->pe == NULL) {
>          /* Already removed */
> -        return APR_SUCCESS;
> +        return APR_NOTFOUND;
>      }
>      fd.desc_type   = APR_POLL_SOCKET;
>      fd.desc.s      = s->sock;
> 
> 
> I'll let the full test suite for APR run a couple of times as a more
> thorough check. I hope the change doesn't break the correctnes of the
> connection counting which is only done after succesful return of remove.

I see only failures in TestCoyoteOutputStream when running the full test
suite for APR, all other tests look OK.

When running TestCoyoteOutputStream alone in a loop it also occasionally
(2/10) fails for bio and nio, for apr somewhat more often (4/10), so the
failure seems unrelated to the poll.c change. Actually there's two types
of failures for TestCoyoteOutputStream, mostly a 300 seconds read
timeout in testNonBlockingWriteNoneBlockingWriteOnce and once for nio
"java.io.IOException: Bogus chunk size" in
testNonBlockingWriteOnceBlockingWriteOnce

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: Issue in AprEndpoint detected by TestCoyoteAdapter

Posted by Mladen Turk <mt...@apache.org>.

On 09/23/2013 09:54 AM, Rainer Jung wrote:
> On 23.09.2013 07:27, Mladen Turk wrote:
>>>
>>
>> The patch seems fine. I mean any return value should do in theory.
>> The main question is why is particular socket removed twice
>> from the Poller.
>
> I agree that there's probably another problem further up the stack.

It would be good to check for thread sync issues.
Maintain and socket close must be serialized.
If the socket is closed while poller is in maintain with the same
same socket it might lead to this issue.

Also not sure if we have a code with either poll() or maintain() call
without doRemove argument set to true.

Regards
-- 
^TM

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: Issue in AprEndpoint detected by TestCoyoteAdapter

Posted by Mark Thomas <ma...@apache.org>.

On 26/09/2013 02:47, Rainer Jung wrote:

> The TestCoyoteAdapter now neither fails with tcnative 1.1.28 nor with
> the additional APR_NOTFOUND change. I did 50 runs for each of the two
> versions and also for bio/nio/apr and none failed. Good!

+1

> I'm still a bit nervous about coupling the connection count decrement
> with the successful removal from the poller. I ran TestCoyoteAdapter
> with an additional log message which would tell me whether Poll.remove
> returned APR_NOTFOUND and also whether a remove wasn't successful for
> any reason. During 50 runs for bio/nio/apr each no such log output was
> written, so there were not unsuccessful or unneeded calls to Poll.remove.

I wrote it that way as there were times during testing when Poll.remove
was called twice for the same socket. This only happened rarely. Because
blocking writes in WebSocket require a socket to have concurrent read
and write threads there were error conditions where both threads would
try to close the socket. Other concurrency fixes may have reduced /
removed the chances of multiple Poll.remove calls but I'd prefer to
leave the check in place.

> The TestCoyoteOutputStream still fails relatively often. This failure
> happens for bio, nio and apr. With the patched tcnative, I get 7/3/7
> failures out of 50 runs each for bio/nio/apr. Except for one failure all
> of them are hangs.

OK. Still some work to do here then. I haven't been able to reproduce
this yet. I want to fix the TLD issue for 8.0.0-RC3 reported on the
users list first then I'll take a look at these issues. If someone wants
to take a look at fixing these before I get to them - I'm not going to
complain :)

> For apr the hang always happens in
> testNonBlockingWriteNoneBlockingWriteNone, for bio half of them in
> testNonBlockingWriteTwiceBlockingWriteOnce and the other half in
> testNonBlockingWriteNoneBlockingWriteNone, for nio one in
> testNonBlockingWriteOnceBlockingWriteOnce and one in
> testNonBlockingWriteTwiceBlockingWriteOnce, the third nio failure is a
> non-hang with failure:
> 
> Testcase: testNonBlockingWriteTwiceBlockingWriteNone took 4.146 sec
>         Caused an ERROR
> Bogus chunk size
> java.io.IOException: Bogus chunk size
>         at
> sun.net.www.http.ChunkedInputStream.processRaw(ChunkedInputStream.java:319)
>         at
> sun.net.www.http.ChunkedInputStream.readAheadBlocking(ChunkedInputStream.java:572)
>         at
> sun.net.www.http.ChunkedInputStream.readAhead(ChunkedInputStream.java:609)
>         at
> sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:696)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at
> sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3053)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>         at java.io.FilterInputStream.read(FilterInputStream.java:107)
>         at
> org.apache.catalina.startup.TomcatBaseTest.methodUrl(TomcatBaseTest.java:264)
>         at
> org.apache.catalina.startup.TomcatBaseTest.getUrl(TomcatBaseTest.java:219)
>         at
> org.apache.catalina.connector.TestCoyoteOutputStream.doNonBlockingTest(TestCoyoteOutputStream.java:91)
>         at
> org.apache.catalina.connector.TestCoyoteOutputStream.testNonBlockingWriteTwiceBlockingWriteNone(TestCoyoteOutputStream.java:54)
> 
> 
> 
> The hangs, the common case, always have the following exception:
> 
> Exception in thread "http-apr-127.0.0.1-auto-3-exec-3"
> java.lang.IllegalStateException: Calling [asyncDispatch()] is not valid
> for a request with Async state [READ_WRITE_OP]
>         at
> org.apache.coyote.AsyncStateMachine.asyncDispatch(AsyncStateMachine.java:282)
>         at
> org.apache.coyote.http11.Http11AprProcessor.actionInternal(Http11AprProcessor.java:483)
>         at
> org.apache.coyote.http11.AbstractHttp11Processor.action(AbstractHttp11Processor.java:838)
>         at org.apache.coyote.Request.action(Request.java:373)
>         at
> org.apache.catalina.core.AsyncContextImpl.dispatch(AsyncContextImpl.java:234)
>         at
> org.apache.catalina.core.AsyncContextImpl.dispatch(AsyncContextImpl.java:191)
>         at
> org.apache.catalina.connector.TestCoyoteOutputStream$NonBlockingWriteServlet.doAsyncWrite(TestCoyoteOutputStream.java:143)
>         at
> org.apache.catalina.connector.TestCoyoteOutputStream$NonBlockingWriteServlet.access$000(TestCoyoteOutputStream.java:107)
>         at
> org.apache.catalina.connector.TestCoyoteOutputStream$NonBlockingWriteServlet$AsyncTask.run(TestCoyoteOutputStream.java:163)
>         at
> org.apache.catalina.core.AsyncContextImpl$RunnableWrapper.run(AsyncContextImpl.java:568)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:724)
> 
> 
> The thread dumps shows the main thread hanging in waiting for an http
> answer:
> 
> "main" prio=3 tid=0x00029800 nid=0x2 runnable [0xfdf7d000]
>    java.lang.Thread.State: RUNNABLE
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>         - locked <0xe6355af8> (a java.io.BufferedInputStream)
>         at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
>         at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
>         at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
>         - locked <0xe6099b20> (a
> sun.net.www.protocol.http.HttpURLConnection)
>         at
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
>         at
> org.apache.catalina.startup.TomcatBaseTest.methodUrl(TomcatBaseTest.java:247)
>         at
> org.apache.catalina.startup.TomcatBaseTest.getUrl(TomcatBaseTest.java:219)
>         at
> org.apache.catalina.connector.TestCoyoteOutputStream.doNonBlockingTest(TestCoyoteOutputStream.java:91)
>         at
> org.apache.catalina.connector.TestCoyoteOutputStream.testNonBlockingWriteNoneBlockingWriteNone(TestCoyoteOutputStream.java:44)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>         at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>         at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>         at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>         at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>         at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>         at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>         at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>         at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>         at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>         at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>         at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>         at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>         at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>         at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>         at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:38)
> 
> 
> Maybe we could set (if not already in place) a connect and a read
> timeout to let the tests proceed (which of course does not fix the failure).
> 
> The pool threads do not seem to work on anything during the hang.
> 
> Regards,
> 
> Rainer
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: dev-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: Issue in AprEndpoint detected by TestCoyoteAdapter

Posted by Rainer Jung <ra...@kippdata.de>.

Hi Mark,

On 25.09.2013 01:50, Mark Thomas wrote:
> On 24/09/2013 16:19, Mark Thomas wrote:
>> On 24/09/2013 08:16, Mark Thomas wrote:
>>> On 23/09/2013 00:54, Rainer Jung wrote:
>>>
>>>> I agree that there's probably another problem further up the stack.
>>>
>>> I'm looking at this now.
>>
>> I've found a couple of issues. Not sure if either of these are the root
>> cause of the remaining issues we see but I'll look into fixing them.
>>
>> - AprEndpoint.destroySocket() calls Poller.removeFromPoller() but that
>>   call isn't thread safe. I think we need a remove process similar to
>>   the add process.
>> - sockets are not always removed from the timeout list when they are
>>   destroyed
> 
> Fixed.
> 
> Rainer, over to you.

The TestCoyoteAdapter now neither fails with tcnative 1.1.28 nor with
the additional APR_NOTFOUND change. I did 50 runs for each of the two
versions and also for bio/nio/apr and none failed. Good!

I'm still a bit nervous about coupling the connection count decrement
with the successful removal from the poller. I ran TestCoyoteAdapter
with an additional log message which would tell me whether Poll.remove
returned APR_NOTFOUND and also whether a remove wasn't successful for
any reason. During 50 runs for bio/nio/apr each no such log output was
written, so there were not unsuccessful or unneeded calls to Poll.remove.

The TestCoyoteOutputStream still fails relatively often. This failure
happens for bio, nio and apr. With the patched tcnative, I get 7/3/7
failures out of 50 runs each for bio/nio/apr. Except for one failure all
of them are hangs.

For apr the hang always happens in
testNonBlockingWriteNoneBlockingWriteNone, for bio half of them in
testNonBlockingWriteTwiceBlockingWriteOnce and the other half in
testNonBlockingWriteNoneBlockingWriteNone, for nio one in
testNonBlockingWriteOnceBlockingWriteOnce and one in
testNonBlockingWriteTwiceBlockingWriteOnce, the third nio failure is a
non-hang with failure:

Testcase: testNonBlockingWriteTwiceBlockingWriteNone took 4.146 sec
        Caused an ERROR
Bogus chunk size
java.io.IOException: Bogus chunk size
        at
sun.net.www.http.ChunkedInputStream.processRaw(ChunkedInputStream.java:319)
        at
sun.net.www.http.ChunkedInputStream.readAheadBlocking(ChunkedInputStream.java:572)
        at
sun.net.www.http.ChunkedInputStream.readAhead(ChunkedInputStream.java:609)
        at
sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:696)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at
sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3053)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.FilterInputStream.read(FilterInputStream.java:107)
        at
org.apache.catalina.startup.TomcatBaseTest.methodUrl(TomcatBaseTest.java:264)
        at
org.apache.catalina.startup.TomcatBaseTest.getUrl(TomcatBaseTest.java:219)
        at
org.apache.catalina.connector.TestCoyoteOutputStream.doNonBlockingTest(TestCoyoteOutputStream.java:91)
        at
org.apache.catalina.connector.TestCoyoteOutputStream.testNonBlockingWriteTwiceBlockingWriteNone(TestCoyoteOutputStream.java:54)



The hangs, the common case, always have the following exception:

Exception in thread "http-apr-127.0.0.1-auto-3-exec-3"
java.lang.IllegalStateException: Calling [asyncDispatch()] is not valid
for a request with Async state [READ_WRITE_OP]
        at
org.apache.coyote.AsyncStateMachine.asyncDispatch(AsyncStateMachine.java:282)
        at
org.apache.coyote.http11.Http11AprProcessor.actionInternal(Http11AprProcessor.java:483)
        at
org.apache.coyote.http11.AbstractHttp11Processor.action(AbstractHttp11Processor.java:838)
        at org.apache.coyote.Request.action(Request.java:373)
        at
org.apache.catalina.core.AsyncContextImpl.dispatch(AsyncContextImpl.java:234)
        at
org.apache.catalina.core.AsyncContextImpl.dispatch(AsyncContextImpl.java:191)
        at
org.apache.catalina.connector.TestCoyoteOutputStream$NonBlockingWriteServlet.doAsyncWrite(TestCoyoteOutputStream.java:143)
        at
org.apache.catalina.connector.TestCoyoteOutputStream$NonBlockingWriteServlet.access$000(TestCoyoteOutputStream.java:107)
        at
org.apache.catalina.connector.TestCoyoteOutputStream$NonBlockingWriteServlet$AsyncTask.run(TestCoyoteOutputStream.java:163)
        at
org.apache.catalina.core.AsyncContextImpl$RunnableWrapper.run(AsyncContextImpl.java:568)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)


The thread dumps shows the main thread hanging in waiting for an http
answer:

"main" prio=3 tid=0x00029800 nid=0x2 runnable [0xfdf7d000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:152)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        - locked <0xe6355af8> (a java.io.BufferedInputStream)
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
        at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
        - locked <0xe6099b20> (a
sun.net.www.protocol.http.HttpURLConnection)
        at
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
        at
org.apache.catalina.startup.TomcatBaseTest.methodUrl(TomcatBaseTest.java:247)
        at
org.apache.catalina.startup.TomcatBaseTest.getUrl(TomcatBaseTest.java:219)
        at
org.apache.catalina.connector.TestCoyoteOutputStream.doNonBlockingTest(TestCoyoteOutputStream.java:91)
        at
org.apache.catalina.connector.TestCoyoteOutputStream.testNonBlockingWriteNoneBlockingWriteNone(TestCoyoteOutputStream.java:44)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
        at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
        at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:38)


Maybe we could set (if not already in place) a connect and a read
timeout to let the tests proceed (which of course does not fix the failure).

The pool threads do not seem to work on anything during the hang.

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: Issue in AprEndpoint detected by TestCoyoteAdapter

Posted by Rainer Jung <ra...@kippdata.de>.

On 25.09.2013 01:50, Mark Thomas wrote:
> On 24/09/2013 16:19, Mark Thomas wrote:
>> On 24/09/2013 08:16, Mark Thomas wrote:
>>> On 23/09/2013 00:54, Rainer Jung wrote:
>>>
>>>> I agree that there's probably another problem further up the stack.
>>>
>>> I'm looking at this now.
>>
>> I've found a couple of issues. Not sure if either of these are the root
>> cause of the remaining issues we see but I'll look into fixing them.
>>
>> - AprEndpoint.destroySocket() calls Poller.removeFromPoller() but that
>>   call isn't thread safe. I think we need a remove process similar to
>>   the add process.
>> - sockets are not always removed from the timeout list when they are
>>   destroyed
> 
> Fixed.
> 
> Rainer, over to you.

ACK, will be able to run test loops this evening/night.

Rainer


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: Issue in AprEndpoint detected by TestCoyoteAdapter

Posted by Mark Thomas <ma...@apache.org>.

On 24/09/2013 16:19, Mark Thomas wrote:
> On 24/09/2013 08:16, Mark Thomas wrote:
>> On 23/09/2013 00:54, Rainer Jung wrote:
>>
>>> I agree that there's probably another problem further up the stack.
>>
>> I'm looking at this now.
> 
> I've found a couple of issues. Not sure if either of these are the root
> cause of the remaining issues we see but I'll look into fixing them.
> 
> - AprEndpoint.destroySocket() calls Poller.removeFromPoller() but that
>   call isn't thread safe. I think we need a remove process similar to
>   the add process.
> - sockets are not always removed from the timeout list when they are
>   destroyed

Fixed.

Rainer, over to you.

Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: Issue in AprEndpoint detected by TestCoyoteAdapter

Posted by Mark Thomas <ma...@apache.org>.

On 24/09/2013 08:16, Mark Thomas wrote:
> On 23/09/2013 00:54, Rainer Jung wrote:
> 
>> I agree that there's probably another problem further up the stack.
> 
> I'm looking at this now.

I've found a couple of issues. Not sure if either of these are the root
cause of the remaining issues we see but I'll look into fixing them.

- AprEndpoint.destroySocket() calls Poller.removeFromPoller() but that
  call isn't thread safe. I think we need a remove process similar to
  the add process.
- sockets are not always removed from the timeout list when they are
  destroyed

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: Issue in AprEndpoint detected by TestCoyoteAdapter

Posted by Mark Thomas <ma...@apache.org>.

On 23/09/2013 00:54, Rainer Jung wrote:

> I agree that there's probably another problem further up the stack.

I'm looking at this now.

> Since the apr endpoint explicitely uses the return value of remove to
> decide whether something was removed, APT_NOTFOUND seems better. But as
> you said: why is it calling remove for a socket not in the poller? When
> the endpoint called remove for "2856392", the poller has size one and
> consisted only of socket 2218784. Strange.

I suspect that this is because the socket could be in the poller for
write while being processed for read (or the other way around). If the
IO operation fails, the socket needs to be removed from the poller
otherwise other bad things (tm) happen.

This is one of the complications caused by having to implement some
WebSocket writes as blocking.

<snip/>

> - since removeFromPoller() is part of error condition handling, and its
> return code is used to determine adjusting the externally tracked size
> of the poller and the connection counts, the return code must be
> "correct". It is still open to me, whether returning APR_NOTFOUND is a
> problem for the conection count tracking.

I was expecting a APR_NOTFOUND return value if the socket wasn't in the
Poller when I wrote that code.

> - we should check, whethera 0 byte read after OPEN_READ really is a
> problem or must be expected even for a non-broken socket.

OPEN_READ should mean there is data available to read without bocking.
Therefore, a read returning 0 bytes looks like an error to me. Could it
be signalling EOF ?

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: Issue in AprEndpoint detected by TestCoyoteAdapter

Posted by Rainer Jung <ra...@kippdata.de>.

On 23.09.2013 07:27, Mladen Turk wrote:
> On 09/22/2013 03:39 PM, Rainer Jung wrote:
>> On 22.09.2013 13:17, Rainer Jung wrote:
>>> I debugged around my occasional failures for TestCoyoteAdapter when
>>> using APR.
>>>
>>> Error is:
>>>
>>> SEVERE [http-apr-127.0.0.1-auto-13-Poller]
>>> org.apache.tomcat.util.net.AprEndpoint$Poller.run Poller failed with
>>> error [81] : [File descriptor in bad state]
>>
>> ...
>>
>>
>>> Not sure whether the problem is more in the concurrent poll plus remove,
>>> or the following code in poll.c:
>>>
>>> TCN_IMPLEMENT_CALL(jint, Poll, remove)(TCN_STDARGS, jlong pollset,
>>>                                         jlong socket)
>>> {
>>>      apr_pollfd_t fd;
>>>      apr_status_t rv;
>>>      tcn_pollset_t *p = J2P(pollset,  tcn_pollset_t *);
>>>      tcn_socket_t  *s = J2P(socket, tcn_socket_t *);
>>>
>>>      UNREFERENCED_STDARGS;
>>>      TCN_ASSERT(socket != 0);
>>>
>>>      if (s->pe == NULL) {
>>>          /* Already removed */
>>>          return APR_SUCCESS;
>>>      }
>>>
>>> Here we return APR_SUCCESS and the code calling Poll.remove in
>>> AprEndpoint always does:
>>>
>>>                      rv = Poll.remove(pollers[i], socket);
>>>                      if (rv != Status.APR_NOTFOUND) {
>>>                          pollerSpace[i]++;
>>>                          connectionCount--;
>>>                          break;
>>>                      }
>>>
>>> So the pollerSpace and connectionCount numbers are (in/de)cremented.
>>
>> The following patch seems to fix it for me, at least 150 test runs for
>> TestCoyoteAdapter were successful:
>>
>> Index: ../native/branches/1.1.x/native/src/poll.c
>> ===================================================================
>> --- ../native/branches/1.1.x/native/src/poll.c  (revision 1525348)
>> +++ ../native/branches/1.1.x/native/src/poll.c  (working copy)
>> @@ -259,7 +259,7 @@
>>
>>       if (s->pe == NULL) {
>>           /* Already removed */
>> -        return APR_SUCCESS;
>> +        return APR_NOTFOUND;
>>       }
>>       fd.desc_type   = APR_POLL_SOCKET;
>>       fd.desc.s      = s->sock;
>>
>>
> 
> The patch seems fine. I mean any return value should do in theory.
> The main question is why is particular socket removed twice
> from the Poller. This is called directly from java code
> so wrapper seems to call it twice (or more).
> I suspect that the socket is first closed and then Poller loop removes it.
> Or it can be removed by poll with doRemove == true or during pollset
> maintain.
> In any case after removed either by poll or maintain returned set of
> removed
> sockets must be invalidated from pollset so it doesn't use it again in
> explicit remove.

I agree that there's probably another problem further up the stack.
Since the apr endpoint explicitely uses the return value of remove to
decide whether something was removed, APT_NOTFOUND seems better. But as
you said: why is it calling remove for a socket not in the poller? When
the endpoint called remove for "2856392", the poller has size one and
consisted only of socket 2218784. Strange.

Here's the more complete log snippet:

SOCKET-1 exists, but will be destroyed before the problem happen,
SOCKET-2 s going to be accepted and later produces the problem,
SOCKET-3 comes into the picture only shortly before the problem happens.


12:19:12.844 FINE [13-exec-4] AprEndpoint$Poller.add socket [SOCKET-1],
timeout [3,000], flags [1]

Here 2 is acceppted

12:19:12.850 FINE [13-Acceptor-0] AprEndpoint.processSocketWithOptions
socket [SOCKET-2]

and added to the poller

12:19:12.853 FINE [13-exec-5] AprEndpoint$Poller.add socket [SOCKET-2],
timeout [3,000], flags [1]
12:19:12.858 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-1]
12:19:12.860 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 of size 0 socket SOCKET-1
12:19:12.861 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 returned with 0

Here adding to the poller actually happens

12:19:12.862 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-2]
12:19:12.863 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 of size 1 socket SOCKET-2
12:19:12.864 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 returned with 0

Now the poller contains Sockets 1 and 2 and gets polled

12:19:12.865 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 of size 2 with timeout 2000
12:19:12.866 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 returned with 2

It returns both sockets: port_getn(6, 0x005D0CF8, 8192, 1, 0xB417F6F8)
 = 2 [0]

12:19:12.867 FINE [13-Poller] AprEndpoint$Poller.run Processing socket
[SOCKET-1] for event(s) [1]
12:19:12.869 FINE [13-Poller] AprEndpoint$Poller.run Processing socket
[SOCKET-2] for event(s) [1]

Socket 1 gets destroyed.

12:19:12.871 FINE [13-exec-6] AprEndpoint.destroySocket socket
[SOCKET-1], doIt [true]

Socket 2 is handled by an exec thread which adds it back to the poller:
port_associate(6, 4, 0x0000000D, 0x00000001, 0x0089AB88) = 0

12:19:12.878 FINE [13-exec-7] AprEndpoint$Poller.add socket [SOCKET-2],
timeout [3,000], flags [1]
12:19:12.880 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-2]
12:19:12.881 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 of size 0 socket SOCKET-2
12:19:12.882 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 returned with 0

Now again polling the poller that now only contains socket 2

12:19:12.883 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 of size 1 with timeout 2000
12:19:12.884 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 returned with 1

Socket 2 is returned from the polling: port_getn(6, 0x005D0CF8, 8192, 1,
0xB417F6F8)   = 1 [0]

12:19:12.885 FINE [13-Poller] AprEndpoint$Poller.run Processing socket
[SOCKET-2] for event(s) [1]

exec thread 8 handles it and adds it back to the poller

12:19:12.890 FINE [13-exec-8] AprEndpoint$Poller.add socket [SOCKET-2],
timeout [3,000], flags [1]
12:19:12.892 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-2]
12:19:12.893 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 of size 0 socket SOCKET-2

Before Socket 2 has been completely added to the poller another socket
(3) gets accepted: accept(5, 0xB397F608, 0xB397F5F4, SOV_DEFAULT)  = 16

12:19:12.898 FINE [13-Acceptor-0] AprEndpoint.processSocketWithOptions
socket [SOCKET-3]

exec thread 9 wants to add socket 3 to the poller:

12:19:12.901 FINE [13-exec-9] AprEndpoint$Poller.add socket [SOCKET-3],
timeout [3,000], flags [1]
12:19:12.894 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 returned with 0

Now poller has size 1 and contains socket 2: port_associate(6, 4,
0x0000000D, 0x00000001, 0x0089ABE8) = 0


The poller gets polled

12:19:12.903 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 of size 1 with timeout 2000
12:19:12.905 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 returned with 1
12:19:12.906 FINE [13-Poller] AprEndpoint$Poller.run Processing socket
[SOCKET-2] for event(s) [1]

polling returns socket 2, size of poller is again 0, socket 3 waits to
be added: port_getn(6, 0x005D0CF8, 8192, 1, 0xB417F6F8)   = 1 [0]

12:19:12.909 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-3]
12:19:12.910 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 of size 0 socket SOCKET-3
12:19:12.911 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 returned with 0

Socket 3 is added to the poller, has now size 1: port_associate(6, 4,
0x00000010, 0x00000001, 0x0089AB88) = 0

Poller gets polled

12:19:12.912 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 of size 1 with timeout 2000

exec thread 10 starts to work on socket 2 returned by the earlier poll,
but then exec thread 10 immediately calls remove for socket 2. That's
the problem. The socket was returned by a poll, why calling remove?

12:19:12.914 WARNING [13-exec-10] AprEndpoint$Poller.removeFromPoller
Removing from poller number 0 of size 1 socket SOCKET-2

Here's what happens natively in the exec thread:

19372/141:      pollsys(0xB317EFC0, 1, 0xB317EF58, 0x00000000)  = 1
19372/141:              fd=13 ev=POLLIN rev=POLLIN
19372/141:              timeout: 3.000000000 sec
19372/141:      read(13, 0x00219468, 8192)                      = 0

It polls the socket with a timeout of 3 seconds, then reads from it and
gets 0 bytes. To me this looks like a SocketProcessor is called with
OPEN_READ, which calls handler.process with OPEN_READ. This finally gets
handled in the processor. I don't know what the expected behavior for
process() with starting state OPEN_READ is, if the read returns 0 bytes.
Somehow there seems to be an error condition triggered, even by the 0
bytes or a timeout. But a timeout is unlikely, because everything takes
only a few milliseconds.

The error condition leads to a call of destroySocket() which in turn
calls poller.removeFromPoller().

So:

- since removeFromPoller() is part of error condition handling, and its
return code is used to determine adjusting the externally tracked size
of the poller and the connection counts, the return code must be
"correct". It is still open to me, whether returning APR_NOTFOUND is a
problem for the conection count tracking.

- we should check, whethera 0 byte read after OPEN_READ really is a
problem or must be expected even for a non-broken socket.

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: Issue in AprEndpoint detected by TestCoyoteAdapter

Posted by Mladen Turk <mt...@apache.org>.

On 09/22/2013 03:39 PM, Rainer Jung wrote:
> On 22.09.2013 13:17, Rainer Jung wrote:
>> I debugged around my occasional failures for TestCoyoteAdapter when
>> using APR.
>>
>> Error is:
>>
>> SEVERE [http-apr-127.0.0.1-auto-13-Poller]
>> org.apache.tomcat.util.net.AprEndpoint$Poller.run Poller failed with
>> error [81] : [File descriptor in bad state]
>
> ...
>
>
>> Not sure whether the problem is more in the concurrent poll plus remove,
>> or the following code in poll.c:
>>
>> TCN_IMPLEMENT_CALL(jint, Poll, remove)(TCN_STDARGS, jlong pollset,
>>                                         jlong socket)
>> {
>>      apr_pollfd_t fd;
>>      apr_status_t rv;
>>      tcn_pollset_t *p = J2P(pollset,  tcn_pollset_t *);
>>      tcn_socket_t  *s = J2P(socket, tcn_socket_t *);
>>
>>      UNREFERENCED_STDARGS;
>>      TCN_ASSERT(socket != 0);
>>
>>      if (s->pe == NULL) {
>>          /* Already removed */
>>          return APR_SUCCESS;
>>      }
>>
>> Here we return APR_SUCCESS and the code calling Poll.remove in
>> AprEndpoint always does:
>>
>>                      rv = Poll.remove(pollers[i], socket);
>>                      if (rv != Status.APR_NOTFOUND) {
>>                          pollerSpace[i]++;
>>                          connectionCount--;
>>                          break;
>>                      }
>>
>> So the pollerSpace and connectionCount numbers are (in/de)cremented.
>
> The following patch seems to fix it for me, at least 150 test runs for
> TestCoyoteAdapter were successful:
>
> Index: ../native/branches/1.1.x/native/src/poll.c
> ===================================================================
> --- ../native/branches/1.1.x/native/src/poll.c  (revision 1525348)
> +++ ../native/branches/1.1.x/native/src/poll.c  (working copy)
> @@ -259,7 +259,7 @@
>
>       if (s->pe == NULL) {
>           /* Already removed */
> -        return APR_SUCCESS;
> +        return APR_NOTFOUND;
>       }
>       fd.desc_type   = APR_POLL_SOCKET;
>       fd.desc.s      = s->sock;
>
>

The patch seems fine. I mean any return value should do in theory.
The main question is why is particular socket removed twice
from the Poller. This is called directly from java code
so wrapper seems to call it twice (or more).
I suspect that the socket is first closed and then Poller loop removes it.
Or it can be removed by poll with doRemove == true or during pollset maintain.
In any case after removed either by poll or maintain returned set of removed
sockets must be invalidated from pollset so it doesn't use it again in explicit remove.


Regards
-- 
^TM

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: Issue in AprEndpoint detected by TestCoyoteAdapter

Posted by Rainer Jung <ra...@kippdata.de>.

On 22.09.2013 13:17, Rainer Jung wrote:
> I debugged around my occasional failures for TestCoyoteAdapter when
> using APR.
> 
> Error is:
> 
> SEVERE [http-apr-127.0.0.1-auto-13-Poller]
> org.apache.tomcat.util.net.AprEndpoint$Poller.run Poller failed with
> error [81] : [File descriptor in bad state]

...


> Not sure whether the problem is more in the concurrent poll plus remove,
> or the following code in poll.c:
> 
> TCN_IMPLEMENT_CALL(jint, Poll, remove)(TCN_STDARGS, jlong pollset,
>                                        jlong socket)
> {
>     apr_pollfd_t fd;
>     apr_status_t rv;
>     tcn_pollset_t *p = J2P(pollset,  tcn_pollset_t *);
>     tcn_socket_t  *s = J2P(socket, tcn_socket_t *);
> 
>     UNREFERENCED_STDARGS;
>     TCN_ASSERT(socket != 0);
> 
>     if (s->pe == NULL) {
>         /* Already removed */
>         return APR_SUCCESS;
>     }
> 
> Here we return APR_SUCCESS and the code calling Poll.remove in
> AprEndpoint always does:
> 
>                     rv = Poll.remove(pollers[i], socket);
>                     if (rv != Status.APR_NOTFOUND) {
>                         pollerSpace[i]++;
>                         connectionCount--;
>                         break;
>                     }
> 
> So the pollerSpace and connectionCount numbers are (in/de)cremented.

The following patch seems to fix it for me, at least 150 test runs for
TestCoyoteAdapter were successful:

Index: ../native/branches/1.1.x/native/src/poll.c
===================================================================
--- ../native/branches/1.1.x/native/src/poll.c  (revision 1525348)
+++ ../native/branches/1.1.x/native/src/poll.c  (working copy)
@@ -259,7 +259,7 @@

     if (s->pe == NULL) {
         /* Already removed */
-        return APR_SUCCESS;
+        return APR_NOTFOUND;
     }
     fd.desc_type   = APR_POLL_SOCKET;
     fd.desc.s      = s->sock;


I'll let the full test suite for APR run a couple of times as a more
thorough check. I hope the change doesn't break the correctnes of the
connection counting which is only done after succesful return of remove.

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org