You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by bu...@apache.org on 2017/11/08 12:21:28 UTC
[Bug 61736] New: NIO BlockPoller may stay at events() method for too
long time
https://bz.apache.org/bugzilla/show_bug.cgi?id=61736
Bug ID: 61736
Summary: NIO BlockPoller may stay at events() method for too
long time
Product: Tomcat 8
Version: 8.0.47
Hardware: PC
OS: Mac OS X 10.1
Status: NEW
Severity: enhancement
Priority: P2
Component: Connectors
Assignee: dev@tomcat.apache.org
Reporter: szlon@qq.com
Target Milestone: ----
Created attachment 35508
--> https://bz.apache.org/bugzilla/attachment.cgi?id=35508&action=edit
An Http Client which is able to send data in controlled packet size and packet
interval.
When using tomcat in NIO mode as a server to receive files, and if the files
are transferred via a non-stable network (there is an obvious interval between
neighbor TCP packets), then the traffic always hang for a great while with zero
window reported from server. Following are the steps to reproduce the problem.
1. Configure the server tcp receive window size.
$ cat /proc/sys/net/ipv4/tcp_rmem
1024 2048 8192
2. Put the following JSP into the ROOT project. (in the path
$CATALINA_HOME/webapps/ROOT/consume.jsp)
<%@ page import="org.apache.commons.io.IOUtils" %><%
final ServletInputStream inputStream = request.getInputStream();
byte[] content = IOUtils.toByteArray(inputStream);
%>ok
The IOUtils is a class from commons-io
3. Start tomcat in NIO mode with the following settings
Server version: Apache Tomcat/8.0.46
Server built: Aug 10 2017 10:10:31 UTC
Server number: 8.0.46.0
OS Name: Linux
OS Version: 3.10.0-229.el7.x86_64
Architecture: amd64
Java Home: /usr/jdk1.8.0_121/jre
JVM Version: 1.8.0_121-b13
JVM Vendor: Oracle Corporation
Command line argument: -Xms256m
Command line argument: -Xmx256m
Command line argument: -Xmn128m
Command line argument: -Xss1m
4. Capture the TCP traffics.
tcpdump -i any -w tcp.cap port 8080
5. Start a client. sending data with http in chunks with the following
settings:
chunk size 1024 byte
time to rest between chunks 50 ms
chunk count per request 100
parallel requests 100
total requests 10000
6. Then we will got a lot of zero window packets, which lasts several seconds.
No. stream Time Source Destination
ack win Length Info
469817 3201 15:09:04.175309 172.16.1.4 172.16.1.5 1
29696 57 [TCP segment of a reassembled PDU]
469904 3201 15:09:04.214945 172.16.1.5 172.16.1.4
4238 1024 54 8080→58750 [ACK] Seq=1 Ack=4238 Win=1024 Len=0
470091 3201 15:09:04.437137 172.16.1.4 172.16.1.5 1
29696 1078 [TCP Window Full] [TCP segment of a reassembled PDU]
470092 3201 15:09:04.437142 172.16.1.5 172.16.1.4
5262 0 54 [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262
Win=0 Len=0
470334 3201 15:09:04.657120 172.16.1.4 172.16.1.5 1
29696 54 [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1
Win=29696 Len=0
470335 3201 15:09:04.657123 172.16.1.5 172.16.1.4
5262 0 54 [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262
Win=0 Len=0
470620 3201 15:09:05.098135 172.16.1.4 172.16.1.5 1
29696 54 [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1
Win=29696 Len=0
470621 3201 15:09:05.098141 172.16.1.5 172.16.1.4
5262 0 54 [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262
Win=0 Len=0
471017 3201 15:09:05.979136 172.16.1.4 172.16.1.5 1
29696 54 [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1
Win=29696 Len=0
471018 3201 15:09:05.979140 172.16.1.5 172.16.1.4
5262 0 54 [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262
Win=0 Len=0
471619 3201 15:09:07.743148 172.16.1.4 172.16.1.5 1
29696 54 [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1
Win=29696 Len=0
471620 3201 15:09:07.743151 172.16.1.5 172.16.1.4
5262 0 54 [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262
Win=0 Len=0
475765 3201 15:09:08.545625 172.16.1.5 172.16.1.4
5262 3072 54 [TCP Window Update] 8080→58750 [ACK] Seq=1
Ack=5262 Win=3072 Len=0
475781 3201 15:09:08.545815 172.16.1.4 172.16.1.5 1
29696 490 [TCP segment of a reassembled PDU]
475782 3201 15:09:08.545821 172.16.1.5 172.16.1.4
5698 3072 54 8080→58750 [ACK] Seq=1 Ack=5698 Win=3072 Len=0
475784 3201 15:09:08.545825 172.16.1.4 172.16.1.5 1
29696 1514 [TCP segment of a reassembled PDU]
After a lot of study, I found this is because of a BUG in the BlockPoller,
which will slow down the speed of polling cycle a lot.
The BlockPoller will always try to poll the Runnable event and run it until the
events queue is empty. But the RunnableAdd event may wake the Worker thread,
and the worker thread may add another more event to the queue. Thus the queue
will be hard to shrink, and the events method will be hard to return and will
affect the latency the next selection to be performed.
// events() method of the BlockPoller thread
public boolean events() {
boolean result = false;
Runnable r = null;
result = (events.size() > 0);
while ( (r = events.poll()) != null ) {
// running the "r" event may cause more events to
// be added to the queue. then the queue will be hard
// to shrink, and the loop may be hard to finish
r.run();
result = true;
}
return result;
}
// the following e1 and e2 line in the RunnableAdd will wake the worker thread
public void run() {
SelectionKey sk = ch.keyFor(selector);
try {
if (sk == null) {
sk = ch.register(selector, ops, key);
ref.key = sk;
} else if (!sk.isValid()) {
cancel(sk, key, ops); // <-- e1
} else {
sk.interestOps(sk.interestOps() | ops);
}
} catch (CancelledKeyException cx) {
cancel(sk, key, ops); // <-- e2
} catch (ClosedChannelException cx) {
cancel(null, key, ops);
}
}
// when the worker is waken at line L3, it will start the next loop
// and may add another event at line L2
public int read(ByteBuffer buf, NioChannel socket, long readTimeout) throws
IOException {
SelectionKey key =
socket.getIOChannel().keyFor(socket.getPoller().getSelector());
if ( key == null ) throw new IOException("Key no longer registered");
KeyReference reference = keyReferenceStack.pop();
if (reference == null) {
reference = new KeyReference();
}
NioSocketWrapper att = (NioSocketWrapper) key.attachment();
int read = 0;
boolean timedout = false;
int keycount = 1;
long time = System.currentTimeMillis();
try {
L1: while(!timedout) {
if (keycount > 0) {
read = socket.read(buf);
if (read != 0) {
break;
}
}
try {
if ( att.getReadLatch()==null ||
att.getReadLatch().getCount()==0) att.startReadLatch(1);
L2: poller.add(att,SelectionKey.OP_READ, reference);
if (readTimeout < 0) {
att.awaitReadLatch(Long.MAX_VALUE, TimeUnit.MILLISECONDS);
} else {
L3: att.awaitReadLatch(readTimeout, TimeUnit.MILLISECONDS);
}
} catch (InterruptedException ignore) {
// Ignore
}
if ( att.getReadLatch()!=null && att.getReadLatch().getCount()> 0)
{
keycount = 0;
}else {
keycount = 1;
att.resetReadLatch();
}
if (readTimeout >= 0 && (keycount == 0))
timedout = (System.currentTimeMillis() - time) >= readTimeout;
} //while
if (timedout)
throw new SocketTimeoutException();
} finally {
poller.remove(att,SelectionKey.OP_READ);
if (timedout && reference.key!=null) {
poller.cancelKey(reference.key);
}
reference.key = null;
keyReferenceStack.push(reference);
}
return read;
}
In fact there is such an occasion where the previous case will happen:
1. [Worker]: added an event, and wait on the readLatch
2. [BlockPoller]: registered operation interest on corresponding SelectionKey
3. [BlockPoller]: perform selection
4. [BlockPoller]: wake the Worker on the selection result
5. [Worker]: wake up, read data success
6. [Worker]: add remove event (e1) to the queue, and return from the read
method
7. [Worker]: Outside call the read method again
8. [Worker]: read data and got nothing
9. [Worker]: add an event (e2) via the poller.add(att, ops, ref) method
10. [Worker]: wait on the readLatch
11. [BlockPoller]: enter next loop, and execute the events() method, polling
events from the events queue and run them.
12. [BlockPoller]: got e1 and run it, the SelectionKey turn invalid (before
next selection, the SelectionKey is always invalid)
13. [BlockPoller]: got e2 and run it, since the SelectionKey is invalid, it
will wake the worker thread, and discard the event.
14. [BlockPoller]: poll other events from the queue, and run them
15. [Worker]: (before queue empty) waken up, and read again, but got nothing
16. [Worker]: add another event (e3) via the poller.add(att, ops, ref) method
17. [Worker]: wait on the readLatch again
18. [BlockPoller]: got e3 and run it, found it invalid again, wake the worker
thread, and discard the event.
19. [Worker]: being waken up again and add event again...
steps between 11 to 17 will repeat again and again (step 18 and later is a
replay of the steps), if each time the event add in step 16 add the event to
the queue before the queue is consumed to empty, because the BlockPoller will
always try to empty the queue.
Then the BlockPoller and the worker will enter a strange loop, where the
BlockPoller will try to empty the queue, while the worker will always add some
more events to the queue.
My solution to prevent this problem, is to prevent the BlockPoller from trying
to empty the queue before move to the selection step, instead, make it only
handle events in the queue when it starts to handle them, further events added
later will not be handled during this BlockPoller loop, but instead delay to
the next loop. The selection operation will remove all invalid SelectionKeys,
so in the time of the next loop, the problem will not happen again, since the
SelectionKeys are all fresh and valid.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org
[Bug 61736] NIO BlockPoller may stay at events() method for too long
time
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61736
--- Comment #1 from Mark Thomas <ma...@apache.org> ---
See https://github.com/apache/tomcat/pull/81 for a proposed patch.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org
[Bug 61736] NIO BlockPoller may stay at events() method for too long
time
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61736
ZILONG SONG <sz...@qq.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |szlon@qq.com
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org
[Bug 61736] NIO BlockPoller may stay at events() method for too long
time
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61736
ZILONG SONG <sz...@qq.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|enhancement |major
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org
[Bug 61736] NIO BlockPoller may stay at events() method for too long
time
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61736
--- Comment #2 from Mark Thomas <ma...@apache.org> ---
Thanks for the detailed report. To have such a complicated issue explained so
clearly - and with a patch! - is extremely helpful.
I can see how the problem happens and I can also see how your proposed patch
addresses it. I'm not concerned about delaying new events to the next loop.
The events queue is a custom implementation so I want to look to see if that
makes an alternative fix possible but I haven't found anything so far (and
probably won't).
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org
[Bug 61736] NIO BlockPoller may stay at events() method for too long
time
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61736
--- Comment #4 from Mark Thomas <ma...@apache.org> ---
My point was not that the queue was the root cause. My point was that because
the queue implementation is a custom one it *might* offer a cleaner way to
implement the polling of the queue rather than getting the current size and
then requesting exactly that many items from the queue.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org
[Bug 61736] NIO BlockPoller may stay at events() method for too long
time
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61736
--- Comment #3 from ZILONG SONG <sz...@qq.com> ---
Yes, the queue is custom implemented, but it still works following the rule of
a non-blocking queue. I'm afraid that the problem is not caused by how the
queue works, but how we use a queue.
Perhaps a more basic reason for this problem is that, the cancelled keys are
still cached. A cancelled key is invalid and cannot be used again, so it should
not be cached. If invalid keys are not cached, or they can recover from the
invalid state, we may not wake the Worker thread up when handling events, so
it's possible to avoid this problem.
However how the SelectionKey will behave is controlled in the JDK, out of our
control. I didn't find anyway other than performing a selection that can clear
the invalid SelectionKey caches. So I think maybe the most effective action to
take is to prevent the BlockPoller to cost too much time in the loop polling
and handling events, and let the next selection to clear the invalid
SelectionKeys as soon as possible.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org
[Bug 61736] NIO BlockPoller may stay at events() method for too long
time
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61736
Mark Thomas <ma...@apache.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Status|NEW |RESOLVED
--- Comment #5 from Mark Thomas <ma...@apache.org> ---
There didn't seem to be any benefit to implementing the limit in the queue
rather than getting the size and looping so I went with the patch as proposed.
Fixed in:
- trunk for 9.0.2 onwards
- 8.5.x for 8.5.24 onwards
- 8.0.x for 8.0.48 onwardse
- 7.0.x for 7.0.83 onwards
Thanks again for the analysis and the patch.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org
[Bug 61736] NIO BlockPoller may stay at events() method for too long
time
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61736
Christopher Schultz <ch...@christopherschultz.net> changed:
What |Removed |Added
----------------------------------------------------------------------------
OS|Mac OS X 10.1 |All
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org