You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Frantz Mazoyer (JIRA)" <ji...@apache.org> on 2015/09/01 14:14:45 UTC
[jira] [Created] (STORM-1023) Nimbus server hogs 100% CPU and
clients are stuck
Frantz Mazoyer created STORM-1023:
-------------------------------------
Summary: Nimbus server hogs 100% CPU and clients are stuck
Key: STORM-1023
URL: https://issues.apache.org/jira/browse/STORM-1023
Project: Apache Storm
Issue Type: Bug
Affects Versions: 0.9.3, 0.10.0, 0.9.4, 0.11.0, 0.9.5, 0.9.6
Environment: Storm 0.9.5 / thrift 0.7
Reporter: Frantz Mazoyer
Testing environment is Storm 0.9.5 / thrift java 0.7.
Test scenario:
Deploy storm topology in loop.
When nimbus cleanup timeout is reached, an error is thrown by thrift server:
"Exception while invoking ..." ... TException
Test result:
Thrift java server in nimbus goes 100% CPU in infinite loop in:
jstack:
{code}
"Thread-5" prio=10 tid=0x00007fb134aab800 nid=0x6767 runnable [0x00007fb129c9b000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
...
at org.apache.thrift7.server.TNonblockingServer$SelectThread.select(TNonblockingServer.java:284)
{code}
strace:
{code}
epoll_wait(70, {{EPOLLIN, {u32=866, u64=866}}, {EPOLLIN, {u32=876, u64=876}}}, 4096, 4294967295) = 2
{code}
Investigation and tests show that:
Any Exception thrown during the processor execution will bypass the call to {code} responseReady() {code} and will cause the counter {code} readBufferBytesAllocated.addAndGet(-buffer_.array().length); {code} not to be decremented by the size of the request buffer.
After a bunch of failed requests, this counter almost reaches the max value MAX_READ_BUFFER_BYTES causing any subsequent request to be delayed forever because the following test in {code} read() {code}:
{code} if (readBufferBytesAllocated.get() + frameSize > MAX_READ_BUFFER_BYTES) {code} is always true.
At the end, the server thread loops in select() which immediately wakes up for read() since the content of the socket was never drained.
This loops forever between select and read() method above causing a 100% CPU on server thread.
Moreover, all client requests are stuck forever.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)