You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Frantz Mazoyer (JIRA)" <ji...@apache.org> on 2015/09/01 14:14:45 UTC

[jira] [Created] (STORM-1023) Nimbus server hogs 100% CPU and clients are stuck

Frantz Mazoyer created STORM-1023:
-------------------------------------

             Summary: Nimbus server hogs 100% CPU and clients are stuck 
                 Key: STORM-1023
                 URL: https://issues.apache.org/jira/browse/STORM-1023
             Project: Apache Storm
          Issue Type: Bug
    Affects Versions: 0.9.3, 0.10.0, 0.9.4, 0.11.0, 0.9.5, 0.9.6
         Environment: Storm 0.9.5 / thrift 0.7
            Reporter: Frantz Mazoyer


Testing environment is Storm 0.9.5 / thrift java 0.7.
Test scenario: 
  Deploy storm topology in loop.
  When nimbus cleanup timeout is reached, an error is thrown by thrift server: 
  "Exception while invoking ..." ... TException

Test result:
  Thrift java server in nimbus goes 100% CPU in infinite loop in:

jstack:
{code}
"Thread-5" prio=10 tid=0x00007fb134aab800 nid=0x6767 runnable [0x00007fb129c9b000]
   java.lang.Thread.State: RUNNABLE
                                      at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
                                      at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
                                      at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
                                      at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
...
at org.apache.thrift7.server.TNonblockingServer$SelectThread.select(TNonblockingServer.java:284) 
{code}

strace:
{code}
epoll_wait(70, {{EPOLLIN, {u32=866, u64=866}}, {EPOLLIN, {u32=876, u64=876}}}, 4096, 4294967295) = 2
{code}

Investigation and tests show that:
Any Exception thrown during the processor execution will bypass the call to {code} responseReady() {code} and will cause the counter {code}       readBufferBytesAllocated.addAndGet(-buffer_.array().length); {code} not to be decremented by the size of the request buffer.

After a bunch of failed requests, this counter almost reaches the max value MAX_READ_BUFFER_BYTES causing any subsequent request to be delayed forever because the following test in {code} read() {code}:
{code}           if (readBufferBytesAllocated.get() + frameSize > MAX_READ_BUFFER_BYTES)  {code} is always true.

At the end, the server thread loops in select() which immediately wakes up for read() since the content of the socket was never drained.

This loops forever between select and read() method above causing a 100% CPU on server thread.
Moreover, all client requests are stuck forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)