You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Milad Fatenejad (JIRA)" <ji...@apache.org> on 2016/09/21 02:57:20 UTC

[jira] [Created] (STORM-2108) Spout unable to recover after worker crashes...continuously see discarding messages errors...

Milad Fatenejad created STORM-2108:
--------------------------------------

             Summary: Spout unable to recover after worker crashes...continuously see discarding messages errors...
                 Key: STORM-2108
                 URL: https://issues.apache.org/jira/browse/STORM-2108
             Project: Apache Storm
          Issue Type: Bug
          Components: storm-core
    Affects Versions: 1.0.2
         Environment: Ubuntu 14.04, Java 8, Docker 1.11.2
            Reporter: Milad Fatenejad


Hello:

We have a new situation that occurred after we upgraded to storm 1.0.2 (from 0.9.2). We had a worker crash due to a bug in our code that caused a stack overflow exception. The supervisor detected the issue and restarted the worker as expected.

After the worker crashed, the many of the tuples the spout sends out continuously time out and our throughput slows to a crawl. The spout seems to send out tuples until it hits the max spout pending. Then some of the tuples time out and it sends the next batch.

We saw this exception in the spout log when the worker crashed:

2016-09-21T01:54:32,749 [refresh-connections-timer] [org.apache.storm.messaging.netty.Client] [INFO]> closing Netty Client Netty-Client-/10.103.16.14:31437
2016-09-21T01:54:32,750 [refresh-connections-timer] [org.apache.storm.messaging.netty.Client] [INFO]> waiting up to 600000 ms to send 0 pending messages to Netty-Client-/10.103.16.14:31437
2016-09-21T01:55:35,925 [Netty-server-localhost-31009-worker-1] [org.apache.storm.messaging.netty.StormServerHandler] [ERROR]> server errors in handling the request
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_77]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_77]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_77]
        at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_77]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:1.8.0_77]
        at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) [storm-core-1.0.2.jar:1.0.2]
        at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) [storm-core-1.0.2.jar:1.0.2]
        at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) [storm-core-1.0.2.jar:1.0.2]
        at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) [storm-core-1.0.2.jar:1.0.2]
        at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [storm-core-1.0.2.jar:1.0.2]
        at org.apache.storm.shade.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [storm-core-1.0.2.jar:1.0.2]
        at org.apache.storm.shade.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [storm-core-1.0.2.jar:1.0.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_77]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_77]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
2

And now we just continuously see these messages in the spout logs:

2016-09-21T02:03:35,513 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:35,644 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:35,774 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:35,817 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:35,849 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,073 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,141 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,169 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,340 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,365 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,416 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,560 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,607 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,660 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,865 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,894 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:37,026 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:37,051 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:37,065 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:37,219 [Thread-10-disruptor-worker-transfer-queue] [org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed

The worker that died (10.103.16.14.31437) was restarted by the supervisor, but I don't see any log messages in the logs indicating that it is receiving any tuples. The "is being closed" messages in the spout logs make me think that storm has failed to close its connection.

This has happened to us repeatedly since the upgrade. Does anyone have any suggestions about how to fix this issue? Also, I originally thought it might be related to STORM-1560, but I don't see the exception that is mentioned in that ticket.

Thanks, and I would greatly appreciate any help




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)