You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Nico Meyer (JIRA)" <ji...@apache.org> on 2016/06/16 10:56:05 UTC

[jira] [Commented] (STORM-1560) Topology stops processing after Netty catches/swallows Throwable

    [ https://issues.apache.org/jira/browse/STORM-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333566#comment-15333566 ] 

Nico Meyer commented on STORM-1560:
-----------------------------------

Is it certain that this problem affects version 1.0.0? Because I observed the same problem  in 0.9.6 when workers restarted in short succession. That often happens if one worker crashes since nimbus tends to reassign tasks right around the time the crashed worker is restarted by the supervisor (most likely because by default both heartbeat timeouts are 30 second).
I had a patch that solved it, but it is no longer necessary for 1.0.0.

> Topology stops processing after Netty catches/swallows Throwable
> ----------------------------------------------------------------
>
>                 Key: STORM-1560
>                 URL: https://issues.apache.org/jira/browse/STORM-1560
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.0.0
>            Reporter: P. Taylor Goetz
>
> In some scenarios, netty connection problems can leave a topology in an unrecoverable state. The likely culprit is the Netty {{HashedWheelTimer}} class that contains the following code:
> {code}
>         public void expire() {
>             if(this.compareAndSetState(0, 2)) {
>                 try {
>                     this.task.run(this);
>                 } catch (Throwable var2) {
>                     if(HashedWheelTimer.logger.isWarnEnabled()) {
>                         HashedWheelTimer.logger.warn("An exception was thrown by " + TimerTask.class.getSimpleName() + '.', var2);
>                     }
>                 }
>             }
>         }
> {code}
> The exception being swallowed can be seen below:
> {code}
> 2016-02-18 08:46:59.116 o.a.s.m.n.Client [INFO] closing Netty Client Netty-Client-/192.168.202.6:6701
> 2016-02-18 08:46:59.173 o.a.s.m.n.Client [INFO] waiting up to 600000 ms to send 0 pending messages to Netty-Client-/192.168.202.6:6701
> 2016-02-18 08:46:59.271 STDIO [ERROR] Feb 18, 2016 8:46:59 AM org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer
> WARNING: An exception was thrown by TimerTask.
> java.lang.RuntimeException: Giving up to scheduleConnect to Netty-Client-/192.168.202.6:6701 after 44 failed attempts. 3 messages were lost
> 	at org.apache.storm.messaging.netty.Client$Connect.run(Client.java:573)
> 	at org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:546)
> 	at org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer$Worker.notifyExpiredTimeouts(HashedWheelTimer.java:446)
> 	at org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:395)
> 	at org.apache.storm.shade.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> The netty client then never recovers, and the follows messages repeat forever:
> {code}
> 2016-02-18 09:42:56.251 o.a.s.m.n.Client [ERROR] discarding 1 messages because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> 2016-02-18 09:43:25.248 o.a.s.m.n.Client [ERROR] discarding 1 messages because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> 2016-02-18 09:43:55.248 o.a.s.m.n.Client [ERROR] discarding 1 messages because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> 2016-02-18 09:43:55.752 o.a.s.m.n.Client [ERROR] discarding 2 messages because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> 2016-02-18 09:43:56.252 o.a.s.m.n.Client [ERROR] discarding 1 messages because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> 2016-02-18 09:44:25.249 o.a.s.m.n.Client [ERROR] discarding 1 messages because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)