You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Sean Zhong (JIRA)" <ji...@apache.org> on 2014/11/12 14:16:33 UTC
[jira] [Updated] (STORM-537) A worker reconnects infinitely to another dead worker

     [ https://issues.apache.org/jira/browse/STORM-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Zhong updated STORM-537:
-----------------------------
    Assignee: Sergey Tryuber

> A worker reconnects infinitely to another dead worker
> -----------------------------------------------------
>
>                 Key: STORM-537
>                 URL: https://issues.apache.org/jira/browse/STORM-537
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.3
>            Reporter: Sergey Tryuber
>            Assignee: Sergey Tryuber
>
> We're using 0.9.3-rc1. Most probably this wrong behavior was introduced as a side efffect for STORM-409. When I kill a worker, another worker starts to print messages like:
> {noformat}
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-<HOST>:4706... [2]
> ..... so on
> {noformat}
> Then it reaches default 300 max_retries and starts the cycle again:
> {noformat}
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-<HOST>:4706, [id: 
> 0xec088412, /<HOST>:39795 :> <HOST>:4706]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-<HOST>:4706... [2]
> {noformat}
> And so on infinitely... 
> An issue most probably is in backtype.storm.messaging.netty.Client#connect method in following place which determines that we give up on reconnection:
> {code}
> if (null != channel) {
>     LOG.info("connection established to a remote host " + name() + ", " + channel.toString());
>     channelRef.set(channel);
> } else {
>     close();
>     throw new RuntimeException("Remote address is not reachable. We will close this client " + name());
> }
> {code}
> I guess (not tried yet), that _channel_ object is not _null_ if this is a real reconnection. So the method return a _channel_ object and then reconnection starts again and again.
> This might be fixed by adding explicity *current = null;* into following code block of the same method:
> {code}
> if (!future.isSuccess()) {
>     if (null != current) {
>         current.close();
>     }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)