You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "xiajun (JIRA)" <ji...@apache.org> on 2014/09/04 06:08:52 UTC

[jira] [Comment Edited] (STORM-404) Worker on one machine crashes due to a failure of another worker on another machine

    [ https://issues.apache.org/jira/browse/STORM-404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120928#comment-14120928 ] 

xiajun edited comment on STORM-404 at 9/4/14 4:08 AM:
------------------------------------------------------

 I reproduce this situation everytime in the case below:
In my case,there are 5 machines,4 machines for supervisor with 6 port and 1 for nimbus; my topology use 24 worker(this means all the ports in the supervisors); But there is a little trick, the prepare method in my bolt just throw exception and do nothing else, this will cause the worker exit immediately. Then other worker will exit with log that @ Itai Frenkel mentioned before.

I read the code and found that in my situation, when the  first worker(we named it as A) exit, other worker has not connect to worker A, so they retried for many times and closing this Client, note that connect logic is async and called from Client Constructor; Because connect and send method are synchronized, send must wait until connect returned, but when send run, close had been called by connect and mark closing as true, so connect will throw that Exception;

But i am still confused if this can happen: just remove the Exception in Bolt::prepare, and worker A exit for some unknown reason, and worker B connect to worker A will fail after retry, and then worker B can send tuple to worker A, this will end up with call send again, this will cause worker B exit by that Exception. You may say that nimbus will find worker A exit and tell worker B not send tuple to worker A any more, but worker and nimbus connected by zookeeper, and worker read nimbus's command Periodically, this is done by mk-refresh-connections. mk-refresh-connections and send share the same RWlock, when the machine load is heavy, there is a chance that mk-refresh-connections not called between the sends, where the first send has close the connection in Client;



was (Author: tedxia):
@Jungtaek Lim I reproduce this situation everytime in the case below:
In my case,there are 5 machines,4 machines for supervisor with 6 port and 1 for nimbus; my topology use 24 worker(this means all the ports in the supervisors); But there is a little trick, the prepare method in my bolt just throw exception and do nothing else, this will cause the worker exit immediately. Then other worker will exit with log that @ Itai Frenkel mentioned before.

I read the code and found that in my situation, when the  first worker(we named it as A) exit, other worker has not connect to worker A, so they retried for many times and closing this Client, note that connect logic is async and called from Client Constructor; Because connect and send method are synchronized, send must wait until connect returned, but when send run, close had been called by connect and mark closing as true, so connect will throw that Exception;

But i am still confused if this can happen: just remove the Exception in Bolt::prepare, and worker A exit for some unknown reason, and worker B connect to worker A will fail after retry, and then worker B can send tuple to worker A, this will end up with call send again, this will cause worker B exit by that Exception. You may say that nimbus will find worker A exit and tell worker B not send tuple to worker A any more, but worker and nimbus connected by zookeeper, and worker read nimbus's command Periodically, this is done by mk-refresh-connections. mk-refresh-connections and send share the same RWlock, when the machine load is heavy, there is a chance that mk-refresh-connections not called between the sends, where the first send has close the connection in Client;


> Worker on one machine crashes due to a failure of another worker on another machine
> -----------------------------------------------------------------------------------
>
>                 Key: STORM-404
>                 URL: https://issues.apache.org/jira/browse/STORM-404
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating
>            Reporter: Itai Frenkel
>
> I have two workers (one on each machine). The first worker(10.30.206.125) had a problem starting (could not find Nimbus host), however the second worker crashed too since it could not connect to the first worker.
> This looks like a cascading failure, which seems like a bug.
> 2014-07-15 17:43:32 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [17]
> 2014-07-15 17:43:33 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [18]
> 2014-07-15 17:43:34 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [19]
> 2014-07-15 17:43:35 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [20]
> 2014-07-15 17:43:36 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [21]
> 2014-07-15 17:43:37 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [22]
> 2014-07-15 17:43:38 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [23]
> 2014-07-15 17:43:39 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [24]
> 2014-07-15 17:43:40 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [25]
> 2014-07-15 17:43:41 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [26]
> 2014-07-15 17:43:42 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [27]
> 2014-07-15 17:43:43 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [28]
> 2014-07-15 17:43:44 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [29]
> 2014-07-15 17:43:45 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [30]
> 2014-07-15 17:43:46 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700
> 2014-07-15 17:43:46 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700..., timeout: 600000ms, pendings: 0
> 2014-07-15 17:43:46 b.s.util [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more
> at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]
> Caused by: java.lang.RuntimeException: Client is being closed, and does not take requests any more
> at backtype.storm.messaging.netty.Client.send(Client.java:194) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> ... 6 common frames omitted
> 2014-07-15 17:43:46 b.s.util [INFO] Halting process: ("Async loop died!")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)