You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Howard Lee (JIRA)" <ji...@apache.org> on 2016/11/02 07:53:58 UTC

[jira] [Commented] (STORM-1022) disconnectiong between workers

    [ https://issues.apache.org/jira/browse/STORM-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15628118#comment-15628118 ] 

Howard Lee commented on STORM-1022:
-----------------------------------

Recently we meet the same problem caused by Netty after several times retrial except that It's not caused by supervisor but worker them selfs.
{quote}
2016-10-24T17:12:05.981+0800 STDIO [ERROR] Oct 24, 2016 5:12:05 PM org.apache.storm.guava.util.concurrent.ExecutionList executeListener
SEVERE: RuntimeException while executing runnable org.apache.storm.guava.util.concurrent.Futures$4@43f31edc with executor org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService@e6f205e
java.lang.RuntimeException: Failed to connect to Netty-Client-xxx/xx.xx.xx.173:6721
        at backtype.storm.messaging.netty.Client.connect(Client.java:308)
        at backtype.storm.messaging.netty.Client.access$1100(Client.java:78)
        at backtype.storm.messaging.netty.Client$2.reconnectAgain(Client.java:297)
        at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:283)
        at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:275)
        at org.apache.storm.guava.util.concurrent.Futures$4.run(Futures.java:1181)
        at org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
        at org.apache.storm.guava.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
        at org.apache.storm.guava.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
        at org.apache.storm.guava.util.concurrent.ListenableFutureTask.done(ListenableFutureTask.java:91)
        at java.util.concurrent.FutureTask$Sync.innerSet(FutureTask.java:251)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: Giving up to connect to Netty-Client-sd-bigdata-storm5-73-173.idc.vip.com/10.208.73.173:6721 after 12 failed attempts
        at backtype.storm.messaging.netty.Client.connect(Client.java:303)
        ... 20 more
{quote}
when workers start, they tried to connect each other. After several times retrial, one worker failed, and throw an *RuntimeException*. The *RuntimeException* is swallowed by guava, as shown in the error stack. The code is as following in ExecutionList.java line 154 in guava 16.0.1:
{code:java}
private static void executeListener(Runnable runnable, Executor executor) {
    try {
      executor.execute(runnable);
    } catch (RuntimeException e) {
      // Log it and keep going, bad runnable and/or executor.  Don't
      // punish the other runnables if we're given a bad one.  We only
      // catch RuntimeException because we want Errors to propagate up.
      log.log(Level.SEVERE, "RuntimeException while executing runnable "
          + runnable + " with executor " + executor, e);
    }
  }
{code}
The RuntimeException will never been thrown to the timer thread of worker and thus will not cause the dead of the worker. And the worker will never try to reconnect again. The worker should die, so that supervisor will pull it up again——this won't happen if this bug exists.
Our storm version is 0.9.4 . And I can see the storm 1.0.1 has removed guava in the code. So the problem was solved in new version of storm.

> disconnectiong between workers
> ------------------------------
>
>                 Key: STORM-1022
>                 URL: https://issues.apache.org/jira/browse/STORM-1022
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>            Reporter: Jackson Chung
>
> We upgraded to 0.9.5 ando ran into the following exception. The supervisors did go down:
> 1 caution in our upgrade is we started a new nimbus, without any supervisors attached. Then we deployed topologies (from CICD). Next we build new supervisors and the supervisors will start on startup. However, in between the network service is restarted (due to hostname changed during the build <- chef). Just wanna throw this out in case this makes a difference.
> In other word, it could be that supervisors started, picked up work,  then network restarted. 
> {code}
> SEVERE: RuntimeException while executing runnable org.apache.storm.guava.util.concurrent.Futures$4@445058b with executor org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService@691bc565
> java.lang.RuntimeException: Failed to connect to Netty-Client-usw2b-grunt-drone32-prod.amz.relateiq.com/10.30.103.202:6700
> at backtype.storm.messaging.netty.Client.connect(Client.java:308)
> at backtype.storm.messaging.netty.Client.access$1100(Client.java:78)
> at backtype.storm.messaging.netty.Client$2.reconnectAgain(Client.java:297)
> at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:283)
> at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:275)
> at org.apache.storm.guava.util.concurrent.Futures$4.run(Futures.java:1181)
> at org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
> at org.apache.storm.guava.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
> at org.apache.storm.guava.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
> at org.apache.storm.guava.util.concurrent.ListenableFutureTask.done(ListenableFutureTask.java:91)
> at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:384)
> at java.util.concurrent.FutureTask.set(FutureTask.java:233)
> at java.util.concurrent.FutureTask.run(FutureTask.java:274)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Giving up to connect to Netty-Client-usw2b-grunt-drone32-prod.amz.relateiq.com/10.30.103.202:6700 after 102 failed attempts
> at backtype.storm.messaging.netty.Client.connect(Client.java:303)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)