You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "James Xu (JIRA)" <ji...@apache.org> on 2013/12/15 05:59:06 UTC

[jira] [Created] (STORM-128) Topology fails to start if a configured DRPC server is down

James Xu created STORM-128:
------------------------------

             Summary: Topology fails to start if a configured DRPC server is down
                 Key: STORM-128
                 URL: https://issues.apache.org/jira/browse/STORM-128
             Project: Apache Storm (Incubating)
          Issue Type: Bug
            Reporter: James Xu
            Priority: Minor


https://github.com/nathanmarz/storm/issues/696

In our environment we have 3 DRPC servers running. This was done mainly for availability and capacity. However, we noticed that when even one of these servers is down, topologies fail to start with the following exception:

java.lang.RuntimeException: org.apache.thrift7.transport.TTransportException: java.net.NoRouteToHostException: No route to host
at backtype.storm.drpc.DRPCInvocationsClient.(DRPCInvocationsClient.java:23)
at backtype.storm.drpc.DRPCSpout.open(DRPCSpout.java:65)
at storm.trident.spout.RichSpoutBatchTriggerer.open(RichSpoutBatchTriggerer.java:41)
at backtype.storm.daemon.executor$fn__3985$fn__3997.invoke(executor.clj:460)
at backtype.storm.util$async_loop$fn__465.invoke(util.clj:375)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.thrift7.transport.TTransportException: java.net.NoRouteToHostException: No route to host
at org.apache.thrift7.transport.TSocket.open(TSocket.java:183)
at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81)
at backtype.storm.drpc.DRPCInvocationsClient.connect(DRPCInvocationsClient.java:30)
at backtype.storm.drpc.DRPCInvocationsClient.(DRPCInvocationsClient.java:21)
... 6 more
Caused by: java.net.NoRouteToHostException: No route to host
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at org.apache.thrift7.transport.TSocket.open(TSocket.java:178)
... 9 more

I was wondering if it makes sense to make Storm handle this gracefully instead of failing fast. Otherwise, the DRPC servers become a SPOF.

If the topologies are already running the topology usually just logs an error message and continues.

----------
dkador: +1 on figuring out how to make the DRPC stuff not a SOP. I'd be happy to look into it myself but not sure where to start. Any guidance?

----------
rijuk: For reference, the stack trace I see when a DRPC server goes down while a topology is running is the following. In this case, the topology continues to function normally.

[backtype.storm.drpc.DRPCSpout Thread-65]: Failed to fetch DRPC result from DRPC server
org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection refused
at org.apache.thrift7.transport.TSocket.open(TSocket.java:183)
at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81)
at backtype.storm.drpc.DRPCInvocationsClient.connect(DRPCInvocationsClient.java:30)
at backtype.storm.drpc.DRPCInvocationsClient.fetchRequest(DRPCInvocationsClient.java:53)
at backtype.storm.drpc.DRPCSpout.nextTuple(DRPCSpout.java:89)
at storm.trident.spout.RichSpoutBatchTriggerer.nextTuple(RichSpoutBatchTriggerer.java:68)
at backtype.storm.daemon.executor$fn__3985$fn__3997$fn__4026.invoke(executor.clj:502)
at backtype.storm.util$async_loop$fn__465.invoke(util.clj:377)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at org.apache.thrift7.transport.TSocket.open(TSocket.java:178)
... 9 more

In this case I'd the host up, but the DRPC server process was down. Hence the ConnectException. But, the behavior is the same even when the host is unreachable, except for the Exception type.

@dkador, I'm not sure what the right solution is. One naive solution I can think of is to make DRPCInvocationsClient constructor rethrow TException instead of throwing a RuntimeException. Obviously, you'll have to make sure that all callers of this higher up in the stack handle this exception properly.

Actually, on second thoughts that's not a good idea. You probably still want the DRPCInvocationsClient object to be constructed. So, maybe you can log an error and just eat that exception. All other methods in that class call "connect" if necessary anyway.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)