You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Alexandre Vermeerbergen (JIRA)" <ji...@apache.org> on 2017/08/26 14:02:00 UTC

[jira] [Created] (STORM-2707) Nimbus loops forever with getClusterInfo error if it looses storm.local.dir contents

Alexandre Vermeerbergen created STORM-2707:
----------------------------------------------

             Summary: Nimbus loops forever with getClusterInfo error if it looses storm.local.dir contents
                 Key: STORM-2707
                 URL: https://issues.apache.org/jira/browse/STORM-2707
             Project: Apache Storm
          Issue Type: Bug
          Components: storm-core
    Affects Versions: 1.1.0, 1.1.1
         Environment: Storm deployed with only one Nimbus node (no HA).
Probably unrelated, but in our case we have:
*  3 Zookeeper VMs, and 4 to 8 Supervisor VMs. 
* All is running with Java 8 update 131 (ORACLE Server JRE package)
* All is running in EC2 VM / instances
            Reporter: Alexandre Vermeerbergen


Hello,

Short issue description:
* Remove storm.local.dir directory
* Storm UI isn't anymore able to query anything from Nimbus
* Nimbus process prints getClusterInfo exceptions in its log whenever it gets a query from Nimbus UI or from "storm" command line
* To fix this issue, we have to stop all Storm processes, cleanup the content of Zookeeper nodes, then restart & redeploy our topologies

Excepted behavior:
* In such case, Storm should cleanup the content of Zookeeper and recover in a mode allowing to kill & restart all topologies

More details:
===========
Sometimes we loose the content of storm.local.dir on our single-node Nimbus production cluster.

We haven't yet considered deploying Nimbus in HA because this is a relatively modest deployment with budget constrains on the number of the number of IaaS resources which can be used for this application. So far so good, because in our environment, Nimbus & Nimbus UI (hosted on same VM) are supervized, and we also have self-healing crons to automatically kill & restart topologies blocked in Kafka consumption or having too many failed tuples (because Storm back pressure has some fuzzy limits, so we use this by-pass, as approved by Roshan in a past discussion, but that's not the point here).

Our problem is that sometime, we loose the content of storm.local.dir.

When it happens, our supervision detects the issue because it cannot anymore query Nimbus REST services on Nimbus-UI process.

In such case it tries to restart Storm-UI but this doesn't help because queries to Storm-UI fails with the following stack trace when it tries to list all topologies:

org.apache.storm.thrift.TApplicationException: Internal error processing getClusterInfo
	at org.apache.storm.thrift.TApplicationException.read(TApplicationException.java:111)
	at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
	at org.apache.storm.generated.Nimbus$Client.recv_getClusterInfo(Nimbus.java:1168)
	at org.apache.storm.generated.Nimbus$Client.getClusterInfo(Nimbus.java:1156)
	at org.apache.storm.ui.core$cluster_summary.invoke(core.clj:356)
	at org.apache.storm.ui.core$fn__9556.invoke(core.clj:1113)
	at org.apache.storm.shade.compojure.core$make_route$fn__5976.invoke(core.clj:100)
	at org.apache.storm.shade.compojure.core$if_route$fn__5964.invoke(core.clj:46)
	at org.apache.storm.shade.compojure.core$if_method$fn__5957.invoke(core.clj:31)
	at org.apache.storm.shade.compojure.core$routing$fn__5982.invoke(core.clj:113)
	at clojure.core$some.invoke(core.clj:2570)
	at org.apache.storm.shade.compojure.core$routing.doInvoke(core.clj:113)
	at clojure.lang.RestFn.applyTo(RestFn.java:139)
	at clojure.core$apply.invoke(core.clj:632)
	at org.apache.storm.shade.compojure.core$routes$fn__5986.invoke(core.clj:118)
	at org.apache.storm.shade.ring.middleware.cors$wrap_cors$fn__8891.invoke(cors.clj:149)
	at org.apache.storm.shade.ring.middleware.json$wrap_json_params$fn__8838.invoke(json.clj:56)
	at org.apache.storm.shade.ring.middleware.multipart_params$wrap_multipart_params$fn__6618.invoke(multipart_params.clj:118)
	at org.apache.storm.shade.ring.middleware.reload$wrap_reload$fn__7901.invoke(reload.clj:22)
	at org.apache.storm.ui.helpers$requests_middleware$fn__6871.invoke(helpers.clj:50)
	at org.apache.storm.ui.core$catch_errors$fn__9758.invoke(core.clj:1428)
	at org.apache.storm.shade.ring.middleware.keyword_params$wrap_keyword_params$fn__6538.invoke(keyword_params.clj:35)
	at org.apache.storm.shade.ring.middleware.nested_params$wrap_nested_params$fn__6581.invoke(nested_params.clj:84)
	at org.apache.storm.shade.ring.middleware.params$wrap_params$fn__6510.invoke(params.clj:64)
	at org.apache.storm.shade.ring.middleware.multipart_params$wrap_multipart_params$fn__6618.invoke(multipart_params.clj:118)
	at org.apache.storm.shade.ring.middleware.flash$wrap_flash$fn__6833.invoke(flash.clj:35)
	at org.apache.storm.shade.ring.middleware.session$wrap_session$fn__6819.invoke(session.clj:98)
	at org.apache.storm.shade.ring.util.servlet$make_service_method$fn__6368.invoke(servlet.clj:127)
	at org.apache.storm.shade.ring.util.servlet$servlet$fn__6372.invoke(servlet.clj:136)
	at org.apache.storm.shade.ring.util.servlet.proxy$javax.servlet.http.HttpServlet$ff19274a.service(Unknown Source)
	at org.apache.storm.shade.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:654)
	at org.apache.storm.shade.org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1320)
	at org.apache.storm.logging.filters.AccessLoggingFilter.handle(AccessLoggingFilter.java:47)
	at org.apache.storm.logging.filters.AccessLoggingFilter.doFilter(AccessLoggingFilter.java:39)
	at org.apache.storm.shade.org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1291)
	at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
	at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
	at org.apache.storm.ui.helpers$x_frame_options_filter_handler$fn__6964.invoke(helpers.clj:189)
	at org.apache.storm.ui.helpers.proxy$java.lang.Object$Filter$abec9a8f.doFilter(Unknown Source)
	at org.apache.storm.shade.org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1291)
	at org.apache.storm.shade.org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:247)
	at org.apache.storm.shade.org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:210)
	at org.apache.storm.shade.org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1291)
	at org.apache.storm.shade.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:443)
	at org.apache.storm.shade.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1044)
	at org.apache.storm.shade.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:372)
	at org.apache.storm.shade.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:978)
	at org.apache.storm.shade.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
	at org.apache.storm.shade.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at org.apache.storm.shade.org.eclipse.jetty.server.Server.handle(Server.java:369)
	at org.apache.storm.shade.org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:486)
	at org.apache.storm.shade.org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:933)
	at org.apache.storm.shade.org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:995)
	at org.apache.storm.shade.org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
	at org.apache.storm.shade.org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
	at org.apache.storm.shade.org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
	at org.apache.storm.shade.org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:668)
	at org.apache.storm.shade.org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
	at org.apache.storm.shade.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
	at org.apache.storm.shade.org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
	at java.lang.Thread.run(Thread.java:748)

In Nimbus.log, we also have this kind of exception each time Nimbus UI is queried:

org.apache.storm.generated.KeyNotFoundException: null
        at org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147) ~[storm-core-1.1.0.jar:1.1.0]
        at org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299) ~[storm-core-1.1.0.jar:1.1.0]
        at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_144]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_144]
        at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[clojure-1.7.0.jar:?]
        at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28) ~[clojure-1.7.0.jar:?]
        at org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:489) ~[storm-core-1.1.0.jar:1.1.0]
        at org.apache.storm.daemon.nimbus$get_cluster_info$iter__10687__10691$fn__10692.invoke(nimbus.clj:1550) ~[storm-core-1.1.0.jar:1.1.0]
        at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.7.0.jar:?]
        at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.7.0.jar:?]
        at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
        at clojure.core$seq__4128.invoke(core.clj:137) ~[clojure-1.7.0.jar:?]
        at clojure.core$dorun.invoke(core.clj:3009) ~[clojure-1.7.0.jar:?]
        at clojure.core$doall.invoke(core.clj:3025) ~[clojure-1.7.0.jar:?]
        at org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524) ~[storm-core-1.1.0.jar:1.1.0]
        at org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__10782.getClusterInfo(nimbus.clj:1971) ~[storm-core-1.1.0.jar:1.1.0]
        at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3920) ~[storm-core-1.1.0.jar:1.1.0]
        at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3904) ~[storm-core-1.1.0.jar:1.1.0]

Even if Storm was shutting down in such case, this wouldn't help because we have to cleanup all Zookeepers to put our Storm cluster back to life.

Ideally, we would like that when storm.local.dir is lost, Nimbus will re-create it "blank" and cleanups Zookeeper nodes (that's the tricky part); then we expect that Supervisors (which are unaffected by this issue when it occurs) will re-register themselves to make Nimbus aware that topologies are Running. 

Also, Topologies restart from UI should consistently fail until topologies JARs are re-submitted (please make the error message very clear and easy to "grep" when such case occurs);

This is my first JIRA, I hope I provided everything to let Storm developers dig this issue ; otherwise please let me know if more information is required: I will be glad to help as much as I can... Storm rocks!

Best regards,
Alexandre Vermeerbergen





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)