You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by Alexandre Vermeerbergen <av...@gmail.com> on 2017/08/13 14:14:50 UTC

Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Hello,

I think it might be of interest for you Storm developers to learn that I
currently have a case of issue with Storm 1.1.0 which was supposed to
resolved in this release according to
https://issues.apache.org/jira/browse/STORM-1977 ; and I can look for any
more information which you'd need to make a diagnostic on why this issue
still can happen.

Indeed, I have a Storm UI process which can't get any information on its
Storm cluster, and I see many following exception in nimbus.log:

2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO] Created
download session for
statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser
with id d5120ad7-a81c-4c39-afc5-a7f876b04c73
2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO] Created
download session for
statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser
with id aba18011-3258-4023-bbef-14d21a7066e1
2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox ...
deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34 [ERROR]
Internal error processing getClusterInfo
org.apache.storm.generated.KeyNotFoundException: null
        at
org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299)
~[storm-core-1.1.0.jar:1.1.0]
        at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source)
~[?:?]
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_121]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_121]
        at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
~[clojure-1.7.0.jar:?]
        at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
~[clojure-1.7.0.jar:?]
        at
org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:489)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.daemon.nimbus$get_cluster_info$iter__10687__10691$fn__10692.invoke(nimbus.clj:1550)
~[storm-core-1.1.0.jar:1.1.0]
        at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.7.0.jar:?]
        at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.7.0.jar:?]
        at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
        at clojure.core$seq__4128.invoke(core.clj:137)
~[clojure-1.7.0.jar:?]
        at clojure.core$dorun.invoke(core.clj:3009) ~[clojure-1.7.0.jar:?]
        at clojure.core$doall.invoke(core.clj:3025) ~[clojure-1.7.0.jar:?]
        at
org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__10782.getClusterInfo(nimbus.clj:1971)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3920)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3904)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:518)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
~[storm-core-1.1.0.jar:1.1.0]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[?:1.8.0_121]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]


What is amazing is that I got this same issue on two Storm clusters running
on different VMs ; just they share the same data in they Kafka Broker
cluster (one cluster is the production one, which was quickly fixed, and
the other one is the "backup" cluster to be used if the production one
fails for quick "back to production")

If left one of these cluster with this behavior because I felt that it
could be interesting for Storm developers to have more information on this
issue, if needed to properly diagnose it.

I can keep this cluster as is for max 2 days.

Is there anything useful which I can collect on it to help Storm developers
to understand the cause (and hopefully use it to make Storm more robust) ?

Few details:

* Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4
Supervisors VMs + 3 Zookeeper VMs

* Running with Java Server JRE 1.8.0_121
* Running on AWS EC2 instances

* We run about 10 topologies, with automatic self-healing on them (if they
stop consuming Kafka items, our self-healer call "Kill topology", and then
eventually restarts the topology

* We have a self-healing on Nimbus UI based on calling its REST services.
If it's not responding fast enough, we restart Nimbus UI
* We figured out the issue because Nimbus UI was restarted every 2 minutes

* To fix our production server which had the same symptom, we had to stop
all Storm processes, then stop all Zookeepers, then remove all data in
Zookeeper "snapshot files", then restart all Zookeeper, then restart all
Storm process, then re-submit all our topologies

Please be as clear as possible about which commands we should run to give
you more details if needed

Best regards,
Alexandre Vermeerbergen

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Jungtaek Lim <ka...@gmail.com>.
I'm not sure. If you are sure about this behavior, please file an issue
with affected version. If you can provide the way to reproduce the issue (I
guess if it is from race condition it would be hard to) it should be great.
Even better you don't mind to spend time to give it a try to fix: that's
how OSP works.

Thanks,
Jungtaek Lim (HeartSaVioR)

2017년 8월 29일 (화) 오후 8:48, Alexandre Vermeerbergen <av...@gmail.com>님이
작성:

> Hello,
>
> I'm afraid we had a new occurrence of this issue, where storm.local.dir
> directory was still globally available but maybe some file was missing or
> corrupted.
>
> The issue starts again just after Nimbus log says it's "Cleaning inbox":
>
> 2017-08-28 02:51:21.163 o.a.s.d.nimbus pool-14-thread-30 [INFO] Created
> download session for
>
> statefulAlerting_ec2-52-211-132-198-eu-west-1-compute-amazonaws-com_defaultStormTopic-276-1503888674-stormconf.ser
> with id 2594b2d0-bc4c-4e4e-b726-2e6970677f50
> 2017-08-28 02:56:09.549 o.a.s.d.nimbus timer [INFO] Cleaning inbox ...
> deleted: stormjar-e35343c5-b5a7-48ae-ad15-a4cf7e880094.jar
> 2017-08-28 03:06:02.673 o.a.s.t.ProcessFunction pool-14-thread-2 [ERROR]
> Internal error processing getClusterInfo
> org.apache.storm.generated.KeyNotFoundException: null
>     at
>
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147)
> ~[storm-core-1.1.0.jar:1.1.0]
>     at
>
> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299)
> ~[storm-core-1.1.0.jar:1.1.0]
>     at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source) ~[?:?]
>
> Couldn't it be that there's some race condition between Zookeeper & Storm
> local dir content, i.e., sometimes Storm write some key to Zookeeper before
> it's actually committed to Storm local dir ?
>
> Best regards,
> Alexandre Vermeerbergen
>
>
>
> 2017-08-27 1:24 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
>
> > Glad that you reproduced and fixed the problem. Happy to help!
> >
> > - Jungtaek Lim (HeartSaVioR)
> >
> > 2017년 8월 26일 (토) 오후 11:13, Alexandre Vermeerbergen <
> > avermeerbergen@gmail.com>님이
> > 작성:
> >
> > > Hello Jungtaek,
> > >
> > > Your answers were very useful, because I was able to reproduce the
> issue
> > by
> > > simply deleting storm.local.dir contents, and I found traces that
> indeed
> > > the machine that suffered this issue had lost this directory : it
> > confirms
> > > your diagnostic, thank you very much for this!
> > >
> > > I have filled a JIRA: https://issues.apache.org/jira/browse/STORM-2707
> > > hoping that this will help make Storm even more robust when it'll be
> > fixed.
> > >
> > > In the meantime, I'm changing our storm.local.dir to a place that
> differs
> > > from /tmp in order to reduce the risk of deletion of this directory.
> > >
> > > Best regard,
> > > Alexandre Vermeerbergen
> > >
> > > 2017-08-25 9:44 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > >
> > > > I'm not sure. Topology code can't be restored, so my best bet would
> be
> > > > detecting it (periodically, or react in failure) and give up
> > leadership.
> > > > If my memory is right, leader Nimbus doesn't pull blobs from
> followers,
> > > so
> > > > if it doesn't have any blobs and need to sync, it just needs to
> become
> > > > follower and let other nimbuses to take a chance to be a leader.
> > > >
> > > > It would not help for non-HA Nimbus cluster anyway. What's next?
> Maybe
> > we
> > > > could remove related topology (from ZK), but it feels me a bit
> > dangerous
> > > to
> > > > do it automatically, so I'd rather leaning on CLI (maybe) tool so
> that
> > > > operators can do it manually.
> > > >
> > > > Please file an issue regarding this when you would want to have a
> > > solution.
> > > >
> > > > Best regards,
> > > > Jungtaek Lim (HeartSaVioR)
> > > >
> > > > 2017년 8월 25일 (금) 오전 6:29, Alexandre Vermeerbergen <
> > > > avermeerbergen@gmail.com>님이
> > > > 작성:
> > > >
> > > > > Hello Jungtaek,
> > > > >
> > > > > I confirm that we currently do not have multiple Nimbus nodes.
> > > > >
> > > > > I want to clarify that Nimbus process never crashed : it keep
> > printing
> > > in
> > > > > its log the error:
> > > > >
> > > > > 2017-08-06 03:44:01.777 o.a.s.t.ProcessFunction pool-14-thread-1
> > > [ERROR]
> > > > > Internal error processing getClusterInfo
> > > > > org.apache.storm.generated.KeyNotFoundException: null
> > > > > ... (rest of the stack trace)
> > > > >
> > > > > However I forgot to mention that:
> > > > >
> > > > > * During this problem our topologies are unaffected
> > > > > * We notice the problem because we have a self-healing check on
> > > Nimbus-UI
> > > > > process which consists in calling Web Services to get a few stats
> on
> > > > > running topologies. But when Nimbus-UI is unresponsive, we kill it
> > and
> > > > > restart it automatically.
> > > > > * The period of our NimbusUI self healing cron is 1 minute
> > > > > * When the getClusterInfo stack trace occurs on Nimbus log, Nimbus
> UI
> > > is
> > > > > unable to restart
> > > > > * Here's the kind of exceptions we see in ui.log  (Nimbus UI's
> > trace):
> > > > >
> > > > > 2017-08-24 21:24:21.777 o.a.s.u.core main [INFO] Starting ui server
> > for
> > > > > storm version '1.1.0'
> > > > > 2017-08-24 21:24:21.788 o.a.s.d.m.MetricsUtils main [INFO] Using
> > > > statistics
> > > > > reporter
> > > > > plugin:org.apache.storm.daemon.metrics.reporters.
> > JmxPreparableReporter
> > > > > 2017-08-24 21:24:21.790 o.a.s.d.m.r.JmxPreparableReporter main
> > [INFO]
> > > > > Preparing...
> > > > > 2017-08-24 21:24:21.801 o.a.s.d.common main [INFO] Started
> statistics
> > > > > report plugin...
> > > > > 2017-08-24 21:24:21.878 o.a.s.s.o.e.j.s.Server main [INFO]
> > > > > jetty-7.x.y-SNAPSHOT
> > > > > 2017-08-24 21:24:21.929 o.a.s.s.o.e.j.s.h.ContextHandler main
> [INFO]
> > > > > started o.a.s.s.o.e.j.s.ServletContextHandler{/,null}
> > > > > 2017-08-24 21:24:21.965 o.a.s.s.o.e.j.s.AbstractConnector main
> > [INFO]
> > > > > Started SelectChannelConnector@0.0.0.0:8070
> > > > > 2017-08-24 21:25:02.446 o.a.s.u.NimbusClient qtp2142536057-19
> [WARN]
> > > > > Ignoring exception while trying to get leader nimbus info from
> > > > > ec2-52-48-59-151.eu-west-1.compute.amazonaws.com. will retry with
> a
> > > > > different seed host.
> > > > > org.apache.storm.thrift.transport.TTransportException: null
> > > > >         at
> > > > >
> > > > > org.apache.storm.thrift.transport.TIOStreamTransport.
> > > > read(TIOStreamTransport.java:132)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > >
> > >
> org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > >
> > > > > org.apache.storm.thrift.transport.TFramedTransport.
> > > > readFrame(TFramedTransport.java:129)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > >
> > > > > org.apache.storm.thrift.transport.TFramedTransport.
> > > > read(TFramedTransport.java:101)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > >
> > >
> org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > >
> > > > > org.apache.storm.thrift.protocol.TBinaryProtocol.
> > > > readAll(TBinaryProtocol.java:429)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > >
> > > > > org.apache.storm.thrift.protocol.TBinaryProtocol.
> > > > readI32(TBinaryProtocol.java:318)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > >
> > > > > org.apache.storm.thrift.protocol.TBinaryProtocol.readMessageBegin(
> > > > TBinaryProtocol.java:219)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > > org.apache.storm.thrift.TServiceClient.receiveBase(
> > > > TServiceClient.java:77)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > > org.apache.storm.generated.Nimbus$Client.recv_getLeader(
> > > > Nimbus.java:1193)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > >
> org.apache.storm.generated.Nimbus$Client.getLeader(Nimbus.java:1181)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > >
> > > > > org.apache.storm.utils.NimbusClient.getConfiguredClientAs(
> > > > NimbusClient.java:84)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > org.apache.storm.ui.core$cluster_summary.invoke(core.clj:355)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at org.apache.storm.ui.core$fn__9556.invoke(core.clj:1113)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > >
> > > > > org.apache.storm.shade.compojure.core$make_route$fn__
> > > > 5976.invoke(core.clj:100)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > > org.apache.storm.shade.compojure.core$if_route$fn__
> > > > 5964.invoke(core.clj:46)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > >
> > > > > org.apache.storm.shade.compojure.core$if_method$fn__
> > > > 5957.invoke(core.clj:31)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at
> > > > > org.apache.storm.shade.compojure.core$routing$fn__
> > > > 5982.invoke(core.clj:113)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at clojure.core$some.invoke(core.clj:2570)
> > > > ~[clojure-1.7.0.jar:?]
> > > > >         at
> > > > >
> org.apache.storm.shade.compojure.core$routing.doInvoke(core.clj:113)
> > > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >         at clojure.lang.RestFn.applyTo(RestFn.java:139)
> > > > > ~[clojure-1.7.0.jar:?]
> > > > >         at clojure.core$apply.invoke(core.clj:632)
> > > > ~[clojure-1.7.0.jar:?]
> > > > >
> > > > > So our problem is that Nimbus stays forever in a inconsistent state
> > > > without
> > > > > crashing, and if it would crash, then it would be helpless since we
> > > would
> > > > > have automatic restart of it (we also have a self healing on
> Nimbus)
> > > but
> > > > > again in such case we need to manually cleanup the content in the
> ZKs
> > > to
> > > > be
> > > > > able to restart Nimbus in a clean state.
> > > > >
> > > > > My question is the following one: is there a way to make Nimbus
> more
> > > > > resilient to such issues?
> > > > >
> > > > > Best regards,
> > > > > Alexandre
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > 2017-08-24 23:06 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > > > >
> > > > > > Blob files (meta, data) are in storm local directory. ZK only has
> > > list
> > > > of
> > > > > > blob keys and which alive nimbuses have that file. So if you lose
> > > storm
> > > > > > local directory, you just can't restore blobs, unless other
> > nimbuses
> > > > have
> > > > > > these blobs so current nimbus could pull.
> > > > > > (I guess you have only one nimbus, since replication factor was 1
> > in
> > > > > nimbus
> > > > > > log.)
> > > > > >
> > > > > > If FNF occurs while finding meta file, we just throw
> > > > KeyNotFoundException
> > > > > > because it is just same as not exist.
> > > > > >
> > > > > > Anyway you have to clear up everything because there's
> > inconsistency
> > > > > > between ZK and local, and no way to recover missed blob files so
> no
> > > way
> > > > > to
> > > > > > make it consistent.
> > > > > >
> > > > > > Btw, in this case, my expectation is that after crashing and
> > > > restarting,
> > > > > > nimbus can't be a leader and just waits for missing blobs. If you
> > saw
> > > > > > Nimbus crashed again while relaunching, it could be a bug.
> > > > > >
> > > > > > - Jungtaek Lim (HeartSaVioR)
> > > > > > On Fri, 25 Aug 2017 at 3:35 AM Alexandre Vermeerbergen <
> > > > > > avermeerbergen@gmail.com> wrote:
> > > > > >
> > > > > > > Hello Jungtaek,
> > > > > > >
> > > > > > > I can do what you suggest (ie moving storm local dir to a place
> > > which
> > > > > > isn't
> > > > > > > in /tmp),but since the issue occurs rarely (once per month), I
> > > doubt
> > > > > I'll
> > > > > > > be able to feedback soon.
> > > > > > >
> > > > > > > What is puzzling to me is that in order to recover from such
> > issue,
> > > > we
> > > > > > have
> > > > > > > to stop everything, then clean the Zookeepers, then restarts
> all
> > > > Storm
> > > > > > > processes and finally restart our topologies. We don't clean
> the
> > > /tmp
> > > > > > > directory.
> > > > > > >
> > > > > > > Are you sure there's no relationships between Zookeeper
> contents
> > > and
> > > > > what
> > > > > > > is causing the following message to happen?
> > > > > > > "Internal error processing getClusterInfo"
> > > > > > > Best regards;
> > > > > > > Alexandre
> > > > > > >
> > > > > > > 2017-08-24 13:41 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > > > > > >
> > > > > > > > Sorry I meant reproduce, not replicate. :)
> > > > > > > >
> > > > > > > > 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <ka...@gmail.com>님이
> > 작성:
> > > > > > > >
> > > > > > > > > Alexandre,
> > > > > > > > >
> > > > > > > > > I found that your storm local dir is placed to "/tmp/storm"
> > > which
> > > > > > parts
> > > > > > > > or
> > > > > > > > > all could be removed at any time.
> > > > > > > > > Could you move the path to non-temporary place and try to
> > > > > replicate?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Jungtaek Lim (HeartSaVioR)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> > > > > > > > > avermeerbergen@gmail.com>님이 작성:
> > > > > > > > >
> > > > > > > > >> Hello Jungtaek,
> > > > > > > > >>
> > > > > > > > >> Thank you very much for your answer.
> > > > > > > > >>
> > > > > > > > >> Please find attached the full Nimbus log (gzipped) related
> > to
> > > > this
> > > > > > > > issue.
> > > > > > > > >>
> > > > > > > > >> Please note that the last ERROR repeats forever until we
> > > > "repair"
> > > > > > > Storm.
> > > > > > > > >>
> > > > > > > > >> From the logs, it could be that the issue began close to
> > when
> > > a
> > > > > > > topology
> > > > > > > > >> was restarted (killed, then started)
> > > > > > > > >>
> > > > > > > > >> Maybe this caused a corruption in Zookeeper. Is there
> > anything
> > > > > > which I
> > > > > > > > >> can collect in our Zookeeper nodes/logs to help analysis?
> > > > > > > > >>
> > > > > > > > >> Best regards,
> > > > > > > > >> Alexandre
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <kabhwan@gmail.com
> >:
> > > > > > > > >>
> > > > > > > > >>> Hi Alexandre, I missed this mail since I was on vacation.
> > > > > > > > >>>
> > > > > > > > >>> I followed the stack trace but hard to analyze without
> > > context.
> > > > > Do
> > > > > > > you
> > > > > > > > >>> mind
> > > > > > > > >>> providing full nimbus log?
> > > > > > > > >>>
> > > > > > > > >>> Thanks,
> > > > > > > > >>> Jungtaek Lim (HeartSaVioR)
> > > > > > > > >>>
> > > > > > > > >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> > > > > > > > >>> avermeerbergen@gmail.com>님이
> > > > > > > > >>> 작성:
> > > > > > > > >>>
> > > > > > > > >>> > Hello,
> > > > > > > > >>> >
> > > > > > > > >>> > Tomorrow I will have to restart the cluster on which I
> > have
> > > > > this
> > > > > > > > issue
> > > > > > > > >>> with
> > > > > > > > >>> > Storm 1.1.0.
> > > > > > > > >>> > Is there are anybody interested in my running some
> > commands
> > > > to
> > > > > > get
> > > > > > > > more
> > > > > > > > >>> > logs before I repair this cluster?
> > > > > > > > >>> >
> > > > > > > > >>> > Best regards,
> > > > > > > > >>> > Alexandre Vermeerbergen
> > > > > > > > >>> >
> > > > > > > > >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> > > > > > > > >>> > avermeerbergen@gmail.com
> > > > > > > > >>> > >:
> > > > > > > > >>> >
> > > > > > > > >>> > > Hello,
> > > > > > > > >>> > >
> > > > > > > > >>> > > I think it might be of interest for you Storm
> > developers
> > > to
> > > > > > learn
> > > > > > > > >>> that I
> > > > > > > > >>> > > currently have a case of issue with Storm 1.1.0 which
> > was
> > > > > > > supposed
> > > > > > > > to
> > > > > > > > >>> > > resolved in this release according to
> > > > > > https://issues.apache.org/
> > > > > > > > >>> > > jira/browse/STORM-1977 ; and I can look for any more
> > > > > > information
> > > > > > > > >>> which
> > > > > > > > >>> > > you'd need to make a diagnostic on why this issue
> still
> > > can
> > > > > > > happen.
> > > > > > > > >>> > >
> > > > > > > > >>> > > Indeed, I have a Storm UI process which can't get any
> > > > > > information
> > > > > > > > on
> > > > > > > > >>> its
> > > > > > > > >>> > > Storm cluster, and I see many following exception in
> > > > > > nimbus.log:
> > > > > > > > >>> > >
> > > > > > > > >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus
> > pool-14-thread-21
> > > > > [INFO]
> > > > > > > > >>> Created
> > > > > > > > >>> > > download session for
> > > > > > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > > > > > >>> > > amazonaws-com_defaultStormTopic-29-
> > > > 1501650673-stormcode.ser
> > > > > > with
> > > > > > > > id
> > > > > > > > >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > > > > > > > >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus
> > pool-14-thread-27
> > > > > [INFO]
> > > > > > > > >>> Created
> > > > > > > > >>> > > download session for
> > > > > > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > > > > > >>> > > amazonaws-com_defaultStormTopic-29-
> > > > 1501650673-stormconf.ser
> > > > > > with
> > > > > > > > id
> > > > > > > > >>> > > aba18011-3258-4023-bbef-14d21a7066e1
> > > > > > > > >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO]
> > > > Cleaning
> > > > > > > inbox
> > > > > > > > >>> ...
> > > > > > > > >>> > > deleted:
> > > stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > > > > > > > >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction
> > > > > > pool-14-thread-34
> > > > > > > > >>> [ERROR]
> > > > > > > > >>> > > Internal error processing getClusterInfo
> > > > > > > > >>> > > org.apache.storm.generated.KeyNotFoundException: null
> > > > > > > > >>> > >         at
> > > > > > > > >>> >
> > > > > > > > >>>
> > > org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> > > > > > > > LocalFsBlobStore.java:147)
> > > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at
> > > > > > > > >>> >
> > > > > > > > >>> org.apache.storm.blobstore.LocalFsBlobStore.
> > > > getBlobReplication(
> > > > > > > > LocalFsBlobStore.java:299)
> > > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at sun.reflect.GeneratedMethodAccessor78.
> > > > > > invoke(Unknown
> > > > > > > > >>> Source)
> > > > > > > > >>> > > ~[?:?]
> > > > > > > > >>> > >         at sun.reflect.DelegatingMethodAccessorImpl.
> > > > invoke(
> > > > > > > > >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> > > > > > > > >>> > >         at java.lang.reflect.Method.
> > > > invoke(Method.java:498)
> > > > > > > > >>> > ~[?:1.8.0_121]
> > > > > > > > >>> > >         at
> > > > > > > > >>>
> > > clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> > > > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > > > >>> > >         at
> > > > > > > > >>>
> > > clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> > > > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > > > >>> > >         at
> > > > > > > > >>> >
> > > > > > > > >>> org.apache.storm.daemon.nimbus$get_blob_replication_
> > > > > > > > count.invoke(nimbus.clj:489)
> > > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at org.apache.storm.daemon.
> > > > > > nimbus$get_cluster_info$iter__
> > > > > > > > >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > > > >>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > > > >>> > >         at clojure.lang.RT.seq(RT.java:507)
> > > > > > > ~[clojure-1.7.0.jar:?]
> > > > > > > > >>> > >         at
> clojure.core$seq__4128.invoke(core.clj:137)
> > > > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > > > >>> > >         at clojure.core$dorun.invoke(core.clj:3009)
> > > > > > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > > > > > >>> > >         at clojure.core$doall.invoke(core.clj:3025)
> > > > > > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > > > > > >>> > >         at
> > > > > > > > >>> > org.apache.storm.daemon.nimbus$get_cluster_info.
> > > > > > > > invoke(nimbus.clj:1524)
> > > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at org.apache.storm.daemon.
> > > > nimbus$mk_reified_nimbus$
> > > > > > > > >>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> > > > > > > > >>> > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at org.apache.storm.generated.
> > Nimbus$Processor$
> > > > > > > > >>> > > getClusterInfo.getResult(Nimbus.java:3920)
> > > > > > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at org.apache.storm.generated.
> > Nimbus$Processor$
> > > > > > > > >>> > > getClusterInfo.getResult(Nimbus.java:3904)
> > > > > > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at
> > > > > > > > >>> >
> > > > > > > > >>> org.apache.storm.thrift.ProcessFunction.process(
> > > > > > > > ProcessFunction.java:39)
> > > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at
> > > > > > > > >>> > org.apache.storm.thrift.TBaseProcessor.process(
> > > > > > > > TBaseProcessor.java:39)
> > > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at org.apache.storm.security.
> > > > > > auth.SimpleTransportPlugin$
> > > > > > > > >>> > > SimpleWrapProcessor.process(
> > SimpleTransportPlugin.java:
> > > > 162)
> > > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at org.apache.storm.thrift.server.
> > > > > > > > AbstractNonblockingServer$
> > > > > > > > >>> > >
> FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at
> > > > > > > > >>> > org.apache.storm.thrift.server.Invocation.run(
> > > > > > Invocation.java:18)
> > > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > > >>> > >         at
> > > > > > > > >>> >
> > > > > > > > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > >>> > > [?:1.8.0_121]
> > > > > > > > >>> > >         at
> > > > > > > > >>> >
> > > > > > > > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > >>> > > [?:1.8.0_121]
> > > > > > > > >>> > >         at java.lang.Thread.run(Thread.java:745)
> > > > > [?:1.8.0_121]
> > > > > > > > >>> > >
> > > > > > > > >>> > >
> > > > > > > > >>> > > What is amazing is that I got this same issue on two
> > > Storm
> > > > > > > clusters
> > > > > > > > >>> > > running on different VMs ; just they share the same
> > data
> > > in
> > > > > > they
> > > > > > > > >>> Kafka
> > > > > > > > >>> > > Broker cluster (one cluster is the production one,
> > which
> > > > was
> > > > > > > > quickly
> > > > > > > > >>> > fixed,
> > > > > > > > >>> > > and the other one is the "backup" cluster to be used
> if
> > > the
> > > > > > > > >>> production
> > > > > > > > >>> > one
> > > > > > > > >>> > > fails for quick "back to production")
> > > > > > > > >>> > >
> > > > > > > > >>> > > If left one of these cluster with this behavior
> > because I
> > > > > felt
> > > > > > > that
> > > > > > > > >>> it
> > > > > > > > >>> > > could be interesting for Storm developers to have
> more
> > > > > > > information
> > > > > > > > on
> > > > > > > > >>> > this
> > > > > > > > >>> > > issue, if needed to properly diagnose it.
> > > > > > > > >>> > >
> > > > > > > > >>> > > I can keep this cluster as is for max 2 days.
> > > > > > > > >>> > >
> > > > > > > > >>> > > Is there anything useful which I can collect on it to
> > > help
> > > > > > Storm
> > > > > > > > >>> > > developers to understand the cause (and hopefully use
> > it
> > > to
> > > > > > make
> > > > > > > > >>> Storm
> > > > > > > > >>> > more
> > > > > > > > >>> > > robust) ?
> > > > > > > > >>> > >
> > > > > > > > >>> > > Few details:
> > > > > > > > >>> > >
> > > > > > > > >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running
> > on a
> > > > VM,
> > > > > > > and 4
> > > > > > > > >>> > > Supervisors VMs + 3 Zookeeper VMs
> > > > > > > > >>> > >
> > > > > > > > >>> > > * Running with Java Server JRE 1.8.0_121
> > > > > > > > >>> > > * Running on AWS EC2 instances
> > > > > > > > >>> > >
> > > > > > > > >>> > > * We run about 10 topologies, with automatic
> > self-healing
> > > > on
> > > > > > them
> > > > > > > > (if
> > > > > > > > >>> > they
> > > > > > > > >>> > > stop consuming Kafka items, our self-healer call
> "Kill
> > > > > > topology",
> > > > > > > > and
> > > > > > > > >>> > then
> > > > > > > > >>> > > eventually restarts the topology
> > > > > > > > >>> > >
> > > > > > > > >>> > > * We have a self-healing on Nimbus UI based on
> calling
> > > its
> > > > > REST
> > > > > > > > >>> services.
> > > > > > > > >>> > > If it's not responding fast enough, we restart Nimbus
> > UI
> > > > > > > > >>> > > * We figured out the issue because Nimbus UI was
> > > restarted
> > > > > > every
> > > > > > > 2
> > > > > > > > >>> > minutes
> > > > > > > > >>> > >
> > > > > > > > >>> > > * To fix our production server which had the same
> > > symptom,
> > > > we
> > > > > > had
> > > > > > > > to
> > > > > > > > >>> stop
> > > > > > > > >>> > > all Storm processes, then stop all Zookeepers, then
> > > remove
> > > > > all
> > > > > > > data
> > > > > > > > >>> in
> > > > > > > > >>> > > Zookeeper "snapshot files", then restart all
> Zookeeper,
> > > > then
> > > > > > > > restart
> > > > > > > > >>> all
> > > > > > > > >>> > > Storm process, then re-submit all our topologies
> > > > > > > > >>> > >
> > > > > > > > >>> > > Please be as clear as possible about which commands
> we
> > > > should
> > > > > > run
> > > > > > > > to
> > > > > > > > >>> give
> > > > > > > > >>> > > you more details if needed
> > > > > > > > >>> > >
> > > > > > > > >>> > > Best regards,
> > > > > > > > >>> > > Alexandre Vermeerbergen
> > > > > > > > >>> > >
> > > > > > > > >>> > >
> > > > > > > > >>> >
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Alexandre Vermeerbergen <av...@gmail.com>.
Hello,

I'm afraid we had a new occurrence of this issue, where storm.local.dir
directory was still globally available but maybe some file was missing or
corrupted.

The issue starts again just after Nimbus log says it's "Cleaning inbox":

2017-08-28 02:51:21.163 o.a.s.d.nimbus pool-14-thread-30 [INFO] Created
download session for
statefulAlerting_ec2-52-211-132-198-eu-west-1-compute-amazonaws-com_defaultStormTopic-276-1503888674-stormconf.ser
with id 2594b2d0-bc4c-4e4e-b726-2e6970677f50
2017-08-28 02:56:09.549 o.a.s.d.nimbus timer [INFO] Cleaning inbox ...
deleted: stormjar-e35343c5-b5a7-48ae-ad15-a4cf7e880094.jar
2017-08-28 03:06:02.673 o.a.s.t.ProcessFunction pool-14-thread-2 [ERROR]
Internal error processing getClusterInfo
org.apache.storm.generated.KeyNotFoundException: null
    at
org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147)
~[storm-core-1.1.0.jar:1.1.0]
    at
org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299)
~[storm-core-1.1.0.jar:1.1.0]
    at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source) ~[?:?]

Couldn't it be that there's some race condition between Zookeeper & Storm
local dir content, i.e., sometimes Storm write some key to Zookeeper before
it's actually committed to Storm local dir ?

Best regards,
Alexandre Vermeerbergen



2017-08-27 1:24 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:

> Glad that you reproduced and fixed the problem. Happy to help!
>
> - Jungtaek Lim (HeartSaVioR)
>
> 2017년 8월 26일 (토) 오후 11:13, Alexandre Vermeerbergen <
> avermeerbergen@gmail.com>님이
> 작성:
>
> > Hello Jungtaek,
> >
> > Your answers were very useful, because I was able to reproduce the issue
> by
> > simply deleting storm.local.dir contents, and I found traces that indeed
> > the machine that suffered this issue had lost this directory : it
> confirms
> > your diagnostic, thank you very much for this!
> >
> > I have filled a JIRA: https://issues.apache.org/jira/browse/STORM-2707
> > hoping that this will help make Storm even more robust when it'll be
> fixed.
> >
> > In the meantime, I'm changing our storm.local.dir to a place that differs
> > from /tmp in order to reduce the risk of deletion of this directory.
> >
> > Best regard,
> > Alexandre Vermeerbergen
> >
> > 2017-08-25 9:44 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> >
> > > I'm not sure. Topology code can't be restored, so my best bet would be
> > > detecting it (periodically, or react in failure) and give up
> leadership.
> > > If my memory is right, leader Nimbus doesn't pull blobs from followers,
> > so
> > > if it doesn't have any blobs and need to sync, it just needs to become
> > > follower and let other nimbuses to take a chance to be a leader.
> > >
> > > It would not help for non-HA Nimbus cluster anyway. What's next? Maybe
> we
> > > could remove related topology (from ZK), but it feels me a bit
> dangerous
> > to
> > > do it automatically, so I'd rather leaning on CLI (maybe) tool so that
> > > operators can do it manually.
> > >
> > > Please file an issue regarding this when you would want to have a
> > solution.
> > >
> > > Best regards,
> > > Jungtaek Lim (HeartSaVioR)
> > >
> > > 2017년 8월 25일 (금) 오전 6:29, Alexandre Vermeerbergen <
> > > avermeerbergen@gmail.com>님이
> > > 작성:
> > >
> > > > Hello Jungtaek,
> > > >
> > > > I confirm that we currently do not have multiple Nimbus nodes.
> > > >
> > > > I want to clarify that Nimbus process never crashed : it keep
> printing
> > in
> > > > its log the error:
> > > >
> > > > 2017-08-06 03:44:01.777 o.a.s.t.ProcessFunction pool-14-thread-1
> > [ERROR]
> > > > Internal error processing getClusterInfo
> > > > org.apache.storm.generated.KeyNotFoundException: null
> > > > ... (rest of the stack trace)
> > > >
> > > > However I forgot to mention that:
> > > >
> > > > * During this problem our topologies are unaffected
> > > > * We notice the problem because we have a self-healing check on
> > Nimbus-UI
> > > > process which consists in calling Web Services to get a few stats on
> > > > running topologies. But when Nimbus-UI is unresponsive, we kill it
> and
> > > > restart it automatically.
> > > > * The period of our NimbusUI self healing cron is 1 minute
> > > > * When the getClusterInfo stack trace occurs on Nimbus log, Nimbus UI
> > is
> > > > unable to restart
> > > > * Here's the kind of exceptions we see in ui.log  (Nimbus UI's
> trace):
> > > >
> > > > 2017-08-24 21:24:21.777 o.a.s.u.core main [INFO] Starting ui server
> for
> > > > storm version '1.1.0'
> > > > 2017-08-24 21:24:21.788 o.a.s.d.m.MetricsUtils main [INFO] Using
> > > statistics
> > > > reporter
> > > > plugin:org.apache.storm.daemon.metrics.reporters.
> JmxPreparableReporter
> > > > 2017-08-24 21:24:21.790 o.a.s.d.m.r.JmxPreparableReporter main
> [INFO]
> > > > Preparing...
> > > > 2017-08-24 21:24:21.801 o.a.s.d.common main [INFO] Started statistics
> > > > report plugin...
> > > > 2017-08-24 21:24:21.878 o.a.s.s.o.e.j.s.Server main [INFO]
> > > > jetty-7.x.y-SNAPSHOT
> > > > 2017-08-24 21:24:21.929 o.a.s.s.o.e.j.s.h.ContextHandler main [INFO]
> > > > started o.a.s.s.o.e.j.s.ServletContextHandler{/,null}
> > > > 2017-08-24 21:24:21.965 o.a.s.s.o.e.j.s.AbstractConnector main
> [INFO]
> > > > Started SelectChannelConnector@0.0.0.0:8070
> > > > 2017-08-24 21:25:02.446 o.a.s.u.NimbusClient qtp2142536057-19 [WARN]
> > > > Ignoring exception while trying to get leader nimbus info from
> > > > ec2-52-48-59-151.eu-west-1.compute.amazonaws.com. will retry with a
> > > > different seed host.
> > > > org.apache.storm.thrift.transport.TTransportException: null
> > > >         at
> > > >
> > > > org.apache.storm.thrift.transport.TIOStreamTransport.
> > > read(TIOStreamTransport.java:132)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > >
> > org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > >
> > > > org.apache.storm.thrift.transport.TFramedTransport.
> > > readFrame(TFramedTransport.java:129)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > >
> > > > org.apache.storm.thrift.transport.TFramedTransport.
> > > read(TFramedTransport.java:101)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > >
> > org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > >
> > > > org.apache.storm.thrift.protocol.TBinaryProtocol.
> > > readAll(TBinaryProtocol.java:429)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > >
> > > > org.apache.storm.thrift.protocol.TBinaryProtocol.
> > > readI32(TBinaryProtocol.java:318)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > >
> > > > org.apache.storm.thrift.protocol.TBinaryProtocol.readMessageBegin(
> > > TBinaryProtocol.java:219)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > > org.apache.storm.thrift.TServiceClient.receiveBase(
> > > TServiceClient.java:77)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > > org.apache.storm.generated.Nimbus$Client.recv_getLeader(
> > > Nimbus.java:1193)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > > org.apache.storm.generated.Nimbus$Client.getLeader(Nimbus.java:1181)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > >
> > > > org.apache.storm.utils.NimbusClient.getConfiguredClientAs(
> > > NimbusClient.java:84)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > org.apache.storm.ui.core$cluster_summary.invoke(core.clj:355)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at org.apache.storm.ui.core$fn__9556.invoke(core.clj:1113)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > >
> > > > org.apache.storm.shade.compojure.core$make_route$fn__
> > > 5976.invoke(core.clj:100)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > > org.apache.storm.shade.compojure.core$if_route$fn__
> > > 5964.invoke(core.clj:46)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > >
> > > > org.apache.storm.shade.compojure.core$if_method$fn__
> > > 5957.invoke(core.clj:31)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at
> > > > org.apache.storm.shade.compojure.core$routing$fn__
> > > 5982.invoke(core.clj:113)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at clojure.core$some.invoke(core.clj:2570)
> > > ~[clojure-1.7.0.jar:?]
> > > >         at
> > > > org.apache.storm.shade.compojure.core$routing.doInvoke(core.clj:113)
> > > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >         at clojure.lang.RestFn.applyTo(RestFn.java:139)
> > > > ~[clojure-1.7.0.jar:?]
> > > >         at clojure.core$apply.invoke(core.clj:632)
> > > ~[clojure-1.7.0.jar:?]
> > > >
> > > > So our problem is that Nimbus stays forever in a inconsistent state
> > > without
> > > > crashing, and if it would crash, then it would be helpless since we
> > would
> > > > have automatic restart of it (we also have a self healing on Nimbus)
> > but
> > > > again in such case we need to manually cleanup the content in the ZKs
> > to
> > > be
> > > > able to restart Nimbus in a clean state.
> > > >
> > > > My question is the following one: is there a way to make Nimbus more
> > > > resilient to such issues?
> > > >
> > > > Best regards,
> > > > Alexandre
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > 2017-08-24 23:06 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > > >
> > > > > Blob files (meta, data) are in storm local directory. ZK only has
> > list
> > > of
> > > > > blob keys and which alive nimbuses have that file. So if you lose
> > storm
> > > > > local directory, you just can't restore blobs, unless other
> nimbuses
> > > have
> > > > > these blobs so current nimbus could pull.
> > > > > (I guess you have only one nimbus, since replication factor was 1
> in
> > > > nimbus
> > > > > log.)
> > > > >
> > > > > If FNF occurs while finding meta file, we just throw
> > > KeyNotFoundException
> > > > > because it is just same as not exist.
> > > > >
> > > > > Anyway you have to clear up everything because there's
> inconsistency
> > > > > between ZK and local, and no way to recover missed blob files so no
> > way
> > > > to
> > > > > make it consistent.
> > > > >
> > > > > Btw, in this case, my expectation is that after crashing and
> > > restarting,
> > > > > nimbus can't be a leader and just waits for missing blobs. If you
> saw
> > > > > Nimbus crashed again while relaunching, it could be a bug.
> > > > >
> > > > > - Jungtaek Lim (HeartSaVioR)
> > > > > On Fri, 25 Aug 2017 at 3:35 AM Alexandre Vermeerbergen <
> > > > > avermeerbergen@gmail.com> wrote:
> > > > >
> > > > > > Hello Jungtaek,
> > > > > >
> > > > > > I can do what you suggest (ie moving storm local dir to a place
> > which
> > > > > isn't
> > > > > > in /tmp),but since the issue occurs rarely (once per month), I
> > doubt
> > > > I'll
> > > > > > be able to feedback soon.
> > > > > >
> > > > > > What is puzzling to me is that in order to recover from such
> issue,
> > > we
> > > > > have
> > > > > > to stop everything, then clean the Zookeepers, then restarts all
> > > Storm
> > > > > > processes and finally restart our topologies. We don't clean the
> > /tmp
> > > > > > directory.
> > > > > >
> > > > > > Are you sure there's no relationships between Zookeeper contents
> > and
> > > > what
> > > > > > is causing the following message to happen?
> > > > > > "Internal error processing getClusterInfo"
> > > > > > Best regards;
> > > > > > Alexandre
> > > > > >
> > > > > > 2017-08-24 13:41 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > > > > >
> > > > > > > Sorry I meant reproduce, not replicate. :)
> > > > > > >
> > > > > > > 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <ka...@gmail.com>님이
> 작성:
> > > > > > >
> > > > > > > > Alexandre,
> > > > > > > >
> > > > > > > > I found that your storm local dir is placed to "/tmp/storm"
> > which
> > > > > parts
> > > > > > > or
> > > > > > > > all could be removed at any time.
> > > > > > > > Could you move the path to non-temporary place and try to
> > > > replicate?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jungtaek Lim (HeartSaVioR)
> > > > > > > >
> > > > > > > >
> > > > > > > > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> > > > > > > > avermeerbergen@gmail.com>님이 작성:
> > > > > > > >
> > > > > > > >> Hello Jungtaek,
> > > > > > > >>
> > > > > > > >> Thank you very much for your answer.
> > > > > > > >>
> > > > > > > >> Please find attached the full Nimbus log (gzipped) related
> to
> > > this
> > > > > > > issue.
> > > > > > > >>
> > > > > > > >> Please note that the last ERROR repeats forever until we
> > > "repair"
> > > > > > Storm.
> > > > > > > >>
> > > > > > > >> From the logs, it could be that the issue began close to
> when
> > a
> > > > > > topology
> > > > > > > >> was restarted (killed, then started)
> > > > > > > >>
> > > > > > > >> Maybe this caused a corruption in Zookeeper. Is there
> anything
> > > > > which I
> > > > > > > >> can collect in our Zookeeper nodes/logs to help analysis?
> > > > > > > >>
> > > > > > > >> Best regards,
> > > > > > > >> Alexandre
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > > > > > > >>
> > > > > > > >>> Hi Alexandre, I missed this mail since I was on vacation.
> > > > > > > >>>
> > > > > > > >>> I followed the stack trace but hard to analyze without
> > context.
> > > > Do
> > > > > > you
> > > > > > > >>> mind
> > > > > > > >>> providing full nimbus log?
> > > > > > > >>>
> > > > > > > >>> Thanks,
> > > > > > > >>> Jungtaek Lim (HeartSaVioR)
> > > > > > > >>>
> > > > > > > >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> > > > > > > >>> avermeerbergen@gmail.com>님이
> > > > > > > >>> 작성:
> > > > > > > >>>
> > > > > > > >>> > Hello,
> > > > > > > >>> >
> > > > > > > >>> > Tomorrow I will have to restart the cluster on which I
> have
> > > > this
> > > > > > > issue
> > > > > > > >>> with
> > > > > > > >>> > Storm 1.1.0.
> > > > > > > >>> > Is there are anybody interested in my running some
> commands
> > > to
> > > > > get
> > > > > > > more
> > > > > > > >>> > logs before I repair this cluster?
> > > > > > > >>> >
> > > > > > > >>> > Best regards,
> > > > > > > >>> > Alexandre Vermeerbergen
> > > > > > > >>> >
> > > > > > > >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> > > > > > > >>> > avermeerbergen@gmail.com
> > > > > > > >>> > >:
> > > > > > > >>> >
> > > > > > > >>> > > Hello,
> > > > > > > >>> > >
> > > > > > > >>> > > I think it might be of interest for you Storm
> developers
> > to
> > > > > learn
> > > > > > > >>> that I
> > > > > > > >>> > > currently have a case of issue with Storm 1.1.0 which
> was
> > > > > > supposed
> > > > > > > to
> > > > > > > >>> > > resolved in this release according to
> > > > > https://issues.apache.org/
> > > > > > > >>> > > jira/browse/STORM-1977 ; and I can look for any more
> > > > > information
> > > > > > > >>> which
> > > > > > > >>> > > you'd need to make a diagnostic on why this issue still
> > can
> > > > > > happen.
> > > > > > > >>> > >
> > > > > > > >>> > > Indeed, I have a Storm UI process which can't get any
> > > > > information
> > > > > > > on
> > > > > > > >>> its
> > > > > > > >>> > > Storm cluster, and I see many following exception in
> > > > > nimbus.log:
> > > > > > > >>> > >
> > > > > > > >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus
> pool-14-thread-21
> > > > [INFO]
> > > > > > > >>> Created
> > > > > > > >>> > > download session for
> > > > > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > > > > >>> > > amazonaws-com_defaultStormTopic-29-
> > > 1501650673-stormcode.ser
> > > > > with
> > > > > > > id
> > > > > > > >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > > > > > > >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus
> pool-14-thread-27
> > > > [INFO]
> > > > > > > >>> Created
> > > > > > > >>> > > download session for
> > > > > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > > > > >>> > > amazonaws-com_defaultStormTopic-29-
> > > 1501650673-stormconf.ser
> > > > > with
> > > > > > > id
> > > > > > > >>> > > aba18011-3258-4023-bbef-14d21a7066e1
> > > > > > > >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO]
> > > Cleaning
> > > > > > inbox
> > > > > > > >>> ...
> > > > > > > >>> > > deleted:
> > stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > > > > > > >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction
> > > > > pool-14-thread-34
> > > > > > > >>> [ERROR]
> > > > > > > >>> > > Internal error processing getClusterInfo
> > > > > > > >>> > > org.apache.storm.generated.KeyNotFoundException: null
> > > > > > > >>> > >         at
> > > > > > > >>> >
> > > > > > > >>>
> > org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> > > > > > > LocalFsBlobStore.java:147)
> > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at
> > > > > > > >>> >
> > > > > > > >>> org.apache.storm.blobstore.LocalFsBlobStore.
> > > getBlobReplication(
> > > > > > > LocalFsBlobStore.java:299)
> > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at sun.reflect.GeneratedMethodAccessor78.
> > > > > invoke(Unknown
> > > > > > > >>> Source)
> > > > > > > >>> > > ~[?:?]
> > > > > > > >>> > >         at sun.reflect.DelegatingMethodAccessorImpl.
> > > invoke(
> > > > > > > >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> > > > > > > >>> > >         at java.lang.reflect.Method.
> > > invoke(Method.java:498)
> > > > > > > >>> > ~[?:1.8.0_121]
> > > > > > > >>> > >         at
> > > > > > > >>>
> > clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> > > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > > >>> > >         at
> > > > > > > >>>
> > clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> > > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > > >>> > >         at
> > > > > > > >>> >
> > > > > > > >>> org.apache.storm.daemon.nimbus$get_blob_replication_
> > > > > > > count.invoke(nimbus.clj:489)
> > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at org.apache.storm.daemon.
> > > > > nimbus$get_cluster_info$iter__
> > > > > > > >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > > >>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > > >>> > >         at clojure.lang.RT.seq(RT.java:507)
> > > > > > ~[clojure-1.7.0.jar:?]
> > > > > > > >>> > >         at clojure.core$seq__4128.invoke(core.clj:137)
> > > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > > >>> > >         at clojure.core$dorun.invoke(core.clj:3009)
> > > > > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > > > > >>> > >         at clojure.core$doall.invoke(core.clj:3025)
> > > > > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > > > > >>> > >         at
> > > > > > > >>> > org.apache.storm.daemon.nimbus$get_cluster_info.
> > > > > > > invoke(nimbus.clj:1524)
> > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at org.apache.storm.daemon.
> > > nimbus$mk_reified_nimbus$
> > > > > > > >>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> > > > > > > >>> > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at org.apache.storm.generated.
> Nimbus$Processor$
> > > > > > > >>> > > getClusterInfo.getResult(Nimbus.java:3920)
> > > > > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at org.apache.storm.generated.
> Nimbus$Processor$
> > > > > > > >>> > > getClusterInfo.getResult(Nimbus.java:3904)
> > > > > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at
> > > > > > > >>> >
> > > > > > > >>> org.apache.storm.thrift.ProcessFunction.process(
> > > > > > > ProcessFunction.java:39)
> > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at
> > > > > > > >>> > org.apache.storm.thrift.TBaseProcessor.process(
> > > > > > > TBaseProcessor.java:39)
> > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at org.apache.storm.security.
> > > > > auth.SimpleTransportPlugin$
> > > > > > > >>> > > SimpleWrapProcessor.process(
> SimpleTransportPlugin.java:
> > > 162)
> > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at org.apache.storm.thrift.server.
> > > > > > > AbstractNonblockingServer$
> > > > > > > >>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at
> > > > > > > >>> > org.apache.storm.thrift.server.Invocation.run(
> > > > > Invocation.java:18)
> > > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > > >>> > >         at
> > > > > > > >>> >
> > > > > > > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > >>> > > [?:1.8.0_121]
> > > > > > > >>> > >         at
> > > > > > > >>> >
> > > > > > > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > >>> > > [?:1.8.0_121]
> > > > > > > >>> > >         at java.lang.Thread.run(Thread.java:745)
> > > > [?:1.8.0_121]
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> > > What is amazing is that I got this same issue on two
> > Storm
> > > > > > clusters
> > > > > > > >>> > > running on different VMs ; just they share the same
> data
> > in
> > > > > they
> > > > > > > >>> Kafka
> > > > > > > >>> > > Broker cluster (one cluster is the production one,
> which
> > > was
> > > > > > > quickly
> > > > > > > >>> > fixed,
> > > > > > > >>> > > and the other one is the "backup" cluster to be used if
> > the
> > > > > > > >>> production
> > > > > > > >>> > one
> > > > > > > >>> > > fails for quick "back to production")
> > > > > > > >>> > >
> > > > > > > >>> > > If left one of these cluster with this behavior
> because I
> > > > felt
> > > > > > that
> > > > > > > >>> it
> > > > > > > >>> > > could be interesting for Storm developers to have more
> > > > > > information
> > > > > > > on
> > > > > > > >>> > this
> > > > > > > >>> > > issue, if needed to properly diagnose it.
> > > > > > > >>> > >
> > > > > > > >>> > > I can keep this cluster as is for max 2 days.
> > > > > > > >>> > >
> > > > > > > >>> > > Is there anything useful which I can collect on it to
> > help
> > > > > Storm
> > > > > > > >>> > > developers to understand the cause (and hopefully use
> it
> > to
> > > > > make
> > > > > > > >>> Storm
> > > > > > > >>> > more
> > > > > > > >>> > > robust) ?
> > > > > > > >>> > >
> > > > > > > >>> > > Few details:
> > > > > > > >>> > >
> > > > > > > >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running
> on a
> > > VM,
> > > > > > and 4
> > > > > > > >>> > > Supervisors VMs + 3 Zookeeper VMs
> > > > > > > >>> > >
> > > > > > > >>> > > * Running with Java Server JRE 1.8.0_121
> > > > > > > >>> > > * Running on AWS EC2 instances
> > > > > > > >>> > >
> > > > > > > >>> > > * We run about 10 topologies, with automatic
> self-healing
> > > on
> > > > > them
> > > > > > > (if
> > > > > > > >>> > they
> > > > > > > >>> > > stop consuming Kafka items, our self-healer call "Kill
> > > > > topology",
> > > > > > > and
> > > > > > > >>> > then
> > > > > > > >>> > > eventually restarts the topology
> > > > > > > >>> > >
> > > > > > > >>> > > * We have a self-healing on Nimbus UI based on calling
> > its
> > > > REST
> > > > > > > >>> services.
> > > > > > > >>> > > If it's not responding fast enough, we restart Nimbus
> UI
> > > > > > > >>> > > * We figured out the issue because Nimbus UI was
> > restarted
> > > > > every
> > > > > > 2
> > > > > > > >>> > minutes
> > > > > > > >>> > >
> > > > > > > >>> > > * To fix our production server which had the same
> > symptom,
> > > we
> > > > > had
> > > > > > > to
> > > > > > > >>> stop
> > > > > > > >>> > > all Storm processes, then stop all Zookeepers, then
> > remove
> > > > all
> > > > > > data
> > > > > > > >>> in
> > > > > > > >>> > > Zookeeper "snapshot files", then restart all Zookeeper,
> > > then
> > > > > > > restart
> > > > > > > >>> all
> > > > > > > >>> > > Storm process, then re-submit all our topologies
> > > > > > > >>> > >
> > > > > > > >>> > > Please be as clear as possible about which commands we
> > > should
> > > > > run
> > > > > > > to
> > > > > > > >>> give
> > > > > > > >>> > > you more details if needed
> > > > > > > >>> > >
> > > > > > > >>> > > Best regards,
> > > > > > > >>> > > Alexandre Vermeerbergen
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Jungtaek Lim <ka...@gmail.com>.
Glad that you reproduced and fixed the problem. Happy to help!

- Jungtaek Lim (HeartSaVioR)

2017년 8월 26일 (토) 오후 11:13, Alexandre Vermeerbergen <av...@gmail.com>님이
작성:

> Hello Jungtaek,
>
> Your answers were very useful, because I was able to reproduce the issue by
> simply deleting storm.local.dir contents, and I found traces that indeed
> the machine that suffered this issue had lost this directory : it confirms
> your diagnostic, thank you very much for this!
>
> I have filled a JIRA: https://issues.apache.org/jira/browse/STORM-2707
> hoping that this will help make Storm even more robust when it'll be fixed.
>
> In the meantime, I'm changing our storm.local.dir to a place that differs
> from /tmp in order to reduce the risk of deletion of this directory.
>
> Best regard,
> Alexandre Vermeerbergen
>
> 2017-08-25 9:44 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
>
> > I'm not sure. Topology code can't be restored, so my best bet would be
> > detecting it (periodically, or react in failure) and give up leadership.
> > If my memory is right, leader Nimbus doesn't pull blobs from followers,
> so
> > if it doesn't have any blobs and need to sync, it just needs to become
> > follower and let other nimbuses to take a chance to be a leader.
> >
> > It would not help for non-HA Nimbus cluster anyway. What's next? Maybe we
> > could remove related topology (from ZK), but it feels me a bit dangerous
> to
> > do it automatically, so I'd rather leaning on CLI (maybe) tool so that
> > operators can do it manually.
> >
> > Please file an issue regarding this when you would want to have a
> solution.
> >
> > Best regards,
> > Jungtaek Lim (HeartSaVioR)
> >
> > 2017년 8월 25일 (금) 오전 6:29, Alexandre Vermeerbergen <
> > avermeerbergen@gmail.com>님이
> > 작성:
> >
> > > Hello Jungtaek,
> > >
> > > I confirm that we currently do not have multiple Nimbus nodes.
> > >
> > > I want to clarify that Nimbus process never crashed : it keep printing
> in
> > > its log the error:
> > >
> > > 2017-08-06 03:44:01.777 o.a.s.t.ProcessFunction pool-14-thread-1
> [ERROR]
> > > Internal error processing getClusterInfo
> > > org.apache.storm.generated.KeyNotFoundException: null
> > > ... (rest of the stack trace)
> > >
> > > However I forgot to mention that:
> > >
> > > * During this problem our topologies are unaffected
> > > * We notice the problem because we have a self-healing check on
> Nimbus-UI
> > > process which consists in calling Web Services to get a few stats on
> > > running topologies. But when Nimbus-UI is unresponsive, we kill it and
> > > restart it automatically.
> > > * The period of our NimbusUI self healing cron is 1 minute
> > > * When the getClusterInfo stack trace occurs on Nimbus log, Nimbus UI
> is
> > > unable to restart
> > > * Here's the kind of exceptions we see in ui.log  (Nimbus UI's trace):
> > >
> > > 2017-08-24 21:24:21.777 o.a.s.u.core main [INFO] Starting ui server for
> > > storm version '1.1.0'
> > > 2017-08-24 21:24:21.788 o.a.s.d.m.MetricsUtils main [INFO] Using
> > statistics
> > > reporter
> > > plugin:org.apache.storm.daemon.metrics.reporters.JmxPreparableReporter
> > > 2017-08-24 21:24:21.790 o.a.s.d.m.r.JmxPreparableReporter main [INFO]
> > > Preparing...
> > > 2017-08-24 21:24:21.801 o.a.s.d.common main [INFO] Started statistics
> > > report plugin...
> > > 2017-08-24 21:24:21.878 o.a.s.s.o.e.j.s.Server main [INFO]
> > > jetty-7.x.y-SNAPSHOT
> > > 2017-08-24 21:24:21.929 o.a.s.s.o.e.j.s.h.ContextHandler main [INFO]
> > > started o.a.s.s.o.e.j.s.ServletContextHandler{/,null}
> > > 2017-08-24 21:24:21.965 o.a.s.s.o.e.j.s.AbstractConnector main [INFO]
> > > Started SelectChannelConnector@0.0.0.0:8070
> > > 2017-08-24 21:25:02.446 o.a.s.u.NimbusClient qtp2142536057-19 [WARN]
> > > Ignoring exception while trying to get leader nimbus info from
> > > ec2-52-48-59-151.eu-west-1.compute.amazonaws.com. will retry with a
> > > different seed host.
> > > org.apache.storm.thrift.transport.TTransportException: null
> > >         at
> > >
> > > org.apache.storm.thrift.transport.TIOStreamTransport.
> > read(TIOStreamTransport.java:132)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.thrift.transport.TFramedTransport.
> > readFrame(TFramedTransport.java:129)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.thrift.transport.TFramedTransport.
> > read(TFramedTransport.java:101)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.thrift.protocol.TBinaryProtocol.
> > readAll(TBinaryProtocol.java:429)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.thrift.protocol.TBinaryProtocol.
> > readI32(TBinaryProtocol.java:318)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.thrift.protocol.TBinaryProtocol.readMessageBegin(
> > TBinaryProtocol.java:219)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > > org.apache.storm.thrift.TServiceClient.receiveBase(
> > TServiceClient.java:77)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > > org.apache.storm.generated.Nimbus$Client.recv_getLeader(
> > Nimbus.java:1193)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > > org.apache.storm.generated.Nimbus$Client.getLeader(Nimbus.java:1181)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.utils.NimbusClient.getConfiguredClientAs(
> > NimbusClient.java:84)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> org.apache.storm.ui.core$cluster_summary.invoke(core.clj:355)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.ui.core$fn__9556.invoke(core.clj:1113)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.shade.compojure.core$make_route$fn__
> > 5976.invoke(core.clj:100)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > > org.apache.storm.shade.compojure.core$if_route$fn__
> > 5964.invoke(core.clj:46)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.shade.compojure.core$if_method$fn__
> > 5957.invoke(core.clj:31)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > > org.apache.storm.shade.compojure.core$routing$fn__
> > 5982.invoke(core.clj:113)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at clojure.core$some.invoke(core.clj:2570)
> > ~[clojure-1.7.0.jar:?]
> > >         at
> > > org.apache.storm.shade.compojure.core$routing.doInvoke(core.clj:113)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at clojure.lang.RestFn.applyTo(RestFn.java:139)
> > > ~[clojure-1.7.0.jar:?]
> > >         at clojure.core$apply.invoke(core.clj:632)
> > ~[clojure-1.7.0.jar:?]
> > >
> > > So our problem is that Nimbus stays forever in a inconsistent state
> > without
> > > crashing, and if it would crash, then it would be helpless since we
> would
> > > have automatic restart of it (we also have a self healing on Nimbus)
> but
> > > again in such case we need to manually cleanup the content in the ZKs
> to
> > be
> > > able to restart Nimbus in a clean state.
> > >
> > > My question is the following one: is there a way to make Nimbus more
> > > resilient to such issues?
> > >
> > > Best regards,
> > > Alexandre
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > 2017-08-24 23:06 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > >
> > > > Blob files (meta, data) are in storm local directory. ZK only has
> list
> > of
> > > > blob keys and which alive nimbuses have that file. So if you lose
> storm
> > > > local directory, you just can't restore blobs, unless other nimbuses
> > have
> > > > these blobs so current nimbus could pull.
> > > > (I guess you have only one nimbus, since replication factor was 1 in
> > > nimbus
> > > > log.)
> > > >
> > > > If FNF occurs while finding meta file, we just throw
> > KeyNotFoundException
> > > > because it is just same as not exist.
> > > >
> > > > Anyway you have to clear up everything because there's inconsistency
> > > > between ZK and local, and no way to recover missed blob files so no
> way
> > > to
> > > > make it consistent.
> > > >
> > > > Btw, in this case, my expectation is that after crashing and
> > restarting,
> > > > nimbus can't be a leader and just waits for missing blobs. If you saw
> > > > Nimbus crashed again while relaunching, it could be a bug.
> > > >
> > > > - Jungtaek Lim (HeartSaVioR)
> > > > On Fri, 25 Aug 2017 at 3:35 AM Alexandre Vermeerbergen <
> > > > avermeerbergen@gmail.com> wrote:
> > > >
> > > > > Hello Jungtaek,
> > > > >
> > > > > I can do what you suggest (ie moving storm local dir to a place
> which
> > > > isn't
> > > > > in /tmp),but since the issue occurs rarely (once per month), I
> doubt
> > > I'll
> > > > > be able to feedback soon.
> > > > >
> > > > > What is puzzling to me is that in order to recover from such issue,
> > we
> > > > have
> > > > > to stop everything, then clean the Zookeepers, then restarts all
> > Storm
> > > > > processes and finally restart our topologies. We don't clean the
> /tmp
> > > > > directory.
> > > > >
> > > > > Are you sure there's no relationships between Zookeeper contents
> and
> > > what
> > > > > is causing the following message to happen?
> > > > > "Internal error processing getClusterInfo"
> > > > > Best regards;
> > > > > Alexandre
> > > > >
> > > > > 2017-08-24 13:41 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > > > >
> > > > > > Sorry I meant reproduce, not replicate. :)
> > > > > >
> > > > > > 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
> > > > > >
> > > > > > > Alexandre,
> > > > > > >
> > > > > > > I found that your storm local dir is placed to "/tmp/storm"
> which
> > > > parts
> > > > > > or
> > > > > > > all could be removed at any time.
> > > > > > > Could you move the path to non-temporary place and try to
> > > replicate?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jungtaek Lim (HeartSaVioR)
> > > > > > >
> > > > > > >
> > > > > > > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> > > > > > > avermeerbergen@gmail.com>님이 작성:
> > > > > > >
> > > > > > >> Hello Jungtaek,
> > > > > > >>
> > > > > > >> Thank you very much for your answer.
> > > > > > >>
> > > > > > >> Please find attached the full Nimbus log (gzipped) related to
> > this
> > > > > > issue.
> > > > > > >>
> > > > > > >> Please note that the last ERROR repeats forever until we
> > "repair"
> > > > > Storm.
> > > > > > >>
> > > > > > >> From the logs, it could be that the issue began close to when
> a
> > > > > topology
> > > > > > >> was restarted (killed, then started)
> > > > > > >>
> > > > > > >> Maybe this caused a corruption in Zookeeper. Is there anything
> > > > which I
> > > > > > >> can collect in our Zookeeper nodes/logs to help analysis?
> > > > > > >>
> > > > > > >> Best regards,
> > > > > > >> Alexandre
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > > > > > >>
> > > > > > >>> Hi Alexandre, I missed this mail since I was on vacation.
> > > > > > >>>
> > > > > > >>> I followed the stack trace but hard to analyze without
> context.
> > > Do
> > > > > you
> > > > > > >>> mind
> > > > > > >>> providing full nimbus log?
> > > > > > >>>
> > > > > > >>> Thanks,
> > > > > > >>> Jungtaek Lim (HeartSaVioR)
> > > > > > >>>
> > > > > > >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> > > > > > >>> avermeerbergen@gmail.com>님이
> > > > > > >>> 작성:
> > > > > > >>>
> > > > > > >>> > Hello,
> > > > > > >>> >
> > > > > > >>> > Tomorrow I will have to restart the cluster on which I have
> > > this
> > > > > > issue
> > > > > > >>> with
> > > > > > >>> > Storm 1.1.0.
> > > > > > >>> > Is there are anybody interested in my running some commands
> > to
> > > > get
> > > > > > more
> > > > > > >>> > logs before I repair this cluster?
> > > > > > >>> >
> > > > > > >>> > Best regards,
> > > > > > >>> > Alexandre Vermeerbergen
> > > > > > >>> >
> > > > > > >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> > > > > > >>> > avermeerbergen@gmail.com
> > > > > > >>> > >:
> > > > > > >>> >
> > > > > > >>> > > Hello,
> > > > > > >>> > >
> > > > > > >>> > > I think it might be of interest for you Storm developers
> to
> > > > learn
> > > > > > >>> that I
> > > > > > >>> > > currently have a case of issue with Storm 1.1.0 which was
> > > > > supposed
> > > > > > to
> > > > > > >>> > > resolved in this release according to
> > > > https://issues.apache.org/
> > > > > > >>> > > jira/browse/STORM-1977 ; and I can look for any more
> > > > information
> > > > > > >>> which
> > > > > > >>> > > you'd need to make a diagnostic on why this issue still
> can
> > > > > happen.
> > > > > > >>> > >
> > > > > > >>> > > Indeed, I have a Storm UI process which can't get any
> > > > information
> > > > > > on
> > > > > > >>> its
> > > > > > >>> > > Storm cluster, and I see many following exception in
> > > > nimbus.log:
> > > > > > >>> > >
> > > > > > >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21
> > > [INFO]
> > > > > > >>> Created
> > > > > > >>> > > download session for
> > > > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > > > >>> > > amazonaws-com_defaultStormTopic-29-
> > 1501650673-stormcode.ser
> > > > with
> > > > > > id
> > > > > > >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > > > > > >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27
> > > [INFO]
> > > > > > >>> Created
> > > > > > >>> > > download session for
> > > > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > > > >>> > > amazonaws-com_defaultStormTopic-29-
> > 1501650673-stormconf.ser
> > > > with
> > > > > > id
> > > > > > >>> > > aba18011-3258-4023-bbef-14d21a7066e1
> > > > > > >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO]
> > Cleaning
> > > > > inbox
> > > > > > >>> ...
> > > > > > >>> > > deleted:
> stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > > > > > >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction
> > > > pool-14-thread-34
> > > > > > >>> [ERROR]
> > > > > > >>> > > Internal error processing getClusterInfo
> > > > > > >>> > > org.apache.storm.generated.KeyNotFoundException: null
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>>
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> > > > > > LocalFsBlobStore.java:147)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>> org.apache.storm.blobstore.LocalFsBlobStore.
> > getBlobReplication(
> > > > > > LocalFsBlobStore.java:299)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at sun.reflect.GeneratedMethodAccessor78.
> > > > invoke(Unknown
> > > > > > >>> Source)
> > > > > > >>> > > ~[?:?]
> > > > > > >>> > >         at sun.reflect.DelegatingMethodAccessorImpl.
> > invoke(
> > > > > > >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> > > > > > >>> > >         at java.lang.reflect.Method.
> > invoke(Method.java:498)
> > > > > > >>> > ~[?:1.8.0_121]
> > > > > > >>> > >         at
> > > > > > >>>
> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at
> > > > > > >>>
> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>> org.apache.storm.daemon.nimbus$get_blob_replication_
> > > > > > count.invoke(nimbus.clj:489)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.daemon.
> > > > nimbus$get_cluster_info$iter__
> > > > > > >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at clojure.lang.RT.seq(RT.java:507)
> > > > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at clojure.core$seq__4128.invoke(core.clj:137)
> > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at clojure.core$dorun.invoke(core.clj:3009)
> > > > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at clojure.core$doall.invoke(core.clj:3025)
> > > > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at
> > > > > > >>> > org.apache.storm.daemon.nimbus$get_cluster_info.
> > > > > > invoke(nimbus.clj:1524)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.daemon.
> > nimbus$mk_reified_nimbus$
> > > > > > >>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> > > > > > >>> > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > > > > >>> > > getClusterInfo.getResult(Nimbus.java:3920)
> > > > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > > > > >>> > > getClusterInfo.getResult(Nimbus.java:3904)
> > > > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>> org.apache.storm.thrift.ProcessFunction.process(
> > > > > > ProcessFunction.java:39)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at
> > > > > > >>> > org.apache.storm.thrift.TBaseProcessor.process(
> > > > > > TBaseProcessor.java:39)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.security.
> > > > auth.SimpleTransportPlugin$
> > > > > > >>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:
> > 162)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.thrift.server.
> > > > > > AbstractNonblockingServer$
> > > > > > >>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at
> > > > > > >>> > org.apache.storm.thrift.server.Invocation.run(
> > > > Invocation.java:18)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > > > ThreadPoolExecutor.java:1142)
> > > > > > >>> > > [?:1.8.0_121]
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > > > ThreadPoolExecutor.java:617)
> > > > > > >>> > > [?:1.8.0_121]
> > > > > > >>> > >         at java.lang.Thread.run(Thread.java:745)
> > > [?:1.8.0_121]
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > What is amazing is that I got this same issue on two
> Storm
> > > > > clusters
> > > > > > >>> > > running on different VMs ; just they share the same data
> in
> > > > they
> > > > > > >>> Kafka
> > > > > > >>> > > Broker cluster (one cluster is the production one, which
> > was
> > > > > > quickly
> > > > > > >>> > fixed,
> > > > > > >>> > > and the other one is the "backup" cluster to be used if
> the
> > > > > > >>> production
> > > > > > >>> > one
> > > > > > >>> > > fails for quick "back to production")
> > > > > > >>> > >
> > > > > > >>> > > If left one of these cluster with this behavior because I
> > > felt
> > > > > that
> > > > > > >>> it
> > > > > > >>> > > could be interesting for Storm developers to have more
> > > > > information
> > > > > > on
> > > > > > >>> > this
> > > > > > >>> > > issue, if needed to properly diagnose it.
> > > > > > >>> > >
> > > > > > >>> > > I can keep this cluster as is for max 2 days.
> > > > > > >>> > >
> > > > > > >>> > > Is there anything useful which I can collect on it to
> help
> > > > Storm
> > > > > > >>> > > developers to understand the cause (and hopefully use it
> to
> > > > make
> > > > > > >>> Storm
> > > > > > >>> > more
> > > > > > >>> > > robust) ?
> > > > > > >>> > >
> > > > > > >>> > > Few details:
> > > > > > >>> > >
> > > > > > >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a
> > VM,
> > > > > and 4
> > > > > > >>> > > Supervisors VMs + 3 Zookeeper VMs
> > > > > > >>> > >
> > > > > > >>> > > * Running with Java Server JRE 1.8.0_121
> > > > > > >>> > > * Running on AWS EC2 instances
> > > > > > >>> > >
> > > > > > >>> > > * We run about 10 topologies, with automatic self-healing
> > on
> > > > them
> > > > > > (if
> > > > > > >>> > they
> > > > > > >>> > > stop consuming Kafka items, our self-healer call "Kill
> > > > topology",
> > > > > > and
> > > > > > >>> > then
> > > > > > >>> > > eventually restarts the topology
> > > > > > >>> > >
> > > > > > >>> > > * We have a self-healing on Nimbus UI based on calling
> its
> > > REST
> > > > > > >>> services.
> > > > > > >>> > > If it's not responding fast enough, we restart Nimbus UI
> > > > > > >>> > > * We figured out the issue because Nimbus UI was
> restarted
> > > > every
> > > > > 2
> > > > > > >>> > minutes
> > > > > > >>> > >
> > > > > > >>> > > * To fix our production server which had the same
> symptom,
> > we
> > > > had
> > > > > > to
> > > > > > >>> stop
> > > > > > >>> > > all Storm processes, then stop all Zookeepers, then
> remove
> > > all
> > > > > data
> > > > > > >>> in
> > > > > > >>> > > Zookeeper "snapshot files", then restart all Zookeeper,
> > then
> > > > > > restart
> > > > > > >>> all
> > > > > > >>> > > Storm process, then re-submit all our topologies
> > > > > > >>> > >
> > > > > > >>> > > Please be as clear as possible about which commands we
> > should
> > > > run
> > > > > > to
> > > > > > >>> give
> > > > > > >>> > > you more details if needed
> > > > > > >>> > >
> > > > > > >>> > > Best regards,
> > > > > > >>> > > Alexandre Vermeerbergen
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Alexandre Vermeerbergen <av...@gmail.com>.
Hello Jungtaek,

Your answers were very useful, because I was able to reproduce the issue by
simply deleting storm.local.dir contents, and I found traces that indeed
the machine that suffered this issue had lost this directory : it confirms
your diagnostic, thank you very much for this!

I have filled a JIRA: https://issues.apache.org/jira/browse/STORM-2707
hoping that this will help make Storm even more robust when it'll be fixed.

In the meantime, I'm changing our storm.local.dir to a place that differs
from /tmp in order to reduce the risk of deletion of this directory.

Best regard,
Alexandre Vermeerbergen

2017-08-25 9:44 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:

> I'm not sure. Topology code can't be restored, so my best bet would be
> detecting it (periodically, or react in failure) and give up leadership.
> If my memory is right, leader Nimbus doesn't pull blobs from followers, so
> if it doesn't have any blobs and need to sync, it just needs to become
> follower and let other nimbuses to take a chance to be a leader.
>
> It would not help for non-HA Nimbus cluster anyway. What's next? Maybe we
> could remove related topology (from ZK), but it feels me a bit dangerous to
> do it automatically, so I'd rather leaning on CLI (maybe) tool so that
> operators can do it manually.
>
> Please file an issue regarding this when you would want to have a solution.
>
> Best regards,
> Jungtaek Lim (HeartSaVioR)
>
> 2017년 8월 25일 (금) 오전 6:29, Alexandre Vermeerbergen <
> avermeerbergen@gmail.com>님이
> 작성:
>
> > Hello Jungtaek,
> >
> > I confirm that we currently do not have multiple Nimbus nodes.
> >
> > I want to clarify that Nimbus process never crashed : it keep printing in
> > its log the error:
> >
> > 2017-08-06 03:44:01.777 o.a.s.t.ProcessFunction pool-14-thread-1 [ERROR]
> > Internal error processing getClusterInfo
> > org.apache.storm.generated.KeyNotFoundException: null
> > ... (rest of the stack trace)
> >
> > However I forgot to mention that:
> >
> > * During this problem our topologies are unaffected
> > * We notice the problem because we have a self-healing check on Nimbus-UI
> > process which consists in calling Web Services to get a few stats on
> > running topologies. But when Nimbus-UI is unresponsive, we kill it and
> > restart it automatically.
> > * The period of our NimbusUI self healing cron is 1 minute
> > * When the getClusterInfo stack trace occurs on Nimbus log, Nimbus UI is
> > unable to restart
> > * Here's the kind of exceptions we see in ui.log  (Nimbus UI's trace):
> >
> > 2017-08-24 21:24:21.777 o.a.s.u.core main [INFO] Starting ui server for
> > storm version '1.1.0'
> > 2017-08-24 21:24:21.788 o.a.s.d.m.MetricsUtils main [INFO] Using
> statistics
> > reporter
> > plugin:org.apache.storm.daemon.metrics.reporters.JmxPreparableReporter
> > 2017-08-24 21:24:21.790 o.a.s.d.m.r.JmxPreparableReporter main [INFO]
> > Preparing...
> > 2017-08-24 21:24:21.801 o.a.s.d.common main [INFO] Started statistics
> > report plugin...
> > 2017-08-24 21:24:21.878 o.a.s.s.o.e.j.s.Server main [INFO]
> > jetty-7.x.y-SNAPSHOT
> > 2017-08-24 21:24:21.929 o.a.s.s.o.e.j.s.h.ContextHandler main [INFO]
> > started o.a.s.s.o.e.j.s.ServletContextHandler{/,null}
> > 2017-08-24 21:24:21.965 o.a.s.s.o.e.j.s.AbstractConnector main [INFO]
> > Started SelectChannelConnector@0.0.0.0:8070
> > 2017-08-24 21:25:02.446 o.a.s.u.NimbusClient qtp2142536057-19 [WARN]
> > Ignoring exception while trying to get leader nimbus info from
> > ec2-52-48-59-151.eu-west-1.compute.amazonaws.com. will retry with a
> > different seed host.
> > org.apache.storm.thrift.transport.TTransportException: null
> >         at
> >
> > org.apache.storm.thrift.transport.TIOStreamTransport.
> read(TIOStreamTransport.java:132)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> > org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> >
> > org.apache.storm.thrift.transport.TFramedTransport.
> readFrame(TFramedTransport.java:129)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> >
> > org.apache.storm.thrift.transport.TFramedTransport.
> read(TFramedTransport.java:101)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> > org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> >
> > org.apache.storm.thrift.protocol.TBinaryProtocol.
> readAll(TBinaryProtocol.java:429)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> >
> > org.apache.storm.thrift.protocol.TBinaryProtocol.
> readI32(TBinaryProtocol.java:318)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> >
> > org.apache.storm.thrift.protocol.TBinaryProtocol.readMessageBegin(
> TBinaryProtocol.java:219)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> > org.apache.storm.thrift.TServiceClient.receiveBase(
> TServiceClient.java:77)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> > org.apache.storm.generated.Nimbus$Client.recv_getLeader(
> Nimbus.java:1193)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> > org.apache.storm.generated.Nimbus$Client.getLeader(Nimbus.java:1181)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> >
> > org.apache.storm.utils.NimbusClient.getConfiguredClientAs(
> NimbusClient.java:84)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at org.apache.storm.ui.core$cluster_summary.invoke(core.clj:355)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at org.apache.storm.ui.core$fn__9556.invoke(core.clj:1113)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> >
> > org.apache.storm.shade.compojure.core$make_route$fn__
> 5976.invoke(core.clj:100)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> > org.apache.storm.shade.compojure.core$if_route$fn__
> 5964.invoke(core.clj:46)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> >
> > org.apache.storm.shade.compojure.core$if_method$fn__
> 5957.invoke(core.clj:31)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> > org.apache.storm.shade.compojure.core$routing$fn__
> 5982.invoke(core.clj:113)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at clojure.core$some.invoke(core.clj:2570)
> ~[clojure-1.7.0.jar:?]
> >         at
> > org.apache.storm.shade.compojure.core$routing.doInvoke(core.clj:113)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at clojure.lang.RestFn.applyTo(RestFn.java:139)
> > ~[clojure-1.7.0.jar:?]
> >         at clojure.core$apply.invoke(core.clj:632)
> ~[clojure-1.7.0.jar:?]
> >
> > So our problem is that Nimbus stays forever in a inconsistent state
> without
> > crashing, and if it would crash, then it would be helpless since we would
> > have automatic restart of it (we also have a self healing on Nimbus) but
> > again in such case we need to manually cleanup the content in the ZKs to
> be
> > able to restart Nimbus in a clean state.
> >
> > My question is the following one: is there a way to make Nimbus more
> > resilient to such issues?
> >
> > Best regards,
> > Alexandre
> >
> >
> >
> >
> >
> >
> >
> > 2017-08-24 23:06 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> >
> > > Blob files (meta, data) are in storm local directory. ZK only has list
> of
> > > blob keys and which alive nimbuses have that file. So if you lose storm
> > > local directory, you just can't restore blobs, unless other nimbuses
> have
> > > these blobs so current nimbus could pull.
> > > (I guess you have only one nimbus, since replication factor was 1 in
> > nimbus
> > > log.)
> > >
> > > If FNF occurs while finding meta file, we just throw
> KeyNotFoundException
> > > because it is just same as not exist.
> > >
> > > Anyway you have to clear up everything because there's inconsistency
> > > between ZK and local, and no way to recover missed blob files so no way
> > to
> > > make it consistent.
> > >
> > > Btw, in this case, my expectation is that after crashing and
> restarting,
> > > nimbus can't be a leader and just waits for missing blobs. If you saw
> > > Nimbus crashed again while relaunching, it could be a bug.
> > >
> > > - Jungtaek Lim (HeartSaVioR)
> > > On Fri, 25 Aug 2017 at 3:35 AM Alexandre Vermeerbergen <
> > > avermeerbergen@gmail.com> wrote:
> > >
> > > > Hello Jungtaek,
> > > >
> > > > I can do what you suggest (ie moving storm local dir to a place which
> > > isn't
> > > > in /tmp),but since the issue occurs rarely (once per month), I doubt
> > I'll
> > > > be able to feedback soon.
> > > >
> > > > What is puzzling to me is that in order to recover from such issue,
> we
> > > have
> > > > to stop everything, then clean the Zookeepers, then restarts all
> Storm
> > > > processes and finally restart our topologies. We don't clean the /tmp
> > > > directory.
> > > >
> > > > Are you sure there's no relationships between Zookeeper contents and
> > what
> > > > is causing the following message to happen?
> > > > "Internal error processing getClusterInfo"
> > > > Best regards;
> > > > Alexandre
> > > >
> > > > 2017-08-24 13:41 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > > >
> > > > > Sorry I meant reproduce, not replicate. :)
> > > > >
> > > > > 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
> > > > >
> > > > > > Alexandre,
> > > > > >
> > > > > > I found that your storm local dir is placed to "/tmp/storm" which
> > > parts
> > > > > or
> > > > > > all could be removed at any time.
> > > > > > Could you move the path to non-temporary place and try to
> > replicate?
> > > > > >
> > > > > > Thanks,
> > > > > > Jungtaek Lim (HeartSaVioR)
> > > > > >
> > > > > >
> > > > > > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> > > > > > avermeerbergen@gmail.com>님이 작성:
> > > > > >
> > > > > >> Hello Jungtaek,
> > > > > >>
> > > > > >> Thank you very much for your answer.
> > > > > >>
> > > > > >> Please find attached the full Nimbus log (gzipped) related to
> this
> > > > > issue.
> > > > > >>
> > > > > >> Please note that the last ERROR repeats forever until we
> "repair"
> > > > Storm.
> > > > > >>
> > > > > >> From the logs, it could be that the issue began close to when a
> > > > topology
> > > > > >> was restarted (killed, then started)
> > > > > >>
> > > > > >> Maybe this caused a corruption in Zookeeper. Is there anything
> > > which I
> > > > > >> can collect in our Zookeeper nodes/logs to help analysis?
> > > > > >>
> > > > > >> Best regards,
> > > > > >> Alexandre
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > > > > >>
> > > > > >>> Hi Alexandre, I missed this mail since I was on vacation.
> > > > > >>>
> > > > > >>> I followed the stack trace but hard to analyze without context.
> > Do
> > > > you
> > > > > >>> mind
> > > > > >>> providing full nimbus log?
> > > > > >>>
> > > > > >>> Thanks,
> > > > > >>> Jungtaek Lim (HeartSaVioR)
> > > > > >>>
> > > > > >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> > > > > >>> avermeerbergen@gmail.com>님이
> > > > > >>> 작성:
> > > > > >>>
> > > > > >>> > Hello,
> > > > > >>> >
> > > > > >>> > Tomorrow I will have to restart the cluster on which I have
> > this
> > > > > issue
> > > > > >>> with
> > > > > >>> > Storm 1.1.0.
> > > > > >>> > Is there are anybody interested in my running some commands
> to
> > > get
> > > > > more
> > > > > >>> > logs before I repair this cluster?
> > > > > >>> >
> > > > > >>> > Best regards,
> > > > > >>> > Alexandre Vermeerbergen
> > > > > >>> >
> > > > > >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> > > > > >>> > avermeerbergen@gmail.com
> > > > > >>> > >:
> > > > > >>> >
> > > > > >>> > > Hello,
> > > > > >>> > >
> > > > > >>> > > I think it might be of interest for you Storm developers to
> > > learn
> > > > > >>> that I
> > > > > >>> > > currently have a case of issue with Storm 1.1.0 which was
> > > > supposed
> > > > > to
> > > > > >>> > > resolved in this release according to
> > > https://issues.apache.org/
> > > > > >>> > > jira/browse/STORM-1977 ; and I can look for any more
> > > information
> > > > > >>> which
> > > > > >>> > > you'd need to make a diagnostic on why this issue still can
> > > > happen.
> > > > > >>> > >
> > > > > >>> > > Indeed, I have a Storm UI process which can't get any
> > > information
> > > > > on
> > > > > >>> its
> > > > > >>> > > Storm cluster, and I see many following exception in
> > > nimbus.log:
> > > > > >>> > >
> > > > > >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21
> > [INFO]
> > > > > >>> Created
> > > > > >>> > > download session for
> > > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > > >>> > > amazonaws-com_defaultStormTopic-29-
> 1501650673-stormcode.ser
> > > with
> > > > > id
> > > > > >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > > > > >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27
> > [INFO]
> > > > > >>> Created
> > > > > >>> > > download session for
> > > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > > >>> > > amazonaws-com_defaultStormTopic-29-
> 1501650673-stormconf.ser
> > > with
> > > > > id
> > > > > >>> > > aba18011-3258-4023-bbef-14d21a7066e1
> > > > > >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO]
> Cleaning
> > > > inbox
> > > > > >>> ...
> > > > > >>> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > > > > >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction
> > > pool-14-thread-34
> > > > > >>> [ERROR]
> > > > > >>> > > Internal error processing getClusterInfo
> > > > > >>> > > org.apache.storm.generated.KeyNotFoundException: null
> > > > > >>> > >         at
> > > > > >>> >
> > > > > >>> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> > > > > LocalFsBlobStore.java:147)
> > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at
> > > > > >>> >
> > > > > >>> org.apache.storm.blobstore.LocalFsBlobStore.
> getBlobReplication(
> > > > > LocalFsBlobStore.java:299)
> > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at sun.reflect.GeneratedMethodAccessor78.
> > > invoke(Unknown
> > > > > >>> Source)
> > > > > >>> > > ~[?:?]
> > > > > >>> > >         at sun.reflect.DelegatingMethodAccessorImpl.
> invoke(
> > > > > >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> > > > > >>> > >         at java.lang.reflect.Method.
> invoke(Method.java:498)
> > > > > >>> > ~[?:1.8.0_121]
> > > > > >>> > >         at
> > > > > >>> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > >>> > >         at
> > > > > >>> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > >>> > >         at
> > > > > >>> >
> > > > > >>> org.apache.storm.daemon.nimbus$get_blob_replication_
> > > > > count.invoke(nimbus.clj:489)
> > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at org.apache.storm.daemon.
> > > nimbus$get_cluster_info$iter__
> > > > > >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > >>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > >>> > >         at clojure.lang.RT.seq(RT.java:507)
> > > > ~[clojure-1.7.0.jar:?]
> > > > > >>> > >         at clojure.core$seq__4128.invoke(core.clj:137)
> > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > >>> > >         at clojure.core$dorun.invoke(core.clj:3009)
> > > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > > >>> > >         at clojure.core$doall.invoke(core.clj:3025)
> > > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > > >>> > >         at
> > > > > >>> > org.apache.storm.daemon.nimbus$get_cluster_info.
> > > > > invoke(nimbus.clj:1524)
> > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at org.apache.storm.daemon.
> nimbus$mk_reified_nimbus$
> > > > > >>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> > > > > >>> > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > > > >>> > > getClusterInfo.getResult(Nimbus.java:3920)
> > > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > > > >>> > > getClusterInfo.getResult(Nimbus.java:3904)
> > > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at
> > > > > >>> >
> > > > > >>> org.apache.storm.thrift.ProcessFunction.process(
> > > > > ProcessFunction.java:39)
> > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at
> > > > > >>> > org.apache.storm.thrift.TBaseProcessor.process(
> > > > > TBaseProcessor.java:39)
> > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at org.apache.storm.security.
> > > auth.SimpleTransportPlugin$
> > > > > >>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:
> 162)
> > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at org.apache.storm.thrift.server.
> > > > > AbstractNonblockingServer$
> > > > > >>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at
> > > > > >>> > org.apache.storm.thrift.server.Invocation.run(
> > > Invocation.java:18)
> > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > >>> > >         at
> > > > > >>> >
> > > > > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > > ThreadPoolExecutor.java:1142)
> > > > > >>> > > [?:1.8.0_121]
> > > > > >>> > >         at
> > > > > >>> >
> > > > > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > > ThreadPoolExecutor.java:617)
> > > > > >>> > > [?:1.8.0_121]
> > > > > >>> > >         at java.lang.Thread.run(Thread.java:745)
> > [?:1.8.0_121]
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> > > What is amazing is that I got this same issue on two Storm
> > > > clusters
> > > > > >>> > > running on different VMs ; just they share the same data in
> > > they
> > > > > >>> Kafka
> > > > > >>> > > Broker cluster (one cluster is the production one, which
> was
> > > > > quickly
> > > > > >>> > fixed,
> > > > > >>> > > and the other one is the "backup" cluster to be used if the
> > > > > >>> production
> > > > > >>> > one
> > > > > >>> > > fails for quick "back to production")
> > > > > >>> > >
> > > > > >>> > > If left one of these cluster with this behavior because I
> > felt
> > > > that
> > > > > >>> it
> > > > > >>> > > could be interesting for Storm developers to have more
> > > > information
> > > > > on
> > > > > >>> > this
> > > > > >>> > > issue, if needed to properly diagnose it.
> > > > > >>> > >
> > > > > >>> > > I can keep this cluster as is for max 2 days.
> > > > > >>> > >
> > > > > >>> > > Is there anything useful which I can collect on it to help
> > > Storm
> > > > > >>> > > developers to understand the cause (and hopefully use it to
> > > make
> > > > > >>> Storm
> > > > > >>> > more
> > > > > >>> > > robust) ?
> > > > > >>> > >
> > > > > >>> > > Few details:
> > > > > >>> > >
> > > > > >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a
> VM,
> > > > and 4
> > > > > >>> > > Supervisors VMs + 3 Zookeeper VMs
> > > > > >>> > >
> > > > > >>> > > * Running with Java Server JRE 1.8.0_121
> > > > > >>> > > * Running on AWS EC2 instances
> > > > > >>> > >
> > > > > >>> > > * We run about 10 topologies, with automatic self-healing
> on
> > > them
> > > > > (if
> > > > > >>> > they
> > > > > >>> > > stop consuming Kafka items, our self-healer call "Kill
> > > topology",
> > > > > and
> > > > > >>> > then
> > > > > >>> > > eventually restarts the topology
> > > > > >>> > >
> > > > > >>> > > * We have a self-healing on Nimbus UI based on calling its
> > REST
> > > > > >>> services.
> > > > > >>> > > If it's not responding fast enough, we restart Nimbus UI
> > > > > >>> > > * We figured out the issue because Nimbus UI was restarted
> > > every
> > > > 2
> > > > > >>> > minutes
> > > > > >>> > >
> > > > > >>> > > * To fix our production server which had the same symptom,
> we
> > > had
> > > > > to
> > > > > >>> stop
> > > > > >>> > > all Storm processes, then stop all Zookeepers, then remove
> > all
> > > > data
> > > > > >>> in
> > > > > >>> > > Zookeeper "snapshot files", then restart all Zookeeper,
> then
> > > > > restart
> > > > > >>> all
> > > > > >>> > > Storm process, then re-submit all our topologies
> > > > > >>> > >
> > > > > >>> > > Please be as clear as possible about which commands we
> should
> > > run
> > > > > to
> > > > > >>> give
> > > > > >>> > > you more details if needed
> > > > > >>> > >
> > > > > >>> > > Best regards,
> > > > > >>> > > Alexandre Vermeerbergen
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Jungtaek Lim <ka...@gmail.com>.
I'm not sure. Topology code can't be restored, so my best bet would be
detecting it (periodically, or react in failure) and give up leadership.
If my memory is right, leader Nimbus doesn't pull blobs from followers, so
if it doesn't have any blobs and need to sync, it just needs to become
follower and let other nimbuses to take a chance to be a leader.

It would not help for non-HA Nimbus cluster anyway. What's next? Maybe we
could remove related topology (from ZK), but it feels me a bit dangerous to
do it automatically, so I'd rather leaning on CLI (maybe) tool so that
operators can do it manually.

Please file an issue regarding this when you would want to have a solution.

Best regards,
Jungtaek Lim (HeartSaVioR)

2017년 8월 25일 (금) 오전 6:29, Alexandre Vermeerbergen <av...@gmail.com>님이
작성:

> Hello Jungtaek,
>
> I confirm that we currently do not have multiple Nimbus nodes.
>
> I want to clarify that Nimbus process never crashed : it keep printing in
> its log the error:
>
> 2017-08-06 03:44:01.777 o.a.s.t.ProcessFunction pool-14-thread-1 [ERROR]
> Internal error processing getClusterInfo
> org.apache.storm.generated.KeyNotFoundException: null
> ... (rest of the stack trace)
>
> However I forgot to mention that:
>
> * During this problem our topologies are unaffected
> * We notice the problem because we have a self-healing check on Nimbus-UI
> process which consists in calling Web Services to get a few stats on
> running topologies. But when Nimbus-UI is unresponsive, we kill it and
> restart it automatically.
> * The period of our NimbusUI self healing cron is 1 minute
> * When the getClusterInfo stack trace occurs on Nimbus log, Nimbus UI is
> unable to restart
> * Here's the kind of exceptions we see in ui.log  (Nimbus UI's trace):
>
> 2017-08-24 21:24:21.777 o.a.s.u.core main [INFO] Starting ui server for
> storm version '1.1.0'
> 2017-08-24 21:24:21.788 o.a.s.d.m.MetricsUtils main [INFO] Using statistics
> reporter
> plugin:org.apache.storm.daemon.metrics.reporters.JmxPreparableReporter
> 2017-08-24 21:24:21.790 o.a.s.d.m.r.JmxPreparableReporter main [INFO]
> Preparing...
> 2017-08-24 21:24:21.801 o.a.s.d.common main [INFO] Started statistics
> report plugin...
> 2017-08-24 21:24:21.878 o.a.s.s.o.e.j.s.Server main [INFO]
> jetty-7.x.y-SNAPSHOT
> 2017-08-24 21:24:21.929 o.a.s.s.o.e.j.s.h.ContextHandler main [INFO]
> started o.a.s.s.o.e.j.s.ServletContextHandler{/,null}
> 2017-08-24 21:24:21.965 o.a.s.s.o.e.j.s.AbstractConnector main [INFO]
> Started SelectChannelConnector@0.0.0.0:8070
> 2017-08-24 21:25:02.446 o.a.s.u.NimbusClient qtp2142536057-19 [WARN]
> Ignoring exception while trying to get leader nimbus info from
> ec2-52-48-59-151.eu-west-1.compute.amazonaws.com. will retry with a
> different seed host.
> org.apache.storm.thrift.transport.TTransportException: null
>         at
>
> org.apache.storm.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
> org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
>
> org.apache.storm.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
>
> org.apache.storm.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
> org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
>
> org.apache.storm.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
>
> org.apache.storm.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
>
> org.apache.storm.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
> org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
> org.apache.storm.generated.Nimbus$Client.recv_getLeader(Nimbus.java:1193)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
> org.apache.storm.generated.Nimbus$Client.getLeader(Nimbus.java:1181)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
>
> org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:84)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.ui.core$cluster_summary.invoke(core.clj:355)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.ui.core$fn__9556.invoke(core.clj:1113)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
>
> org.apache.storm.shade.compojure.core$make_route$fn__5976.invoke(core.clj:100)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
> org.apache.storm.shade.compojure.core$if_route$fn__5964.invoke(core.clj:46)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
>
> org.apache.storm.shade.compojure.core$if_method$fn__5957.invoke(core.clj:31)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at
> org.apache.storm.shade.compojure.core$routing$fn__5982.invoke(core.clj:113)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at clojure.core$some.invoke(core.clj:2570) ~[clojure-1.7.0.jar:?]
>         at
> org.apache.storm.shade.compojure.core$routing.doInvoke(core.clj:113)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at clojure.lang.RestFn.applyTo(RestFn.java:139)
> ~[clojure-1.7.0.jar:?]
>         at clojure.core$apply.invoke(core.clj:632) ~[clojure-1.7.0.jar:?]
>
> So our problem is that Nimbus stays forever in a inconsistent state without
> crashing, and if it would crash, then it would be helpless since we would
> have automatic restart of it (we also have a self healing on Nimbus) but
> again in such case we need to manually cleanup the content in the ZKs to be
> able to restart Nimbus in a clean state.
>
> My question is the following one: is there a way to make Nimbus more
> resilient to such issues?
>
> Best regards,
> Alexandre
>
>
>
>
>
>
>
> 2017-08-24 23:06 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
>
> > Blob files (meta, data) are in storm local directory. ZK only has list of
> > blob keys and which alive nimbuses have that file. So if you lose storm
> > local directory, you just can't restore blobs, unless other nimbuses have
> > these blobs so current nimbus could pull.
> > (I guess you have only one nimbus, since replication factor was 1 in
> nimbus
> > log.)
> >
> > If FNF occurs while finding meta file, we just throw KeyNotFoundException
> > because it is just same as not exist.
> >
> > Anyway you have to clear up everything because there's inconsistency
> > between ZK and local, and no way to recover missed blob files so no way
> to
> > make it consistent.
> >
> > Btw, in this case, my expectation is that after crashing and restarting,
> > nimbus can't be a leader and just waits for missing blobs. If you saw
> > Nimbus crashed again while relaunching, it could be a bug.
> >
> > - Jungtaek Lim (HeartSaVioR)
> > On Fri, 25 Aug 2017 at 3:35 AM Alexandre Vermeerbergen <
> > avermeerbergen@gmail.com> wrote:
> >
> > > Hello Jungtaek,
> > >
> > > I can do what you suggest (ie moving storm local dir to a place which
> > isn't
> > > in /tmp),but since the issue occurs rarely (once per month), I doubt
> I'll
> > > be able to feedback soon.
> > >
> > > What is puzzling to me is that in order to recover from such issue, we
> > have
> > > to stop everything, then clean the Zookeepers, then restarts all Storm
> > > processes and finally restart our topologies. We don't clean the /tmp
> > > directory.
> > >
> > > Are you sure there's no relationships between Zookeeper contents and
> what
> > > is causing the following message to happen?
> > > "Internal error processing getClusterInfo"
> > > Best regards;
> > > Alexandre
> > >
> > > 2017-08-24 13:41 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > >
> > > > Sorry I meant reproduce, not replicate. :)
> > > >
> > > > 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
> > > >
> > > > > Alexandre,
> > > > >
> > > > > I found that your storm local dir is placed to "/tmp/storm" which
> > parts
> > > > or
> > > > > all could be removed at any time.
> > > > > Could you move the path to non-temporary place and try to
> replicate?
> > > > >
> > > > > Thanks,
> > > > > Jungtaek Lim (HeartSaVioR)
> > > > >
> > > > >
> > > > > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> > > > > avermeerbergen@gmail.com>님이 작성:
> > > > >
> > > > >> Hello Jungtaek,
> > > > >>
> > > > >> Thank you very much for your answer.
> > > > >>
> > > > >> Please find attached the full Nimbus log (gzipped) related to this
> > > > issue.
> > > > >>
> > > > >> Please note that the last ERROR repeats forever until we "repair"
> > > Storm.
> > > > >>
> > > > >> From the logs, it could be that the issue began close to when a
> > > topology
> > > > >> was restarted (killed, then started)
> > > > >>
> > > > >> Maybe this caused a corruption in Zookeeper. Is there anything
> > which I
> > > > >> can collect in our Zookeeper nodes/logs to help analysis?
> > > > >>
> > > > >> Best regards,
> > > > >> Alexandre
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > > > >>
> > > > >>> Hi Alexandre, I missed this mail since I was on vacation.
> > > > >>>
> > > > >>> I followed the stack trace but hard to analyze without context.
> Do
> > > you
> > > > >>> mind
> > > > >>> providing full nimbus log?
> > > > >>>
> > > > >>> Thanks,
> > > > >>> Jungtaek Lim (HeartSaVioR)
> > > > >>>
> > > > >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> > > > >>> avermeerbergen@gmail.com>님이
> > > > >>> 작성:
> > > > >>>
> > > > >>> > Hello,
> > > > >>> >
> > > > >>> > Tomorrow I will have to restart the cluster on which I have
> this
> > > > issue
> > > > >>> with
> > > > >>> > Storm 1.1.0.
> > > > >>> > Is there are anybody interested in my running some commands to
> > get
> > > > more
> > > > >>> > logs before I repair this cluster?
> > > > >>> >
> > > > >>> > Best regards,
> > > > >>> > Alexandre Vermeerbergen
> > > > >>> >
> > > > >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> > > > >>> > avermeerbergen@gmail.com
> > > > >>> > >:
> > > > >>> >
> > > > >>> > > Hello,
> > > > >>> > >
> > > > >>> > > I think it might be of interest for you Storm developers to
> > learn
> > > > >>> that I
> > > > >>> > > currently have a case of issue with Storm 1.1.0 which was
> > > supposed
> > > > to
> > > > >>> > > resolved in this release according to
> > https://issues.apache.org/
> > > > >>> > > jira/browse/STORM-1977 ; and I can look for any more
> > information
> > > > >>> which
> > > > >>> > > you'd need to make a diagnostic on why this issue still can
> > > happen.
> > > > >>> > >
> > > > >>> > > Indeed, I have a Storm UI process which can't get any
> > information
> > > > on
> > > > >>> its
> > > > >>> > > Storm cluster, and I see many following exception in
> > nimbus.log:
> > > > >>> > >
> > > > >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21
> [INFO]
> > > > >>> Created
> > > > >>> > > download session for
> > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser
> > with
> > > > id
> > > > >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > > > >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27
> [INFO]
> > > > >>> Created
> > > > >>> > > download session for
> > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser
> > with
> > > > id
> > > > >>> > > aba18011-3258-4023-bbef-14d21a7066e1
> > > > >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning
> > > inbox
> > > > >>> ...
> > > > >>> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > > > >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction
> > pool-14-thread-34
> > > > >>> [ERROR]
> > > > >>> > > Internal error processing getClusterInfo
> > > > >>> > > org.apache.storm.generated.KeyNotFoundException: null
> > > > >>> > >         at
> > > > >>> >
> > > > >>> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> > > > LocalFsBlobStore.java:147)
> > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at
> > > > >>> >
> > > > >>> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(
> > > > LocalFsBlobStore.java:299)
> > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at sun.reflect.GeneratedMethodAccessor78.
> > invoke(Unknown
> > > > >>> Source)
> > > > >>> > > ~[?:?]
> > > > >>> > >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > > >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> > > > >>> > >         at java.lang.reflect.Method.invoke(Method.java:498)
> > > > >>> > ~[?:1.8.0_121]
> > > > >>> > >         at
> > > > >>> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > >>> > >         at
> > > > >>> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > >>> > >         at
> > > > >>> >
> > > > >>> org.apache.storm.daemon.nimbus$get_blob_replication_
> > > > count.invoke(nimbus.clj:489)
> > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at org.apache.storm.daemon.
> > nimbus$get_cluster_info$iter__
> > > > >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > >>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > >>> > >         at clojure.lang.RT.seq(RT.java:507)
> > > ~[clojure-1.7.0.jar:?]
> > > > >>> > >         at clojure.core$seq__4128.invoke(core.clj:137)
> > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > >>> > >         at clojure.core$dorun.invoke(core.clj:3009)
> > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > >>> > >         at clojure.core$doall.invoke(core.clj:3025)
> > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > >>> > >         at
> > > > >>> > org.apache.storm.daemon.nimbus$get_cluster_info.
> > > > invoke(nimbus.clj:1524)
> > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
> > > > >>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> > > > >>> > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > > >>> > > getClusterInfo.getResult(Nimbus.java:3920)
> > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > > >>> > > getClusterInfo.getResult(Nimbus.java:3904)
> > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at
> > > > >>> >
> > > > >>> org.apache.storm.thrift.ProcessFunction.process(
> > > > ProcessFunction.java:39)
> > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at
> > > > >>> > org.apache.storm.thrift.TBaseProcessor.process(
> > > > TBaseProcessor.java:39)
> > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at org.apache.storm.security.
> > auth.SimpleTransportPlugin$
> > > > >>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
> > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at org.apache.storm.thrift.server.
> > > > AbstractNonblockingServer$
> > > > >>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at
> > > > >>> > org.apache.storm.thrift.server.Invocation.run(
> > Invocation.java:18)
> > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > >>> > >         at
> > > > >>> >
> > > > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > ThreadPoolExecutor.java:1142)
> > > > >>> > > [?:1.8.0_121]
> > > > >>> > >         at
> > > > >>> >
> > > > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > ThreadPoolExecutor.java:617)
> > > > >>> > > [?:1.8.0_121]
> > > > >>> > >         at java.lang.Thread.run(Thread.java:745)
> [?:1.8.0_121]
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > What is amazing is that I got this same issue on two Storm
> > > clusters
> > > > >>> > > running on different VMs ; just they share the same data in
> > they
> > > > >>> Kafka
> > > > >>> > > Broker cluster (one cluster is the production one, which was
> > > > quickly
> > > > >>> > fixed,
> > > > >>> > > and the other one is the "backup" cluster to be used if the
> > > > >>> production
> > > > >>> > one
> > > > >>> > > fails for quick "back to production")
> > > > >>> > >
> > > > >>> > > If left one of these cluster with this behavior because I
> felt
> > > that
> > > > >>> it
> > > > >>> > > could be interesting for Storm developers to have more
> > > information
> > > > on
> > > > >>> > this
> > > > >>> > > issue, if needed to properly diagnose it.
> > > > >>> > >
> > > > >>> > > I can keep this cluster as is for max 2 days.
> > > > >>> > >
> > > > >>> > > Is there anything useful which I can collect on it to help
> > Storm
> > > > >>> > > developers to understand the cause (and hopefully use it to
> > make
> > > > >>> Storm
> > > > >>> > more
> > > > >>> > > robust) ?
> > > > >>> > >
> > > > >>> > > Few details:
> > > > >>> > >
> > > > >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM,
> > > and 4
> > > > >>> > > Supervisors VMs + 3 Zookeeper VMs
> > > > >>> > >
> > > > >>> > > * Running with Java Server JRE 1.8.0_121
> > > > >>> > > * Running on AWS EC2 instances
> > > > >>> > >
> > > > >>> > > * We run about 10 topologies, with automatic self-healing on
> > them
> > > > (if
> > > > >>> > they
> > > > >>> > > stop consuming Kafka items, our self-healer call "Kill
> > topology",
> > > > and
> > > > >>> > then
> > > > >>> > > eventually restarts the topology
> > > > >>> > >
> > > > >>> > > * We have a self-healing on Nimbus UI based on calling its
> REST
> > > > >>> services.
> > > > >>> > > If it's not responding fast enough, we restart Nimbus UI
> > > > >>> > > * We figured out the issue because Nimbus UI was restarted
> > every
> > > 2
> > > > >>> > minutes
> > > > >>> > >
> > > > >>> > > * To fix our production server which had the same symptom, we
> > had
> > > > to
> > > > >>> stop
> > > > >>> > > all Storm processes, then stop all Zookeepers, then remove
> all
> > > data
> > > > >>> in
> > > > >>> > > Zookeeper "snapshot files", then restart all Zookeeper, then
> > > > restart
> > > > >>> all
> > > > >>> > > Storm process, then re-submit all our topologies
> > > > >>> > >
> > > > >>> > > Please be as clear as possible about which commands we should
> > run
> > > > to
> > > > >>> give
> > > > >>> > > you more details if needed
> > > > >>> > >
> > > > >>> > > Best regards,
> > > > >>> > > Alexandre Vermeerbergen
> > > > >>> > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >>
> > > >
> > >
> >
>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Alexandre Vermeerbergen <av...@gmail.com>.
Hello Jungtaek,

I confirm that we currently do not have multiple Nimbus nodes.

I want to clarify that Nimbus process never crashed : it keep printing in
its log the error:

2017-08-06 03:44:01.777 o.a.s.t.ProcessFunction pool-14-thread-1 [ERROR]
Internal error processing getClusterInfo
org.apache.storm.generated.KeyNotFoundException: null
... (rest of the stack trace)

However I forgot to mention that:

* During this problem our topologies are unaffected
* We notice the problem because we have a self-healing check on Nimbus-UI
process which consists in calling Web Services to get a few stats on
running topologies. But when Nimbus-UI is unresponsive, we kill it and
restart it automatically.
* The period of our NimbusUI self healing cron is 1 minute
* When the getClusterInfo stack trace occurs on Nimbus log, Nimbus UI is
unable to restart
* Here's the kind of exceptions we see in ui.log  (Nimbus UI's trace):

2017-08-24 21:24:21.777 o.a.s.u.core main [INFO] Starting ui server for
storm version '1.1.0'
2017-08-24 21:24:21.788 o.a.s.d.m.MetricsUtils main [INFO] Using statistics
reporter
plugin:org.apache.storm.daemon.metrics.reporters.JmxPreparableReporter
2017-08-24 21:24:21.790 o.a.s.d.m.r.JmxPreparableReporter main [INFO]
Preparing...
2017-08-24 21:24:21.801 o.a.s.d.common main [INFO] Started statistics
report plugin...
2017-08-24 21:24:21.878 o.a.s.s.o.e.j.s.Server main [INFO]
jetty-7.x.y-SNAPSHOT
2017-08-24 21:24:21.929 o.a.s.s.o.e.j.s.h.ContextHandler main [INFO]
started o.a.s.s.o.e.j.s.ServletContextHandler{/,null}
2017-08-24 21:24:21.965 o.a.s.s.o.e.j.s.AbstractConnector main [INFO]
Started SelectChannelConnector@0.0.0.0:8070
2017-08-24 21:25:02.446 o.a.s.u.NimbusClient qtp2142536057-19 [WARN]
Ignoring exception while trying to get leader nimbus info from
ec2-52-48-59-151.eu-west-1.compute.amazonaws.com. will retry with a
different seed host.
org.apache.storm.thrift.transport.TTransportException: null
        at
org.apache.storm.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.generated.Nimbus$Client.recv_getLeader(Nimbus.java:1193)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.generated.Nimbus$Client.getLeader(Nimbus.java:1181)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:84)
~[storm-core-1.1.0.jar:1.1.0]
        at org.apache.storm.ui.core$cluster_summary.invoke(core.clj:355)
~[storm-core-1.1.0.jar:1.1.0]
        at org.apache.storm.ui.core$fn__9556.invoke(core.clj:1113)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.shade.compojure.core$make_route$fn__5976.invoke(core.clj:100)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.shade.compojure.core$if_route$fn__5964.invoke(core.clj:46)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.shade.compojure.core$if_method$fn__5957.invoke(core.clj:31)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.shade.compojure.core$routing$fn__5982.invoke(core.clj:113)
~[storm-core-1.1.0.jar:1.1.0]
        at clojure.core$some.invoke(core.clj:2570) ~[clojure-1.7.0.jar:?]
        at
org.apache.storm.shade.compojure.core$routing.doInvoke(core.clj:113)
~[storm-core-1.1.0.jar:1.1.0]
        at clojure.lang.RestFn.applyTo(RestFn.java:139)
~[clojure-1.7.0.jar:?]
        at clojure.core$apply.invoke(core.clj:632) ~[clojure-1.7.0.jar:?]

So our problem is that Nimbus stays forever in a inconsistent state without
crashing, and if it would crash, then it would be helpless since we would
have automatic restart of it (we also have a self healing on Nimbus) but
again in such case we need to manually cleanup the content in the ZKs to be
able to restart Nimbus in a clean state.

My question is the following one: is there a way to make Nimbus more
resilient to such issues?

Best regards,
Alexandre







2017-08-24 23:06 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:

> Blob files (meta, data) are in storm local directory. ZK only has list of
> blob keys and which alive nimbuses have that file. So if you lose storm
> local directory, you just can't restore blobs, unless other nimbuses have
> these blobs so current nimbus could pull.
> (I guess you have only one nimbus, since replication factor was 1 in nimbus
> log.)
>
> If FNF occurs while finding meta file, we just throw KeyNotFoundException
> because it is just same as not exist.
>
> Anyway you have to clear up everything because there's inconsistency
> between ZK and local, and no way to recover missed blob files so no way to
> make it consistent.
>
> Btw, in this case, my expectation is that after crashing and restarting,
> nimbus can't be a leader and just waits for missing blobs. If you saw
> Nimbus crashed again while relaunching, it could be a bug.
>
> - Jungtaek Lim (HeartSaVioR)
> On Fri, 25 Aug 2017 at 3:35 AM Alexandre Vermeerbergen <
> avermeerbergen@gmail.com> wrote:
>
> > Hello Jungtaek,
> >
> > I can do what you suggest (ie moving storm local dir to a place which
> isn't
> > in /tmp),but since the issue occurs rarely (once per month), I doubt I'll
> > be able to feedback soon.
> >
> > What is puzzling to me is that in order to recover from such issue, we
> have
> > to stop everything, then clean the Zookeepers, then restarts all Storm
> > processes and finally restart our topologies. We don't clean the /tmp
> > directory.
> >
> > Are you sure there's no relationships between Zookeeper contents and what
> > is causing the following message to happen?
> > "Internal error processing getClusterInfo"
> > Best regards;
> > Alexandre
> >
> > 2017-08-24 13:41 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> >
> > > Sorry I meant reproduce, not replicate. :)
> > >
> > > 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
> > >
> > > > Alexandre,
> > > >
> > > > I found that your storm local dir is placed to "/tmp/storm" which
> parts
> > > or
> > > > all could be removed at any time.
> > > > Could you move the path to non-temporary place and try to replicate?
> > > >
> > > > Thanks,
> > > > Jungtaek Lim (HeartSaVioR)
> > > >
> > > >
> > > > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> > > > avermeerbergen@gmail.com>님이 작성:
> > > >
> > > >> Hello Jungtaek,
> > > >>
> > > >> Thank you very much for your answer.
> > > >>
> > > >> Please find attached the full Nimbus log (gzipped) related to this
> > > issue.
> > > >>
> > > >> Please note that the last ERROR repeats forever until we "repair"
> > Storm.
> > > >>
> > > >> From the logs, it could be that the issue began close to when a
> > topology
> > > >> was restarted (killed, then started)
> > > >>
> > > >> Maybe this caused a corruption in Zookeeper. Is there anything
> which I
> > > >> can collect in our Zookeeper nodes/logs to help analysis?
> > > >>
> > > >> Best regards,
> > > >> Alexandre
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > > >>
> > > >>> Hi Alexandre, I missed this mail since I was on vacation.
> > > >>>
> > > >>> I followed the stack trace but hard to analyze without context. Do
> > you
> > > >>> mind
> > > >>> providing full nimbus log?
> > > >>>
> > > >>> Thanks,
> > > >>> Jungtaek Lim (HeartSaVioR)
> > > >>>
> > > >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> > > >>> avermeerbergen@gmail.com>님이
> > > >>> 작성:
> > > >>>
> > > >>> > Hello,
> > > >>> >
> > > >>> > Tomorrow I will have to restart the cluster on which I have this
> > > issue
> > > >>> with
> > > >>> > Storm 1.1.0.
> > > >>> > Is there are anybody interested in my running some commands to
> get
> > > more
> > > >>> > logs before I repair this cluster?
> > > >>> >
> > > >>> > Best regards,
> > > >>> > Alexandre Vermeerbergen
> > > >>> >
> > > >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> > > >>> > avermeerbergen@gmail.com
> > > >>> > >:
> > > >>> >
> > > >>> > > Hello,
> > > >>> > >
> > > >>> > > I think it might be of interest for you Storm developers to
> learn
> > > >>> that I
> > > >>> > > currently have a case of issue with Storm 1.1.0 which was
> > supposed
> > > to
> > > >>> > > resolved in this release according to
> https://issues.apache.org/
> > > >>> > > jira/browse/STORM-1977 ; and I can look for any more
> information
> > > >>> which
> > > >>> > > you'd need to make a diagnostic on why this issue still can
> > happen.
> > > >>> > >
> > > >>> > > Indeed, I have a Storm UI process which can't get any
> information
> > > on
> > > >>> its
> > > >>> > > Storm cluster, and I see many following exception in
> nimbus.log:
> > > >>> > >
> > > >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO]
> > > >>> Created
> > > >>> > > download session for
> > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser
> with
> > > id
> > > >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > > >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO]
> > > >>> Created
> > > >>> > > download session for
> > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser
> with
> > > id
> > > >>> > > aba18011-3258-4023-bbef-14d21a7066e1
> > > >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning
> > inbox
> > > >>> ...
> > > >>> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > > >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction
> pool-14-thread-34
> > > >>> [ERROR]
> > > >>> > > Internal error processing getClusterInfo
> > > >>> > > org.apache.storm.generated.KeyNotFoundException: null
> > > >>> > >         at
> > > >>> >
> > > >>> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> > > LocalFsBlobStore.java:147)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at
> > > >>> >
> > > >>> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(
> > > LocalFsBlobStore.java:299)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at sun.reflect.GeneratedMethodAccessor78.
> invoke(Unknown
> > > >>> Source)
> > > >>> > > ~[?:?]
> > > >>> > >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> > > >>> > >         at java.lang.reflect.Method.invoke(Method.java:498)
> > > >>> > ~[?:1.8.0_121]
> > > >>> > >         at
> > > >>> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> > > >>> > > ~[clojure-1.7.0.jar:?]
> > > >>> > >         at
> > > >>> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> > > >>> > > ~[clojure-1.7.0.jar:?]
> > > >>> > >         at
> > > >>> >
> > > >>> org.apache.storm.daemon.nimbus$get_blob_replication_
> > > count.invoke(nimbus.clj:489)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at org.apache.storm.daemon.
> nimbus$get_cluster_info$iter__
> > > >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > >>> > > ~[clojure-1.7.0.jar:?]
> > > >>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > >>> > > ~[clojure-1.7.0.jar:?]
> > > >>> > >         at clojure.lang.RT.seq(RT.java:507)
> > ~[clojure-1.7.0.jar:?]
> > > >>> > >         at clojure.core$seq__4128.invoke(core.clj:137)
> > > >>> > > ~[clojure-1.7.0.jar:?]
> > > >>> > >         at clojure.core$dorun.invoke(core.clj:3009)
> > > >>> > ~[clojure-1.7.0.jar:?]
> > > >>> > >         at clojure.core$doall.invoke(core.clj:3025)
> > > >>> > ~[clojure-1.7.0.jar:?]
> > > >>> > >         at
> > > >>> > org.apache.storm.daemon.nimbus$get_cluster_info.
> > > invoke(nimbus.clj:1524)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
> > > >>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> > > >>> > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > >>> > > getClusterInfo.getResult(Nimbus.java:3920)
> > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > >>> > > getClusterInfo.getResult(Nimbus.java:3904)
> > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at
> > > >>> >
> > > >>> org.apache.storm.thrift.ProcessFunction.process(
> > > ProcessFunction.java:39)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at
> > > >>> > org.apache.storm.thrift.TBaseProcessor.process(
> > > TBaseProcessor.java:39)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at org.apache.storm.security.
> auth.SimpleTransportPlugin$
> > > >>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at org.apache.storm.thrift.server.
> > > AbstractNonblockingServer$
> > > >>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at
> > > >>> > org.apache.storm.thrift.server.Invocation.run(
> Invocation.java:18)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > >         at
> > > >>> >
> > > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > ThreadPoolExecutor.java:1142)
> > > >>> > > [?:1.8.0_121]
> > > >>> > >         at
> > > >>> >
> > > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > ThreadPoolExecutor.java:617)
> > > >>> > > [?:1.8.0_121]
> > > >>> > >         at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
> > > >>> > >
> > > >>> > >
> > > >>> > > What is amazing is that I got this same issue on two Storm
> > clusters
> > > >>> > > running on different VMs ; just they share the same data in
> they
> > > >>> Kafka
> > > >>> > > Broker cluster (one cluster is the production one, which was
> > > quickly
> > > >>> > fixed,
> > > >>> > > and the other one is the "backup" cluster to be used if the
> > > >>> production
> > > >>> > one
> > > >>> > > fails for quick "back to production")
> > > >>> > >
> > > >>> > > If left one of these cluster with this behavior because I felt
> > that
> > > >>> it
> > > >>> > > could be interesting for Storm developers to have more
> > information
> > > on
> > > >>> > this
> > > >>> > > issue, if needed to properly diagnose it.
> > > >>> > >
> > > >>> > > I can keep this cluster as is for max 2 days.
> > > >>> > >
> > > >>> > > Is there anything useful which I can collect on it to help
> Storm
> > > >>> > > developers to understand the cause (and hopefully use it to
> make
> > > >>> Storm
> > > >>> > more
> > > >>> > > robust) ?
> > > >>> > >
> > > >>> > > Few details:
> > > >>> > >
> > > >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM,
> > and 4
> > > >>> > > Supervisors VMs + 3 Zookeeper VMs
> > > >>> > >
> > > >>> > > * Running with Java Server JRE 1.8.0_121
> > > >>> > > * Running on AWS EC2 instances
> > > >>> > >
> > > >>> > > * We run about 10 topologies, with automatic self-healing on
> them
> > > (if
> > > >>> > they
> > > >>> > > stop consuming Kafka items, our self-healer call "Kill
> topology",
> > > and
> > > >>> > then
> > > >>> > > eventually restarts the topology
> > > >>> > >
> > > >>> > > * We have a self-healing on Nimbus UI based on calling its REST
> > > >>> services.
> > > >>> > > If it's not responding fast enough, we restart Nimbus UI
> > > >>> > > * We figured out the issue because Nimbus UI was restarted
> every
> > 2
> > > >>> > minutes
> > > >>> > >
> > > >>> > > * To fix our production server which had the same symptom, we
> had
> > > to
> > > >>> stop
> > > >>> > > all Storm processes, then stop all Zookeepers, then remove all
> > data
> > > >>> in
> > > >>> > > Zookeeper "snapshot files", then restart all Zookeeper, then
> > > restart
> > > >>> all
> > > >>> > > Storm process, then re-submit all our topologies
> > > >>> > >
> > > >>> > > Please be as clear as possible about which commands we should
> run
> > > to
> > > >>> give
> > > >>> > > you more details if needed
> > > >>> > >
> > > >>> > > Best regards,
> > > >>> > > Alexandre Vermeerbergen
> > > >>> > >
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > >
> >
>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Jungtaek Lim <ka...@gmail.com>.
Blob files (meta, data) are in storm local directory. ZK only has list of
blob keys and which alive nimbuses have that file. So if you lose storm
local directory, you just can't restore blobs, unless other nimbuses have
these blobs so current nimbus could pull.
(I guess you have only one nimbus, since replication factor was 1 in nimbus
log.)

If FNF occurs while finding meta file, we just throw KeyNotFoundException
because it is just same as not exist.

Anyway you have to clear up everything because there's inconsistency
between ZK and local, and no way to recover missed blob files so no way to
make it consistent.

Btw, in this case, my expectation is that after crashing and restarting,
nimbus can't be a leader and just waits for missing blobs. If you saw
Nimbus crashed again while relaunching, it could be a bug.

- Jungtaek Lim (HeartSaVioR)
On Fri, 25 Aug 2017 at 3:35 AM Alexandre Vermeerbergen <
avermeerbergen@gmail.com> wrote:

> Hello Jungtaek,
>
> I can do what you suggest (ie moving storm local dir to a place which isn't
> in /tmp),but since the issue occurs rarely (once per month), I doubt I'll
> be able to feedback soon.
>
> What is puzzling to me is that in order to recover from such issue, we have
> to stop everything, then clean the Zookeepers, then restarts all Storm
> processes and finally restart our topologies. We don't clean the /tmp
> directory.
>
> Are you sure there's no relationships between Zookeeper contents and what
> is causing the following message to happen?
> "Internal error processing getClusterInfo"
> Best regards;
> Alexandre
>
> 2017-08-24 13:41 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
>
> > Sorry I meant reproduce, not replicate. :)
> >
> > 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
> >
> > > Alexandre,
> > >
> > > I found that your storm local dir is placed to "/tmp/storm" which parts
> > or
> > > all could be removed at any time.
> > > Could you move the path to non-temporary place and try to replicate?
> > >
> > > Thanks,
> > > Jungtaek Lim (HeartSaVioR)
> > >
> > >
> > > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> > > avermeerbergen@gmail.com>님이 작성:
> > >
> > >> Hello Jungtaek,
> > >>
> > >> Thank you very much for your answer.
> > >>
> > >> Please find attached the full Nimbus log (gzipped) related to this
> > issue.
> > >>
> > >> Please note that the last ERROR repeats forever until we "repair"
> Storm.
> > >>
> > >> From the logs, it could be that the issue began close to when a
> topology
> > >> was restarted (killed, then started)
> > >>
> > >> Maybe this caused a corruption in Zookeeper. Is there anything which I
> > >> can collect in our Zookeeper nodes/logs to help analysis?
> > >>
> > >> Best regards,
> > >> Alexandre
> > >>
> > >>
> > >>
> > >>
> > >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> > >>
> > >>> Hi Alexandre, I missed this mail since I was on vacation.
> > >>>
> > >>> I followed the stack trace but hard to analyze without context. Do
> you
> > >>> mind
> > >>> providing full nimbus log?
> > >>>
> > >>> Thanks,
> > >>> Jungtaek Lim (HeartSaVioR)
> > >>>
> > >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> > >>> avermeerbergen@gmail.com>님이
> > >>> 작성:
> > >>>
> > >>> > Hello,
> > >>> >
> > >>> > Tomorrow I will have to restart the cluster on which I have this
> > issue
> > >>> with
> > >>> > Storm 1.1.0.
> > >>> > Is there are anybody interested in my running some commands to get
> > more
> > >>> > logs before I repair this cluster?
> > >>> >
> > >>> > Best regards,
> > >>> > Alexandre Vermeerbergen
> > >>> >
> > >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> > >>> > avermeerbergen@gmail.com
> > >>> > >:
> > >>> >
> > >>> > > Hello,
> > >>> > >
> > >>> > > I think it might be of interest for you Storm developers to learn
> > >>> that I
> > >>> > > currently have a case of issue with Storm 1.1.0 which was
> supposed
> > to
> > >>> > > resolved in this release according to https://issues.apache.org/
> > >>> > > jira/browse/STORM-1977 ; and I can look for any more information
> > >>> which
> > >>> > > you'd need to make a diagnostic on why this issue still can
> happen.
> > >>> > >
> > >>> > > Indeed, I have a Storm UI process which can't get any information
> > on
> > >>> its
> > >>> > > Storm cluster, and I see many following exception in nimbus.log:
> > >>> > >
> > >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO]
> > >>> Created
> > >>> > > download session for
> > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with
> > id
> > >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO]
> > >>> Created
> > >>> > > download session for
> > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with
> > id
> > >>> > > aba18011-3258-4023-bbef-14d21a7066e1
> > >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning
> inbox
> > >>> ...
> > >>> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34
> > >>> [ERROR]
> > >>> > > Internal error processing getClusterInfo
> > >>> > > org.apache.storm.generated.KeyNotFoundException: null
> > >>> > >         at
> > >>> >
> > >>> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> > LocalFsBlobStore.java:147)
> > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at
> > >>> >
> > >>> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(
> > LocalFsBlobStore.java:299)
> > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown
> > >>> Source)
> > >>> > > ~[?:?]
> > >>> > >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> > >>> > >         at java.lang.reflect.Method.invoke(Method.java:498)
> > >>> > ~[?:1.8.0_121]
> > >>> > >         at
> > >>> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> > >>> > > ~[clojure-1.7.0.jar:?]
> > >>> > >         at
> > >>> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> > >>> > > ~[clojure-1.7.0.jar:?]
> > >>> > >         at
> > >>> >
> > >>> org.apache.storm.daemon.nimbus$get_blob_replication_
> > count.invoke(nimbus.clj:489)
> > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at org.apache.storm.daemon.nimbus$get_cluster_info$iter__
> > >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > >>> > > ~[clojure-1.7.0.jar:?]
> > >>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > >>> > > ~[clojure-1.7.0.jar:?]
> > >>> > >         at clojure.lang.RT.seq(RT.java:507)
> ~[clojure-1.7.0.jar:?]
> > >>> > >         at clojure.core$seq__4128.invoke(core.clj:137)
> > >>> > > ~[clojure-1.7.0.jar:?]
> > >>> > >         at clojure.core$dorun.invoke(core.clj:3009)
> > >>> > ~[clojure-1.7.0.jar:?]
> > >>> > >         at clojure.core$doall.invoke(core.clj:3025)
> > >>> > ~[clojure-1.7.0.jar:?]
> > >>> > >         at
> > >>> > org.apache.storm.daemon.nimbus$get_cluster_info.
> > invoke(nimbus.clj:1524)
> > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
> > >>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> > >>> > ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > >>> > > getClusterInfo.getResult(Nimbus.java:3920)
> > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > >>> > > getClusterInfo.getResult(Nimbus.java:3904)
> > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at
> > >>> >
> > >>> org.apache.storm.thrift.ProcessFunction.process(
> > ProcessFunction.java:39)
> > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at
> > >>> > org.apache.storm.thrift.TBaseProcessor.process(
> > TBaseProcessor.java:39)
> > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at org.apache.storm.security.auth.SimpleTransportPlugin$
> > >>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
> > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at org.apache.storm.thrift.server.
> > AbstractNonblockingServer$
> > >>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at
> > >>> > org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
> > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >>> > >         at
> > >>> >
> > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> > >>> > > [?:1.8.0_121]
> > >>> > >         at
> > >>> >
> > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> > >>> > > [?:1.8.0_121]
> > >>> > >         at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
> > >>> > >
> > >>> > >
> > >>> > > What is amazing is that I got this same issue on two Storm
> clusters
> > >>> > > running on different VMs ; just they share the same data in they
> > >>> Kafka
> > >>> > > Broker cluster (one cluster is the production one, which was
> > quickly
> > >>> > fixed,
> > >>> > > and the other one is the "backup" cluster to be used if the
> > >>> production
> > >>> > one
> > >>> > > fails for quick "back to production")
> > >>> > >
> > >>> > > If left one of these cluster with this behavior because I felt
> that
> > >>> it
> > >>> > > could be interesting for Storm developers to have more
> information
> > on
> > >>> > this
> > >>> > > issue, if needed to properly diagnose it.
> > >>> > >
> > >>> > > I can keep this cluster as is for max 2 days.
> > >>> > >
> > >>> > > Is there anything useful which I can collect on it to help Storm
> > >>> > > developers to understand the cause (and hopefully use it to make
> > >>> Storm
> > >>> > more
> > >>> > > robust) ?
> > >>> > >
> > >>> > > Few details:
> > >>> > >
> > >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM,
> and 4
> > >>> > > Supervisors VMs + 3 Zookeeper VMs
> > >>> > >
> > >>> > > * Running with Java Server JRE 1.8.0_121
> > >>> > > * Running on AWS EC2 instances
> > >>> > >
> > >>> > > * We run about 10 topologies, with automatic self-healing on them
> > (if
> > >>> > they
> > >>> > > stop consuming Kafka items, our self-healer call "Kill topology",
> > and
> > >>> > then
> > >>> > > eventually restarts the topology
> > >>> > >
> > >>> > > * We have a self-healing on Nimbus UI based on calling its REST
> > >>> services.
> > >>> > > If it's not responding fast enough, we restart Nimbus UI
> > >>> > > * We figured out the issue because Nimbus UI was restarted every
> 2
> > >>> > minutes
> > >>> > >
> > >>> > > * To fix our production server which had the same symptom, we had
> > to
> > >>> stop
> > >>> > > all Storm processes, then stop all Zookeepers, then remove all
> data
> > >>> in
> > >>> > > Zookeeper "snapshot files", then restart all Zookeeper, then
> > restart
> > >>> all
> > >>> > > Storm process, then re-submit all our topologies
> > >>> > >
> > >>> > > Please be as clear as possible about which commands we should run
> > to
> > >>> give
> > >>> > > you more details if needed
> > >>> > >
> > >>> > > Best regards,
> > >>> > > Alexandre Vermeerbergen
> > >>> > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> >
>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Alexandre Vermeerbergen <av...@gmail.com>.
Hello Jungtaek,

I can do what you suggest (ie moving storm local dir to a place which isn't
in /tmp),but since the issue occurs rarely (once per month), I doubt I'll
be able to feedback soon.

What is puzzling to me is that in order to recover from such issue, we have
to stop everything, then clean the Zookeepers, then restarts all Storm
processes and finally restart our topologies. We don't clean the /tmp
directory.

Are you sure there's no relationships between Zookeeper contents and what
is causing the following message to happen?
"Internal error processing getClusterInfo"
Best regards;
Alexandre

2017-08-24 13:41 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:

> Sorry I meant reproduce, not replicate. :)
>
> 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
>
> > Alexandre,
> >
> > I found that your storm local dir is placed to "/tmp/storm" which parts
> or
> > all could be removed at any time.
> > Could you move the path to non-temporary place and try to replicate?
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> >
> >
> > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> > avermeerbergen@gmail.com>님이 작성:
> >
> >> Hello Jungtaek,
> >>
> >> Thank you very much for your answer.
> >>
> >> Please find attached the full Nimbus log (gzipped) related to this
> issue.
> >>
> >> Please note that the last ERROR repeats forever until we "repair" Storm.
> >>
> >> From the logs, it could be that the issue began close to when a topology
> >> was restarted (killed, then started)
> >>
> >> Maybe this caused a corruption in Zookeeper. Is there anything which I
> >> can collect in our Zookeeper nodes/logs to help analysis?
> >>
> >> Best regards,
> >> Alexandre
> >>
> >>
> >>
> >>
> >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
> >>
> >>> Hi Alexandre, I missed this mail since I was on vacation.
> >>>
> >>> I followed the stack trace but hard to analyze without context. Do you
> >>> mind
> >>> providing full nimbus log?
> >>>
> >>> Thanks,
> >>> Jungtaek Lim (HeartSaVioR)
> >>>
> >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> >>> avermeerbergen@gmail.com>님이
> >>> 작성:
> >>>
> >>> > Hello,
> >>> >
> >>> > Tomorrow I will have to restart the cluster on which I have this
> issue
> >>> with
> >>> > Storm 1.1.0.
> >>> > Is there are anybody interested in my running some commands to get
> more
> >>> > logs before I repair this cluster?
> >>> >
> >>> > Best regards,
> >>> > Alexandre Vermeerbergen
> >>> >
> >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> >>> > avermeerbergen@gmail.com
> >>> > >:
> >>> >
> >>> > > Hello,
> >>> > >
> >>> > > I think it might be of interest for you Storm developers to learn
> >>> that I
> >>> > > currently have a case of issue with Storm 1.1.0 which was supposed
> to
> >>> > > resolved in this release according to https://issues.apache.org/
> >>> > > jira/browse/STORM-1977 ; and I can look for any more information
> >>> which
> >>> > > you'd need to make a diagnostic on why this issue still can happen.
> >>> > >
> >>> > > Indeed, I have a Storm UI process which can't get any information
> on
> >>> its
> >>> > > Storm cluster, and I see many following exception in nimbus.log:
> >>> > >
> >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO]
> >>> Created
> >>> > > download session for
> >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with
> id
> >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO]
> >>> Created
> >>> > > download session for
> >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with
> id
> >>> > > aba18011-3258-4023-bbef-14d21a7066e1
> >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox
> >>> ...
> >>> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34
> >>> [ERROR]
> >>> > > Internal error processing getClusterInfo
> >>> > > org.apache.storm.generated.KeyNotFoundException: null
> >>> > >         at
> >>> >
> >>> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> LocalFsBlobStore.java:147)
> >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at
> >>> >
> >>> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(
> LocalFsBlobStore.java:299)
> >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown
> >>> Source)
> >>> > > ~[?:?]
> >>> > >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> >>> > >         at java.lang.reflect.Method.invoke(Method.java:498)
> >>> > ~[?:1.8.0_121]
> >>> > >         at
> >>> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> >>> > > ~[clojure-1.7.0.jar:?]
> >>> > >         at
> >>> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> >>> > > ~[clojure-1.7.0.jar:?]
> >>> > >         at
> >>> >
> >>> org.apache.storm.daemon.nimbus$get_blob_replication_
> count.invoke(nimbus.clj:489)
> >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at org.apache.storm.daemon.nimbus$get_cluster_info$iter__
> >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> >>> > > ~[clojure-1.7.0.jar:?]
> >>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> >>> > > ~[clojure-1.7.0.jar:?]
> >>> > >         at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
> >>> > >         at clojure.core$seq__4128.invoke(core.clj:137)
> >>> > > ~[clojure-1.7.0.jar:?]
> >>> > >         at clojure.core$dorun.invoke(core.clj:3009)
> >>> > ~[clojure-1.7.0.jar:?]
> >>> > >         at clojure.core$doall.invoke(core.clj:3025)
> >>> > ~[clojure-1.7.0.jar:?]
> >>> > >         at
> >>> > org.apache.storm.daemon.nimbus$get_cluster_info.
> invoke(nimbus.clj:1524)
> >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
> >>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> >>> > ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> >>> > > getClusterInfo.getResult(Nimbus.java:3920)
> >>> ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> >>> > > getClusterInfo.getResult(Nimbus.java:3904)
> >>> ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at
> >>> >
> >>> org.apache.storm.thrift.ProcessFunction.process(
> ProcessFunction.java:39)
> >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at
> >>> > org.apache.storm.thrift.TBaseProcessor.process(
> TBaseProcessor.java:39)
> >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at org.apache.storm.security.auth.SimpleTransportPlugin$
> >>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
> >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at org.apache.storm.thrift.server.
> AbstractNonblockingServer$
> >>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at
> >>> > org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
> >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> >>> > >         at
> >>> >
> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >>> > > [?:1.8.0_121]
> >>> > >         at
> >>> >
> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >>> > > [?:1.8.0_121]
> >>> > >         at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
> >>> > >
> >>> > >
> >>> > > What is amazing is that I got this same issue on two Storm clusters
> >>> > > running on different VMs ; just they share the same data in they
> >>> Kafka
> >>> > > Broker cluster (one cluster is the production one, which was
> quickly
> >>> > fixed,
> >>> > > and the other one is the "backup" cluster to be used if the
> >>> production
> >>> > one
> >>> > > fails for quick "back to production")
> >>> > >
> >>> > > If left one of these cluster with this behavior because I felt that
> >>> it
> >>> > > could be interesting for Storm developers to have more information
> on
> >>> > this
> >>> > > issue, if needed to properly diagnose it.
> >>> > >
> >>> > > I can keep this cluster as is for max 2 days.
> >>> > >
> >>> > > Is there anything useful which I can collect on it to help Storm
> >>> > > developers to understand the cause (and hopefully use it to make
> >>> Storm
> >>> > more
> >>> > > robust) ?
> >>> > >
> >>> > > Few details:
> >>> > >
> >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4
> >>> > > Supervisors VMs + 3 Zookeeper VMs
> >>> > >
> >>> > > * Running with Java Server JRE 1.8.0_121
> >>> > > * Running on AWS EC2 instances
> >>> > >
> >>> > > * We run about 10 topologies, with automatic self-healing on them
> (if
> >>> > they
> >>> > > stop consuming Kafka items, our self-healer call "Kill topology",
> and
> >>> > then
> >>> > > eventually restarts the topology
> >>> > >
> >>> > > * We have a self-healing on Nimbus UI based on calling its REST
> >>> services.
> >>> > > If it's not responding fast enough, we restart Nimbus UI
> >>> > > * We figured out the issue because Nimbus UI was restarted every 2
> >>> > minutes
> >>> > >
> >>> > > * To fix our production server which had the same symptom, we had
> to
> >>> stop
> >>> > > all Storm processes, then stop all Zookeepers, then remove all data
> >>> in
> >>> > > Zookeeper "snapshot files", then restart all Zookeeper, then
> restart
> >>> all
> >>> > > Storm process, then re-submit all our topologies
> >>> > >
> >>> > > Please be as clear as possible about which commands we should run
> to
> >>> give
> >>> > > you more details if needed
> >>> > >
> >>> > > Best regards,
> >>> > > Alexandre Vermeerbergen
> >>> > >
> >>> > >
> >>> >
> >>>
> >>
> >>
>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Jungtaek Lim <ka...@gmail.com>.
Sorry I meant reproduce, not replicate. :)

2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <ka...@gmail.com>님이 작성:

> Alexandre,
>
> I found that your storm local dir is placed to "/tmp/storm" which parts or
> all could be removed at any time.
> Could you move the path to non-temporary place and try to replicate?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> avermeerbergen@gmail.com>님이 작성:
>
>> Hello Jungtaek,
>>
>> Thank you very much for your answer.
>>
>> Please find attached the full Nimbus log (gzipped) related to this issue.
>>
>> Please note that the last ERROR repeats forever until we "repair" Storm.
>>
>> From the logs, it could be that the issue began close to when a topology
>> was restarted (killed, then started)
>>
>> Maybe this caused a corruption in Zookeeper. Is there anything which I
>> can collect in our Zookeeper nodes/logs to help analysis?
>>
>> Best regards,
>> Alexandre
>>
>>
>>
>>
>> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
>>
>>> Hi Alexandre, I missed this mail since I was on vacation.
>>>
>>> I followed the stack trace but hard to analyze without context. Do you
>>> mind
>>> providing full nimbus log?
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
>>> avermeerbergen@gmail.com>님이
>>> 작성:
>>>
>>> > Hello,
>>> >
>>> > Tomorrow I will have to restart the cluster on which I have this issue
>>> with
>>> > Storm 1.1.0.
>>> > Is there are anybody interested in my running some commands to get more
>>> > logs before I repair this cluster?
>>> >
>>> > Best regards,
>>> > Alexandre Vermeerbergen
>>> >
>>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
>>> > avermeerbergen@gmail.com
>>> > >:
>>> >
>>> > > Hello,
>>> > >
>>> > > I think it might be of interest for you Storm developers to learn
>>> that I
>>> > > currently have a case of issue with Storm 1.1.0 which was supposed to
>>> > > resolved in this release according to https://issues.apache.org/
>>> > > jira/browse/STORM-1977 ; and I can look for any more information
>>> which
>>> > > you'd need to make a diagnostic on why this issue still can happen.
>>> > >
>>> > > Indeed, I have a Storm UI process which can't get any information on
>>> its
>>> > > Storm cluster, and I see many following exception in nimbus.log:
>>> > >
>>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO]
>>> Created
>>> > > download session for
>>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
>>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with id
>>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
>>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO]
>>> Created
>>> > > download session for
>>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
>>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with id
>>> > > aba18011-3258-4023-bbef-14d21a7066e1
>>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox
>>> ...
>>> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
>>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34
>>> [ERROR]
>>> > > Internal error processing getClusterInfo
>>> > > org.apache.storm.generated.KeyNotFoundException: null
>>> > >         at
>>> >
>>> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at
>>> >
>>> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown
>>> Source)
>>> > > ~[?:?]
>>> > >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
>>> > >         at java.lang.reflect.Method.invoke(Method.java:498)
>>> > ~[?:1.8.0_121]
>>> > >         at
>>> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
>>> > > ~[clojure-1.7.0.jar:?]
>>> > >         at
>>> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
>>> > > ~[clojure-1.7.0.jar:?]
>>> > >         at
>>> >
>>> org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:489)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.daemon.nimbus$get_cluster_info$iter__
>>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
>>> > > ~[clojure-1.7.0.jar:?]
>>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
>>> > > ~[clojure-1.7.0.jar:?]
>>> > >         at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
>>> > >         at clojure.core$seq__4128.invoke(core.clj:137)
>>> > > ~[clojure-1.7.0.jar:?]
>>> > >         at clojure.core$dorun.invoke(core.clj:3009)
>>> > ~[clojure-1.7.0.jar:?]
>>> > >         at clojure.core$doall.invoke(core.clj:3025)
>>> > ~[clojure-1.7.0.jar:?]
>>> > >         at
>>> > org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
>>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
>>> > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.generated.Nimbus$Processor$
>>> > > getClusterInfo.getResult(Nimbus.java:3920)
>>> ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.generated.Nimbus$Processor$
>>> > > getClusterInfo.getResult(Nimbus.java:3904)
>>> ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at
>>> >
>>> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at
>>> > org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.security.auth.SimpleTransportPlugin$
>>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.thrift.server.AbstractNonblockingServer$
>>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at
>>> > org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> > > [?:1.8.0_121]
>>> > >         at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> > > [?:1.8.0_121]
>>> > >         at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
>>> > >
>>> > >
>>> > > What is amazing is that I got this same issue on two Storm clusters
>>> > > running on different VMs ; just they share the same data in they
>>> Kafka
>>> > > Broker cluster (one cluster is the production one, which was quickly
>>> > fixed,
>>> > > and the other one is the "backup" cluster to be used if the
>>> production
>>> > one
>>> > > fails for quick "back to production")
>>> > >
>>> > > If left one of these cluster with this behavior because I felt that
>>> it
>>> > > could be interesting for Storm developers to have more information on
>>> > this
>>> > > issue, if needed to properly diagnose it.
>>> > >
>>> > > I can keep this cluster as is for max 2 days.
>>> > >
>>> > > Is there anything useful which I can collect on it to help Storm
>>> > > developers to understand the cause (and hopefully use it to make
>>> Storm
>>> > more
>>> > > robust) ?
>>> > >
>>> > > Few details:
>>> > >
>>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4
>>> > > Supervisors VMs + 3 Zookeeper VMs
>>> > >
>>> > > * Running with Java Server JRE 1.8.0_121
>>> > > * Running on AWS EC2 instances
>>> > >
>>> > > * We run about 10 topologies, with automatic self-healing on them (if
>>> > they
>>> > > stop consuming Kafka items, our self-healer call "Kill topology", and
>>> > then
>>> > > eventually restarts the topology
>>> > >
>>> > > * We have a self-healing on Nimbus UI based on calling its REST
>>> services.
>>> > > If it's not responding fast enough, we restart Nimbus UI
>>> > > * We figured out the issue because Nimbus UI was restarted every 2
>>> > minutes
>>> > >
>>> > > * To fix our production server which had the same symptom, we had to
>>> stop
>>> > > all Storm processes, then stop all Zookeepers, then remove all data
>>> in
>>> > > Zookeeper "snapshot files", then restart all Zookeeper, then restart
>>> all
>>> > > Storm process, then re-submit all our topologies
>>> > >
>>> > > Please be as clear as possible about which commands we should run to
>>> give
>>> > > you more details if needed
>>> > >
>>> > > Best regards,
>>> > > Alexandre Vermeerbergen
>>> > >
>>> > >
>>> >
>>>
>>
>>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Jungtaek Lim <ka...@gmail.com>.
Alexandre,

I found that your storm local dir is placed to "/tmp/storm" which parts or
all could be removed at any time.
Could you move the path to non-temporary place and try to replicate?

Thanks,
Jungtaek Lim (HeartSaVioR)


2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <av...@gmail.com>님이
작성:

> Hello Jungtaek,
>
> Thank you very much for your answer.
>
> Please find attached the full Nimbus log (gzipped) related to this issue.
>
> Please note that the last ERROR repeats forever until we "repair" Storm.
>
> From the logs, it could be that the issue began close to when a topology
> was restarted (killed, then started)
>
> Maybe this caused a corruption in Zookeeper. Is there anything which I can
> collect in our Zookeeper nodes/logs to help analysis?
>
> Best regards,
> Alexandre
>
>
>
>
> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:
>
>> Hi Alexandre, I missed this mail since I was on vacation.
>>
>> I followed the stack trace but hard to analyze without context. Do you
>> mind
>> providing full nimbus log?
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
>> avermeerbergen@gmail.com>님이
>> 작성:
>>
>> > Hello,
>> >
>> > Tomorrow I will have to restart the cluster on which I have this issue
>> with
>> > Storm 1.1.0.
>> > Is there are anybody interested in my running some commands to get more
>> > logs before I repair this cluster?
>> >
>> > Best regards,
>> > Alexandre Vermeerbergen
>> >
>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
>> > avermeerbergen@gmail.com
>> > >:
>> >
>> > > Hello,
>> > >
>> > > I think it might be of interest for you Storm developers to learn
>> that I
>> > > currently have a case of issue with Storm 1.1.0 which was supposed to
>> > > resolved in this release according to https://issues.apache.org/
>> > > jira/browse/STORM-1977 ; and I can look for any more information which
>> > > you'd need to make a diagnostic on why this issue still can happen.
>> > >
>> > > Indeed, I have a Storm UI process which can't get any information on
>> its
>> > > Storm cluster, and I see many following exception in nimbus.log:
>> > >
>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO]
>> Created
>> > > download session for
>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with id
>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO]
>> Created
>> > > download session for
>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with id
>> > > aba18011-3258-4023-bbef-14d21a7066e1
>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox ...
>> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34
>> [ERROR]
>> > > Internal error processing getClusterInfo
>> > > org.apache.storm.generated.KeyNotFoundException: null
>> > >         at
>> >
>> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147)
>> > > ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at
>> >
>> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299)
>> > > ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown
>> Source)
>> > > ~[?:?]
>> > >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
>> > >         at java.lang.reflect.Method.invoke(Method.java:498)
>> > ~[?:1.8.0_121]
>> > >         at
>> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
>> > > ~[clojure-1.7.0.jar:?]
>> > >         at
>> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
>> > > ~[clojure-1.7.0.jar:?]
>> > >         at
>> >
>> org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:489)
>> > > ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at org.apache.storm.daemon.nimbus$get_cluster_info$iter__
>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
>> > > ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
>> > > ~[clojure-1.7.0.jar:?]
>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
>> > > ~[clojure-1.7.0.jar:?]
>> > >         at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
>> > >         at clojure.core$seq__4128.invoke(core.clj:137)
>> > > ~[clojure-1.7.0.jar:?]
>> > >         at clojure.core$dorun.invoke(core.clj:3009)
>> > ~[clojure-1.7.0.jar:?]
>> > >         at clojure.core$doall.invoke(core.clj:3025)
>> > ~[clojure-1.7.0.jar:?]
>> > >         at
>> > org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524)
>> > > ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
>> > ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at org.apache.storm.generated.Nimbus$Processor$
>> > > getClusterInfo.getResult(Nimbus.java:3920)
>> ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at org.apache.storm.generated.Nimbus$Processor$
>> > > getClusterInfo.getResult(Nimbus.java:3904)
>> ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at
>> > org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
>> > > ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at
>> > org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>> > > ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at org.apache.storm.security.auth.SimpleTransportPlugin$
>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
>> > > ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at org.apache.storm.thrift.server.AbstractNonblockingServer$
>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
>> > > ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at
>> > org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
>> > > ~[storm-core-1.1.0.jar:1.1.0]
>> > >         at
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> > > [?:1.8.0_121]
>> > >         at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> > > [?:1.8.0_121]
>> > >         at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
>> > >
>> > >
>> > > What is amazing is that I got this same issue on two Storm clusters
>> > > running on different VMs ; just they share the same data in they Kafka
>> > > Broker cluster (one cluster is the production one, which was quickly
>> > fixed,
>> > > and the other one is the "backup" cluster to be used if the production
>> > one
>> > > fails for quick "back to production")
>> > >
>> > > If left one of these cluster with this behavior because I felt that it
>> > > could be interesting for Storm developers to have more information on
>> > this
>> > > issue, if needed to properly diagnose it.
>> > >
>> > > I can keep this cluster as is for max 2 days.
>> > >
>> > > Is there anything useful which I can collect on it to help Storm
>> > > developers to understand the cause (and hopefully use it to make Storm
>> > more
>> > > robust) ?
>> > >
>> > > Few details:
>> > >
>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4
>> > > Supervisors VMs + 3 Zookeeper VMs
>> > >
>> > > * Running with Java Server JRE 1.8.0_121
>> > > * Running on AWS EC2 instances
>> > >
>> > > * We run about 10 topologies, with automatic self-healing on them (if
>> > they
>> > > stop consuming Kafka items, our self-healer call "Kill topology", and
>> > then
>> > > eventually restarts the topology
>> > >
>> > > * We have a self-healing on Nimbus UI based on calling its REST
>> services.
>> > > If it's not responding fast enough, we restart Nimbus UI
>> > > * We figured out the issue because Nimbus UI was restarted every 2
>> > minutes
>> > >
>> > > * To fix our production server which had the same symptom, we had to
>> stop
>> > > all Storm processes, then stop all Zookeepers, then remove all data in
>> > > Zookeeper "snapshot files", then restart all Zookeeper, then restart
>> all
>> > > Storm process, then re-submit all our topologies
>> > >
>> > > Please be as clear as possible about which commands we should run to
>> give
>> > > you more details if needed
>> > >
>> > > Best regards,
>> > > Alexandre Vermeerbergen
>> > >
>> > >
>> >
>>
>
>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Alexandre Vermeerbergen <av...@gmail.com>.
Hello Jungtaek,

Thank you very much for your answer.

Please find attached the full Nimbus log (gzipped) related to this issue.

Please note that the last ERROR repeats forever until we "repair" Storm.

From the logs, it could be that the issue began close to when a topology
was restarted (killed, then started)

Maybe this caused a corruption in Zookeeper. Is there anything which I can
collect in our Zookeeper nodes/logs to help analysis?

Best regards,
Alexandre




2017-08-24 9:29 GMT+02:00 Jungtaek Lim <ka...@gmail.com>:

> Hi Alexandre, I missed this mail since I was on vacation.
>
> I followed the stack trace but hard to analyze without context. Do you mind
> providing full nimbus log?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> avermeerbergen@gmail.com>님이
> 작성:
>
> > Hello,
> >
> > Tomorrow I will have to restart the cluster on which I have this issue
> with
> > Storm 1.1.0.
> > Is there are anybody interested in my running some commands to get more
> > logs before I repair this cluster?
> >
> > Best regards,
> > Alexandre Vermeerbergen
> >
> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> > avermeerbergen@gmail.com
> > >:
> >
> > > Hello,
> > >
> > > I think it might be of interest for you Storm developers to learn that
> I
> > > currently have a case of issue with Storm 1.1.0 which was supposed to
> > > resolved in this release according to https://issues.apache.org/
> > > jira/browse/STORM-1977 ; and I can look for any more information which
> > > you'd need to make a diagnostic on why this issue still can happen.
> > >
> > > Indeed, I have a Storm UI process which can't get any information on
> its
> > > Storm cluster, and I see many following exception in nimbus.log:
> > >
> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO] Created
> > > download session for statefulAlerting_ec2-52-51-
> 199-56-eu-west-1-compute-
> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with id
> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO] Created
> > > download session for statefulAlerting_ec2-52-51-
> 199-56-eu-west-1-compute-
> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with id
> > > aba18011-3258-4023-bbef-14d21a7066e1
> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox ...
> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34
> [ERROR]
> > > Internal error processing getClusterInfo
> > > org.apache.storm.generated.KeyNotFoundException: null
> > >         at
> > org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> LocalFsBlobStore.java:147)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(
> LocalFsBlobStore.java:299)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown
> Source)
> > > ~[?:?]
> > >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> > >         at java.lang.reflect.Method.invoke(Method.java:498)
> > ~[?:1.8.0_121]
> > >         at clojure.lang.Reflector.invokeMatchingMethod(
> Reflector.java:93)
> > > ~[clojure-1.7.0.jar:?]
> > >         at clojure.lang.Reflector.invokeInstanceMethod(
> Reflector.java:28)
> > > ~[clojure-1.7.0.jar:?]
> > >         at
> > org.apache.storm.daemon.nimbus$get_blob_replication_
> count.invoke(nimbus.clj:489)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.daemon.nimbus$get_cluster_info$iter__
> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > ~[clojure-1.7.0.jar:?]
> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > ~[clojure-1.7.0.jar:?]
> > >         at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
> > >         at clojure.core$seq__4128.invoke(core.clj:137)
> > > ~[clojure-1.7.0.jar:?]
> > >         at clojure.core$dorun.invoke(core.clj:3009)
> > ~[clojure-1.7.0.jar:?]
> > >         at clojure.core$doall.invoke(core.clj:3025)
> > ~[clojure-1.7.0.jar:?]
> > >         at
> > org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > getClusterInfo.getResult(Nimbus.java:3920)
> ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > getClusterInfo.getResult(Nimbus.java:3904)
> ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.security.auth.SimpleTransportPlugin$
> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.thrift.server.AbstractNonblockingServer$
> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> > > [?:1.8.0_121]
> > >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> > > [?:1.8.0_121]
> > >         at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
> > >
> > >
> > > What is amazing is that I got this same issue on two Storm clusters
> > > running on different VMs ; just they share the same data in they Kafka
> > > Broker cluster (one cluster is the production one, which was quickly
> > fixed,
> > > and the other one is the "backup" cluster to be used if the production
> > one
> > > fails for quick "back to production")
> > >
> > > If left one of these cluster with this behavior because I felt that it
> > > could be interesting for Storm developers to have more information on
> > this
> > > issue, if needed to properly diagnose it.
> > >
> > > I can keep this cluster as is for max 2 days.
> > >
> > > Is there anything useful which I can collect on it to help Storm
> > > developers to understand the cause (and hopefully use it to make Storm
> > more
> > > robust) ?
> > >
> > > Few details:
> > >
> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4
> > > Supervisors VMs + 3 Zookeeper VMs
> > >
> > > * Running with Java Server JRE 1.8.0_121
> > > * Running on AWS EC2 instances
> > >
> > > * We run about 10 topologies, with automatic self-healing on them (if
> > they
> > > stop consuming Kafka items, our self-healer call "Kill topology", and
> > then
> > > eventually restarts the topology
> > >
> > > * We have a self-healing on Nimbus UI based on calling its REST
> services.
> > > If it's not responding fast enough, we restart Nimbus UI
> > > * We figured out the issue because Nimbus UI was restarted every 2
> > minutes
> > >
> > > * To fix our production server which had the same symptom, we had to
> stop
> > > all Storm processes, then stop all Zookeepers, then remove all data in
> > > Zookeeper "snapshot files", then restart all Zookeeper, then restart
> all
> > > Storm process, then re-submit all our topologies
> > >
> > > Please be as clear as possible about which commands we should run to
> give
> > > you more details if needed
> > >
> > > Best regards,
> > > Alexandre Vermeerbergen
> > >
> > >
> >
>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Jungtaek Lim <ka...@gmail.com>.
Hi Alexandre, I missed this mail since I was on vacation.

I followed the stack trace but hard to analyze without context. Do you mind
providing full nimbus log?

Thanks,
Jungtaek Lim (HeartSaVioR)

2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <av...@gmail.com>님이
작성:

> Hello,
>
> Tomorrow I will have to restart the cluster on which I have this issue with
> Storm 1.1.0.
> Is there are anybody interested in my running some commands to get more
> logs before I repair this cluster?
>
> Best regards,
> Alexandre Vermeerbergen
>
> 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> avermeerbergen@gmail.com
> >:
>
> > Hello,
> >
> > I think it might be of interest for you Storm developers to learn that I
> > currently have a case of issue with Storm 1.1.0 which was supposed to
> > resolved in this release according to https://issues.apache.org/
> > jira/browse/STORM-1977 ; and I can look for any more information which
> > you'd need to make a diagnostic on why this issue still can happen.
> >
> > Indeed, I have a Storm UI process which can't get any information on its
> > Storm cluster, and I see many following exception in nimbus.log:
> >
> > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO] Created
> > download session for statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with id
> > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO] Created
> > download session for statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with id
> > aba18011-3258-4023-bbef-14d21a7066e1
> > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox ...
> > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34 [ERROR]
> > Internal error processing getClusterInfo
> > org.apache.storm.generated.KeyNotFoundException: null
> >         at
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source)
> > ~[?:?]
> >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> >         at java.lang.reflect.Method.invoke(Method.java:498)
> ~[?:1.8.0_121]
> >         at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> > ~[clojure-1.7.0.jar:?]
> >         at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> > ~[clojure-1.7.0.jar:?]
> >         at
> org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:489)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at org.apache.storm.daemon.nimbus$get_cluster_info$iter__
> > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > ~[clojure-1.7.0.jar:?]
> >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > ~[clojure-1.7.0.jar:?]
> >         at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
> >         at clojure.core$seq__4128.invoke(core.clj:137)
> > ~[clojure-1.7.0.jar:?]
> >         at clojure.core$dorun.invoke(core.clj:3009)
> ~[clojure-1.7.0.jar:?]
> >         at clojure.core$doall.invoke(core.clj:3025)
> ~[clojure-1.7.0.jar:?]
> >         at
> org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
> > reify__10782.getClusterInfo(nimbus.clj:1971)
> ~[storm-core-1.1.0.jar:1.1.0]
> >         at org.apache.storm.generated.Nimbus$Processor$
> > getClusterInfo.getResult(Nimbus.java:3920) ~[storm-core-1.1.0.jar:1.1.0]
> >         at org.apache.storm.generated.Nimbus$Processor$
> > getClusterInfo.getResult(Nimbus.java:3904) ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at org.apache.storm.security.auth.SimpleTransportPlugin$
> > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at org.apache.storm.thrift.server.AbstractNonblockingServer$
> > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
> > ~[storm-core-1.1.0.jar:1.1.0]
> >         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > [?:1.8.0_121]
> >         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > [?:1.8.0_121]
> >         at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
> >
> >
> > What is amazing is that I got this same issue on two Storm clusters
> > running on different VMs ; just they share the same data in they Kafka
> > Broker cluster (one cluster is the production one, which was quickly
> fixed,
> > and the other one is the "backup" cluster to be used if the production
> one
> > fails for quick "back to production")
> >
> > If left one of these cluster with this behavior because I felt that it
> > could be interesting for Storm developers to have more information on
> this
> > issue, if needed to properly diagnose it.
> >
> > I can keep this cluster as is for max 2 days.
> >
> > Is there anything useful which I can collect on it to help Storm
> > developers to understand the cause (and hopefully use it to make Storm
> more
> > robust) ?
> >
> > Few details:
> >
> > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4
> > Supervisors VMs + 3 Zookeeper VMs
> >
> > * Running with Java Server JRE 1.8.0_121
> > * Running on AWS EC2 instances
> >
> > * We run about 10 topologies, with automatic self-healing on them (if
> they
> > stop consuming Kafka items, our self-healer call "Kill topology", and
> then
> > eventually restarts the topology
> >
> > * We have a self-healing on Nimbus UI based on calling its REST services.
> > If it's not responding fast enough, we restart Nimbus UI
> > * We figured out the issue because Nimbus UI was restarted every 2
> minutes
> >
> > * To fix our production server which had the same symptom, we had to stop
> > all Storm processes, then stop all Zookeepers, then remove all data in
> > Zookeeper "snapshot files", then restart all Zookeeper, then restart all
> > Storm process, then re-submit all our topologies
> >
> > Please be as clear as possible about which commands we should run to give
> > you more details if needed
> >
> > Best regards,
> > Alexandre Vermeerbergen
> >
> >
>

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

Posted by Alexandre Vermeerbergen <av...@gmail.com>.
Hello,

Tomorrow I will have to restart the cluster on which I have this issue with
Storm 1.1.0.
Is there are anybody interested in my running some commands to get more
logs before I repair this cluster?

Best regards,
Alexandre Vermeerbergen

2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <avermeerbergen@gmail.com
>:

> Hello,
>
> I think it might be of interest for you Storm developers to learn that I
> currently have a case of issue with Storm 1.1.0 which was supposed to
> resolved in this release according to https://issues.apache.org/
> jira/browse/STORM-1977 ; and I can look for any more information which
> you'd need to make a diagnostic on why this issue still can happen.
>
> Indeed, I have a Storm UI process which can't get any information on its
> Storm cluster, and I see many following exception in nimbus.log:
>
> 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO] Created
> download session for statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with id
> d5120ad7-a81c-4c39-afc5-a7f876b04c73
> 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO] Created
> download session for statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with id
> aba18011-3258-4023-bbef-14d21a7066e1
> 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox ...
> deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34 [ERROR]
> Internal error processing getClusterInfo
> org.apache.storm.generated.KeyNotFoundException: null
>         at org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source)
> ~[?:?]
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
>         at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_121]
>         at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> ~[clojure-1.7.0.jar:?]
>         at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> ~[clojure-1.7.0.jar:?]
>         at org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:489)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.daemon.nimbus$get_cluster_info$iter__
> 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> ~[clojure-1.7.0.jar:?]
>         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> ~[clojure-1.7.0.jar:?]
>         at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
>         at clojure.core$seq__4128.invoke(core.clj:137)
> ~[clojure-1.7.0.jar:?]
>         at clojure.core$dorun.invoke(core.clj:3009) ~[clojure-1.7.0.jar:?]
>         at clojure.core$doall.invoke(core.clj:3025) ~[clojure-1.7.0.jar:?]
>         at org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
> reify__10782.getClusterInfo(nimbus.clj:1971) ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.generated.Nimbus$Processor$
> getClusterInfo.getResult(Nimbus.java:3920) ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.generated.Nimbus$Processor$
> getClusterInfo.getResult(Nimbus.java:3904) ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.security.auth.SimpleTransportPlugin$
> SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.thrift.server.AbstractNonblockingServer$
> FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [?:1.8.0_121]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [?:1.8.0_121]
>         at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
>
>
> What is amazing is that I got this same issue on two Storm clusters
> running on different VMs ; just they share the same data in they Kafka
> Broker cluster (one cluster is the production one, which was quickly fixed,
> and the other one is the "backup" cluster to be used if the production one
> fails for quick "back to production")
>
> If left one of these cluster with this behavior because I felt that it
> could be interesting for Storm developers to have more information on this
> issue, if needed to properly diagnose it.
>
> I can keep this cluster as is for max 2 days.
>
> Is there anything useful which I can collect on it to help Storm
> developers to understand the cause (and hopefully use it to make Storm more
> robust) ?
>
> Few details:
>
> * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4
> Supervisors VMs + 3 Zookeeper VMs
>
> * Running with Java Server JRE 1.8.0_121
> * Running on AWS EC2 instances
>
> * We run about 10 topologies, with automatic self-healing on them (if they
> stop consuming Kafka items, our self-healer call "Kill topology", and then
> eventually restarts the topology
>
> * We have a self-healing on Nimbus UI based on calling its REST services.
> If it's not responding fast enough, we restart Nimbus UI
> * We figured out the issue because Nimbus UI was restarted every 2 minutes
>
> * To fix our production server which had the same symptom, we had to stop
> all Storm processes, then stop all Zookeepers, then remove all data in
> Zookeeper "snapshot files", then restart all Zookeeper, then restart all
> Storm process, then re-submit all our topologies
>
> Please be as clear as possible about which commands we should run to give
> you more details if needed
>
> Best regards,
> Alexandre Vermeerbergen
>
>