You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Kevin Conaway <ke...@gmail.com> on 2016/04/14 15:54:02 UTC

Storm Metrics Consumer Not Receiving Tuples

We are using Storm 0.10 with the following configuration:

   - 1 Nimbus node
   - 6 Supervisor nodes, each with 2 worker slots.  Each supervisor has 8
   cores.


Our topology has a KafkaSpout that forwards to a bolt where we transform
the message and insert it in to Cassandra.  Our topic has 50 partitions so
we have configured the number of executors/tasks for the KafkaSpout to be
50.  Our bolt has 150 executors/tasks.

We have also added the storm-graphite metrics consumer (
https://github.com/verisign/storm-graphite) to our topology so that storms
metrics are sent to our graphite cluster.

Yesterday we were running a 2000 tuple/sec load test and everything was
fine for a few hours until we noticed that we were no longer receiving
metrics from Storm in graphite.

I verified that its not a connectivity issue between the Storm and
Graphite.  Looking in Storm UI,
the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
received a single tuple in the prior 10 minute or 3 hour window.

Since the metrics consumer bolt was assigned to one executor, I took thread
dumps of that JVM.  I saw the following stack trace for the metrics
consumer thread:

"Thread-23-__metricscom.verisign.storm.metrics.GraphiteMetricsConsumer" #56
prio=5 os_prio=0 tid=0x00007fb4a13f1000 nid=0xe45 waiting on condition
[0x00007fb3a7af9000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000a9ea23e8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
        at
com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:83)
        at
com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:54)
        at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:102)
        at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
        at
backtype.storm.daemon.executor$fn__5694$fn__5707$fn__5758.invoke(executor.clj:819)
        at backtype.storm.util$async_loop$fn__545.invoke(util.clj:479)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)

I also saw a number of threads stuck in _waitForFreeSlotAt_ on the
disruptor queue, I'm not sure if thats an issue or not:

"user-timer" #33 daemon prio=10 os_prio=0 tid=0x00007fb4a1579800 nid=0xe1f
runnable [0x00007fb445665000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
        at
com.lmax.disruptor.AbstractMultithreadedClaimStrategy.waitForFreeSlotAt(AbstractMultithreadedClaimStrategy.java:99)
        at
com.lmax.disruptor.AbstractMultithreadedClaimStrategy.incrementAndGet(AbstractMultithreadedClaimStrategy.java:49)
        at com.lmax.disruptor.Sequencer.next(Sequencer.java:127)
        at
backtype.storm.utils.DisruptorQueue.publishDirect(DisruptorQueue.java:181)
        at
backtype.storm.utils.DisruptorQueue.publish(DisruptorQueue.java:174)
        at backtype.storm.disruptor$publish.invoke(disruptor.clj:66)
        at backtype.storm.disruptor$publish.invoke(disruptor.clj:68)
        at
backtype.storm.daemon.executor$setup_metrics_BANG_$fn__5544.invoke(executor.clj:295)
        at
backtype.storm.timer$schedule_recurring$this__3721.invoke(timer.clj:102)
        at
backtype.storm.timer$mk_timer$fn__3704$fn__3705.invoke(timer.clj:50)
        at backtype.storm.timer$mk_timer$fn__3704.invoke(timer.clj:42)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)

"Thread-36-disruptor-executor[49 49]-send-queue" #69 prio=5 os_prio=0
tid=0x00007fb4a0c36800 nid=0xe5a runnable [0x00007fb3a6dec000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
        at
com.lmax.disruptor.AbstractMultithreadedClaimStrategy.waitForFreeSlotAt(AbstractMultithreadedClaimStrategy.java:99)
        at
com.lmax.disruptor.AbstractMultithreadedClaimStrategy.incrementAndGet(AbstractMultithreadedClaimStrategy.java:49)
        at com.lmax.disruptor.Sequencer.next(Sequencer.java:127)
        at
backtype.storm.utils.DisruptorQueue.publishDirect(DisruptorQueue.java:181)
        at
backtype.storm.utils.DisruptorQueue.publish(DisruptorQueue.java:174)
        at backtype.storm.disruptor$publish.invoke(disruptor.clj:66)
        at backtype.storm.disruptor$publish.invoke(disruptor.clj:68)
        at
backtype.storm.daemon.worker$mk_transfer_fn$transfer_fn__6886.invoke(worker.clj:141)
        at
backtype.storm.daemon.executor$start_batch_transfer__GT_worker_handler_BANG_$fn__5534.invoke(executor.clj:279)
        at
backtype.storm.disruptor$clojure_handler$reify__5189.onEvent(disruptor.clj:58)
        at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:132)
        at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:106)
        at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
        at
backtype.storm.disruptor$consume_loop_STAR_$fn__5202.invoke(disruptor.clj:94)
        at backtype.storm.util$async_loop$fn__545.invoke(util.clj:479)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)

Whats also interesting is that at the same time the metrics stopped being
emitted, we started seeing a lot of tuples failing at the spout.

Looking at our average CPU usage graph below, you can see that all of the
workers were engaged in doing work and then about half of them went idle.
http://i.imgur.com/Y9kBDFD.png

In Storm UI, the bolt latency did not increase and there were no failures
so it seems like there was an issue getting tuples from the spout to the
bolt.

Some more info about the state of the system before we stopped receiving
metrics:

The average execute / processing latency for the bolt was ~18ms:
http://i.imgur.com/BzmT1gz.png

The average number of spout acks / second was ~2400:
http://i.imgur.com/DTywOyj.png

The average spout lag was under 10 and the average spout latency was 50ms:
http://i.imgur.com/omEy11t.png

There were errors or warnings in any of the logs and none of the workers
were restarted during this time.

Any thoughts on what could be causing this and how to diagnose further?

Thank you

-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: Storm Metrics Consumer Not Receiving Tuples

Posted by Kevin Conaway <ke...@gmail.com>.

I tried to reproduce this locally and could not.  I stood up a local storm
0.10 cluster and deployed my topology with the Graphite metrics consumer
configured.  After it was up and running, I killed my local kafka broker to
observe what happened.  Although the kafka spout tasks were printing
errors, metrics were still being sent to Graphite during this time.

However, what I did notice was that every time the metrics were sent, I saw
the following two messages in the worker logs:

*2016-04-16 17:37:39.758 b.s.m.n.Server [INFO] Getting metrics for server
on port 6701*

*2016-04-16 17:37:39.758 b.s.m.n.Client [INFO] Getting metrics for client
connection to Netty-Client-/192.168.1.11:6700 <http://192.168.1.11:6700>*

I went back through the worker logs from our load testing cluster and
noticed that those log messages stopped being printed at the exact same
time the metrics stopped being reported to Graphite.  Both of those log
messages are logged in the implementation of *IStatefulObject.getState()*
 (in* backtype.storm.messaging.netty.Server* and
* backtype.storm.messaging.netty.Client*)  so whatever class is responsible
for invoking that method stopped working.   At first guess, that would
appear to be whatever process is responsible for collecting metrics via
*IMetric.getValueAndReset()*

Does that provide any further insight in to what happened?  I will keep
digging on my end.
Thanks,

Kevin

On Fri, Apr 15, 2016 at 2:17 PM, Kevin Conaway <ke...@gmail.com>
wrote:

> I took thread dumps of the worker where the graphite consumer bolt
> executor was running but I didn't see any BLOCKED threads or anything out
> of the ordinary.  This is the thread dump for the graphite metrics consumer
> bolt:
>
> "Thread-23-__metricscom.verisign.storm.metrics.GraphiteMetricsConsumer"
> #56 prio=5 os_prio=0 tid=0x00007f0b8555c800 nid=0x9a2 waiting on condition
> [0x00007f0abaeed000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at
> backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713)
>         at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477)
>         at clojure.lang.AFn.run(AFn.java:22)
>         at java.lang.Thread.run(Thread.java:745)
>
> Would a "stuck" bolt on some other worker JVM have the same effect?
>
>
> On Fri, Apr 15, 2016 at 2:10 PM, Abhishek Agarwal <ab...@gmail.com>
> wrote:
>
>> You might want to check the thread dump and verify if some bolt is stuck
>> somewhere
>>
>> Excuse typos
>> On Apr 15, 2016 11:08 PM, "Kevin Conaway" <ke...@gmail.com>
>> wrote:
>>
>>> Was the bolt really "stuck" though given that the failure was at the
>>> spout level (because the spout couldn't connect to the Kafka broker)?
>>>
>>> Additionally, we restarted the Kafka broker and it seemed like the spout
>>> was able to reconnect but we never saw messages from through on the metric
>>> consumer until we killed and restarted the topology.
>>>
>>> On Fri, Apr 15, 2016 at 1:31 PM, Abhishek Agarwal <ab...@gmail.com>
>>> wrote:
>>>
>>>> Kevin,
>>>> That would explain it. A stuck bolt will stall the whole topology.
>>>> MetricConsumer runs as a bolt so it will be blocked as well
>>>>
>>>> Excuse typos
>>>> On Apr 15, 2016 10:29 PM, "Kevin Conaway" <ke...@gmail.com>
>>>> wrote:
>>>>
>>>>> Two more data points on this:
>>>>>
>>>>> 1.) We are registering the graphite MetricsConsumer on our Topology
>>>>> Config, not globally in storm.yaml.  I don't know if this makes a
>>>>> difference.
>>>>>
>>>>> 2.) We re-ran another test last night and it ran fine for about 6
>>>>> hours until the Kafka brokers ran out of disk space (oops) which halted the
>>>>> test.  This exact time also coincided with when the Graphite instance
>>>>> stopped receiving metrics from Storm.  Given that we weren't processing any
>>>>> tuples while storm was down, I understand why we didn't get those metrics
>>>>> but shouldn't the __system metrics (like heap size, gc time) still have
>>>>> been sent?
>>>>>
>>>>> On Thu, Apr 14, 2016 at 10:09 PM, Kevin Conaway <
>>>>> kevin.a.conaway@gmail.com> wrote:
>>>>>
>>>>>> Thank you for taking the time to respond.
>>>>>>
>>>>>> In my bolt I am registering 3 custom metrics (each a ReducedMetric to
>>>>>> track the latency of individual operations in the bolt).  The metric
>>>>>> interval for each is the same as TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS
>>>>>> which we have set at 60s
>>>>>>
>>>>>> The topology did not hang completely but it did degrade severely.
>>>>>> Without metrics it was hard to tell but it looked like some of the tasks
>>>>>> for certain kafka partitions either stopped emitting tuples or never got
>>>>>> acknowledgements for the tuples they did emit.  Some tuples were definitely
>>>>>> making it through though because data was continuously being inserted in to
>>>>>> Cassandra.  After I killed and resubmitted the topology, there were still
>>>>>> messages left over in the topic but only for certain partitions.
>>>>>>
>>>>>> What queue configuration are you looking for?
>>>>>>
>>>>>> I don't believe that the case was that the graphite metrics consumer
>>>>>> wasn't "keeping up".  In storm UI, the processing latency was very low for
>>>>>> that pseudo-bolt, as was the capacity.  Storm UI just showed that no tuples
>>>>>> were being delivered to the bolt.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <ka...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Kevin,
>>>>>>>
>>>>>>> Do you register custom metrics? If then how long / vary is their
>>>>>>> intervals?
>>>>>>> Did your topology not working completely? (I mean did all tuples
>>>>>>> become failing after that time?)
>>>>>>> And could you share your queue configuration?
>>>>>>>
>>>>>>> And you can replace storm-graphite to LoggingMetricsConsumer and see
>>>>>>> it helps. If changing consumer resolves the issue, we can guess
>>>>>>> storm-graphite cannot keep up the metrics.
>>>>>>>
>>>>>>> Btw, I'm addressing metrics consumer issues (asynchronous, filter).
>>>>>>> You can track the progress here:
>>>>>>> https://issues.apache.org/jira/browse/STORM-1699
>>>>>>>
>>>>>>> I'm afraid they may be not ported to 0.10.x, but asynchronous
>>>>>>> metrics consumer bolt
>>>>>>> <https://issues.apache.org/jira/browse/STORM-1698> is a simple
>>>>>>> patch so you can apply and build custom 0.10.0, and give it a try.
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>>
>>>>>>> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <dd...@norsys.fr>님이
>>>>>>> 작성:
>>>>>>>
>>>>>>>> Hi Kevin,
>>>>>>>>
>>>>>>>> I have a similar issue with storm 0.9.6 (see the following topic
>>>>>>>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser
>>>>>>>> ).
>>>>>>>>
>>>>>>>> It is still open. So, please, keep me informed on your progress.
>>>>>>>>
>>>>>>>> Denis
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 14/04/2016 15:54, Kevin Conaway a écrit :
>>>>>>>>
>>>>>>>> We are using Storm 0.10 with the following configuration:
>>>>>>>>
>>>>>>>>    - 1 Nimbus node
>>>>>>>>    - 6 Supervisor nodes, each with 2 worker slots.  Each
>>>>>>>>    supervisor has 8 cores.
>>>>>>>>
>>>>>>>>
>>>>>>>> Our topology has a KafkaSpout that forwards to a bolt where we
>>>>>>>> transform the message and insert it in to Cassandra.  Our topic has 50
>>>>>>>> partitions so we have configured the number of executors/tasks for the
>>>>>>>> KafkaSpout to be 50.  Our bolt has 150 executors/tasks.
>>>>>>>>
>>>>>>>> We have also added the storm-graphite metrics consumer (
>>>>>>>> <https://github.com/verisign/storm-graphite>
>>>>>>>> https://github.com/verisign/storm-graphite) to our topology so
>>>>>>>> that storms metrics are sent to our graphite cluster.
>>>>>>>>
>>>>>>>> Yesterday we were running a 2000 tuple/sec load test and everything
>>>>>>>> was fine for a few hours until we noticed that we were no longer receiving
>>>>>>>> metrics from Storm in graphite.
>>>>>>>>
>>>>>>>> I verified that its not a connectivity issue between the Storm and
>>>>>>>> Graphite.  Looking in Storm UI,
>>>>>>>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
>>>>>>>> received a single tuple in the prior 10 minute or 3 hour window.
>>>>>>>>
>>>>>>>> Since the metrics consumer bolt was assigned to one executor, I
>>>>>>>> took thread dumps of that JVM.  I saw the following stack trace for the
>>>>>>>> metrics consumer thread:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> [image: Avast logo]
>>>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>>>
>>>>>>>> L'absence de virus dans ce courrier électronique a été vérifiée par
>>>>>>>> le logiciel antivirus Avast.
>>>>>>>> www.avast.com
>>>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kevin Conaway
>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>>> https://github.com/kevinconaway
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kevin Conaway
>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>> https://github.com/kevinconaway
>>>>>
>>>>
>>>
>>>
>>> --
>>> Kevin Conaway
>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>> https://github.com/kevinconaway
>>>
>>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>



-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: Storm Metrics Consumer Not Receiving Tuples

Posted by Kevin Conaway <ke...@gmail.com>.

I took thread dumps of the worker where the graphite consumer bolt executor
was running but I didn't see any BLOCKED threads or anything out of the
ordinary.  This is the thread dump for the graphite metrics consumer bolt:

"Thread-23-__metricscom.verisign.storm.metrics.GraphiteMetricsConsumer" #56
prio=5 os_prio=0 tid=0x00007f0b8555c800 nid=0x9a2 waiting on condition
[0x00007f0abaeed000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at
backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713)
        at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)

Would a "stuck" bolt on some other worker JVM have the same effect?


On Fri, Apr 15, 2016 at 2:10 PM, Abhishek Agarwal <ab...@gmail.com>
wrote:

> You might want to check the thread dump and verify if some bolt is stuck
> somewhere
>
> Excuse typos
> On Apr 15, 2016 11:08 PM, "Kevin Conaway" <ke...@gmail.com>
> wrote:
>
>> Was the bolt really "stuck" though given that the failure was at the
>> spout level (because the spout couldn't connect to the Kafka broker)?
>>
>> Additionally, we restarted the Kafka broker and it seemed like the spout
>> was able to reconnect but we never saw messages from through on the metric
>> consumer until we killed and restarted the topology.
>>
>> On Fri, Apr 15, 2016 at 1:31 PM, Abhishek Agarwal <ab...@gmail.com>
>> wrote:
>>
>>> Kevin,
>>> That would explain it. A stuck bolt will stall the whole topology.
>>> MetricConsumer runs as a bolt so it will be blocked as well
>>>
>>> Excuse typos
>>> On Apr 15, 2016 10:29 PM, "Kevin Conaway" <ke...@gmail.com>
>>> wrote:
>>>
>>>> Two more data points on this:
>>>>
>>>> 1.) We are registering the graphite MetricsConsumer on our Topology
>>>> Config, not globally in storm.yaml.  I don't know if this makes a
>>>> difference.
>>>>
>>>> 2.) We re-ran another test last night and it ran fine for about 6 hours
>>>> until the Kafka brokers ran out of disk space (oops) which halted the
>>>> test.  This exact time also coincided with when the Graphite instance
>>>> stopped receiving metrics from Storm.  Given that we weren't processing any
>>>> tuples while storm was down, I understand why we didn't get those metrics
>>>> but shouldn't the __system metrics (like heap size, gc time) still have
>>>> been sent?
>>>>
>>>> On Thu, Apr 14, 2016 at 10:09 PM, Kevin Conaway <
>>>> kevin.a.conaway@gmail.com> wrote:
>>>>
>>>>> Thank you for taking the time to respond.
>>>>>
>>>>> In my bolt I am registering 3 custom metrics (each a ReducedMetric to
>>>>> track the latency of individual operations in the bolt).  The metric
>>>>> interval for each is the same as TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS
>>>>> which we have set at 60s
>>>>>
>>>>> The topology did not hang completely but it did degrade severely.
>>>>> Without metrics it was hard to tell but it looked like some of the tasks
>>>>> for certain kafka partitions either stopped emitting tuples or never got
>>>>> acknowledgements for the tuples they did emit.  Some tuples were definitely
>>>>> making it through though because data was continuously being inserted in to
>>>>> Cassandra.  After I killed and resubmitted the topology, there were still
>>>>> messages left over in the topic but only for certain partitions.
>>>>>
>>>>> What queue configuration are you looking for?
>>>>>
>>>>> I don't believe that the case was that the graphite metrics consumer
>>>>> wasn't "keeping up".  In storm UI, the processing latency was very low for
>>>>> that pseudo-bolt, as was the capacity.  Storm UI just showed that no tuples
>>>>> were being delivered to the bolt.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Kevin,
>>>>>>
>>>>>> Do you register custom metrics? If then how long / vary is their
>>>>>> intervals?
>>>>>> Did your topology not working completely? (I mean did all tuples
>>>>>> become failing after that time?)
>>>>>> And could you share your queue configuration?
>>>>>>
>>>>>> And you can replace storm-graphite to LoggingMetricsConsumer and see
>>>>>> it helps. If changing consumer resolves the issue, we can guess
>>>>>> storm-graphite cannot keep up the metrics.
>>>>>>
>>>>>> Btw, I'm addressing metrics consumer issues (asynchronous, filter).
>>>>>> You can track the progress here:
>>>>>> https://issues.apache.org/jira/browse/STORM-1699
>>>>>>
>>>>>> I'm afraid they may be not ported to 0.10.x, but asynchronous
>>>>>> metrics consumer bolt
>>>>>> <https://issues.apache.org/jira/browse/STORM-1698> is a simple patch
>>>>>> so you can apply and build custom 0.10.0, and give it a try.
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>>
>>>>>> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <dd...@norsys.fr>님이
>>>>>> 작성:
>>>>>>
>>>>>>> Hi Kevin,
>>>>>>>
>>>>>>> I have a similar issue with storm 0.9.6 (see the following topic
>>>>>>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser
>>>>>>> ).
>>>>>>>
>>>>>>> It is still open. So, please, keep me informed on your progress.
>>>>>>>
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> Le 14/04/2016 15:54, Kevin Conaway a écrit :
>>>>>>>
>>>>>>> We are using Storm 0.10 with the following configuration:
>>>>>>>
>>>>>>>    - 1 Nimbus node
>>>>>>>    - 6 Supervisor nodes, each with 2 worker slots.  Each supervisor
>>>>>>>    has 8 cores.
>>>>>>>
>>>>>>>
>>>>>>> Our topology has a KafkaSpout that forwards to a bolt where we
>>>>>>> transform the message and insert it in to Cassandra.  Our topic has 50
>>>>>>> partitions so we have configured the number of executors/tasks for the
>>>>>>> KafkaSpout to be 50.  Our bolt has 150 executors/tasks.
>>>>>>>
>>>>>>> We have also added the storm-graphite metrics consumer (
>>>>>>> <https://github.com/verisign/storm-graphite>
>>>>>>> https://github.com/verisign/storm-graphite) to our topology so that
>>>>>>> storms metrics are sent to our graphite cluster.
>>>>>>>
>>>>>>> Yesterday we were running a 2000 tuple/sec load test and everything
>>>>>>> was fine for a few hours until we noticed that we were no longer receiving
>>>>>>> metrics from Storm in graphite.
>>>>>>>
>>>>>>> I verified that its not a connectivity issue between the Storm and
>>>>>>> Graphite.  Looking in Storm UI,
>>>>>>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
>>>>>>> received a single tuple in the prior 10 minute or 3 hour window.
>>>>>>>
>>>>>>> Since the metrics consumer bolt was assigned to one executor, I took
>>>>>>> thread dumps of that JVM.  I saw the following stack trace for the metrics
>>>>>>> consumer thread:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> [image: Avast logo]
>>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>>
>>>>>>> L'absence de virus dans ce courrier électronique a été vérifiée par
>>>>>>> le logiciel antivirus Avast.
>>>>>>> www.avast.com
>>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kevin Conaway
>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>> https://github.com/kevinconaway
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Kevin Conaway
>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>> https://github.com/kevinconaway
>>>>
>>>
>>
>>
>> --
>> Kevin Conaway
>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>> https://github.com/kevinconaway
>>
>


-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: Storm Metrics Consumer Not Receiving Tuples

Posted by Abhishek Agarwal <ab...@gmail.com>.

You might want to check the thread dump and verify if some bolt is stuck
somewhere

Excuse typos
On Apr 15, 2016 11:08 PM, "Kevin Conaway" <ke...@gmail.com> wrote:

> Was the bolt really "stuck" though given that the failure was at the spout
> level (because the spout couldn't connect to the Kafka broker)?
>
> Additionally, we restarted the Kafka broker and it seemed like the spout
> was able to reconnect but we never saw messages from through on the metric
> consumer until we killed and restarted the topology.
>
> On Fri, Apr 15, 2016 at 1:31 PM, Abhishek Agarwal <ab...@gmail.com>
> wrote:
>
>> Kevin,
>> That would explain it. A stuck bolt will stall the whole topology.
>> MetricConsumer runs as a bolt so it will be blocked as well
>>
>> Excuse typos
>> On Apr 15, 2016 10:29 PM, "Kevin Conaway" <ke...@gmail.com>
>> wrote:
>>
>>> Two more data points on this:
>>>
>>> 1.) We are registering the graphite MetricsConsumer on our Topology
>>> Config, not globally in storm.yaml.  I don't know if this makes a
>>> difference.
>>>
>>> 2.) We re-ran another test last night and it ran fine for about 6 hours
>>> until the Kafka brokers ran out of disk space (oops) which halted the
>>> test.  This exact time also coincided with when the Graphite instance
>>> stopped receiving metrics from Storm.  Given that we weren't processing any
>>> tuples while storm was down, I understand why we didn't get those metrics
>>> but shouldn't the __system metrics (like heap size, gc time) still have
>>> been sent?
>>>
>>> On Thu, Apr 14, 2016 at 10:09 PM, Kevin Conaway <
>>> kevin.a.conaway@gmail.com> wrote:
>>>
>>>> Thank you for taking the time to respond.
>>>>
>>>> In my bolt I am registering 3 custom metrics (each a ReducedMetric to
>>>> track the latency of individual operations in the bolt).  The metric
>>>> interval for each is the same as TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS
>>>> which we have set at 60s
>>>>
>>>> The topology did not hang completely but it did degrade severely.
>>>> Without metrics it was hard to tell but it looked like some of the tasks
>>>> for certain kafka partitions either stopped emitting tuples or never got
>>>> acknowledgements for the tuples they did emit.  Some tuples were definitely
>>>> making it through though because data was continuously being inserted in to
>>>> Cassandra.  After I killed and resubmitted the topology, there were still
>>>> messages left over in the topic but only for certain partitions.
>>>>
>>>> What queue configuration are you looking for?
>>>>
>>>> I don't believe that the case was that the graphite metrics consumer
>>>> wasn't "keeping up".  In storm UI, the processing latency was very low for
>>>> that pseudo-bolt, as was the capacity.  Storm UI just showed that no tuples
>>>> were being delivered to the bolt.
>>>>
>>>> Thanks!
>>>>
>>>> On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Kevin,
>>>>>
>>>>> Do you register custom metrics? If then how long / vary is their
>>>>> intervals?
>>>>> Did your topology not working completely? (I mean did all tuples
>>>>> become failing after that time?)
>>>>> And could you share your queue configuration?
>>>>>
>>>>> And you can replace storm-graphite to LoggingMetricsConsumer and see
>>>>> it helps. If changing consumer resolves the issue, we can guess
>>>>> storm-graphite cannot keep up the metrics.
>>>>>
>>>>> Btw, I'm addressing metrics consumer issues (asynchronous, filter).
>>>>> You can track the progress here:
>>>>> https://issues.apache.org/jira/browse/STORM-1699
>>>>>
>>>>> I'm afraid they may be not ported to 0.10.x, but asynchronous metrics
>>>>> consumer bolt <https://issues.apache.org/jira/browse/STORM-1698> is a
>>>>> simple patch so you can apply and build custom 0.10.0, and give it a try.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>>
>>>>> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <dd...@norsys.fr>님이
>>>>> 작성:
>>>>>
>>>>>> Hi Kevin,
>>>>>>
>>>>>> I have a similar issue with storm 0.9.6 (see the following topic
>>>>>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser
>>>>>> ).
>>>>>>
>>>>>> It is still open. So, please, keep me informed on your progress.
>>>>>>
>>>>>> Denis
>>>>>>
>>>>>>
>>>>>> Le 14/04/2016 15:54, Kevin Conaway a écrit :
>>>>>>
>>>>>> We are using Storm 0.10 with the following configuration:
>>>>>>
>>>>>>    - 1 Nimbus node
>>>>>>    - 6 Supervisor nodes, each with 2 worker slots.  Each supervisor
>>>>>>    has 8 cores.
>>>>>>
>>>>>>
>>>>>> Our topology has a KafkaSpout that forwards to a bolt where we
>>>>>> transform the message and insert it in to Cassandra.  Our topic has 50
>>>>>> partitions so we have configured the number of executors/tasks for the
>>>>>> KafkaSpout to be 50.  Our bolt has 150 executors/tasks.
>>>>>>
>>>>>> We have also added the storm-graphite metrics consumer (
>>>>>> <https://github.com/verisign/storm-graphite>
>>>>>> https://github.com/verisign/storm-graphite) to our topology so that
>>>>>> storms metrics are sent to our graphite cluster.
>>>>>>
>>>>>> Yesterday we were running a 2000 tuple/sec load test and everything
>>>>>> was fine for a few hours until we noticed that we were no longer receiving
>>>>>> metrics from Storm in graphite.
>>>>>>
>>>>>> I verified that its not a connectivity issue between the Storm and
>>>>>> Graphite.  Looking in Storm UI,
>>>>>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
>>>>>> received a single tuple in the prior 10 minute or 3 hour window.
>>>>>>
>>>>>> Since the metrics consumer bolt was assigned to one executor, I took
>>>>>> thread dumps of that JVM.  I saw the following stack trace for the metrics
>>>>>> consumer thread:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> [image: Avast logo]
>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>
>>>>>> L'absence de virus dans ce courrier électronique a été vérifiée par
>>>>>> le logiciel antivirus Avast.
>>>>>> www.avast.com
>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Kevin Conaway
>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>> https://github.com/kevinconaway
>>>>
>>>
>>>
>>>
>>> --
>>> Kevin Conaway
>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>> https://github.com/kevinconaway
>>>
>>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>

Re: Storm Metrics Consumer Not Receiving Tuples

Posted by Kevin Conaway <ke...@gmail.com>.

Was the bolt really "stuck" though given that the failure was at the spout
level (because the spout couldn't connect to the Kafka broker)?

Additionally, we restarted the Kafka broker and it seemed like the spout
was able to reconnect but we never saw messages from through on the metric
consumer until we killed and restarted the topology.

On Fri, Apr 15, 2016 at 1:31 PM, Abhishek Agarwal <ab...@gmail.com>
wrote:

> Kevin,
> That would explain it. A stuck bolt will stall the whole topology.
> MetricConsumer runs as a bolt so it will be blocked as well
>
> Excuse typos
> On Apr 15, 2016 10:29 PM, "Kevin Conaway" <ke...@gmail.com>
> wrote:
>
>> Two more data points on this:
>>
>> 1.) We are registering the graphite MetricsConsumer on our Topology
>> Config, not globally in storm.yaml.  I don't know if this makes a
>> difference.
>>
>> 2.) We re-ran another test last night and it ran fine for about 6 hours
>> until the Kafka brokers ran out of disk space (oops) which halted the
>> test.  This exact time also coincided with when the Graphite instance
>> stopped receiving metrics from Storm.  Given that we weren't processing any
>> tuples while storm was down, I understand why we didn't get those metrics
>> but shouldn't the __system metrics (like heap size, gc time) still have
>> been sent?
>>
>> On Thu, Apr 14, 2016 at 10:09 PM, Kevin Conaway <
>> kevin.a.conaway@gmail.com> wrote:
>>
>>> Thank you for taking the time to respond.
>>>
>>> In my bolt I am registering 3 custom metrics (each a ReducedMetric to
>>> track the latency of individual operations in the bolt).  The metric
>>> interval for each is the same as TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS
>>> which we have set at 60s
>>>
>>> The topology did not hang completely but it did degrade severely.
>>> Without metrics it was hard to tell but it looked like some of the tasks
>>> for certain kafka partitions either stopped emitting tuples or never got
>>> acknowledgements for the tuples they did emit.  Some tuples were definitely
>>> making it through though because data was continuously being inserted in to
>>> Cassandra.  After I killed and resubmitted the topology, there were still
>>> messages left over in the topic but only for certain partitions.
>>>
>>> What queue configuration are you looking for?
>>>
>>> I don't believe that the case was that the graphite metrics consumer
>>> wasn't "keeping up".  In storm UI, the processing latency was very low for
>>> that pseudo-bolt, as was the capacity.  Storm UI just showed that no tuples
>>> were being delivered to the bolt.
>>>
>>> Thanks!
>>>
>>> On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>>>
>>>> Kevin,
>>>>
>>>> Do you register custom metrics? If then how long / vary is their
>>>> intervals?
>>>> Did your topology not working completely? (I mean did all tuples become
>>>> failing after that time?)
>>>> And could you share your queue configuration?
>>>>
>>>> And you can replace storm-graphite to LoggingMetricsConsumer and see it
>>>> helps. If changing consumer resolves the issue, we can guess storm-graphite
>>>> cannot keep up the metrics.
>>>>
>>>> Btw, I'm addressing metrics consumer issues (asynchronous, filter).
>>>> You can track the progress here:
>>>> https://issues.apache.org/jira/browse/STORM-1699
>>>>
>>>> I'm afraid they may be not ported to 0.10.x, but asynchronous metrics
>>>> consumer bolt <https://issues.apache.org/jira/browse/STORM-1698> is a
>>>> simple patch so you can apply and build custom 0.10.0, and give it a try.
>>>>
>>>> Hope this helps.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>>
>>>> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <dd...@norsys.fr>님이
>>>> 작성:
>>>>
>>>>> Hi Kevin,
>>>>>
>>>>> I have a similar issue with storm 0.9.6 (see the following topic
>>>>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser
>>>>> ).
>>>>>
>>>>> It is still open. So, please, keep me informed on your progress.
>>>>>
>>>>> Denis
>>>>>
>>>>>
>>>>> Le 14/04/2016 15:54, Kevin Conaway a écrit :
>>>>>
>>>>> We are using Storm 0.10 with the following configuration:
>>>>>
>>>>>    - 1 Nimbus node
>>>>>    - 6 Supervisor nodes, each with 2 worker slots.  Each supervisor
>>>>>    has 8 cores.
>>>>>
>>>>>
>>>>> Our topology has a KafkaSpout that forwards to a bolt where we
>>>>> transform the message and insert it in to Cassandra.  Our topic has 50
>>>>> partitions so we have configured the number of executors/tasks for the
>>>>> KafkaSpout to be 50.  Our bolt has 150 executors/tasks.
>>>>>
>>>>> We have also added the storm-graphite metrics consumer (
>>>>> <https://github.com/verisign/storm-graphite>
>>>>> https://github.com/verisign/storm-graphite) to our topology so that
>>>>> storms metrics are sent to our graphite cluster.
>>>>>
>>>>> Yesterday we were running a 2000 tuple/sec load test and everything
>>>>> was fine for a few hours until we noticed that we were no longer receiving
>>>>> metrics from Storm in graphite.
>>>>>
>>>>> I verified that its not a connectivity issue between the Storm and
>>>>> Graphite.  Looking in Storm UI,
>>>>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
>>>>> received a single tuple in the prior 10 minute or 3 hour window.
>>>>>
>>>>> Since the metrics consumer bolt was assigned to one executor, I took
>>>>> thread dumps of that JVM.  I saw the following stack trace for the metrics
>>>>> consumer thread:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> [image: Avast logo]
>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>
>>>>> L'absence de virus dans ce courrier électronique a été vérifiée par le
>>>>> logiciel antivirus Avast.
>>>>> www.avast.com
>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Kevin Conaway
>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>> https://github.com/kevinconaway
>>>
>>
>>
>>
>> --
>> Kevin Conaway
>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>> https://github.com/kevinconaway
>>
>


-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: Storm Metrics Consumer Not Receiving Tuples

Posted by Abhishek Agarwal <ab...@gmail.com>.

Kevin,
That would explain it. A stuck bolt will stall the whole topology.
MetricConsumer runs as a bolt so it will be blocked as well

Excuse typos
On Apr 15, 2016 10:29 PM, "Kevin Conaway" <ke...@gmail.com> wrote:

> Two more data points on this:
>
> 1.) We are registering the graphite MetricsConsumer on our Topology
> Config, not globally in storm.yaml.  I don't know if this makes a
> difference.
>
> 2.) We re-ran another test last night and it ran fine for about 6 hours
> until the Kafka brokers ran out of disk space (oops) which halted the
> test.  This exact time also coincided with when the Graphite instance
> stopped receiving metrics from Storm.  Given that we weren't processing any
> tuples while storm was down, I understand why we didn't get those metrics
> but shouldn't the __system metrics (like heap size, gc time) still have
> been sent?
>
> On Thu, Apr 14, 2016 at 10:09 PM, Kevin Conaway <kevin.a.conaway@gmail.com
> > wrote:
>
>> Thank you for taking the time to respond.
>>
>> In my bolt I am registering 3 custom metrics (each a ReducedMetric to
>> track the latency of individual operations in the bolt).  The metric
>> interval for each is the same as TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS
>> which we have set at 60s
>>
>> The topology did not hang completely but it did degrade severely.
>> Without metrics it was hard to tell but it looked like some of the tasks
>> for certain kafka partitions either stopped emitting tuples or never got
>> acknowledgements for the tuples they did emit.  Some tuples were definitely
>> making it through though because data was continuously being inserted in to
>> Cassandra.  After I killed and resubmitted the topology, there were still
>> messages left over in the topic but only for certain partitions.
>>
>> What queue configuration are you looking for?
>>
>> I don't believe that the case was that the graphite metrics consumer
>> wasn't "keeping up".  In storm UI, the processing latency was very low for
>> that pseudo-bolt, as was the capacity.  Storm UI just showed that no tuples
>> were being delivered to the bolt.
>>
>> Thanks!
>>
>> On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>>
>>> Kevin,
>>>
>>> Do you register custom metrics? If then how long / vary is their
>>> intervals?
>>> Did your topology not working completely? (I mean did all tuples become
>>> failing after that time?)
>>> And could you share your queue configuration?
>>>
>>> And you can replace storm-graphite to LoggingMetricsConsumer and see it
>>> helps. If changing consumer resolves the issue, we can guess storm-graphite
>>> cannot keep up the metrics.
>>>
>>> Btw, I'm addressing metrics consumer issues (asynchronous, filter).
>>> You can track the progress here:
>>> https://issues.apache.org/jira/browse/STORM-1699
>>>
>>> I'm afraid they may be not ported to 0.10.x, but asynchronous metrics
>>> consumer bolt <https://issues.apache.org/jira/browse/STORM-1698> is a
>>> simple patch so you can apply and build custom 0.10.0, and give it a try.
>>>
>>> Hope this helps.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <dd...@norsys.fr>님이
>>> 작성:
>>>
>>>> Hi Kevin,
>>>>
>>>> I have a similar issue with storm 0.9.6 (see the following topic
>>>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser
>>>> ).
>>>>
>>>> It is still open. So, please, keep me informed on your progress.
>>>>
>>>> Denis
>>>>
>>>>
>>>> Le 14/04/2016 15:54, Kevin Conaway a écrit :
>>>>
>>>> We are using Storm 0.10 with the following configuration:
>>>>
>>>>    - 1 Nimbus node
>>>>    - 6 Supervisor nodes, each with 2 worker slots.  Each supervisor
>>>>    has 8 cores.
>>>>
>>>>
>>>> Our topology has a KafkaSpout that forwards to a bolt where we
>>>> transform the message and insert it in to Cassandra.  Our topic has 50
>>>> partitions so we have configured the number of executors/tasks for the
>>>> KafkaSpout to be 50.  Our bolt has 150 executors/tasks.
>>>>
>>>> We have also added the storm-graphite metrics consumer (
>>>> <https://github.com/verisign/storm-graphite>
>>>> https://github.com/verisign/storm-graphite) to our topology so that
>>>> storms metrics are sent to our graphite cluster.
>>>>
>>>> Yesterday we were running a 2000 tuple/sec load test and everything was
>>>> fine for a few hours until we noticed that we were no longer receiving
>>>> metrics from Storm in graphite.
>>>>
>>>> I verified that its not a connectivity issue between the Storm and
>>>> Graphite.  Looking in Storm UI,
>>>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
>>>> received a single tuple in the prior 10 minute or 3 hour window.
>>>>
>>>> Since the metrics consumer bolt was assigned to one executor, I took
>>>> thread dumps of that JVM.  I saw the following stack trace for the metrics
>>>> consumer thread:
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> [image: Avast logo]
>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>
>>>> L'absence de virus dans ce courrier électronique a été vérifiée par le
>>>> logiciel antivirus Avast.
>>>> www.avast.com
>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>
>>>>
>>
>>
>> --
>> Kevin Conaway
>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>> https://github.com/kevinconaway
>>
>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>

Re: Storm Metrics Consumer Not Receiving Tuples

Posted by Kevin Conaway <ke...@gmail.com>.

Two more data points on this:

1.) We are registering the graphite MetricsConsumer on our Topology Config,
not globally in storm.yaml.  I don't know if this makes a difference.

2.) We re-ran another test last night and it ran fine for about 6 hours
until the Kafka brokers ran out of disk space (oops) which halted the
test.  This exact time also coincided with when the Graphite instance
stopped receiving metrics from Storm.  Given that we weren't processing any
tuples while storm was down, I understand why we didn't get those metrics
but shouldn't the __system metrics (like heap size, gc time) still have
been sent?

On Thu, Apr 14, 2016 at 10:09 PM, Kevin Conaway <ke...@gmail.com>
wrote:

> Thank you for taking the time to respond.
>
> In my bolt I am registering 3 custom metrics (each a ReducedMetric to
> track the latency of individual operations in the bolt).  The metric
> interval for each is the same as TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS
> which we have set at 60s
>
> The topology did not hang completely but it did degrade severely.  Without
> metrics it was hard to tell but it looked like some of the tasks for
> certain kafka partitions either stopped emitting tuples or never got
> acknowledgements for the tuples they did emit.  Some tuples were definitely
> making it through though because data was continuously being inserted in to
> Cassandra.  After I killed and resubmitted the topology, there were still
> messages left over in the topic but only for certain partitions.
>
> What queue configuration are you looking for?
>
> I don't believe that the case was that the graphite metrics consumer
> wasn't "keeping up".  In storm UI, the processing latency was very low for
> that pseudo-bolt, as was the capacity.  Storm UI just showed that no tuples
> were being delivered to the bolt.
>
> Thanks!
>
> On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>
>> Kevin,
>>
>> Do you register custom metrics? If then how long / vary is their
>> intervals?
>> Did your topology not working completely? (I mean did all tuples become
>> failing after that time?)
>> And could you share your queue configuration?
>>
>> And you can replace storm-graphite to LoggingMetricsConsumer and see it
>> helps. If changing consumer resolves the issue, we can guess storm-graphite
>> cannot keep up the metrics.
>>
>> Btw, I'm addressing metrics consumer issues (asynchronous, filter).
>> You can track the progress here:
>> https://issues.apache.org/jira/browse/STORM-1699
>>
>> I'm afraid they may be not ported to 0.10.x, but asynchronous metrics
>> consumer bolt <https://issues.apache.org/jira/browse/STORM-1698> is a
>> simple patch so you can apply and build custom 0.10.0, and give it a try.
>>
>> Hope this helps.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <dd...@norsys.fr>님이 작성:
>>
>>> Hi Kevin,
>>>
>>> I have a similar issue with storm 0.9.6 (see the following topic
>>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser
>>> ).
>>>
>>> It is still open. So, please, keep me informed on your progress.
>>>
>>> Denis
>>>
>>>
>>> Le 14/04/2016 15:54, Kevin Conaway a écrit :
>>>
>>> We are using Storm 0.10 with the following configuration:
>>>
>>>    - 1 Nimbus node
>>>    - 6 Supervisor nodes, each with 2 worker slots.  Each supervisor has
>>>    8 cores.
>>>
>>>
>>> Our topology has a KafkaSpout that forwards to a bolt where we transform
>>> the message and insert it in to Cassandra.  Our topic has 50 partitions so
>>> we have configured the number of executors/tasks for the KafkaSpout to be
>>> 50.  Our bolt has 150 executors/tasks.
>>>
>>> We have also added the storm-graphite metrics consumer (
>>> <https://github.com/verisign/storm-graphite>
>>> https://github.com/verisign/storm-graphite) to our topology so that
>>> storms metrics are sent to our graphite cluster.
>>>
>>> Yesterday we were running a 2000 tuple/sec load test and everything was
>>> fine for a few hours until we noticed that we were no longer receiving
>>> metrics from Storm in graphite.
>>>
>>> I verified that its not a connectivity issue between the Storm and
>>> Graphite.  Looking in Storm UI,
>>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
>>> received a single tuple in the prior 10 minute or 3 hour window.
>>>
>>> Since the metrics consumer bolt was assigned to one executor, I took
>>> thread dumps of that JVM.  I saw the following stack trace for the metrics
>>> consumer thread:
>>>
>>>
>>>
>>>
>>> ------------------------------
>>> [image: Avast logo]
>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>
>>> L'absence de virus dans ce courrier électronique a été vérifiée par le
>>> logiciel antivirus Avast.
>>> www.avast.com
>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>
>>>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>



-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: Storm Metrics Consumer Not Receiving Tuples

Posted by Kevin Conaway <ke...@gmail.com>.

Thank you for taking the time to respond.

In my bolt I am registering 3 custom metrics (each a ReducedMetric to track
the latency of individual operations in the bolt).  The metric interval for
each is the same as TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS which we have
set at 60s

The topology did not hang completely but it did degrade severely.  Without
metrics it was hard to tell but it looked like some of the tasks for
certain kafka partitions either stopped emitting tuples or never got
acknowledgements for the tuples they did emit.  Some tuples were definitely
making it through though because data was continuously being inserted in to
Cassandra.  After I killed and resubmitted the topology, there were still
messages left over in the topic but only for certain partitions.

What queue configuration are you looking for?

I don't believe that the case was that the graphite metrics consumer wasn't
"keeping up".  In storm UI, the processing latency was very low for that
pseudo-bolt, as was the capacity.  Storm UI just showed that no tuples were
being delivered to the bolt.

Thanks!

On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <ka...@gmail.com> wrote:

> Kevin,
>
> Do you register custom metrics? If then how long / vary is their intervals?
> Did your topology not working completely? (I mean did all tuples become
> failing after that time?)
> And could you share your queue configuration?
>
> And you can replace storm-graphite to LoggingMetricsConsumer and see it
> helps. If changing consumer resolves the issue, we can guess storm-graphite
> cannot keep up the metrics.
>
> Btw, I'm addressing metrics consumer issues (asynchronous, filter).
> You can track the progress here:
> https://issues.apache.org/jira/browse/STORM-1699
>
> I'm afraid they may be not ported to 0.10.x, but asynchronous metrics
> consumer bolt <https://issues.apache.org/jira/browse/STORM-1698> is a
> simple patch so you can apply and build custom 0.10.0, and give it a try.
>
> Hope this helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <dd...@norsys.fr>님이 작성:
>
>> Hi Kevin,
>>
>> I have a similar issue with storm 0.9.6 (see the following topic
>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser
>> ).
>>
>> It is still open. So, please, keep me informed on your progress.
>>
>> Denis
>>
>>
>> Le 14/04/2016 15:54, Kevin Conaway a écrit :
>>
>> We are using Storm 0.10 with the following configuration:
>>
>>    - 1 Nimbus node
>>    - 6 Supervisor nodes, each with 2 worker slots.  Each supervisor has
>>    8 cores.
>>
>>
>> Our topology has a KafkaSpout that forwards to a bolt where we transform
>> the message and insert it in to Cassandra.  Our topic has 50 partitions so
>> we have configured the number of executors/tasks for the KafkaSpout to be
>> 50.  Our bolt has 150 executors/tasks.
>>
>> We have also added the storm-graphite metrics consumer (
>> <https://github.com/verisign/storm-graphite>
>> https://github.com/verisign/storm-graphite) to our topology so that
>> storms metrics are sent to our graphite cluster.
>>
>> Yesterday we were running a 2000 tuple/sec load test and everything was
>> fine for a few hours until we noticed that we were no longer receiving
>> metrics from Storm in graphite.
>>
>> I verified that its not a connectivity issue between the Storm and
>> Graphite.  Looking in Storm UI,
>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
>> received a single tuple in the prior 10 minute or 3 hour window.
>>
>> Since the metrics consumer bolt was assigned to one executor, I took
>> thread dumps of that JVM.  I saw the following stack trace for the metrics
>> consumer thread:
>>
>>
>>
>>
>> ------------------------------
>> [image: Avast logo]
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>
>> L'absence de virus dans ce courrier électronique a été vérifiée par le
>> logiciel antivirus Avast.
>> www.avast.com
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>
>>


-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: Storm Metrics Consumer Not Receiving Tuples

Posted by Jungtaek Lim <ka...@gmail.com>.

Kevin,

Do you register custom metrics? If then how long / vary is their intervals?
Did your topology not working completely? (I mean did all tuples become
failing after that time?)
And could you share your queue configuration?

And you can replace storm-graphite to LoggingMetricsConsumer and see it
helps. If changing consumer resolves the issue, we can guess storm-graphite
cannot keep up the metrics.

Btw, I'm addressing metrics consumer issues (asynchronous, filter).
You can track the progress here:
https://issues.apache.org/jira/browse/STORM-1699

I'm afraid they may be not ported to 0.10.x, but asynchronous metrics
consumer bolt <https://issues.apache.org/jira/browse/STORM-1698> is a
simple patch so you can apply and build custom 0.10.0, and give it a try.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)


2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <dd...@norsys.fr>님이 작성:

> Hi Kevin,
>
> I have a similar issue with storm 0.9.6 (see the following topic
> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser).
>
> It is still open. So, please, keep me informed on your progress.
>
> Denis
>
>
> Le 14/04/2016 15:54, Kevin Conaway a écrit :
>
> We are using Storm 0.10 with the following configuration:
>
>    - 1 Nimbus node
>    - 6 Supervisor nodes, each with 2 worker slots.  Each supervisor has 8
>    cores.
>
>
> Our topology has a KafkaSpout that forwards to a bolt where we transform
> the message and insert it in to Cassandra.  Our topic has 50 partitions so
> we have configured the number of executors/tasks for the KafkaSpout to be
> 50.  Our bolt has 150 executors/tasks.
>
> We have also added the storm-graphite metrics consumer (
> https://github.com/verisign/storm-graphite) to our topology so that
> storms metrics are sent to our graphite cluster.
>
> Yesterday we were running a 2000 tuple/sec load test and everything was
> fine for a few hours until we noticed that we were no longer receiving
> metrics from Storm in graphite.
>
> I verified that its not a connectivity issue between the Storm and
> Graphite.  Looking in Storm UI,
> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
> received a single tuple in the prior 10 minute or 3 hour window.
>
> Since the metrics consumer bolt was assigned to one executor, I took
> thread dumps of that JVM.  I saw the following stack trace for the metrics
> consumer thread:
>
>
>
>
> ------------------------------
> [image: Avast logo]
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>
>

Re: Storm Metrics Consumer Not Receiving Tuples

Posted by Denis DEBARBIEUX <dd...@norsys.fr>.

Hi Kevin,

I have a similar issue with storm 0.9.6 (see the following topic 
https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser).

It is still open. So, please, keep me informed on your progress.

Denis

Le 14/04/2016 15:54, Kevin Conaway a écrit :
> We are using Storm 0.10 with the following configuration:
>
>   * 1 Nimbus node
>   * 6 Supervisor nodes, each with 2 worker slots.  Each supervisor has
>     8 cores.
>
>
> Our topology has a KafkaSpout that forwards to a bolt where we 
> transform the message and insert it in to Cassandra.  Our topic has 50 
> partitions so we have configured the number of executors/tasks for the 
> KafkaSpout to be 50.  Our bolt has 150 executors/tasks.
>
> We have also added the storm-graphite metrics consumer 
> (https://github.com/verisign/storm-graphite) to our topology so that 
> storms metrics are sent to our graphite cluster.
>
> Yesterday we were running a 2000 tuple/sec load test and everything 
> was fine for a few hours until we noticed that we were no longer 
> receiving metrics from Storm in graphite.
>
> I verified that its not a connectivity issue between the Storm and 
> Graphite.  Looking in Storm UI, 
> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't 
> received a single tuple in the prior 10 minute or 3 hour window.
>
> Since the metrics consumer bolt was assigned to one executor, I took 
> thread dumps of that JVM.  I saw the following stack trace for the 
> metrics consumer thread:
>



---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus