You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Tom Raney <to...@urbanairship.com> on 2017/10/18 19:31:34 UTC

inter-topology contention?

 Is there a good way to measure how contended a cluster is in terms of
inbound/outbound queues?

I'm using 1.0.2 and have noticed that at times tuples flowing through a
topology slow down considerably.

Load for each of the 5 nodes in the cluster is low and network doesn't
appear bottlenecked.  Sometimes, if I redeploy or re-balance the topology,
throughput increases dramatically for a day or so.

I'm using topology.max.spout.pending set to 30 with 8 spouts feeding 40
"writer" bolts.  The capacity metric for the busiest bolt is around .780,
which seems to indicate that they aren't the bottleneck.

topology.message.timeout.secs is set to 120 seconds, but I'm not seeing
failures.

Additionally, I'm using tic tuples to flush the accumulated data at each
bolt to the database every 5 minutes.  Between those cycles, the bolt
accumulates aggregated data and only writes if cache misses occur.  But,
the cache hit rate is almost always 100%.

-Tom

Re: inter-topology contention?

Posted by Bobby Evans <bo...@apache.org>.

There is no simple way to do this across an entire cluster.  We are not
reporting it in the metrics that go to the UI so it is not that simple to
aggregate them for more then one topology.  They are available in the
topology specific metrics.

http://storm.apache.org/releases/1.0.4/Metrics.html

Describes some of this.  Sadly all of the metrics we support are not listed
there.  I would suggest you install the logging metrics consumer and start
looking around at what we have there.  The ones I think you want are
__receive __send and __transport.  Each of these are queue metrics.  Some
of the fields you should look at are population, overflow, and
sojourn_time_ms.  population is the number of entires in the queue itself.
If the queue fills up there is an overflow where there may be more
entires.  sojourn_time_ms is an estimation of the number of ms it will take
for a tuple to flow through the queue.  In my experience this is a very
noisy number and is not always that accurate, but it can give you an idea
if there is a problem (aka if the number is large).

- Bobby

On Wed, Oct 18, 2017 at 2:31 PM Tom Raney <to...@urbanairship.com>
wrote:

> Is there a good way to measure how contended a cluster is in terms of
> inbound/outbound queues?
>
> I'm using 1.0.2 and have noticed that at times tuples flowing through a
> topology slow down considerably.
>
> Load for each of the 5 nodes in the cluster is low and network doesn't
> appear bottlenecked.  Sometimes, if I redeploy or re-balance the topology,
> throughput increases dramatically for a day or so.
>
> I'm using topology.max.spout.pending set to 30 with 8 spouts feeding 40
> "writer" bolts.  The capacity metric for the busiest bolt is around .780,
> which seems to indicate that they aren't the bottleneck.
>
> topology.message.timeout.secs is set to 120 seconds, but I'm not seeing
> failures.
>
> Additionally, I'm using tic tuples to flush the accumulated data at each
> bolt to the database every 5 minutes.  Between those cycles, the bolt
> accumulates aggregated data and only writes if cache misses occur.  But,
> the cache hit rate is almost always 100%.
>
> -Tom
>