You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by Erik Weathers <ew...@groupon.com.INVALID> on 2016/12/23 04:00:45 UTC

ever seen the netty message_queue grow (seemingly) infinitely?

We're debugging a topology's infinite memory growth for a worker process
that is running a metrics consumer bolt, and we just noticed that the netty
Server.java's message_queue
<https://github.com/apache/storm/blob/v0.9.6/storm-core/src/jvm/backtype/storm/messaging/netty/Server.java#L97>
is growing forever (at least it goes up to ~5GB before it hits heap limits
and leads to heavy GCing).  (We found this by using Eclipse's Memory
Analysis Tool on a heap dump obtained via jmap.)

We're running storm-0.9.6, and this is happening with a topology that is
processing 200K+ tuples per second, and producing a lot of metrics.

I'm a bit surprised that this queue would grow forever, I assumed there
would be some sort of limit.  I'm pretty naive about how netty's message
receiving system tied into the Storm executors at this point though.  I'm
kind of assuming the behavior could be a result of backpressure / slowness
from our downstream monitoring system, but there's no visibility provided
by Storm into what's happening with these messages in the netty queues
(that I have been able to ferret out at least!).

Thanks for any input you might be able to provide!

- Erik

Re: ever seen the netty message_queue grow (seemingly) infinitely?

Posted by Erik Weathers <ew...@groupon.com.INVALID>.

thanks for the response Jungtaek.  There are definitely a ton of executors
in this topology, and its processing a ton of tuples.

Also of note is that the issue I mentioned happens only when custom
app-level metrics are enabled.  When only the storm-level internal metrics
are enabled, the problem goes away.

Unfortunatley, we cannot upgrade to storm-1.x+ for a long while still,
mostly because of the logback to log4j 2 change.  That is going to require
significant effort to get all of our client topologies owned by dozens of
teams to make modifications.  We haven't even yet gotten time to work on
*how* they will have to change (since we have a logback-based library our
clients use to do logging).

This issue is also blocking me from looking into the new metrics stuff and
the exposure of the queue metrics into the UI (which we discussed
separately).

So we'll keep plugging away, I'll update the thread with any concrete
findings.

Thanks!

- Erik

On Wed, Jan 4, 2017 at 8:05 AM, Jungtaek Lim <ka...@gmail.com> wrote:

> May you already know about this, please also note that count of metrics
> tuples are linear with overall task count. Higher parallelism puts more
> pressure to the metrics bolt.
>
> I guess Taylor and Alessandro have been working on metrics v2. Unless we
> finish metrics v2, we can just reduce the load with metrics whitelist /
> blacklist, and asynchronous metrics consumer bolt on upcoming Storm 1.1.0.
> (Before that you might would like to give a try to migrate to 1.x, say,
> 1.0.2 for now.)
>
> - Jungtaek Lim (HeartSaVioR)
>
> 2017년 1월 5일 (목) 오전 12:42, Bobby Evans <ev...@yahoo-inc.com.invalid>님이 작성:
>
> > Yes you are right that will not help.  The best you can do now is to
> > increase the number of MetricsConsumer instances that you have.  You can
> do
> > this when you register the metrics consumer.
> > conf.registerMetricsConsumer(NoOpMetricsConsumer.class, 3);
> > The default is 1, but we have see with very large topologies, or ones
> that
> > output a lot of metrics they can sometimes get bogged down.
> > You could also try profiling that worker to see what is taking so long.
> > If a NoOp is also showing the same signs it would be interesting to see
> > why.  It could be the number of events coming in, or it could be the size
> > of the metrics being sent making deserialization costly. - Bobby
> >
> >     On Tuesday, January 3, 2017 2:05 PM, Erik Weathers
> > <ew...@groupon.com.INVALID> wrote:
> >
> >
> >  Thanks for the response Bobby!
> >
> > I think I might have failed to sufficiently emphasize & explain something
> > in my earlier description of the issue:  this is happening *only* in a
> > worker process that is hosting a bolt that implements the
> *IMetricsConsumer
> > *interface.  The other 24 worker processes are working just fine, their
> > netty queues do not grow forever.  The same number and type of executors
> > are on every worker process, except that one worker that is hosting the
> > metrics consumer bolt.
> >
> > So the netty queue is growing unbounded because of an influx of metrics.
> > The acking and max spout pending configs wouldn't seem to directly
> > influence the filling of the netty queue with custom metrics.
> >
> > Notably, this "choking" behavior happens even with a
> "NoOpMetricsConsumer"
> > bolt which is the same as storm's LoggingMetricsConsumer but with the
> > handleDataPoints() doing *nothing*.  Interesting, right?
> >
> > - Erik
> >
> > On Tue, Jan 3, 2017 at 7:06 AM, Bobby Evans <evans@yahoo-inc.com.invalid
> >
> > wrote:
> >
> > > Storm does not have back pressure by default.  Also because storm
> > supports
> > > loops in a topology the message queues can grow unbounded.  We have put
> > in
> > > a number of fixes in newer versions of storm, also for the messaging
> side
> > > of things.  But the simplest way to avoid this is to have acking
> enabled
> > > and have max spout pending set to a reasonable number.  This will
> > typically
> > > be caused by one of the executors in your worker not being able to keep
> > up
> > > with the load coming in.  There is also the possibility that a single
> > > thread cannot keep up with the incoming  message load.  In the former
> > case
> > > you should be able to see the capacity go very high on some of the
> > > executors.  In the latter case you will not see that, and may need to
> add
> > > more workers to your topology.  - Bobby
> > >
> > >    On Thursday, December 22, 2016 10:01 PM, Erik Weathers
> > > <ew...@groupon.com.INVALID> wrote:
> > >
> > >
> > >  We're debugging a topology's infinite memory growth for a worker
> process
> > > that is running a metrics consumer bolt, and we just noticed that the
> > netty
> > > Server.java's message_queue
> > > <https://github.com/apache/storm/blob/v0.9.6/storm-core/
> > > src/jvm/backtype/storm/messaging/netty/Server.java#L97>
> > > is growing forever (at least it goes up to ~5GB before it hits heap
> > limits
> > > and leads to heavy GCing).  (We found this by using Eclipse's Memory
> > > Analysis Tool on a heap dump obtained via jmap.)
> > >
> > > We're running storm-0.9.6, and this is happening with a topology that
> is
> > > processing 200K+ tuples per second, and producing a lot of metrics.
> > >
> > > I'm a bit surprised that this queue would grow forever, I assumed there
> > > would be some sort of limit.  I'm pretty naive about how netty's
> message
> > > receiving system tied into the Storm executors at this point though.
> I'm
> > > kind of assuming the behavior could be a result of backpressure /
> > slowness
> > > from our downstream monitoring system, but there's no visibility
> provided
> > > by Storm into what's happening with these messages in the netty queues
> > > (that I have been able to ferret out at least!).
> > >
> > > Thanks for any input you might be able to provide!
> > >
> > > - Erik
> > >
> > >
> > >
> > >
> >
> >
> >
>

Re: ever seen the netty message_queue grow (seemingly) infinitely?

Posted by Jungtaek Lim <ka...@gmail.com>.

May you already know about this, please also note that count of metrics
tuples are linear with overall task count. Higher parallelism puts more
pressure to the metrics bolt.

I guess Taylor and Alessandro have been working on metrics v2. Unless we
finish metrics v2, we can just reduce the load with metrics whitelist /
blacklist, and asynchronous metrics consumer bolt on upcoming Storm 1.1.0.
(Before that you might would like to give a try to migrate to 1.x, say,
1.0.2 for now.)

- Jungtaek Lim (HeartSaVioR)

2017년 1월 5일 (목) 오전 12:42, Bobby Evans <ev...@yahoo-inc.com.invalid>님이 작성:

> Yes you are right that will not help.  The best you can do now is to
> increase the number of MetricsConsumer instances that you have.  You can do
> this when you register the metrics consumer.
> conf.registerMetricsConsumer(NoOpMetricsConsumer.class, 3);
> The default is 1, but we have see with very large topologies, or ones that
> output a lot of metrics they can sometimes get bogged down.
> You could also try profiling that worker to see what is taking so long.
> If a NoOp is also showing the same signs it would be interesting to see
> why.  It could be the number of events coming in, or it could be the size
> of the metrics being sent making deserialization costly. - Bobby
>
>     On Tuesday, January 3, 2017 2:05 PM, Erik Weathers
> <ew...@groupon.com.INVALID> wrote:
>
>
>  Thanks for the response Bobby!
>
> I think I might have failed to sufficiently emphasize & explain something
> in my earlier description of the issue:  this is happening *only* in a
> worker process that is hosting a bolt that implements the *IMetricsConsumer
> *interface.  The other 24 worker processes are working just fine, their
> netty queues do not grow forever.  The same number and type of executors
> are on every worker process, except that one worker that is hosting the
> metrics consumer bolt.
>
> So the netty queue is growing unbounded because of an influx of metrics.
> The acking and max spout pending configs wouldn't seem to directly
> influence the filling of the netty queue with custom metrics.
>
> Notably, this "choking" behavior happens even with a "NoOpMetricsConsumer"
> bolt which is the same as storm's LoggingMetricsConsumer but with the
> handleDataPoints() doing *nothing*.  Interesting, right?
>
> - Erik
>
> On Tue, Jan 3, 2017 at 7:06 AM, Bobby Evans <ev...@yahoo-inc.com.invalid>
> wrote:
>
> > Storm does not have back pressure by default.  Also because storm
> supports
> > loops in a topology the message queues can grow unbounded.  We have put
> in
> > a number of fixes in newer versions of storm, also for the messaging side
> > of things.  But the simplest way to avoid this is to have acking enabled
> > and have max spout pending set to a reasonable number.  This will
> typically
> > be caused by one of the executors in your worker not being able to keep
> up
> > with the load coming in.  There is also the possibility that a single
> > thread cannot keep up with the incoming  message load.  In the former
> case
> > you should be able to see the capacity go very high on some of the
> > executors.  In the latter case you will not see that, and may need to add
> > more workers to your topology.  - Bobby
> >
> >    On Thursday, December 22, 2016 10:01 PM, Erik Weathers
> > <ew...@groupon.com.INVALID> wrote:
> >
> >
> >  We're debugging a topology's infinite memory growth for a worker process
> > that is running a metrics consumer bolt, and we just noticed that the
> netty
> > Server.java's message_queue
> > <https://github.com/apache/storm/blob/v0.9.6/storm-core/
> > src/jvm/backtype/storm/messaging/netty/Server.java#L97>
> > is growing forever (at least it goes up to ~5GB before it hits heap
> limits
> > and leads to heavy GCing).  (We found this by using Eclipse's Memory
> > Analysis Tool on a heap dump obtained via jmap.)
> >
> > We're running storm-0.9.6, and this is happening with a topology that is
> > processing 200K+ tuples per second, and producing a lot of metrics.
> >
> > I'm a bit surprised that this queue would grow forever, I assumed there
> > would be some sort of limit.  I'm pretty naive about how netty's message
> > receiving system tied into the Storm executors at this point though.  I'm
> > kind of assuming the behavior could be a result of backpressure /
> slowness
> > from our downstream monitoring system, but there's no visibility provided
> > by Storm into what's happening with these messages in the netty queues
> > (that I have been able to ferret out at least!).
> >
> > Thanks for any input you might be able to provide!
> >
> > - Erik
> >
> >
> >
> >
>
>
>

Re: ever seen the netty message_queue grow (seemingly) infinitely?

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

Yes you are right that will not help.  The best you can do now is to increase the number of MetricsConsumer instances that you have.  You can do this when you register the metrics consumer.
conf.registerMetricsConsumer(NoOpMetricsConsumer.class, 3);
The default is 1, but we have see with very large topologies, or ones that output a lot of metrics they can sometimes get bogged down.
You could also try profiling that worker to see what is taking so long.  If a NoOp is also showing the same signs it would be interesting to see why.  It could be the number of events coming in, or it could be the size of the metrics being sent making deserialization costly. - Bobby 

    On Tuesday, January 3, 2017 2:05 PM, Erik Weathers <ew...@groupon.com.INVALID> wrote:

 Thanks for the response Bobby!

I think I might have failed to sufficiently emphasize & explain something
in my earlier description of the issue:  this is happening *only* in a
worker process that is hosting a bolt that implements the *IMetricsConsumer
*interface.  The other 24 worker processes are working just fine, their
netty queues do not grow forever.  The same number and type of executors
are on every worker process, except that one worker that is hosting the
metrics consumer bolt.

So the netty queue is growing unbounded because of an influx of metrics.
The acking and max spout pending configs wouldn't seem to directly
influence the filling of the netty queue with custom metrics.

Notably, this "choking" behavior happens even with a "NoOpMetricsConsumer"
bolt which is the same as storm's LoggingMetricsConsumer but with the
handleDataPoints() doing *nothing*.  Interesting, right?

- Erik

On Tue, Jan 3, 2017 at 7:06 AM, Bobby Evans <ev...@yahoo-inc.com.invalid>
wrote:

> Storm does not have back pressure by default.  Also because storm supports
> loops in a topology the message queues can grow unbounded.  We have put in
> a number of fixes in newer versions of storm, also for the messaging side
> of things.  But the simplest way to avoid this is to have acking enabled
> and have max spout pending set to a reasonable number.  This will typically
> be caused by one of the executors in your worker not being able to keep up
> with the load coming in.  There is also the possibility that a single
> thread cannot keep up with the incoming  message load.  In the former case
> you should be able to see the capacity go very high on some of the
> executors.  In the latter case you will not see that, and may need to add
> more workers to your topology.  - Bobby
>
>    On Thursday, December 22, 2016 10:01 PM, Erik Weathers
> <ew...@groupon.com.INVALID> wrote:
>
>
>  We're debugging a topology's infinite memory growth for a worker process
> that is running a metrics consumer bolt, and we just noticed that the netty
> Server.java's message_queue
> <https://github.com/apache/storm/blob/v0.9.6/storm-core/
> src/jvm/backtype/storm/messaging/netty/Server.java#L97>
> is growing forever (at least it goes up to ~5GB before it hits heap limits
> and leads to heavy GCing).  (We found this by using Eclipse's Memory
> Analysis Tool on a heap dump obtained via jmap.)
>
> We're running storm-0.9.6, and this is happening with a topology that is
> processing 200K+ tuples per second, and producing a lot of metrics.
>
> I'm a bit surprised that this queue would grow forever, I assumed there
> would be some sort of limit.  I'm pretty naive about how netty's message
> receiving system tied into the Storm executors at this point though.  I'm
> kind of assuming the behavior could be a result of backpressure / slowness
> from our downstream monitoring system, but there's no visibility provided
> by Storm into what's happening with these messages in the netty queues
> (that I have been able to ferret out at least!).
>
> Thanks for any input you might be able to provide!
>
> - Erik
>
>
>
>

Re: ever seen the netty message_queue grow (seemingly) infinitely?

Posted by Erik Weathers <ew...@groupon.com.INVALID>.

Thanks for the response Bobby!

I think I might have failed to sufficiently emphasize & explain something
in my earlier description of the issue:  this is happening *only* in a
worker process that is hosting a bolt that implements the *IMetricsConsumer
*interface.  The other 24 worker processes are working just fine, their
netty queues do not grow forever.  The same number and type of executors
are on every worker process, except that one worker that is hosting the
metrics consumer bolt.

So the netty queue is growing unbounded because of an influx of metrics.
The acking and max spout pending configs wouldn't seem to directly
influence the filling of the netty queue with custom metrics.

Notably, this "choking" behavior happens even with a "NoOpMetricsConsumer"
bolt which is the same as storm's LoggingMetricsConsumer but with the
handleDataPoints() doing *nothing*.  Interesting, right?

- Erik

On Tue, Jan 3, 2017 at 7:06 AM, Bobby Evans <ev...@yahoo-inc.com.invalid>
wrote:

> Storm does not have back pressure by default.  Also because storm supports
> loops in a topology the message queues can grow unbounded.  We have put in
> a number of fixes in newer versions of storm, also for the messaging side
> of things.  But the simplest way to avoid this is to have acking enabled
> and have max spout pending set to a reasonable number.  This will typically
> be caused by one of the executors in your worker not being able to keep up
> with the load coming in.  There is also the possibility that a single
> thread cannot keep up with the incoming  message load.  In the former case
> you should be able to see the capacity go very high on some of the
> executors.  In the latter case you will not see that, and may need to add
> more workers to your topology.  - Bobby
>
>     On Thursday, December 22, 2016 10:01 PM, Erik Weathers
> <ew...@groupon.com.INVALID> wrote:
>
>
>  We're debugging a topology's infinite memory growth for a worker process
> that is running a metrics consumer bolt, and we just noticed that the netty
> Server.java's message_queue
> <https://github.com/apache/storm/blob/v0.9.6/storm-core/
> src/jvm/backtype/storm/messaging/netty/Server.java#L97>
> is growing forever (at least it goes up to ~5GB before it hits heap limits
> and leads to heavy GCing).  (We found this by using Eclipse's Memory
> Analysis Tool on a heap dump obtained via jmap.)
>
> We're running storm-0.9.6, and this is happening with a topology that is
> processing 200K+ tuples per second, and producing a lot of metrics.
>
> I'm a bit surprised that this queue would grow forever, I assumed there
> would be some sort of limit.  I'm pretty naive about how netty's message
> receiving system tied into the Storm executors at this point though.  I'm
> kind of assuming the behavior could be a result of backpressure / slowness
> from our downstream monitoring system, but there's no visibility provided
> by Storm into what's happening with these messages in the netty queues
> (that I have been able to ferret out at least!).
>
> Thanks for any input you might be able to provide!
>
> - Erik
>
>
>
>

Re: ever seen the netty message_queue grow (seemingly) infinitely?

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

Storm does not have back pressure by default.  Also because storm supports loops in a topology the message queues can grow unbounded.  We have put in a number of fixes in newer versions of storm, also for the messaging side of things.  But the simplest way to avoid this is to have acking enabled and have max spout pending set to a reasonable number.  This will typically be caused by one of the executors in your worker not being able to keep up with the load coming in.  There is also the possibility that a single thread cannot keep up with the incoming  message load.  In the former case you should be able to see the capacity go very high on some of the executors.  In the latter case you will not see that, and may need to add more workers to your topology.  - Bobby 

    On Thursday, December 22, 2016 10:01 PM, Erik Weathers <ew...@groupon.com.INVALID> wrote:
 

 We're debugging a topology's infinite memory growth for a worker process
that is running a metrics consumer bolt, and we just noticed that the netty
Server.java's message_queue
<https://github.com/apache/storm/blob/v0.9.6/storm-core/src/jvm/backtype/storm/messaging/netty/Server.java#L97>
is growing forever (at least it goes up to ~5GB before it hits heap limits
and leads to heavy GCing).  (We found this by using Eclipse's Memory
Analysis Tool on a heap dump obtained via jmap.)

We're running storm-0.9.6, and this is happening with a topology that is
processing 200K+ tuples per second, and producing a lot of metrics.

I'm a bit surprised that this queue would grow forever, I assumed there
would be some sort of limit.  I'm pretty naive about how netty's message
receiving system tied into the Storm executors at this point though.  I'm
kind of assuming the behavior could be a result of backpressure / slowness
from our downstream monitoring system, but there's no visibility provided
by Storm into what's happening with these messages in the netty queues
(that I have been able to ferret out at least!).

Thanks for any input you might be able to provide!

- Erik