You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Fabrice Gabolde <fg...@weborama.com> on 2016/03/10 18:49:46 UTC

Topology stuck, doesn't call nextTuple

Hello list,

I have a 0.9.6 topology with 16 spout executors (a custom RMQ spout)
and around 80 executors on various bolts, taking up 8 slots on our
18-slot cluster. It's a straightforward affair with only
BaseBasicBolts.

This morning it just stopped processing any tuples at 10:40 AM, on all
workers within the same minute, despite the RMQ queues having
thousands of messages ready. Logs do not contain any errors. We have
the logs configured to emit on INFO+, except for logs from our org
which are configured to emit on TRACE+.

The *very* first thing the nextTuple method does is emit a log
message. Grepping for this message shows that it stops appearing at
10:40, so exactly at the same time that the topology hangs. (The rest
of the threads keep processing the tuples they already have, then they
stop when starved.)

netstat shows the RMQ connections are still up, and the RMQ cluster
agrees (sees 16 consumers on the queue).

Looking at the metrics, the internal queues are empty (population=0 or
1 everywhere on every worker). A thread dump shows this for the spout:

"Thread-44-rabbit" #76 prio=5 os_prio=0 tid=0x00007fcacc129000
nid=0x1a9e waiting on condition [0x00007fca3d1d4000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at backtype.storm.spout.SleepSpoutWaitStrategy.emptyEmit(SleepSpoutWaitStrategy.java:36)
    at backtype.storm.daemon.executor$fn__3371$fn__3386$fn__3415.invoke(executor.clj:582)
    at backtype.storm.util$async_loop$fn__460.invoke(util.clj:463)
    at clojure.lang.AFn.run(AFn.java:24)
    at java.lang.Thread.run(Thread.java:745)

That *looks* like it's just running the spout wait strategy
(configured for 5000ms), but then the log message should appear, so it
really seems nextTuple is not called.

Killing an RMQ connection from the admin interface triggers an
exception from the RMQ client, and then nothing. The client is
designed to reconnect the next time its receive method is called,
which it isn't (since nextTuple is not called, presumably).

We used to have this type of problem as well when using Zaqar as a
message queue. This makes the problem even harder to debug, since
Zaqar was at the time purely HTTP polling (running GET /messages in a
loop, basically, so no blocking). Restarting the topology works but
I'd rather that not become a permanent thing...

Does anybody have an idea where I should start looking?

-- 
Fabrice Gabolde

Re: Can't deploy a a storm topology

Posted by Boris Prochazka <bo...@viaplay.com>.
As I posted I hade som problems with deploying a topology from OSX running a virtualised docker host in VirtualBox. The problem is that the client connects to a NATed nimbus. The contacted nimbus replies with the nimbus leaders ip address and ask the client to use it instead. The replied address is the private ip+port and can never be reached from the storm client.

The workaround i found is to deploy the topology from the docker-host it self of directly from the nimbus. The docker-host is hard to modify and all modifications are lost after a restart therefore the only reasonable way of deploying topologies is to copy the jar to the nimbus and deploy it there.

Boris Prochazka
boris.prochazka@viaplay.com
Phone: +46 70 5125122
Skypeto: boris.prochazka
--------------------------------------------------------------------
"Perfection is achieved, not when there is nothing more to add,
 but when there is nothing left to take away."
 - Antoine de Saint-Exupery (1900 - 1944)


Re: Topology stuck, doesn't call nextTuple

Posted by Larry Akah <la...@gmail.com>.
unsubscribe

2016-03-11 9:53 GMT+01:00 Fabrice Gabolde <fg...@weborama.com>:

> On Mar 10, 2016 19:33, "Abhishek Agarwal" <ab...@gmail.com> wrote:
> >
> > Are these tuples being acked by bolts?
>
> All our bolts are BaseBasicBolts, so I would assume they are.
>
> How would I check for this?
>
> --
> Fabrice Gabolde
>



-- 
*Akah Larry N.H*

*Android Platform Engineer*
*Founder IceTeck*
*www.iceteck.com*

Developing technologies for emergence and sustainable development.

Re: Topology stuck, doesn't call nextTuple

Posted by Abhishek Agarwal <ab...@gmail.com>.
Acked count should be consistent with emitted count. Acking is done only
after the bolt's execute method.  if your bolt is stuck, topology will be
stuck as well. So you may want to check that as well.

On Fri, Mar 11, 2016 at 2:23 PM, Fabrice Gabolde <fg...@weborama.com>
wrote:

> On Mar 10, 2016 19:33, "Abhishek Agarwal" <ab...@gmail.com> wrote:
> >
> > Are these tuples being acked by bolts?
>
> All our bolts are BaseBasicBolts, so I would assume they are.
>
> How would I check for this?
>
> --
> Fabrice Gabolde
>



-- 
Regards,
Abhishek Agarwal

Re: Topology stuck, doesn't call nextTuple

Posted by Fabrice Gabolde <fg...@weborama.com>.
On Mar 10, 2016 19:33, "Abhishek Agarwal" <ab...@gmail.com> wrote:
>
> Are these tuples being acked by bolts?

All our bolts are BaseBasicBolts, so I would assume they are.

How would I check for this?

-- 
Fabrice Gabolde

Re: Topology stuck, doesn't call nextTuple

Posted by Abhishek Agarwal <ab...@gmail.com>.
Are these tuples being acked by bolts?

Excuse typos
On Mar 10, 2016 11:19 PM, "Fabrice Gabolde" <fg...@weborama.com> wrote:

> Hello list,
>
> I have a 0.9.6 topology with 16 spout executors (a custom RMQ spout)
> and around 80 executors on various bolts, taking up 8 slots on our
> 18-slot cluster. It's a straightforward affair with only
> BaseBasicBolts.
>
> This morning it just stopped processing any tuples at 10:40 AM, on all
> workers within the same minute, despite the RMQ queues having
> thousands of messages ready. Logs do not contain any errors. We have
> the logs configured to emit on INFO+, except for logs from our org
> which are configured to emit on TRACE+.
>
> The *very* first thing the nextTuple method does is emit a log
> message. Grepping for this message shows that it stops appearing at
> 10:40, so exactly at the same time that the topology hangs. (The rest
> of the threads keep processing the tuples they already have, then they
> stop when starved.)
>
> netstat shows the RMQ connections are still up, and the RMQ cluster
> agrees (sees 16 consumers on the queue).
>
> Looking at the metrics, the internal queues are empty (population=0 or
> 1 everywhere on every worker). A thread dump shows this for the spout:
>
> "Thread-44-rabbit" #76 prio=5 os_prio=0 tid=0x00007fcacc129000
> nid=0x1a9e waiting on condition [0x00007fca3d1d4000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>     at java.lang.Thread.sleep(Native Method)
>     at
> backtype.storm.spout.SleepSpoutWaitStrategy.emptyEmit(SleepSpoutWaitStrategy.java:36)
>     at
> backtype.storm.daemon.executor$fn__3371$fn__3386$fn__3415.invoke(executor.clj:582)
>     at backtype.storm.util$async_loop$fn__460.invoke(util.clj:463)
>     at clojure.lang.AFn.run(AFn.java:24)
>     at java.lang.Thread.run(Thread.java:745)
>
> That *looks* like it's just running the spout wait strategy
> (configured for 5000ms), but then the log message should appear, so it
> really seems nextTuple is not called.
>
> Killing an RMQ connection from the admin interface triggers an
> exception from the RMQ client, and then nothing. The client is
> designed to reconnect the next time its receive method is called,
> which it isn't (since nextTuple is not called, presumably).
>
> We used to have this type of problem as well when using Zaqar as a
> message queue. This makes the problem even harder to debug, since
> Zaqar was at the time purely HTTP polling (running GET /messages in a
> loop, basically, so no blocking). Restarting the topology works but
> I'd rather that not become a permanent thing...
>
> Does anybody have an idea where I should start looking?
>
> --
> Fabrice Gabolde
>