You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Yashwant Ganti <ya...@gmail.com> on 2014/11/14 19:11:33 UTC

Why would Zookeeper Timeouts lead to worker restarts?

Hello All,

We had a problem with Zookeeper infrastructure recently and the components
in our topology would face timeouts when trying to establish a connection.
After a prolonged period of attempting to connect and throwing exceptions,
I noticed that some of the workers were restarted. One of the other devs
mentioned that this was because the topology had to be 'rebalanced'. I did
not quite understand this and was wondering if someone could shed light on
why storm decided to restart the workers.

Thanks,
Yash

Re: Why would Zookeeper Timeouts lead to worker restarts?

Posted by Yashwant Ganti <ya...@gmail.com>.

Thanks for the info guys. Hardening our ZK infrastructure and adding more
capacity to the cluster (to reduce the load on each bolt instance)
alleviated the problem. One clue was the fact that despite the ZK
infrastructure having trouble, topologies with lightly loaded bolts were
performing much better than ours.

On Mon, Nov 17, 2014 at 11:50 PM, Itai Frenkel <It...@forter.com> wrote:

>  Hello Yash,
>
>
>  Please note that Devang and Bobby described two different root causes
> from the same problem you are facing.
>
>
>  FYI, the problem Bobby described in zookeeper is discussed in
> https://issues.apache.org/jira/browse/ZOOKEEPER-885
> <https://issues.apache.org/jira/browse/ZOOKEEPER-885#comment-13749751>
>
> 
>
> Itai
>
>
>  ------------------------------
> *From:* Devang Shah <de...@gmail.com>
> *Sent:* Tuesday, November 18, 2014 1:16 AM
> *To:* user@storm.apache.org
> *Subject:* Re: Why would Zookeeper Timeouts lead to worker restarts?
>
>
> We had faced similar issues and identified the cause to be heavy
> processing of the bolts. GC was hogging the CPU cycles and workers didn't
> get time to communicate with Zookeeper which eventually ended in worker
> restarts. Try using conc-mark-sweep algo.
>
> Thanks and Regards,
> Devang
> On 15 Nov 2014 03:47, "Bobby Evans" <ev...@yahoo-inc.com> wrote:
>
>>  Workers communicate to nimbus though ZooKeeper.  If ZK is under a lot
>> of load and is not able to handle all of the write requests, nimbus can
>> think that the workers are not heart-beating, even when they are.  If this
>> happens it will reschedule them, possibly on a different node/location.  If
>> that happens the Supervisors will dutifully obey and change things around
>> to match the new scheduling.  If you rebalance your topology, nimbus lies
>> to the scheduler telling it that none of the workers for your topology are
>> running, and lets it pick all new places for them to run.  Then the
>> supervisors shoot/launch workers to make things match the new scheduling.
>>
>>  - Bobby
>>
>>
>>   On Friday, November 14, 2014 12:11 PM, Yashwant Ganti <
>> yashwant.g@gmail.com> wrote:
>>
>>
>>  Hello All,
>>
>>  We had a problem with Zookeeper infrastructure recently and the
>> components in our topology would face timeouts when trying to establish a
>> connection. After a prolonged period of attempting to connect and throwing
>> exceptions, I noticed that some of the workers were restarted. One of the
>> other devs mentioned that this was because the topology had to be
>> 'rebalanced'. I did not quite understand this and was wondering if someone
>> could shed light on why storm decided to restart the workers.
>>
>>  Thanks,
>> Yash
>>
>>
>>

Re: Why would Zookeeper Timeouts lead to worker restarts?

Posted by Itai Frenkel <It...@forter.com>.

Hello Yash,


Please note that Devang and Bobby described two different root causes from the same problem you are facing.


FYI, the problem Bobby described in zookeeper is discussed in https://issues.apache.org/jira/browse/ZOOKEEPER-885<https://issues.apache.org/jira/browse/ZOOKEEPER-885#comment-13749751>

?

Itai


________________________________
From: Devang Shah <de...@gmail.com>
Sent: Tuesday, November 18, 2014 1:16 AM
To: user@storm.apache.org
Subject: Re: Why would Zookeeper Timeouts lead to worker restarts?


We had faced similar issues and identified the cause to be heavy processing of the bolts. GC was hogging the CPU cycles and workers didn't get time to communicate with Zookeeper which eventually ended in worker restarts. Try using conc-mark-sweep algo.

Thanks and Regards,
Devang

On 15 Nov 2014 03:47, "Bobby Evans" <ev...@yahoo-inc.com>> wrote:
Workers communicate to nimbus though ZooKeeper.  If ZK is under a lot of load and is not able to handle all of the write requests, nimbus can think that the workers are not heart-beating, even when they are.  If this happens it will reschedule them, possibly on a different node/location.  If that happens the Supervisors will dutifully obey and change things around to match the new scheduling.  If you rebalance your topology, nimbus lies to the scheduler telling it that none of the workers for your topology are running, and lets it pick all new places for them to run.  Then the supervisors shoot/launch workers to make things match the new scheduling.

- Bobby


On Friday, November 14, 2014 12:11 PM, Yashwant Ganti <ya...@gmail.com>> wrote:


Hello All,

We had a problem with Zookeeper infrastructure recently and the components in our topology would face timeouts when trying to establish a connection. After a prolonged period of attempting to connect and throwing exceptions, I noticed that some of the workers were restarted. One of the other devs mentioned that this was because the topology had to be 'rebalanced'. I did not quite understand this and was wondering if someone could shed light on why storm decided to restart the workers.

Thanks,
Yash

Re: Why would Zookeeper Timeouts lead to worker restarts?

Posted by Devang Shah <de...@gmail.com>.

We had faced similar issues and identified the cause to be heavy processing
of the bolts. GC was hogging the CPU cycles and workers didn't get time to
communicate with Zookeeper which eventually ended in worker restarts. Try
using conc-mark-sweep algo.

Thanks and Regards,
Devang
On 15 Nov 2014 03:47, "Bobby Evans" <ev...@yahoo-inc.com> wrote:

> Workers communicate to nimbus though ZooKeeper.  If ZK is under a lot of
> load and is not able to handle all of the write requests, nimbus can think
> that the workers are not heart-beating, even when they are.  If this
> happens it will reschedule them, possibly on a different node/location.  If
> that happens the Supervisors will dutifully obey and change things around
> to match the new scheduling.  If you rebalance your topology, nimbus lies
> to the scheduler telling it that none of the workers for your topology are
> running, and lets it pick all new places for them to run.  Then the
> supervisors shoot/launch workers to make things match the new scheduling.
>
> - Bobby
>
>
>   On Friday, November 14, 2014 12:11 PM, Yashwant Ganti <
> yashwant.g@gmail.com> wrote:
>
>
> Hello All,
>
> We had a problem with Zookeeper infrastructure recently and the components
> in our topology would face timeouts when trying to establish a connection.
> After a prolonged period of attempting to connect and throwing exceptions,
> I noticed that some of the workers were restarted. One of the other devs
> mentioned that this was because the topology had to be 'rebalanced'. I did
> not quite understand this and was wondering if someone could shed light on
> why storm decided to restart the workers.
>
> Thanks,
> Yash
>
>
>

Re: Why would Zookeeper Timeouts lead to worker restarts?

Posted by Bobby Evans <ev...@yahoo-inc.com>.

Workers communicate to nimbus though ZooKeeper.  If ZK is under a lot of load and is not able to handle all of the write requests, nimbus can think that the workers are not heart-beating, even when they are.  If this happens it will reschedule them, possibly on a different node/location.  If that happens the Supervisors will dutifully obey and change things around to match the new scheduling.  If you rebalance your topology, nimbus lies to the scheduler telling it that none of the workers for your topology are running, and lets it pick all new places for them to run.  Then the supervisors shoot/launch workers to make things match the new scheduling.
 - Bobby
 

     On Friday, November 14, 2014 12:11 PM, Yashwant Ganti <ya...@gmail.com> wrote:
   

 Hello All, 
We had a problem with Zookeeper infrastructure recently and the components in our topology would face timeouts when trying to establish a connection. After a prolonged period of attempting to connect and throwing exceptions, I noticed that some of the workers were restarted. One of the other devs mentioned that this was because the topology had to be 'rebalanced'. I did not quite understand this and was wondering if someone could shed light on why storm decided to restart the workers. 
Thanks,Yash

Fwd: Why would Zookeeper Timeouts lead to worker restarts?

Posted by Yashwant Ganti <ya...@gmail.com>.

Hello All,

We had a problem with Zookeeper infrastructure recently and the components
in our topology would face timeouts when trying to establish a connection.
After a prolonged period of attempting to connect and throwing exceptions,
I noticed that some of the workers were restarted. One of the other devs
mentioned that this was because the topology had to be 'rebalanced'. I did
not quite understand this and was wondering if someone could shed light on
why storm decided to restart the workers.

Thanks,
Yash