You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Yury Ruchin <yu...@gmail.com> on 2015/12/13 13:22:13 UTC

Cascading "not alive" in topology with Storm 0.9.5

Hello,

I'm running a large topology using Storm 0.9.5. I have 2.5K executors
distributed over 60 workers, 4-5 workers per node. The topology consumes
data from Kafka spout.

I regularly observe Nimbus considering topology workers dead by heartbeat
timeout. It then moves executors to other workers, but soon another worker
times out. Nimbus moves its executors and so on. The sequence repeats over
and over - in fact, there are cascading worker timeouts in topology which
it cannot restore from.The topology itself looks alive but stops consuming
from Kafka and as the result stops processing altogether.

I didn't see any obvious issues with network, so initially I assumed there
might be worker process failures caused by exceptions/errors inside the
process, e. g. OOME. Nothing appeared in worker logs. I then found that the
processes were actually alive when Nimbus declared them dead - it seems
like they simply stopped sending heartbeats for some reason.

I looked for Java fatal error logs in assumption that the error might be
caused by some nasty low-level things happening - but found nothing.

I suspected high CPU usage, but it turned out the user CPU + system CPU on
the nodes never went above 50-60% in peaks. The regular load was even less.

I was observing the same issue with Storm 0.9.3, then upgraded to Storm
0.9.5 hoping that fixes for https://issues.apache.org/jira/browse/STORM-329
and https://issues.apache.org/jira/browse/STORM-404 will help. But they
haven't.

Strange enough, I can only reproduce the issue in this large setup. Small
test setups with 2 workers do not expose this issue - even after killing
all worker processes by kill -9 they restore seamlessly.

My other guess is that large number of workers causes significant overhead
on establishing Netty connections during worker startup which somehow
prevents heartbeats from being sent. Maybe this is something similar to
https://issues.apache.org/jira/browse/STORM-763 and it's worth upgrading to
0.9.6 - I don't know how to check it.

Any help is appreciated.

Re: Cascading "not alive" in topology with Storm 0.9.5

Posted by Kyle Nusbaum <kn...@yahoo-inc.com>.

The issue that we were seeing was that the heartbeats were so large and numerous that Zookeeper was bogged down with writes, and it took Nimbus a long time to read them. By the time it finished reading them from zookeeper, they were old enough to cause a timeout.
 -- Kyle 

    On Thursday, December 17, 2015 4:04 AM, Yury Ruchin <yu...@gmail.com> wrote:
 

 Hi Ravi, Kyle, thanks for the input!
I tried increasing task timeout from 30 to 60 seconds - and still observed the same issue. Increasing the timeout further does not look reasonable, since it will affect Nimbus ability to detect real crashes.
I was looking at Zookeeper metrics and haven't noticed any anomalies - no load spikes around the point of heartbeat timeout. I will double check, however.
Kyle,
Could you elaborate a bit on what the issue with Zookeeper looked like in your case? Was it simply that write call to Zookeeper at times blocked for more than nimbus.task.timeout.secs?
2015-12-16 21:53 GMT+03:00 Kyle Nusbaum <kn...@yahoo-inc.com>:

Yes, I would check Zookeeper.We've seen the exact same thing in large clusters, which is what this was designed to help solve: https://issues.apache.org/jira/browse/STORM-885
 -- Kyle 


    On Monday, December 14, 2015 8:45 PM, Ravi Tandon <Ra...@microsoft.com> wrote:
 

 Try the following:    ·       Increase the value of"nimbus.monitor.freq.secs"="120", this will make nimbus to wait longer before declaring a worker dead. Also check other configs like “supervisor.worker.timeout.secs“ that will allow the system to wait longer before the re-assignment/re-launching workers. ·       Check the write load on the Zookeepers too, that maybe the bottleneck of your cluster and the co-ordination thereof than the worker nodes themselves. You can choose to have additional ZK nodes or provide better spec machines for the quorum.    -Ravi    From: Yury Ruchin [mailto:yuri.ruchin@gmail.com]
Sent: Sunday, December 13, 2015 4:22 AM
To: user@storm.apache.org
Subject: Cascading "not alive" in topology with Storm 0.9.5    Hello,    I'm running a large topology using Storm 0.9.5. I have 2.5K executors distributed over 60 workers, 4-5 workers per node. The topology consumes data from Kafka spout.    I regularly observe Nimbus considering topology workers dead by heartbeat timeout. It then moves executors to other workers, but soon another worker times out. Nimbus moves its executors and so on. The sequence repeats over and over - in fact, there are cascading worker timeouts in topology which it cannot restore from.The topology itself looks alive but stops consuming from Kafka and as the result stops processing altogether.    I didn't see any obvious issues with network, so initially I assumed there might be worker process failures caused by exceptions/errors inside the process, e. g. OOME. Nothing appeared in worker logs. I then found that the processes were actually alive when Nimbus declared them dead - it seems like they simply stopped sending heartbeats for some reason.    I looked for Java fatal error logs in assumption that the error might be caused by some nasty low-level things happening - but found nothing.    I suspected high CPU usage, but it turned out the user CPU + system CPU on the nodes never went above 50-60% in peaks. The regular load was even less.    I was observing the same issue with Storm 0.9.3, then upgraded to Storm 0.9.5 hoping that fixes for https://issues.apache.org/jira/browse/STORM-329 and https://issues.apache.org/jira/browse/STORM-404 will help. But they haven't.    Strange enough, I can only reproduce the issue in this large setup. Small test setups with 2 workers do not expose this issue - even after killing all worker processes by kill -9 they restore seamlessly.    My other guess is that large number of workers causes significant overhead on establishing Netty connections during worker startup which somehow prevents heartbeats from being sent. Maybe this is something similar to https://issues.apache.org/jira/browse/STORM-763 and it's worth upgrading to 0.9.6 - I don't know how to check it.    Any help is appreciated.

Re: Cascading "not alive" in topology with Storm 0.9.5

Posted by Yury Ruchin <yu...@gmail.com>.

Hi Ravi, Kyle, thanks for the input!

I tried increasing task timeout from 30 to 60 seconds - and still observed
the same issue. Increasing the timeout further does not look reasonable,
since it will affect Nimbus ability to detect real crashes.

I was looking at Zookeeper metrics and haven't noticed any anomalies - no
load spikes around the point of heartbeat timeout. I will double check,
however.

Kyle,

Could you elaborate a bit on what the issue with Zookeeper looked like in
your case? Was it simply that write call to Zookeeper at times blocked for
more than nimbus.task.timeout.secs?

2015-12-16 21:53 GMT+03:00 Kyle Nusbaum <kn...@yahoo-inc.com>:

> Yes, I would check Zookeeper.
> We've seen the exact same thing in large clusters, which is what this was
> designed to help solve: https://issues.apache.org/jira/browse/STORM-885
>
> -- Kyle
>
>
>
> On Monday, December 14, 2015 8:45 PM, Ravi Tandon <
> Ravi.Tandon@microsoft.com> wrote:
>
>
> Try the following:
>
> ·        Increase the value of "nimbus.monitor.freq.secs"="120", this
> will make nimbus to wait longer before declaring a worker dead. Also check
> other configs like “supervisor.worker.timeout.secs“ that will allow the
> system to wait longer before the re-assignment/re-launching workers.
> ·        Check the write load on the Zookeepers too, that maybe the
> bottleneck of your cluster and the co-ordination thereof than the worker
> nodes themselves. You can choose to have additional ZK nodes or provide
> better spec machines for the quorum.
>
> -Ravi
>
> *From:* Yury Ruchin [mailto:yuri.ruchin@gmail.com]
> *Sent:* Sunday, December 13, 2015 4:22 AM
> *To:* user@storm.apache.org
> *Subject:* Cascading "not alive" in topology with Storm 0.9.5
>
> Hello,
>
> I'm running a large topology using Storm 0.9.5. I have 2.5K executors
> distributed over 60 workers, 4-5 workers per node. The topology consumes
> data from Kafka spout.
>
> I regularly observe Nimbus considering topology workers dead by heartbeat
> timeout. It then moves executors to other workers, but soon another worker
> times out. Nimbus moves its executors and so on. The sequence repeats over
> and over - in fact, there are cascading worker timeouts in topology which
> it cannot restore from.The topology itself looks alive but stops consuming
> from Kafka and as the result stops processing altogether.
>
> I didn't see any obvious issues with network, so initially I assumed there
> might be worker process failures caused by exceptions/errors inside the
> process, e. g. OOME. Nothing appeared in worker logs. I then found that the
> processes were actually alive when Nimbus declared them dead - it seems
> like they simply stopped sending heartbeats for some reason.
>
> I looked for Java fatal error logs in assumption that the error might be
> caused by some nasty low-level things happening - but found nothing.
>
> I suspected high CPU usage, but it turned out the user CPU + system CPU on
> the nodes never went above 50-60% in peaks. The regular load was even less.
>
> I was observing the same issue with Storm 0.9.3, then upgraded to Storm
> 0.9.5 hoping that fixes for
> https://issues.apache.org/jira/browse/STORM-329
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-329&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=j7KqlX9nKf7abFTWur0lsIeXNBZUXwCCga7X1Mei7yY%3d>
> and https://issues.apache.org/jira/browse/STORM-404
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-404&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=1iDLe2Jr5qZAmuiOYXJomzqdX5G3XqZDFPSkP4wOt2g%3d>
> will help. But they haven't.
>
> Strange enough, I can only reproduce the issue in this large setup. Small
> test setups with 2 workers do not expose this issue - even after killing
> all worker processes by kill -9 they restore seamlessly.
>
> My other guess is that large number of workers causes significant overhead
> on establishing Netty connections during worker startup which somehow
> prevents heartbeats from being sent. Maybe this is something similar to
> https://issues.apache.org/jira/browse/STORM-763
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-763&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=mWe7i%2bVejDHainxeYwaybylchyhPisCwT3q6skqTIl0%3d>
> and it's worth upgrading to 0.9.6 - I don't know how to check it.
>
> Any help is appreciated.
>
>
>
>
>

Re: Cascading "not alive" in topology with Storm 0.9.5

Posted by Kyle Nusbaum <kn...@yahoo-inc.com>.

Yes, I would check Zookeeper.We've seen the exact same thing in large clusters, which is what this was designed to help solve: https://issues.apache.org/jira/browse/STORM-885
 -- Kyle 


    On Monday, December 14, 2015 8:45 PM, Ravi Tandon <Ra...@microsoft.com> wrote:
 

 #yiv5243466594 #yiv5243466594 -- _filtered #yiv5243466594 {font-family:Wingdings;panose-1:5 0 0 0 0 0 0 0 0 0;} _filtered #yiv5243466594 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv5243466594 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv5243466594 {font-family:Consolas;panose-1:2 11 6 9 2 2 4 3 2 4;}#yiv5243466594 #yiv5243466594 p.yiv5243466594MsoNormal, #yiv5243466594 li.yiv5243466594MsoNormal, #yiv5243466594 div.yiv5243466594MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv5243466594 a:link, #yiv5243466594 span.yiv5243466594MsoHyperlink {color:blue;text-decoration:underline;}#yiv5243466594 a:visited, #yiv5243466594 span.yiv5243466594MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv5243466594 p.yiv5243466594MsoListParagraph, #yiv5243466594 li.yiv5243466594MsoListParagraph, #yiv5243466594 div.yiv5243466594MsoListParagraph {margin-top:0in;margin-right:0in;margin-bottom:0in;margin-left:.5in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv5243466594 p.yiv5243466594msonormal0, #yiv5243466594 li.yiv5243466594msonormal0, #yiv5243466594 div.yiv5243466594msonormal0 {margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv5243466594 span.yiv5243466594EmailStyle18 {color:#1F497D;}#yiv5243466594 span.yiv5243466594pl-s {}#yiv5243466594 span.yiv5243466594pl-k {}#yiv5243466594 .yiv5243466594MsoChpDefault {} _filtered #yiv5243466594 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv5243466594 div.yiv5243466594WordSection1 {}#yiv5243466594 _filtered #yiv5243466594 {} _filtered #yiv5243466594 {font-family:Symbol;} _filtered #yiv5243466594 {} _filtered #yiv5243466594 {font-family:Wingdings;} _filtered #yiv5243466594 {font-family:Symbol;} _filtered #yiv5243466594 {} _filtered #yiv5243466594 {font-family:Wingdings;} _filtered #yiv5243466594 {font-family:Symbol;} _filtered #yiv5243466594 {} _filtered #yiv5243466594 {font-family:Wingdings;}#yiv5243466594 ol {margin-bottom:0in;}#yiv5243466594 ul {margin-bottom:0in;}#yiv5243466594 Try the following:    ·       Increase the value of"nimbus.monitor.freq.secs"="120", this will make nimbus to wait longer before declaring a worker dead. Also check other configs like “supervisor.worker.timeout.secs“ that will allow the system to wait longer before the re-assignment/re-launching workers. ·       Check the write load on the Zookeepers too, that maybe the bottleneck of your cluster and the co-ordination thereof than the worker nodes themselves. You can choose to have additional ZK nodes or provide better spec machines for the quorum.    -Ravi    From: Yury Ruchin [mailto:yuri.ruchin@gmail.com]
Sent: Sunday, December 13, 2015 4:22 AM
To: user@storm.apache.org
Subject: Cascading "not alive" in topology with Storm 0.9.5    Hello,    I'm running a large topology using Storm 0.9.5. I have 2.5K executors distributed over 60 workers, 4-5 workers per node. The topology consumes data from Kafka spout.    I regularly observe Nimbus considering topology workers dead by heartbeat timeout. It then moves executors to other workers, but soon another worker times out. Nimbus moves its executors and so on. The sequence repeats over and over - in fact, there are cascading worker timeouts in topology which it cannot restore from.The topology itself looks alive but stops consuming from Kafka and as the result stops processing altogether.    I didn't see any obvious issues with network, so initially I assumed there might be worker process failures caused by exceptions/errors inside the process, e. g. OOME. Nothing appeared in worker logs. I then found that the processes were actually alive when Nimbus declared them dead - it seems like they simply stopped sending heartbeats for some reason.    I looked for Java fatal error logs in assumption that the error might be caused by some nasty low-level things happening - but found nothing.    I suspected high CPU usage, but it turned out the user CPU + system CPU on the nodes never went above 50-60% in peaks. The regular load was even less.    I was observing the same issue with Storm 0.9.3, then upgraded to Storm 0.9.5 hoping that fixes for https://issues.apache.org/jira/browse/STORM-329 and https://issues.apache.org/jira/browse/STORM-404 will help. But they haven't.    Strange enough, I can only reproduce the issue in this large setup. Small test setups with 2 workers do not expose this issue - even after killing all worker processes by kill -9 they restore seamlessly.    My other guess is that large number of workers causes significant overhead on establishing Netty connections during worker startup which somehow prevents heartbeats from being sent. Maybe this is something similar to https://issues.apache.org/jira/browse/STORM-763 and it's worth upgrading to 0.9.6 - I don't know how to check it.    Any help is appreciated.

RE: Cascading "not alive" in topology with Storm 0.9.5

Posted by Ravi Tandon <Ra...@microsoft.com>.

Try the following:


·        Increase the value of "nimbus.monitor.freq.secs"="120", this will make nimbus to wait longer before declaring a worker dead. Also check other configs like “supervisor.worker.timeout.secs“ that will allow the system to wait longer before the re-assignment/re-launching workers.

·        Check the write load on the Zookeepers too, that maybe the bottleneck of your cluster and the co-ordination thereof than the worker nodes themselves. You can choose to have additional ZK nodes or provide better spec machines for the quorum.

-Ravi

From: Yury Ruchin [mailto:yuri.ruchin@gmail.com]
Sent: Sunday, December 13, 2015 4:22 AM
To: user@storm.apache.org
Subject: Cascading "not alive" in topology with Storm 0.9.5

Hello,

I'm running a large topology using Storm 0.9.5. I have 2.5K executors distributed over 60 workers, 4-5 workers per node. The topology consumes data from Kafka spout.

I regularly observe Nimbus considering topology workers dead by heartbeat timeout. It then moves executors to other workers, but soon another worker times out. Nimbus moves its executors and so on. The sequence repeats over and over - in fact, there are cascading worker timeouts in topology which it cannot restore from.The topology itself looks alive but stops consuming from Kafka and as the result stops processing altogether.

I didn't see any obvious issues with network, so initially I assumed there might be worker process failures caused by exceptions/errors inside the process, e. g. OOME. Nothing appeared in worker logs. I then found that the processes were actually alive when Nimbus declared them dead - it seems like they simply stopped sending heartbeats for some reason.

I looked for Java fatal error logs in assumption that the error might be caused by some nasty low-level things happening - but found nothing.

I suspected high CPU usage, but it turned out the user CPU + system CPU on the nodes never went above 50-60% in peaks. The regular load was even less.

I was observing the same issue with Storm 0.9.3, then upgraded to Storm 0.9.5 hoping that fixes for https://issues.apache.org/jira/browse/STORM-329<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-329&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=j7KqlX9nKf7abFTWur0lsIeXNBZUXwCCga7X1Mei7yY%3d> and https://issues.apache.org/jira/browse/STORM-404<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-404&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=1iDLe2Jr5qZAmuiOYXJomzqdX5G3XqZDFPSkP4wOt2g%3d> will help. But they haven't.

Strange enough, I can only reproduce the issue in this large setup. Small test setups with 2 workers do not expose this issue - even after killing all worker processes by kill -9 they restore seamlessly.

My other guess is that large number of workers causes significant overhead on establishing Netty connections during worker startup which somehow prevents heartbeats from being sent. Maybe this is something similar to https://issues.apache.org/jira/browse/STORM-763<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-763&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=mWe7i%2bVejDHainxeYwaybylchyhPisCwT3q6skqTIl0%3d> and it's worth upgrading to 0.9.6 - I don't know how to check it.

Any help is appreciated.