You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Vinay Pothnis <vi...@gmail.com> on 2014/08/06 02:13:54 UTC

High CPU utilization after storm node failover

[Storm Version: 0.9.2-incubating]

Hello,

I am trying to test failover scenarios with my storm cluster. The following
are the details of the cluster:

* 4 nodes
* Each node with 2 slots
* Topology with around 600 spouts and bolts
* Num. Workers for the topology = 4

I am running a test that generating a constant load. The cluster is able to
handle this load fairly well and the CPU utilization at this point is below
50% on all the nodes. 1 slot is occupied on each of the nodes.

I then bring down one of the nodes (kill the supervisor and the worker
processes on a node). After this, another worker is created on one of the
remaining nodes. But the CPU utilization jumps up to 100%. At this point,
nimbus cannot communicate with the supervisor on the node and keeps killing
and restarting workers.

The CPU utilization remains pegged at 100% as long as the load is on. If I
stop the tests and restart the test after a while, the same set up with
just 3 nodes works perfectly fine with less CPU utilization.

Any pointers to how to figure out the reason for the high CPU utilization
during the failover?

Thanks
Vinay

Re: High CPU utilization after storm node failover

Posted by Srinath C <sr...@gmail.com>.

Thanks Vinay. That seemed to work fine for me, but let me re-test it.

Taylor, I'm away from lab resources right now. But once am able to, i'll
run some tests will report back with debug logs for nimbus, supervisor and
worker on a sample topology. In my case there were tuples being processed
at the rate of around 15k per second.



On Thu, Aug 7, 2014 at 1:15 AM, Vinay Pothnis <vi...@gmail.com>
wrote:

> Update:
>
> I tested the scenario that Srinath proposed.
>
> * 4 nodes (lets say A, B, C, D) - each node with 1 slot/worker each.
> * Topology started with 3 workers. (Lets say A, B, C)
> * 1 node was spare. (D)
>
> Once the load was stabilized, I killed one of the nodes on which the
> topology was running (node A). The following things happened:
>
> * New worker was created on the previously spare node D.
> * The events from the load test continued to get processed.
> * But CPU utilization on nodes B and C (from the original set) shot up to
> 100%
> * Interesting thing was that it was the "system cpu time" that went up as
> opposed to "user cpu time"
> * It stayed at 100% for a few minutes. Then the workers on nodes B and C
> died.
> * The supervisor restarted the workers and the CPU utilization dropped
> back to normal.
> * Nodes B & C both had this exception (see below) when the worker died.
> Looks like they were trying to contact Node A that was brought down.
> * Once the new workers were started, the CPU came down to normal.
>
> Any ideas as to what might be happening?
>
> Thanks
> Vinay
>
> *ERROR 2014-08-06 18:59:45,373 [b.s.util
> Thread-427-disruptor-worker-transfer-queue] Async loop died!*
> *java.lang.RuntimeException: java.lang.RuntimeException: Remote address is
> not reachable. We will close this client Netty-Client-*
> *NODE-A/NODE-A-IP:6700*
> * at
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]*
> * at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]*
> *Caused by: java.lang.RuntimeException: Remote address is not reachable.
> We will close this client Netty-Client-NODE-A/NODE-A-IP:6700*
> * at backtype.storm.messaging.netty.Client.connect(Client.java:166)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at backtype.storm.messaging.netty.Client.send(Client.java:203)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * ... 6 common frames omitted*
>
>
>
>
>
>
>
> On Wed, Aug 6, 2014 at 9:02 AM, Vinay Pothnis <vi...@gmail.com>
> wrote:
>
>> Hmm, I am trying to figure out what I can share to reproduce this.
>>
>> I will try this with a simple topology and see if this can be reproduced.
>> I will also try Srinath's approach of having only one worker/slot per node
>> and having a spare. If that works, I would have a somewhat "launchable"
>> scenario and I will have more time to investigate the high cpu utilization
>> after failover.
>>
>> Thanks
>> Vinay
>>
>>
>> On Wed, Aug 6, 2014 at 8:24 AM, Bobby Evans <ev...@yahoo-inc.com> wrote:
>>
>>>  +1 for failure testing.  We have used other similar tools in the past
>>> to simulate different situations like network cuts, high packet loss, etc.
>>>  I would love to see more of this happen, and the scheduler get smart
>>> enough to detect these situations and deal with them.
>>>
>>>  - Bobby
>>>
>>>   From: "P. Taylor Goetz" <pt...@gmail.com>
>>> Reply-To: "user@storm.incubator.apache.org" <
>>> user@storm.incubator.apache.org>
>>> Date: Tuesday, August 5, 2014 at 8:15 PM
>>> To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>
>>> Cc: "dev@storm.incubator.apache.org" <de...@storm.incubator.apache.org>
>>> Subject: Re: High CPU utilization after storm node failover
>>>
>>>   + dev@storm
>>>
>>>  Vinyasa/Srinath,
>>>
>>>  Anything you can share to make this reproducible would be very helpful.
>>>
>>>  I would love to see a network partition simulation framework for Storm
>>> along the lines of what Kyle Kingsbury has done with Jepsen [1]. It
>>> basically sets up a virtual cluster then simulates network partitions by
>>> manipulating iptables.
>>>
>>>  Jepsen [2] is written in clojure and Kyle is a strong proponent.
>>>
>>>  I think it is worth a look.
>>>
>>>  -Taylor
>>>
>>>  [1]
>>> http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
>>> [2] https://github.com/aphyr/jepsen
>>>
>>> On Aug 5, 2014, at 8:39 PM, Srinath C <sr...@gmail.com> wrote:
>>>
>>>   I have seen this behaviour too using 0.9.2-incubating.
>>> The failover works better when there is a redundant node available.
>>> Maybe 1 slot per node is the best approach.
>>>  Eager to know if there are any steps to further diagnose.
>>>
>>>
>>>
>>>
>>> On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <vi...@gmail.com>
>>> wrote:
>>>
>>>>  [Storm Version: 0.9.2-incubating]
>>>>
>>>>  Hello,
>>>>
>>>>  I am trying to test failover scenarios with my storm cluster. The
>>>> following are the details of the cluster:
>>>>
>>>>  * 4 nodes
>>>> * Each node with 2 slots
>>>> * Topology with around 600 spouts and bolts
>>>> * Num. Workers for the topology = 4
>>>>
>>>>  I am running a test that generating a constant load. The cluster is
>>>> able to handle this load fairly well and the CPU utilization at this point
>>>> is below 50% on all the nodes. 1 slot is occupied on each of the nodes.
>>>>
>>>>  I then bring down one of the nodes (kill the supervisor and the
>>>> worker processes on a node). After this, another worker is created on one
>>>> of the remaining nodes. But the CPU utilization jumps up to 100%. At this
>>>> point, nimbus cannot communicate with the supervisor on the node and keeps
>>>> killing and restarting workers.
>>>>
>>>>  The CPU utilization remains pegged at 100% as long as the load is on.
>>>> If I stop the tests and restart the test after a while, the same set up
>>>> with just 3 nodes works perfectly fine with less CPU utilization.
>>>>
>>>>  Any pointers to how to figure out the reason for the high CPU
>>>> utilization during the failover?
>>>>
>>>>  Thanks
>>>>  Vinay
>>>>
>>>>
>>>>
>>>
>>
>

Re: High CPU utilization after storm node failover

Posted by Srinath C <sr...@gmail.com>.

Thanks Vinay. That seemed to work fine for me, but let me re-test it.

Taylor, I'm away from lab resources right now. But once am able to, i'll
run some tests will report back with debug logs for nimbus, supervisor and
worker on a sample topology. In my case there were tuples being processed
at the rate of around 15k per second.



On Thu, Aug 7, 2014 at 1:15 AM, Vinay Pothnis <vi...@gmail.com>
wrote:

> Update:
>
> I tested the scenario that Srinath proposed.
>
> * 4 nodes (lets say A, B, C, D) - each node with 1 slot/worker each.
> * Topology started with 3 workers. (Lets say A, B, C)
> * 1 node was spare. (D)
>
> Once the load was stabilized, I killed one of the nodes on which the
> topology was running (node A). The following things happened:
>
> * New worker was created on the previously spare node D.
> * The events from the load test continued to get processed.
> * But CPU utilization on nodes B and C (from the original set) shot up to
> 100%
> * Interesting thing was that it was the "system cpu time" that went up as
> opposed to "user cpu time"
> * It stayed at 100% for a few minutes. Then the workers on nodes B and C
> died.
> * The supervisor restarted the workers and the CPU utilization dropped
> back to normal.
> * Nodes B & C both had this exception (see below) when the worker died.
> Looks like they were trying to contact Node A that was brought down.
> * Once the new workers were started, the CPU came down to normal.
>
> Any ideas as to what might be happening?
>
> Thanks
> Vinay
>
> *ERROR 2014-08-06 18:59:45,373 [b.s.util
> Thread-427-disruptor-worker-transfer-queue] Async loop died!*
> *java.lang.RuntimeException: java.lang.RuntimeException: Remote address is
> not reachable. We will close this client Netty-Client-*
> *NODE-A/NODE-A-IP:6700*
> * at
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]*
> * at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]*
> *Caused by: java.lang.RuntimeException: Remote address is not reachable.
> We will close this client Netty-Client-NODE-A/NODE-A-IP:6700*
> * at backtype.storm.messaging.netty.Client.connect(Client.java:166)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at backtype.storm.messaging.netty.Client.send(Client.java:203)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * at
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
> * ... 6 common frames omitted*
>
>
>
>
>
>
>
> On Wed, Aug 6, 2014 at 9:02 AM, Vinay Pothnis <vi...@gmail.com>
> wrote:
>
>> Hmm, I am trying to figure out what I can share to reproduce this.
>>
>> I will try this with a simple topology and see if this can be reproduced.
>> I will also try Srinath's approach of having only one worker/slot per node
>> and having a spare. If that works, I would have a somewhat "launchable"
>> scenario and I will have more time to investigate the high cpu utilization
>> after failover.
>>
>> Thanks
>> Vinay
>>
>>
>> On Wed, Aug 6, 2014 at 8:24 AM, Bobby Evans <ev...@yahoo-inc.com> wrote:
>>
>>>  +1 for failure testing.  We have used other similar tools in the past
>>> to simulate different situations like network cuts, high packet loss, etc.
>>>  I would love to see more of this happen, and the scheduler get smart
>>> enough to detect these situations and deal with them.
>>>
>>>  - Bobby
>>>
>>>   From: "P. Taylor Goetz" <pt...@gmail.com>
>>> Reply-To: "user@storm.incubator.apache.org" <
>>> user@storm.incubator.apache.org>
>>> Date: Tuesday, August 5, 2014 at 8:15 PM
>>> To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>
>>> Cc: "dev@storm.incubator.apache.org" <de...@storm.incubator.apache.org>
>>> Subject: Re: High CPU utilization after storm node failover
>>>
>>>   + dev@storm
>>>
>>>  Vinyasa/Srinath,
>>>
>>>  Anything you can share to make this reproducible would be very helpful.
>>>
>>>  I would love to see a network partition simulation framework for Storm
>>> along the lines of what Kyle Kingsbury has done with Jepsen [1]. It
>>> basically sets up a virtual cluster then simulates network partitions by
>>> manipulating iptables.
>>>
>>>  Jepsen [2] is written in clojure and Kyle is a strong proponent.
>>>
>>>  I think it is worth a look.
>>>
>>>  -Taylor
>>>
>>>  [1]
>>> http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
>>> [2] https://github.com/aphyr/jepsen
>>>
>>> On Aug 5, 2014, at 8:39 PM, Srinath C <sr...@gmail.com> wrote:
>>>
>>>   I have seen this behaviour too using 0.9.2-incubating.
>>> The failover works better when there is a redundant node available.
>>> Maybe 1 slot per node is the best approach.
>>>  Eager to know if there are any steps to further diagnose.
>>>
>>>
>>>
>>>
>>> On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <vi...@gmail.com>
>>> wrote:
>>>
>>>>  [Storm Version: 0.9.2-incubating]
>>>>
>>>>  Hello,
>>>>
>>>>  I am trying to test failover scenarios with my storm cluster. The
>>>> following are the details of the cluster:
>>>>
>>>>  * 4 nodes
>>>> * Each node with 2 slots
>>>> * Topology with around 600 spouts and bolts
>>>> * Num. Workers for the topology = 4
>>>>
>>>>  I am running a test that generating a constant load. The cluster is
>>>> able to handle this load fairly well and the CPU utilization at this point
>>>> is below 50% on all the nodes. 1 slot is occupied on each of the nodes.
>>>>
>>>>  I then bring down one of the nodes (kill the supervisor and the
>>>> worker processes on a node). After this, another worker is created on one
>>>> of the remaining nodes. But the CPU utilization jumps up to 100%. At this
>>>> point, nimbus cannot communicate with the supervisor on the node and keeps
>>>> killing and restarting workers.
>>>>
>>>>  The CPU utilization remains pegged at 100% as long as the load is on.
>>>> If I stop the tests and restart the test after a while, the same set up
>>>> with just 3 nodes works perfectly fine with less CPU utilization.
>>>>
>>>>  Any pointers to how to figure out the reason for the high CPU
>>>> utilization during the failover?
>>>>
>>>>  Thanks
>>>>  Vinay
>>>>
>>>>
>>>>
>>>
>>
>

Re: High CPU utilization after storm node failover

Posted by Vinay Pothnis <vi...@gmail.com>.

Update:

I tested the scenario that Srinath proposed.

* 4 nodes (lets say A, B, C, D) - each node with 1 slot/worker each.
* Topology started with 3 workers. (Lets say A, B, C)
* 1 node was spare. (D)

Once the load was stabilized, I killed one of the nodes on which the
topology was running (node A). The following things happened:

* New worker was created on the previously spare node D.
* The events from the load test continued to get processed.
* But CPU utilization on nodes B and C (from the original set) shot up to
100%
* Interesting thing was that it was the "system cpu time" that went up as
opposed to "user cpu time"
* It stayed at 100% for a few minutes. Then the workers on nodes B and C
died.
* The supervisor restarted the workers and the CPU utilization dropped back
to normal.
* Nodes B & C both had this exception (see below) when the worker died.
Looks like they were trying to contact Node A that was brought down.
* Once the new workers were started, the CPU came down to normal.

Any ideas as to what might be happening?

Thanks
Vinay

*ERROR 2014-08-06 18:59:45,373 [b.s.util
Thread-427-disruptor-worker-transfer-queue] Async loop died!*
*java.lang.RuntimeException: java.lang.RuntimeException: Remote address is
not reachable. We will close this client Netty-Client-*
*NODE-A/NODE-A-IP:6700*
* at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]*
* at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]*
*Caused by: java.lang.RuntimeException: Remote address is not reachable. We
will close this client Netty-Client-NODE-A/NODE-A-IP:6700*
* at backtype.storm.messaging.netty.Client.connect(Client.java:166)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at backtype.storm.messaging.netty.Client.send(Client.java:203)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* ... 6 common frames omitted*

On Wed, Aug 6, 2014 at 9:02 AM, Vinay Pothnis <vi...@gmail.com>
wrote:

> Hmm, I am trying to figure out what I can share to reproduce this.
>
> I will try this with a simple topology and see if this can be reproduced.
> I will also try Srinath's approach of having only one worker/slot per node
> and having a spare. If that works, I would have a somewhat "launchable"
> scenario and I will have more time to investigate the high cpu utilization
> after failover.
>
> Thanks
> Vinay
>
>
> On Wed, Aug 6, 2014 at 8:24 AM, Bobby Evans <ev...@yahoo-inc.com> wrote:
>
>>  +1 for failure testing.  We have used other similar tools in the past
>> to simulate different situations like network cuts, high packet loss, etc.
>>  I would love to see more of this happen, and the scheduler get smart
>> enough to detect these situations and deal with them.
>>
>>  - Bobby
>>
>>   From: "P. Taylor Goetz" <pt...@gmail.com>
>> Reply-To: "user@storm.incubator.apache.org" <
>> user@storm.incubator.apache.org>
>> Date: Tuesday, August 5, 2014 at 8:15 PM
>> To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>
>> Cc: "dev@storm.incubator.apache.org" <de...@storm.incubator.apache.org>
>> Subject: Re: High CPU utilization after storm node failover
>>
>>   + dev@storm
>>
>>  Vinyasa/Srinath,
>>
>>  Anything you can share to make this reproducible would be very helpful.
>>
>>  I would love to see a network partition simulation framework for Storm
>> along the lines of what Kyle Kingsbury has done with Jepsen [1]. It
>> basically sets up a virtual cluster then simulates network partitions by
>> manipulating iptables.
>>
>>  Jepsen [2] is written in clojure and Kyle is a strong proponent.
>>
>>  I think it is worth a look.
>>
>>  -Taylor
>>
>>  [1]
>> http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
>> [2] https://github.com/aphyr/jepsen
>>
>> On Aug 5, 2014, at 8:39 PM, Srinath C <sr...@gmail.com> wrote:
>>
>>   I have seen this behaviour too using 0.9.2-incubating.
>> The failover works better when there is a redundant node available. Maybe
>> 1 slot per node is the best approach.
>>  Eager to know if there are any steps to further diagnose.
>>
>>
>>
>>
>> On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <vi...@gmail.com>
>> wrote:
>>
>>>  [Storm Version: 0.9.2-incubating]
>>>
>>>  Hello,
>>>
>>>  I am trying to test failover scenarios with my storm cluster. The
>>> following are the details of the cluster:
>>>
>>>  * 4 nodes
>>> * Each node with 2 slots
>>> * Topology with around 600 spouts and bolts
>>> * Num. Workers for the topology = 4
>>>
>>>  I am running a test that generating a constant load. The cluster is
>>> able to handle this load fairly well and the CPU utilization at this point
>>> is below 50% on all the nodes. 1 slot is occupied on each of the nodes.
>>>
>>>  I then bring down one of the nodes (kill the supervisor and the worker
>>> processes on a node). After this, another worker is created on one of the
>>> remaining nodes. But the CPU utilization jumps up to 100%. At this point,
>>> nimbus cannot communicate with the supervisor on the node and keeps killing
>>> and restarting workers.
>>>
>>>  The CPU utilization remains pegged at 100% as long as the load is on.
>>> If I stop the tests and restart the test after a while, the same set up
>>> with just 3 nodes works perfectly fine with less CPU utilization.
>>>
>>>  Any pointers to how to figure out the reason for the high CPU
>>> utilization during the failover?
>>>
>>>  Thanks
>>>  Vinay
>>>
>>>
>>>
>>
>

Re: High CPU utilization after storm node failover

Posted by Vinay Pothnis <vi...@gmail.com>.

Update:

I tested the scenario that Srinath proposed.

* 4 nodes (lets say A, B, C, D) - each node with 1 slot/worker each.
* Topology started with 3 workers. (Lets say A, B, C)
* 1 node was spare. (D)

Once the load was stabilized, I killed one of the nodes on which the
topology was running (node A). The following things happened:

* New worker was created on the previously spare node D.
* The events from the load test continued to get processed.
* But CPU utilization on nodes B and C (from the original set) shot up to
100%
* Interesting thing was that it was the "system cpu time" that went up as
opposed to "user cpu time"
* It stayed at 100% for a few minutes. Then the workers on nodes B and C
died.
* The supervisor restarted the workers and the CPU utilization dropped back
to normal.
* Nodes B & C both had this exception (see below) when the worker died.
Looks like they were trying to contact Node A that was brought down.
* Once the new workers were started, the CPU came down to normal.

Any ideas as to what might be happening?

Thanks
Vinay

*ERROR 2014-08-06 18:59:45,373 [b.s.util
Thread-427-disruptor-worker-transfer-queue] Async loop died!*
*java.lang.RuntimeException: java.lang.RuntimeException: Remote address is
not reachable. We will close this client Netty-Client-*
*NODE-A/NODE-A-IP:6700*
* at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]*
* at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]*
*Caused by: java.lang.RuntimeException: Remote address is not reachable. We
will close this client Netty-Client-NODE-A/NODE-A-IP:6700*
* at backtype.storm.messaging.netty.Client.connect(Client.java:166)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at backtype.storm.messaging.netty.Client.send(Client.java:203)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* ... 6 common frames omitted*

On Wed, Aug 6, 2014 at 9:02 AM, Vinay Pothnis <vi...@gmail.com>
wrote:

> Hmm, I am trying to figure out what I can share to reproduce this.
>
> I will try this with a simple topology and see if this can be reproduced.
> I will also try Srinath's approach of having only one worker/slot per node
> and having a spare. If that works, I would have a somewhat "launchable"
> scenario and I will have more time to investigate the high cpu utilization
> after failover.
>
> Thanks
> Vinay
>
>
> On Wed, Aug 6, 2014 at 8:24 AM, Bobby Evans <ev...@yahoo-inc.com> wrote:
>
>>  +1 for failure testing.  We have used other similar tools in the past
>> to simulate different situations like network cuts, high packet loss, etc.
>>  I would love to see more of this happen, and the scheduler get smart
>> enough to detect these situations and deal with them.
>>
>>  - Bobby
>>
>>   From: "P. Taylor Goetz" <pt...@gmail.com>
>> Reply-To: "user@storm.incubator.apache.org" <
>> user@storm.incubator.apache.org>
>> Date: Tuesday, August 5, 2014 at 8:15 PM
>> To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>
>> Cc: "dev@storm.incubator.apache.org" <de...@storm.incubator.apache.org>
>> Subject: Re: High CPU utilization after storm node failover
>>
>>   + dev@storm
>>
>>  Vinyasa/Srinath,
>>
>>  Anything you can share to make this reproducible would be very helpful.
>>
>>  I would love to see a network partition simulation framework for Storm
>> along the lines of what Kyle Kingsbury has done with Jepsen [1]. It
>> basically sets up a virtual cluster then simulates network partitions by
>> manipulating iptables.
>>
>>  Jepsen [2] is written in clojure and Kyle is a strong proponent.
>>
>>  I think it is worth a look.
>>
>>  -Taylor
>>
>>  [1]
>> http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
>> [2] https://github.com/aphyr/jepsen
>>
>> On Aug 5, 2014, at 8:39 PM, Srinath C <sr...@gmail.com> wrote:
>>
>>   I have seen this behaviour too using 0.9.2-incubating.
>> The failover works better when there is a redundant node available. Maybe
>> 1 slot per node is the best approach.
>>  Eager to know if there are any steps to further diagnose.
>>
>>
>>
>>
>> On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <vi...@gmail.com>
>> wrote:
>>
>>>  [Storm Version: 0.9.2-incubating]
>>>
>>>  Hello,
>>>
>>>  I am trying to test failover scenarios with my storm cluster. The
>>> following are the details of the cluster:
>>>
>>>  * 4 nodes
>>> * Each node with 2 slots
>>> * Topology with around 600 spouts and bolts
>>> * Num. Workers for the topology = 4
>>>
>>>  I am running a test that generating a constant load. The cluster is
>>> able to handle this load fairly well and the CPU utilization at this point
>>> is below 50% on all the nodes. 1 slot is occupied on each of the nodes.
>>>
>>>  I then bring down one of the nodes (kill the supervisor and the worker
>>> processes on a node). After this, another worker is created on one of the
>>> remaining nodes. But the CPU utilization jumps up to 100%. At this point,
>>> nimbus cannot communicate with the supervisor on the node and keeps killing
>>> and restarting workers.
>>>
>>>  The CPU utilization remains pegged at 100% as long as the load is on.
>>> If I stop the tests and restart the test after a while, the same set up
>>> with just 3 nodes works perfectly fine with less CPU utilization.
>>>
>>>  Any pointers to how to figure out the reason for the high CPU
>>> utilization during the failover?
>>>
>>>  Thanks
>>>  Vinay
>>>
>>>
>>>
>>
>

Re: High CPU utilization after storm node failover

Posted by Vinay Pothnis <vi...@gmail.com>.

Hmm, I am trying to figure out what I can share to reproduce this.

I will try this with a simple topology and see if this can be reproduced. I
will also try Srinath's approach of having only one worker/slot per node
and having a spare. If that works, I would have a somewhat "launchable"
scenario and I will have more time to investigate the high cpu utilization
after failover.

Thanks
Vinay


On Wed, Aug 6, 2014 at 8:24 AM, Bobby Evans <ev...@yahoo-inc.com> wrote:

>  +1 for failure testing.  We have used other similar tools in the past to
> simulate different situations like network cuts, high packet loss, etc.  I
> would love to see more of this happen, and the scheduler get smart enough
> to detect these situations and deal with them.
>
>  - Bobby
>
>   From: "P. Taylor Goetz" <pt...@gmail.com>
> Reply-To: "user@storm.incubator.apache.org" <
> user@storm.incubator.apache.org>
> Date: Tuesday, August 5, 2014 at 8:15 PM
> To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>
> Cc: "dev@storm.incubator.apache.org" <de...@storm.incubator.apache.org>
> Subject: Re: High CPU utilization after storm node failover
>
>   + dev@storm
>
>  Vinyasa/Srinath,
>
>  Anything you can share to make this reproducible would be very helpful.
>
>  I would love to see a network partition simulation framework for Storm
> along the lines of what Kyle Kingsbury has done with Jepsen [1]. It
> basically sets up a virtual cluster then simulates network partitions by
> manipulating iptables.
>
>  Jepsen [2] is written in clojure and Kyle is a strong proponent.
>
>  I think it is worth a look.
>
>  -Taylor
>
>  [1]
> http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
> [2] https://github.com/aphyr/jepsen
>
> On Aug 5, 2014, at 8:39 PM, Srinath C <sr...@gmail.com> wrote:
>
>   I have seen this behaviour too using 0.9.2-incubating.
> The failover works better when there is a redundant node available. Maybe
> 1 slot per node is the best approach.
>  Eager to know if there are any steps to further diagnose.
>
>
>
>
> On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <vi...@gmail.com>
> wrote:
>
>>  [Storm Version: 0.9.2-incubating]
>>
>>  Hello,
>>
>>  I am trying to test failover scenarios with my storm cluster. The
>> following are the details of the cluster:
>>
>>  * 4 nodes
>> * Each node with 2 slots
>> * Topology with around 600 spouts and bolts
>> * Num. Workers for the topology = 4
>>
>>  I am running a test that generating a constant load. The cluster is
>> able to handle this load fairly well and the CPU utilization at this point
>> is below 50% on all the nodes. 1 slot is occupied on each of the nodes.
>>
>>  I then bring down one of the nodes (kill the supervisor and the worker
>> processes on a node). After this, another worker is created on one of the
>> remaining nodes. But the CPU utilization jumps up to 100%. At this point,
>> nimbus cannot communicate with the supervisor on the node and keeps killing
>> and restarting workers.
>>
>>  The CPU utilization remains pegged at 100% as long as the load is on.
>> If I stop the tests and restart the test after a while, the same set up
>> with just 3 nodes works perfectly fine with less CPU utilization.
>>
>>  Any pointers to how to figure out the reason for the high CPU
>> utilization during the failover?
>>
>>  Thanks
>>  Vinay
>>
>>
>>
>

Re: High CPU utilization after storm node failover

Posted by Vinay Pothnis <vi...@gmail.com>.

Hmm, I am trying to figure out what I can share to reproduce this.

I will try this with a simple topology and see if this can be reproduced. I
will also try Srinath's approach of having only one worker/slot per node
and having a spare. If that works, I would have a somewhat "launchable"
scenario and I will have more time to investigate the high cpu utilization
after failover.

Thanks
Vinay


On Wed, Aug 6, 2014 at 8:24 AM, Bobby Evans <ev...@yahoo-inc.com> wrote:

>  +1 for failure testing.  We have used other similar tools in the past to
> simulate different situations like network cuts, high packet loss, etc.  I
> would love to see more of this happen, and the scheduler get smart enough
> to detect these situations and deal with them.
>
>  - Bobby
>
>   From: "P. Taylor Goetz" <pt...@gmail.com>
> Reply-To: "user@storm.incubator.apache.org" <
> user@storm.incubator.apache.org>
> Date: Tuesday, August 5, 2014 at 8:15 PM
> To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>
> Cc: "dev@storm.incubator.apache.org" <de...@storm.incubator.apache.org>
> Subject: Re: High CPU utilization after storm node failover
>
>   + dev@storm
>
>  Vinyasa/Srinath,
>
>  Anything you can share to make this reproducible would be very helpful.
>
>  I would love to see a network partition simulation framework for Storm
> along the lines of what Kyle Kingsbury has done with Jepsen [1]. It
> basically sets up a virtual cluster then simulates network partitions by
> manipulating iptables.
>
>  Jepsen [2] is written in clojure and Kyle is a strong proponent.
>
>  I think it is worth a look.
>
>  -Taylor
>
>  [1]
> http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
> [2] https://github.com/aphyr/jepsen
>
> On Aug 5, 2014, at 8:39 PM, Srinath C <sr...@gmail.com> wrote:
>
>   I have seen this behaviour too using 0.9.2-incubating.
> The failover works better when there is a redundant node available. Maybe
> 1 slot per node is the best approach.
>  Eager to know if there are any steps to further diagnose.
>
>
>
>
> On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <vi...@gmail.com>
> wrote:
>
>>  [Storm Version: 0.9.2-incubating]
>>
>>  Hello,
>>
>>  I am trying to test failover scenarios with my storm cluster. The
>> following are the details of the cluster:
>>
>>  * 4 nodes
>> * Each node with 2 slots
>> * Topology with around 600 spouts and bolts
>> * Num. Workers for the topology = 4
>>
>>  I am running a test that generating a constant load. The cluster is
>> able to handle this load fairly well and the CPU utilization at this point
>> is below 50% on all the nodes. 1 slot is occupied on each of the nodes.
>>
>>  I then bring down one of the nodes (kill the supervisor and the worker
>> processes on a node). After this, another worker is created on one of the
>> remaining nodes. But the CPU utilization jumps up to 100%. At this point,
>> nimbus cannot communicate with the supervisor on the node and keeps killing
>> and restarting workers.
>>
>>  The CPU utilization remains pegged at 100% as long as the load is on.
>> If I stop the tests and restart the test after a while, the same set up
>> with just 3 nodes works perfectly fine with less CPU utilization.
>>
>>  Any pointers to how to figure out the reason for the high CPU
>> utilization during the failover?
>>
>>  Thanks
>>  Vinay
>>
>>
>>
>

Re: High CPU utilization after storm node failover

Posted by Bobby Evans <ev...@yahoo-inc.com>.

+1 for failure testing.  We have used other similar tools in the past to simulate different situations like network cuts, high packet loss, etc.  I would love to see more of this happen, and the scheduler get smart enough to detect these situations and deal with them.

- Bobby

From: "P. Taylor Goetz" <pt...@gmail.com>>
Reply-To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Date: Tuesday, August 5, 2014 at 8:15 PM
To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Cc: "dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <de...@storm.incubator.apache.org>>
Subject: Re: High CPU utilization after storm node failover

+ dev@storm

Vinyasa/Srinath,

Anything you can share to make this reproducible would be very helpful.

I would love to see a network partition simulation framework for Storm along the lines of what Kyle Kingsbury has done with Jepsen [1]. It basically sets up a virtual cluster then simulates network partitions by manipulating iptables.

Jepsen [2] is written in clojure and Kyle is a strong proponent.

I think it is worth a look.

-Taylor

[1] http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
[2] https://github.com/aphyr/jepsen

On Aug 5, 2014, at 8:39 PM, Srinath C <sr...@gmail.com>> wrote:

I have seen this behaviour too using 0.9.2-incubating.
The failover works better when there is a redundant node available. Maybe 1 slot per node is the best approach.
Eager to know if there are any steps to further diagnose.

On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <vi...@gmail.com>> wrote:
[Storm Version: 0.9.2-incubating]

Hello,

I am trying to test failover scenarios with my storm cluster. The following are the details of the cluster:

* 4 nodes
* Each node with 2 slots
* Topology with around 600 spouts and bolts
* Num. Workers for the topology = 4

I am running a test that generating a constant load. The cluster is able to handle this load fairly well and the CPU utilization at this point is below 50% on all the nodes. 1 slot is occupied on each of the nodes.

I then bring down one of the nodes (kill the supervisor and the worker processes on a node). After this, another worker is created on one of the remaining nodes. But the CPU utilization jumps up to 100%. At this point, nimbus cannot communicate with the supervisor on the node and keeps killing and restarting workers.

The CPU utilization remains pegged at 100% as long as the load is on. If I stop the tests and restart the test after a while, the same set up with just 3 nodes works perfectly fine with less CPU utilization.

Any pointers to how to figure out the reason for the high CPU utilization during the failover?

Thanks
Vinay

Re: High CPU utilization after storm node failover

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

+1 for failure testing.  We have used other similar tools in the past to simulate different situations like network cuts, high packet loss, etc.  I would love to see more of this happen, and the scheduler get smart enough to detect these situations and deal with them.

- Bobby

From: "P. Taylor Goetz" <pt...@gmail.com>>
Reply-To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Date: Tuesday, August 5, 2014 at 8:15 PM
To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Cc: "dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <de...@storm.incubator.apache.org>>
Subject: Re: High CPU utilization after storm node failover

+ dev@storm

Vinyasa/Srinath,

Anything you can share to make this reproducible would be very helpful.

I would love to see a network partition simulation framework for Storm along the lines of what Kyle Kingsbury has done with Jepsen [1]. It basically sets up a virtual cluster then simulates network partitions by manipulating iptables.

Jepsen [2] is written in clojure and Kyle is a strong proponent.

I think it is worth a look.

-Taylor

[1] http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
[2] https://github.com/aphyr/jepsen

On Aug 5, 2014, at 8:39 PM, Srinath C <sr...@gmail.com>> wrote:

I have seen this behaviour too using 0.9.2-incubating.
The failover works better when there is a redundant node available. Maybe 1 slot per node is the best approach.
Eager to know if there are any steps to further diagnose.

On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <vi...@gmail.com>> wrote:
[Storm Version: 0.9.2-incubating]

Hello,

I am trying to test failover scenarios with my storm cluster. The following are the details of the cluster:

* 4 nodes
* Each node with 2 slots
* Topology with around 600 spouts and bolts
* Num. Workers for the topology = 4

I am running a test that generating a constant load. The cluster is able to handle this load fairly well and the CPU utilization at this point is below 50% on all the nodes. 1 slot is occupied on each of the nodes.

I then bring down one of the nodes (kill the supervisor and the worker processes on a node). After this, another worker is created on one of the remaining nodes. But the CPU utilization jumps up to 100%. At this point, nimbus cannot communicate with the supervisor on the node and keeps killing and restarting workers.

The CPU utilization remains pegged at 100% as long as the load is on. If I stop the tests and restart the test after a while, the same set up with just 3 nodes works perfectly fine with less CPU utilization.

Any pointers to how to figure out the reason for the high CPU utilization during the failover?

Thanks
Vinay

Re: High CPU utilization after storm node failover

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

+ dev@storm

Vinyasa/Srinath,

Anything you can share to make this reproducible would be very helpful.

I would love to see a network partition simulation framework for Storm along the lines of what Kyle Kingsbury has done with Jepsen [1]. It basically sets up a virtual cluster then simulates network partitions by manipulating iptables.

Jepsen [2] is written in clojure and Kyle is a strong proponent.

I think it is worth a look.

-Taylor

[1] http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
[2] https://github.com/aphyr/jepsen

> On Aug 5, 2014, at 8:39 PM, Srinath C <sr...@gmail.com> wrote:
> 
> I have seen this behaviour too using 0.9.2-incubating.
> The failover works better when there is a redundant node available. Maybe 1 slot per node is the best approach.
> Eager to know if there are any steps to further diagnose.
> 
> 
> 
> 
>> On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <vi...@gmail.com> wrote:
>> [Storm Version: 0.9.2-incubating]
>> 
>> Hello, 
>> 
>> I am trying to test failover scenarios with my storm cluster. The following are the details of the cluster:
>> 
>> * 4 nodes
>> * Each node with 2 slots
>> * Topology with around 600 spouts and bolts
>> * Num. Workers for the topology = 4
>> 
>> I am running a test that generating a constant load. The cluster is able to handle this load fairly well and the CPU utilization at this point is below 50% on all the nodes. 1 slot is occupied on each of the nodes. 
>> 
>> I then bring down one of the nodes (kill the supervisor and the worker processes on a node). After this, another worker is created on one of the remaining nodes. But the CPU utilization jumps up to 100%. At this point, nimbus cannot communicate with the supervisor on the node and keeps killing and restarting workers. 
>> 
>> The CPU utilization remains pegged at 100% as long as the load is on. If I stop the tests and restart the test after a while, the same set up with just 3 nodes works perfectly fine with less CPU utilization. 
>> 
>> Any pointers to how to figure out the reason for the high CPU utilization during the failover? 
>> 
>> Thanks
>> Vinay
>

Re: High CPU utilization after storm node failover

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

+ dev@storm

Vinyasa/Srinath,

Anything you can share to make this reproducible would be very helpful.

I would love to see a network partition simulation framework for Storm along the lines of what Kyle Kingsbury has done with Jepsen [1]. It basically sets up a virtual cluster then simulates network partitions by manipulating iptables.

Jepsen [2] is written in clojure and Kyle is a strong proponent.

I think it is worth a look.

-Taylor

[1] http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
[2] https://github.com/aphyr/jepsen

> On Aug 5, 2014, at 8:39 PM, Srinath C <sr...@gmail.com> wrote:
> 
> I have seen this behaviour too using 0.9.2-incubating.
> The failover works better when there is a redundant node available. Maybe 1 slot per node is the best approach.
> Eager to know if there are any steps to further diagnose.
> 
> 
> 
> 
>> On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <vi...@gmail.com> wrote:
>> [Storm Version: 0.9.2-incubating]
>> 
>> Hello, 
>> 
>> I am trying to test failover scenarios with my storm cluster. The following are the details of the cluster:
>> 
>> * 4 nodes
>> * Each node with 2 slots
>> * Topology with around 600 spouts and bolts
>> * Num. Workers for the topology = 4
>> 
>> I am running a test that generating a constant load. The cluster is able to handle this load fairly well and the CPU utilization at this point is below 50% on all the nodes. 1 slot is occupied on each of the nodes. 
>> 
>> I then bring down one of the nodes (kill the supervisor and the worker processes on a node). After this, another worker is created on one of the remaining nodes. But the CPU utilization jumps up to 100%. At this point, nimbus cannot communicate with the supervisor on the node and keeps killing and restarting workers. 
>> 
>> The CPU utilization remains pegged at 100% as long as the load is on. If I stop the tests and restart the test after a while, the same set up with just 3 nodes works perfectly fine with less CPU utilization. 
>> 
>> Any pointers to how to figure out the reason for the high CPU utilization during the failover? 
>> 
>> Thanks
>> Vinay
>

Re: High CPU utilization after storm node failover

Posted by Srinath C <sr...@gmail.com>.

I have seen this behaviour too using 0.9.2-incubating.
The failover works better when there is a redundant node available. Maybe 1
slot per node is the best approach.
Eager to know if there are any steps to further diagnose.




On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <vi...@gmail.com>
wrote:

> [Storm Version: 0.9.2-incubating]
>
> Hello,
>
> I am trying to test failover scenarios with my storm cluster. The
> following are the details of the cluster:
>
> * 4 nodes
> * Each node with 2 slots
> * Topology with around 600 spouts and bolts
> * Num. Workers for the topology = 4
>
> I am running a test that generating a constant load. The cluster is able
> to handle this load fairly well and the CPU utilization at this point is
> below 50% on all the nodes. 1 slot is occupied on each of the nodes.
>
> I then bring down one of the nodes (kill the supervisor and the worker
> processes on a node). After this, another worker is created on one of the
> remaining nodes. But the CPU utilization jumps up to 100%. At this point,
> nimbus cannot communicate with the supervisor on the node and keeps killing
> and restarting workers.
>
> The CPU utilization remains pegged at 100% as long as the load is on. If I
> stop the tests and restart the test after a while, the same set up with
> just 3 nodes works perfectly fine with less CPU utilization.
>
> Any pointers to how to figure out the reason for the high CPU utilization
> during the failover?
>
> Thanks
> Vinay
>
>
>