You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by STEPHENS Conor <Co...@murex.com> on 2020/12/17 10:14:32 UTC

Slowness in recovery after upgrade to storm 2

Hi developers,

We have been users of apache storm for a number of years. Earlier this year we tried to upgrade from storm 1.2.1 to storm 2.

While validating this upgrade we noticed a high level of randomness in our chaos monkey test, this randomness is currently blocking our upgrade to storm 2.

Test context

In our test we have 3 zookeeper instances,  2 storm nimbus instances, 2 storm supervisors and 1 storm UI instance, these instances are distributed across three separate VM's.

The test execution can be seen below.

1. Start 1 worker
2. Insert events and verify they have been processed
3. Begin inserting more events
4. Kill zookeeper 1, nimbus 1, supervisor 1
5. Restart zookeeper 1, nimbus 1, supervisor 1
6. Kill the worker
7. Kill zookeeper 2, nimbus 2, supervisor 2
8. Restart zookeeper 2, nimbus 2, supervisor 2
9. We repeats steps 4-8 until all the events inserted in step 3 are processed

The randomness we are seeing occurs in step 9, it usually takes ~6 minutes for this test case, in some instances it takes up to ~20 minutes. When this occurs the test will timeout after 15minutes, after upping the time out to 30 minutes the test passes consistently.

We have done a considerable amount of analysis to try understand this slowness but have not found the root cause, and would appreciate any advice you can offer. See below for the observations of our analysis, I can provide specific logs if that will help.

Analysis

The issue seems to be that when the nimbi and supervisors and being killed on and off, something happens that causes the supervisors to fail finding a nimbus leader. This brings all processing to a stop until a nimbus leader is found. Eventually the nimbus leader is found and the workers resume processing.

During the wait period, the following exceptions are  repeatedly logged in the supervisor logs:

org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
o.a.s.l.AsyncLocalizer AsyncLocalizer Task Executor - 1 [ERROR] AsyncLocalizer cleanup failure
org.apache.storm.utils.NimbusLeaderNotFoundException: Could not find leader nimbus from seed hosts [dell998srv.fr.murex.com, mx28860vm.fr.murex.com]. Did you specify a valid list of nimbus hosts for config nimbus.seeds?
o.a.s.u.NimbusClient AsyncLocalizer Task Executor - 1 [WARN] Ignoring exception while trying to get leader nimbus info from dell998srv.fr.murex.com. will retry with a different seed host.
o.a.s.u.NimbusClient timer [WARN] Ignoring exception while trying to get leader nimbus info from dell998srv.fr.murex.com. will retry with a different seed host.
.a.s.d.s.t.ReportWorkerHeartbeats timer [ERROR] Send worker heartbeats to master exception
o.a.s.d.s.t.SynchronizeAssignments Thread-3 [ERROR] Get assignments from master exception

The issue occurs at least 50% of time (3 out of every 5 or 6 runs).

From the nimbus logs, we see that the nimbus leadership switches from the first leader to the second nimbus when the leader dies. So there's always a nimbus leader even during the period that the supervisor is waiting for a leader.

Probably the most important observation is that the supervisor seems to find the nimbus leader when leadership returns to the original leader. Eg. If Nimbus-1 gains leadership first, and then it gets killed and Nimbus-2 gains leadership. The supervisors are not able to find the Nimbus leader whilst Nimbus-2 is a leader and are able to find it when Nimbus-2 dies and Nimbus-1 gains leadership back.

Kind regards,
Conor
*******************************
This e-mail contains information for the intended recipient only. It may contain proprietary material or confidential information. Its content and any attachment hereto are strictly confidential and must not be disclosed to any unauthorized third party. If you are not the intended recipient, please delete this email and any attachment and notify us immediately. Murex cannot guarantee that it is virus free and accepts no responsibility for any loss or damage arising from its use. If you have received this e-mail in error please notify immediately the sender and delete the original email received, any attachments and all copies from your system.

Re: Slowness in recovery after upgrade to storm 2

Posted by Aaron Gresch <ag...@gmail.com>.
I would file a JIRA with these details for tracking.

If your observation for your test is correct, it might be useful to
simplify your test to isolate the problem.  Could you simply kill nimbus 1
and leave it down and then see if the supervisor is able to recover, etc.
If that works, add a second variable such as the supervisor restart, etc.

On Thu, Dec 17, 2020 at 6:52 AM STEPHENS Conor <Co...@murex.com>
wrote:

> Hi developers,
>
> We have been users of apache storm for a number of years. Earlier this
> year we tried to upgrade from storm 1.2.1 to storm 2.
>
> While validating this upgrade we noticed a high level of randomness in our
> chaos monkey test, this randomness is currently blocking our upgrade to
> storm 2.
>
> Test context
>
> In our test we have 3 zookeeper instances,  2 storm nimbus instances, 2
> storm supervisors and 1 storm UI instance, these instances are distributed
> across three separate VM's.
>
> The test execution can be seen below.
>
> 1. Start 1 worker
> 2. Insert events and verify they have been processed
> 3. Begin inserting more events
> 4. Kill zookeeper 1, nimbus 1, supervisor 1
> 5. Restart zookeeper 1, nimbus 1, supervisor 1
> 6. Kill the worker
> 7. Kill zookeeper 2, nimbus 2, supervisor 2
> 8. Restart zookeeper 2, nimbus 2, supervisor 2
> 9. We repeats steps 4-8 until all the events inserted in step 3 are
> processed
>
> The randomness we are seeing occurs in step 9, it usually takes ~6 minutes
> for this test case, in some instances it takes up to ~20 minutes. When this
> occurs the test will timeout after 15minutes, after upping the time out to
> 30 minutes the test passes consistently.
>
> We have done a considerable amount of analysis to try understand this
> slowness but have not found the root cause, and would appreciate any advice
> you can offer. See below for the observations of our analysis, I can
> provide specific logs if that will help.
>
> Analysis
>
> The issue seems to be that when the nimbi and supervisors and being killed
> on and off, something happens that causes the supervisors to fail finding a
> nimbus leader. This brings all processing to a stop until a nimbus leader
> is found. Eventually the nimbus leader is found and the workers resume
> processing.
>
> During the wait period, the following exceptions are  repeatedly logged in
> the supervisor logs:
>
> org.apache.storm.thrift.transport.TTransportException:
> java.net.ConnectException: Connection refused (Connection refused)
> o.a.s.l.AsyncLocalizer AsyncLocalizer Task Executor - 1 [ERROR]
> AsyncLocalizer cleanup failure
> org.apache.storm.utils.NimbusLeaderNotFoundException: Could not find
> leader nimbus from seed hosts [dell998srv.fr.murex.com,
> mx28860vm.fr.murex.com]. Did you specify a valid list of nimbus hosts for
> config nimbus.seeds?
> o.a.s.u.NimbusClient AsyncLocalizer Task Executor - 1 [WARN] Ignoring
> exception while trying to get leader nimbus info from
> dell998srv.fr.murex.com. will retry with a different seed host.
> o.a.s.u.NimbusClient timer [WARN] Ignoring exception while trying to get
> leader nimbus info from dell998srv.fr.murex.com. will retry with a
> different seed host.
> .a.s.d.s.t.ReportWorkerHeartbeats timer [ERROR] Send worker heartbeats to
> master exception
> o.a.s.d.s.t.SynchronizeAssignments Thread-3 [ERROR] Get assignments from
> master exception
>
> The issue occurs at least 50% of time (3 out of every 5 or 6 runs).
>
> From the nimbus logs, we see that the nimbus leadership switches from the
> first leader to the second nimbus when the leader dies. So there's always a
> nimbus leader even during the period that the supervisor is waiting for a
> leader.
>
> Probably the most important observation is that the supervisor seems to
> find the nimbus leader when leadership returns to the original leader. Eg.
> If Nimbus-1 gains leadership first, and then it gets killed and Nimbus-2
> gains leadership. The supervisors are not able to find the Nimbus leader
> whilst Nimbus-2 is a leader and are able to find it when Nimbus-2 dies and
> Nimbus-1 gains leadership back.
>
> Kind regards,
> Conor
> *******************************
> This e-mail contains information for the intended recipient only. It may
> contain proprietary material or confidential information. Its content and
> any attachment hereto are strictly confidential and must not be disclosed
> to any unauthorized third party. If you are not the intended recipient,
> please delete this email and any attachment and notify us immediately.
> Murex cannot guarantee that it is virus free and accepts no responsibility
> for any loss or damage arising from its use. If you have received this
> e-mail in error please notify immediately the sender and delete the
> original email received, any attachments and all copies from your system.
>