You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by pwozniak <pw...@man.poznan.pl> on 2020/02/18 18:23:11 UTC
Some Workers are restarted all the time

Hi,

We are using Apache Storm 1.1.3 in our application. Several hours ago 
Nimbus started to restart some of (just some of) our workers all the time.
Now We are struggling with investigating what exactly if happening. What 
if the cause of this problem.

Here is was we was able to find out:

1. Nimbus is restarting s/me workers every 150 second which is equals to 
our config value for 'worker.heartbeat.frequency.secs' and  
'task.heartbeat.frequency.secs';
2. We assumed that, for some reason, Executors (jvm's) are not able to 
update their statuses to the zookeeper. But when we check the 
corresponding zNodes, last update dates of all of them was older than 
150 seconds (for the topologies that are running smoothly too). So in 
this case all the topologies should be restarted by the nimbus.
3. Than we realized that we also use pacemaker and maybe because of that 
dates in zookeeper are correct. But we found nothing in the pacemaker logs.
4. We tried to search for something interesting in logs of nimbus, 
supervisor and workers.

For nimbus we get only number of entries like:
2020-02-11 20:21:56.087 o.a.s.d.nimbus timer [INFO] Executor 
enrichment_topology-3-1581414651:[24 24] not alive

For supervisor logs:

2020-02-18 14:34:52.729 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING 
msInState: 150043 topo:enrichment_topology-6-1581951947 
worker:32de0a97-4a50-4967-873c-df189eb8ec94 -> RUNNING msInState: 150043 
topo:enrichment_topology-6-1581951947 
worker:32de0a97-4a50-4967-873c-df189eb8ec94
2020-02-18 14:34:52.730 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE RUNNING
2020-02-18 14:34:52.730 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700: 
Assignment Changed from 
LocalAssignment(topology_id:enrichment_topology-6-1581951947, 
executors:[ExecutorInfo(task_start:51, task_end:51), 
ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:35, 
task_end:35), ExecutorInfo(task_start:19, task_end:19), 
ExecutorInfo(task_start:67, task_end:67), ExecutorInfo(task_start:115, 
task_end:115), ExecutorInfo(task_start:83, task_end:83), 
ExecutorInfo(task_start:99, task_end:99), ExecutorInfo(task_start:85, 
task_end:85), ExecutorInfo(task_start:53, task_end:53), 
ExecutorInfo(task_start:101, task_end:101), ExecutorInfo(task_start:69, 
task_end:69), ExecutorInfo(task_start:21, task_end:21), 
ExecutorInfo(task_start:5, task_end:5), ExecutorInfo(task_start:37, 
task_end:37), ExecutorInfo(task_start:117, task_end:117), 
ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:119, 
task_end:119), ExecutorInfo(task_start:103, task_end:103), 
ExecutorInfo(task_start:39, task_end:39), ExecutorInfo(task_start:71, 
task_end:71), ExecutorInfo(task_start:23, task_end:23), 
ExecutorInfo(task_start:55, task_end:55), ExecutorInfo(task_start:87, 
task_end:87), ExecutorInfo(task_start:41, task_end:41), 
ExecutorInfo(task_start:73, task_end:73), ExecutorInfo(task_start:105, 
task_end:105), ExecutorInfo(task_start:9, task_end:9), 
ExecutorInfo(task_start:57, task_end:57), ExecutorInfo(task_start:121, 
task_end:121), ExecutorInfo(task_start:89, task_end:89), 
ExecutorInfo(task_start:25, task_end:25), ExecutorInfo(task_start:75, 
task_end:75), ExecutorInfo(task_start:107, task_end:107), 
ExecutorInfo(task_start:123, task_end:123), ExecutorInfo(task_start:43, 
task_end:43), ExecutorInfo(task_start:11, task_end:11), 
ExecutorInfo(task_start:91, task_end:91), ExecutorInfo(task_start:27, 
task_end:27), ExecutorInfo(task_start:59, task_end:59), 
ExecutorInfo(task_start:125, task_end:125), ExecutorInfo(task_start:109, 
task_end:109), ExecutorInfo(task_start:93, task_end:93), 
ExecutorInfo(task_start:61, task_end:61), ExecutorInfo(task_start:13, 
task_end:13), ExecutorInfo(task_start:29, task_end:29), 
ExecutorInfo(task_start:45, task_end:45), ExecutorInfo(task_start:77, 
task_end:77), ExecutorInfo(task_start:47, task_end:47), 
ExecutorInfo(task_start:111, task_end:111), ExecutorInfo(task_start:127, 
task_end:127), ExecutorInfo(task_start:63, task_end:63), 
ExecutorInfo(task_start:31, task_end:31), ExecutorInfo(task_start:79, 
task_end:79), ExecutorInfo(task_start:95, task_end:95), 
ExecutorInfo(task_start:15, task_end:15), ExecutorInfo(task_start:81, 
task_end:81), ExecutorInfo(task_start:97, task_end:97), 
ExecutorInfo(task_start:33, task_end:33), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:65, task_end:65), 
ExecutorInfo(task_start:129, task_end:129), ExecutorInfo(task_start:49, 
task_end:49), ExecutorInfo(task_start:113, task_end:113), 
ExecutorInfo(task_start:17, task_end:17)], 
resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0), 
owner:storm) to null
2020-02-18 14:34:52.730 o.a.s.d.s.Container SLOT_6700 [INFO] Killing 
6654f8f2-6972-4238-af0b-65e196c8fb70:32de0a97-4a50-4967-873c-df189eb8ec94

which is also not telling us what is the cause of the problem

And for the worker logs:

2020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Shutting down 
worker xslt_topology-21-1580129826 6654f8f2-6972-4238-af0b-65e196c8fb70 6703
2020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Terminating 
messaging context
2020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Shutting down 
executors

5. We used jvisualVM to check if maybe for some reason workers are 
occupied by something else and are not able to send heartbeats to 
zookeeper/pacemaker. But haven't found anything interesting;


Everything without success.

Is there anything more that we can do to find our what is the reason of 
this behavior?



Additional Questions:
1. Are there any other solutions to debug and investigate the causes of 
similar problems;
2. Maybe Pacemaker is the problem here:
     - where does it store the data?
     - can we read it somehow like it is possible for zookeeper (zkCli.sh)
     - maybe some or the requests are not reaching the pacemaker, is 
there way to verify it


Regards,

Pawel