You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by pwozniak <pw...@man.poznan.pl> on 2020/02/18 18:23:11 UTC
Some Workers are restarted all the time
Hi,
We are using Apache Storm 1.1.3 in our application. Several hours ago
Nimbus started to restart some of (just some of) our workers all the time.
Now We are struggling with investigating what exactly if happening. What
if the cause of this problem.
Here is was we was able to find out:
1. Nimbus is restarting s/me workers every 150 second which is equals to
our config value for 'worker.heartbeat.frequency.secs' and
'task.heartbeat.frequency.secs';
2. We assumed that, for some reason, Executors (jvm's) are not able to
update their statuses to the zookeeper. But when we check the
corresponding zNodes, last update dates of all of them was older than
150 seconds (for the topologies that are running smoothly too). So in
this case all the topologies should be restarted by the nimbus.
3. Than we realized that we also use pacemaker and maybe because of that
dates in zookeeper are correct. But we found nothing in the pacemaker logs.
4. We tried to search for something interesting in logs of nimbus,
supervisor and workers.
For nimbus we get only number of entries like:
2020-02-11 20:21:56.087 o.a.s.d.nimbus timer [INFO] Executor
enrichment_topology-3-1581414651:[24 24] not alive
For supervisor logs:
2020-02-18 14:34:52.729 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING
msInState: 150043 topo:enrichment_topology-6-1581951947
worker:32de0a97-4a50-4967-873c-df189eb8ec94 -> RUNNING msInState: 150043
topo:enrichment_topology-6-1581951947
worker:32de0a97-4a50-4967-873c-df189eb8ec94
2020-02-18 14:34:52.730 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE RUNNING
2020-02-18 14:34:52.730 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700:
Assignment Changed from
LocalAssignment(topology_id:enrichment_topology-6-1581951947,
executors:[ExecutorInfo(task_start:51, task_end:51),
ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:35,
task_end:35), ExecutorInfo(task_start:19, task_end:19),
ExecutorInfo(task_start:67, task_end:67), ExecutorInfo(task_start:115,
task_end:115), ExecutorInfo(task_start:83, task_end:83),
ExecutorInfo(task_start:99, task_end:99), ExecutorInfo(task_start:85,
task_end:85), ExecutorInfo(task_start:53, task_end:53),
ExecutorInfo(task_start:101, task_end:101), ExecutorInfo(task_start:69,
task_end:69), ExecutorInfo(task_start:21, task_end:21),
ExecutorInfo(task_start:5, task_end:5), ExecutorInfo(task_start:37,
task_end:37), ExecutorInfo(task_start:117, task_end:117),
ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:119,
task_end:119), ExecutorInfo(task_start:103, task_end:103),
ExecutorInfo(task_start:39, task_end:39), ExecutorInfo(task_start:71,
task_end:71), ExecutorInfo(task_start:23, task_end:23),
ExecutorInfo(task_start:55, task_end:55), ExecutorInfo(task_start:87,
task_end:87), ExecutorInfo(task_start:41, task_end:41),
ExecutorInfo(task_start:73, task_end:73), ExecutorInfo(task_start:105,
task_end:105), ExecutorInfo(task_start:9, task_end:9),
ExecutorInfo(task_start:57, task_end:57), ExecutorInfo(task_start:121,
task_end:121), ExecutorInfo(task_start:89, task_end:89),
ExecutorInfo(task_start:25, task_end:25), ExecutorInfo(task_start:75,
task_end:75), ExecutorInfo(task_start:107, task_end:107),
ExecutorInfo(task_start:123, task_end:123), ExecutorInfo(task_start:43,
task_end:43), ExecutorInfo(task_start:11, task_end:11),
ExecutorInfo(task_start:91, task_end:91), ExecutorInfo(task_start:27,
task_end:27), ExecutorInfo(task_start:59, task_end:59),
ExecutorInfo(task_start:125, task_end:125), ExecutorInfo(task_start:109,
task_end:109), ExecutorInfo(task_start:93, task_end:93),
ExecutorInfo(task_start:61, task_end:61), ExecutorInfo(task_start:13,
task_end:13), ExecutorInfo(task_start:29, task_end:29),
ExecutorInfo(task_start:45, task_end:45), ExecutorInfo(task_start:77,
task_end:77), ExecutorInfo(task_start:47, task_end:47),
ExecutorInfo(task_start:111, task_end:111), ExecutorInfo(task_start:127,
task_end:127), ExecutorInfo(task_start:63, task_end:63),
ExecutorInfo(task_start:31, task_end:31), ExecutorInfo(task_start:79,
task_end:79), ExecutorInfo(task_start:95, task_end:95),
ExecutorInfo(task_start:15, task_end:15), ExecutorInfo(task_start:81,
task_end:81), ExecutorInfo(task_start:97, task_end:97),
ExecutorInfo(task_start:33, task_end:33), ExecutorInfo(task_start:1,
task_end:1), ExecutorInfo(task_start:65, task_end:65),
ExecutorInfo(task_start:129, task_end:129), ExecutorInfo(task_start:49,
task_end:49), ExecutorInfo(task_start:113, task_end:113),
ExecutorInfo(task_start:17, task_end:17)],
resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0),
owner:storm) to null
2020-02-18 14:34:52.730 o.a.s.d.s.Container SLOT_6700 [INFO] Killing
6654f8f2-6972-4238-af0b-65e196c8fb70:32de0a97-4a50-4967-873c-df189eb8ec94
which is also not telling us what is the cause of the problem
And for the worker logs:
2020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Shutting down
worker xslt_topology-21-1580129826 6654f8f2-6972-4238-af0b-65e196c8fb70 6703
2020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Terminating
messaging context
2020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Shutting down
executors
5. We used jvisualVM to check if maybe for some reason workers are
occupied by something else and are not able to send heartbeats to
zookeeper/pacemaker. But haven't found anything interesting;
Everything without success.
Is there anything more that we can do to find our what is the reason of
this behavior?
Additional Questions:
1. Are there any other solutions to debug and investigate the causes of
similar problems;
2. Maybe Pacemaker is the problem here:
- where does it store the data?
- can we read it somehow like it is possible for zookeeper (zkCli.sh)
- maybe some or the requests are not reaching the pacemaker, is
there way to verify it
Regards,
Pawel