You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by DashengJu <da...@gmail.com> on 2014/08/20 12:26:18 UTC

worker always timeout to heartbeat and was restarted by supervisor

hi, all

In out production environment, we have a topology named
logparser_mobile_nginx，it has 50 worker, spout have 48 executors,
bolt_parser have 1000 executors and bolt_saver have 50 executors.

The topology running normal most of times, but 1~5 workers restarted every
1~2 hours. When we see the logs of supervisor and worker, found 1) worker
have no error or exception; 2) supervisor says the worker did not do
heartbeat and timeout happened.

because worker have no log, I do not know why worker did not do heartbeat.
anyone have any ideas  how to investigate?
0) is the worker exist caused?
1) does it related to GC problem?
2) does it related to Memory problem? If this, I think the JVM will report
Memory Exception in worker log.

By the way, some small topologies works well on the same environment.

below is the supervisor log:
----------------------------------------------------------------------------------------------
2014-08-20 15:51:33 b.s.d.supervisor [INFO]
90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started

2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
state for id c7e8d375-db76-4e2
9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
"logparser_mobile_nginx-259-1408518
662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
:port 9714}
2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
75-db76-4e29-8019-e783ab3cd6de
2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
Process is probably already dead
.
2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
Process is probably already dead
.
2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
b76-4e29-8019-e783ab3cd6de

2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
state for id d5a8d578-89ff-4a5
0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
"logparser_nginx-265-1408521077", :
executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
78-89ff-4a50-a906-75e847ac63a1
2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
Process is probably already dead
.
2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
9ff-4a50-a906-75e847ac63a1

2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 5154f643-cd79-411
9-9368-153f1bede757. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
"logparser_mobile_nginx-259-1408518
662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
:port 9720}
2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
43-cd79-4119-9368-153f1bede757
2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
Process is probably already dead.
2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
Process is probably already dead.
2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757

2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor time:
1408521676. State: :timed-out, Heartbeat:
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
:storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40] [24
24] [-1 -1]}, :port 9713}
2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
Process is probably already dead.
2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba

-- 
dashengju
+86 13810875910
dashengju@gmail.com

Re: worker always timeout to heartbeat and was restarted by supervisor

Posted by DashengJu <da...@gmail.com>.

Thanks for your reply.

We have solve the problem. I think the main problem is GC caused.
We changed the worker.childopts params and the problem never happens.

------------------------------------------------
before config: -Xmx 1024m
------------------------------------------------
after config: -Xmx2600m -Xms2600m -Xss256k -XX:MaxPermSize=128m
-XX:PermSize=96m -XX:NewSize=1000m -XX:MaxNewSize=1000m
-XX:MaxTenuringThreshold=1 -XX:SurvivorRatio=6 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-server -XX:+AggressiveOpts -XX:+UseCompressedOops -Djava.awt.headless=true
-Djava.net.preferIPv4Stack=true -Xloggc:logs/gc-worker-%ID%.log -verbose:gc
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10m
-XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps
-XX:+PrintClassHistogram -XX:+PrintTenuringDistribution
-XX:-PrintGCApplicationStoppedTime -XX:-PrintGCApplicationConcurrentTime
-XX:+PrintCommandLineFlags -XX:+PrintFlagsFinal


On Thu, Aug 21, 2014 at 6:16 AM, Derek Dagit <de...@yahoo-inc.com> wrote:

> 1) does it related to GC problem?
>>
>
> This is usually the cause.  As the worker consumes more and more of the
> heap, gc takes longer and longer each time.  Eventually it takes so long
> the heartbeats to the supervisor do not happen.
>
> There could be a spike or skew in your data such that one or more workers
> cannot handle it with their heap settings.
>
> --
> Derek
>
>
> On 8/20/14, 5:26, DashengJu wrote:
>
>> hi, all
>>
>> In out production environment, we have a topology named
>> logparser_mobile_nginx，it has 50 worker, spout have 48 executors,
>> bolt_parser have 1000 executors and bolt_saver have 50 executors.
>>
>> The topology running normal most of times, but 1~5 workers restarted every
>> 1~2 hours. When we see the logs of supervisor and worker, found 1) worker
>> have no error or exception; 2) supervisor says the worker did not do
>> heartbeat and timeout happened.
>>
>> because worker have no log, I do not know why worker did not do heartbeat.
>> anyone have any ideas  how to investigate?
>> 0) is the worker exist caused?
>> 1) does it related to GC problem?
>> 2) does it related to Memory problem? If this, I think the JVM will report
>> Memory Exception in worker log.
>>
>> By the way, some small topologies works well on the same environment.
>>
>> below is the supervisor log:
>> ------------------------------------------------------------
>> ----------------------------------
>> 2014-08-20 15:51:33 b.s.d.supervisor [INFO]
>> 90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started
>>
>> 2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
>> state for id c7e8d375-db76-4e2
>> 9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
>> :timed-out, Heartbeat: #backtype.sto
>> rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
>> "logparser_mobile_nginx-259-1408518
>> 662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
>> :port 9714}
>> 2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
>> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
>> 75-db76-4e29-8019-e783ab3cd6de
>> 2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
>> Process is probably already dead
>> .
>> 2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
>> Process is probably already dead
>> .
>> 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
>> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
>> b76-4e29-8019-e783ab3cd6de
>>
>> 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
>> state for id d5a8d578-89ff-4a5
>> 0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
>> :timed-out, Heartbeat: #backtype.sto
>> rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
>> "logparser_nginx-265-1408521077", :
>> executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
>> 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
>> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
>> 78-89ff-4a50-a906-75e847ac63a1
>> 2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
>> Process is probably already dead
>> .
>> 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
>> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
>> 9ff-4a50-a906-75e847ac63a1
>>
>> 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
>> state for id 5154f643-cd79-411
>> 9-9368-153f1bede757. Current supervisor time: 1408521676. State:
>> :timed-out, Heartbeat: #backtype.sto
>> rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
>> "logparser_mobile_nginx-259-1408518
>> 662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
>> :port 9720}
>> 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
>> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
>> 43-cd79-4119-9368-153f1bede757
>> 2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
>> Process is probably already dead.
>> 2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
>> Process is probably already dead.
>> 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
>> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757
>>
>> 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
>> state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor
>> time:
>> 1408521676. State: :timed-out, Heartbeat:
>> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
>> :storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40]
>> [24
>> 24] [-1 -1]}, :port 9713}
>> 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
>> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
>> 2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
>> Process is probably already dead.
>> 2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
>> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
>>
>>


-- 
dashengju
+86 13810875910
dashengju@gmail.com

Re: worker always timeout to heartbeat and was restarted by supervisor

Posted by Derek Dagit <de...@yahoo-inc.com>.

> 1) does it related to GC problem?

This is usually the cause.  As the worker consumes more and more of the heap, gc takes longer and longer each time.  Eventually it takes so long the heartbeats to the supervisor do not happen.

There could be a spike or skew in your data such that one or more workers cannot handle it with their heap settings.

-- 
Derek

On 8/20/14, 5:26, DashengJu wrote:
> hi, all
>
> In out production environment, we have a topology named
> logparser_mobile_nginx，it has 50 worker, spout have 48 executors,
> bolt_parser have 1000 executors and bolt_saver have 50 executors.
>
> The topology running normal most of times, but 1~5 workers restarted every
> 1~2 hours. When we see the logs of supervisor and worker, found 1) worker
> have no error or exception; 2) supervisor says the worker did not do
> heartbeat and timeout happened.
>
> because worker have no log, I do not know why worker did not do heartbeat.
> anyone have any ideas  how to investigate?
> 0) is the worker exist caused?
> 1) does it related to GC problem?
> 2) does it related to Memory problem? If this, I think the JVM will report
> Memory Exception in worker log.
>
> By the way, some small topologies works well on the same environment.
>
> below is the supervisor log:
> ----------------------------------------------------------------------------------------------
> 2014-08-20 15:51:33 b.s.d.supervisor [INFO]
> 90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started
>
> 2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id c7e8d375-db76-4e2
> 9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
> :timed-out, Heartbeat: #backtype.sto
> rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
> "logparser_mobile_nginx-259-1408518
> 662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
> :port 9714}
> 2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
> 75-db76-4e29-8019-e783ab3cd6de
> 2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
> Process is probably already dead
> .
> 2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
> Process is probably already dead
> .
> 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
> b76-4e29-8019-e783ab3cd6de
>
> 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id d5a8d578-89ff-4a5
> 0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
> :timed-out, Heartbeat: #backtype.sto
> rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
> "logparser_nginx-265-1408521077", :
> executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
> 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
> 78-89ff-4a50-a906-75e847ac63a1
> 2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
> Process is probably already dead
> .
> 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
> 9ff-4a50-a906-75e847ac63a1
>
> 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 5154f643-cd79-411
> 9-9368-153f1bede757. Current supervisor time: 1408521676. State:
> :timed-out, Heartbeat: #backtype.sto
> rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
> "logparser_mobile_nginx-259-1408518
> 662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
> :port 9720}
> 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
> 43-cd79-4119-9368-153f1bede757
> 2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
> Process is probably already dead.
> 2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
> Process is probably already dead.
> 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757
>
> 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor time:
> 1408521676. State: :timed-out, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
> :storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40] [24
> 24] [-1 -1]}, :port 9713}
> 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
> 2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
> Process is probably already dead.
> 2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
>

Re: worker always timeout to heartbeat and was restarted by supervisor

Posted by DashengJu <da...@gmail.com>.

Thanks for your reply.

We have solve the problem. I think the main problem is GC caused.
We changed the worker.childopts params and the problem never happens.

------------------------------------------------
before config: -Xmx 1024m
------------------------------------------------
after config: -Xmx2600m -Xms2600m -Xss256k -XX:MaxPermSize=128m
-XX:PermSize=96m -XX:NewSize=1000m -XX:MaxNewSize=1000m
-XX:MaxTenuringThreshold=1 -XX:SurvivorRatio=6 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-server -XX:+AggressiveOpts -XX:+UseCompressedOops -Djava.awt.headless=true
-Djava.net.preferIPv4Stack=true -Xloggc:logs/gc-worker-%ID%.log -verbose:gc
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10m
-XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps
-XX:+PrintClassHistogram -XX:+PrintTenuringDistribution
-XX:-PrintGCApplicationStoppedTime -XX:-PrintGCApplicationConcurrentTime
-XX:+PrintCommandLineFlags -XX:+PrintFlagsFinal


On Thu, Aug 28, 2014 at 10:28 PM, Bobby Evans <ev...@yahoo-inc.com.invalid>
wrote:

> Sorry I have not responded sooner, I am trying to catch up on the mailing
> list.
>
> In my experience there are only two times when a worker times out to the
> supervisor.  One is when it has died, and the other is when it is stuck
> doing GC and cannot get enough cpu to write out a small file.  I have seen
> another really rare time when the disk filled up and a lot of bad things
> started to happen.
>
> Because the supervisor/kill complain that the process is already dead, it
> indicates to me that the process crashed for some reason, and you need to
> look at the worker logs to understand why it crashed.
>
> - Bobby
>
> On 8/20/14, 5:29 AM, "DashengJu" <da...@gmail.com> wrote:
>
> >---------- Forwarded message ----------
> >From: DashengJu <da...@gmail.com>
> >Date: Wed, Aug 20, 2014 at 6:26 PM
> >Subject: worker always timeout to heartbeat and was restarted by
> >supervisor
> >To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>
> >
> >
> >hi, all
> >
> >In out production environment, we have a topology named
> >logparser_mobile_nginx，it has 50 worker, spout have 48 executors,
> >bolt_parser have 1000 executors and bolt_saver have 50 executors.
> >
> >The topology running normal most of times, but 1~5 workers restarted every
> >1~2 hours. When we see the logs of supervisor and worker, found 1) worker
> >have no error or exception; 2) supervisor says the worker did not do
> >heartbeat and timeout happened.
> >
> >because worker have no log, I do not know why worker did not do heartbeat.
> >anyone have any ideas  how to investigate?
> >0) is the worker exist caused?
> >1) does it related to GC problem?
> >2) does it related to Memory problem? If this, I think the JVM will report
> >Memory Exception in worker log.
> >
> >By the way, some small topologies works well on the same environment.
> >
> >below is the supervisor log:
> >--------------------------------------------------------------------------
> >--------------------
> >2014-08-20 15:51:33 b.s.d.supervisor [INFO]
> >90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started
> >
> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id c7e8d375-db76-4e2
> >9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
> >:timed-out, Heartbeat: #backtype.sto
> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
> >"logparser_mobile_nginx-259-1408518
> >662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
> >:port 9714}
> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
> >75-db76-4e29-8019-e783ab3cd6de
> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
> >Process is probably already dead
> >.
> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
> >Process is probably already dead
> >.
> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
> >b76-4e29-8019-e783ab3cd6de
> >
> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id d5a8d578-89ff-4a5
> >0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
> >:timed-out, Heartbeat: #backtype.sto
> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
> >"logparser_nginx-265-1408521077", :
> >executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
> >78-89ff-4a50-a906-75e847ac63a1
> >2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
> >Process is probably already dead
> >.
> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
> >9ff-4a50-a906-75e847ac63a1
> >
> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id 5154f643-cd79-411
> >9-9368-153f1bede757. Current supervisor time: 1408521676. State:
> >:timed-out, Heartbeat: #backtype.sto
> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
> >"logparser_mobile_nginx-259-1408518
> >662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
> >:port 9720}
> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
> >43-cd79-4119-9368-153f1bede757
> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
> >Process is probably already dead.
> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
> >Process is probably already dead.
> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757
> >
> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor
> >time:
> >1408521676. State: :timed-out, Heartbeat:
> >#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
> >:storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40] [24
> >24] [-1 -1]}, :port 9713}
> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
> >2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
> >Process is probably already dead.
> >2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
> >
> >--
> >dashengju
> >+86 13810875910
> >dashengju@gmail.com
> >
> >
> >
> >--
> >dashengju
> >+86 13810875910
> >dashengju@gmail.com
>
>


-- 
dashengju
+86 13810875910
dashengju@gmail.com

Re: worker always timeout to heartbeat and was restarted by supervisor

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

Yes you are correct, I was assuming that streaming was not being used.

On 8/29/14, 4:13 AM, "DashengJu" <da...@gmail.com> wrote:

>“Because the supervisor/kill complain that the process is already dead, it
>indicates to me that the process crashed for some reason, and you need to
>look at the worker logs to understand why it crashed.”
>
>--------------------------------------------------------------------------
>--------------
>@Evans，this not indicates the process crashed. The work spawn some python
>subprocess, when the supervisor/kill, it may kill the java process first,
>then the python subprocess exit at the same time, so when the supervisor
>try to kill subprocess pid, found the process already exit.
>
>
>On Thu, Aug 28, 2014 at 10:28 PM, Bobby Evans
><ev...@yahoo-inc.com.invalid>
>wrote:
>
>> Sorry I have not responded sooner, I am trying to catch up on the
>>mailing
>> list.
>>
>> In my experience there are only two times when a worker times out to the
>> supervisor.  One is when it has died, and the other is when it is stuck
>> doing GC and cannot get enough cpu to write out a small file.  I have
>>seen
>> another really rare time when the disk filled up and a lot of bad things
>> started to happen.
>>
>> Because the supervisor/kill complain that the process is already dead,
>>it
>> indicates to me that the process crashed for some reason, and you need
>>to
>> look at the worker logs to understand why it crashed.
>>
>> - Bobby
>>
>> On 8/20/14, 5:29 AM, "DashengJu" <da...@gmail.com> wrote:
>>
>> >---------- Forwarded message ----------
>> >From: DashengJu <da...@gmail.com>
>> >Date: Wed, Aug 20, 2014 at 6:26 PM
>> >Subject: worker always timeout to heartbeat and was restarted by
>> >supervisor
>> >To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>
>> >
>> >
>> >hi, all
>> >
>> >In out production environment, we have a topology named
>> >logparser_mobile_nginx，it has 50 worker, spout have 48 executors,
>> >bolt_parser have 1000 executors and bolt_saver have 50 executors.
>> >
>> >The topology running normal most of times, but 1~5 workers restarted
>>every
>> >1~2 hours. When we see the logs of supervisor and worker, found 1)
>>worker
>> >have no error or exception; 2) supervisor says the worker did not do
>> >heartbeat and timeout happened.
>> >
>> >because worker have no log, I do not know why worker did not do
>>heartbeat.
>> >anyone have any ideas  how to investigate?
>> >0) is the worker exist caused?
>> >1) does it related to GC problem?
>> >2) does it related to Memory problem? If this, I think the JVM will
>>report
>> >Memory Exception in worker log.
>> >
>> >By the way, some small topologies works well on the same environment.
>> >
>> >below is the supervisor log:
>> 
>>>------------------------------------------------------------------------
>>>--
>> >--------------------
>> >2014-08-20 15:51:33 b.s.d.supervisor [INFO]
>> >90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started
>> >
>> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
>> >state for id c7e8d375-db76-4e2
>> >9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
>> >:timed-out, Heartbeat: #backtype.sto
>> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
>> >"logparser_mobile_nginx-259-1408518
>> >662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1
>>-1]},
>> >:port 9714}
>> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
>> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
>> >75-db76-4e29-8019-e783ab3cd6de
>> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
>> >Process is probably already dead
>> >.
>> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
>> >Process is probably already dead
>> >.
>> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
>> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
>> >b76-4e29-8019-e783ab3cd6de
>> >
>> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
>> >state for id d5a8d578-89ff-4a5
>> >0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
>> >:timed-out, Heartbeat: #backtype.sto
>> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
>> >"logparser_nginx-265-1408521077", :
>> >executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
>> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
>> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
>> >78-89ff-4a50-a906-75e847ac63a1
>> >2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
>> >Process is probably already dead
>> >.
>> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
>> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
>> >9ff-4a50-a906-75e847ac63a1
>> >
>> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
>> >state for id 5154f643-cd79-411
>> >9-9368-153f1bede757. Current supervisor time: 1408521676. State:
>> >:timed-out, Heartbeat: #backtype.sto
>> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
>> >"logparser_mobile_nginx-259-1408518
>> >662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1
>>-1]},
>> >:port 9720}
>> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
>> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
>> >43-cd79-4119-9368-153f1bede757
>> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
>> >Process is probably already dead.
>> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
>> >Process is probably already dead.
>> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
>> 
>>>6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede75
>>>7
>> >
>> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
>> >state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor
>> >time:
>> >1408521676. State: :timed-out, Heartbeat:
>> >#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
>> >:storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40]
>>[24
>> >24] [-1 -1]}, :port 9713}
>> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
>> 
>>>6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0b
>>>a
>> >2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
>> >Process is probably already dead.
>> >2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
>> 
>>>6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0b
>>>a
>> >
>> >--
>> >dashengju
>> >+86 13810875910
>> >dashengju@gmail.com
>> >
>> >
>> >
>> >--
>> >dashengju
>> >+86 13810875910
>> >dashengju@gmail.com
>>
>>
>
>
>-- 
>dashengju
>+86 13810875910
>dashengju@gmail.com

Re: worker always timeout to heartbeat and was restarted by supervisor

Posted by DashengJu <da...@gmail.com>.

“Because the supervisor/kill complain that the process is already dead, it
indicates to me that the process crashed for some reason, and you need to
look at the worker logs to understand why it crashed.”

----------------------------------------------------------------------------------------
@Evans，this not indicates the process crashed. The work spawn some python
subprocess, when the supervisor/kill, it may kill the java process first,
then the python subprocess exit at the same time, so when the supervisor
try to kill subprocess pid, found the process already exit.


On Thu, Aug 28, 2014 at 10:28 PM, Bobby Evans <ev...@yahoo-inc.com.invalid>
wrote:

> Sorry I have not responded sooner, I am trying to catch up on the mailing
> list.
>
> In my experience there are only two times when a worker times out to the
> supervisor.  One is when it has died, and the other is when it is stuck
> doing GC and cannot get enough cpu to write out a small file.  I have seen
> another really rare time when the disk filled up and a lot of bad things
> started to happen.
>
> Because the supervisor/kill complain that the process is already dead, it
> indicates to me that the process crashed for some reason, and you need to
> look at the worker logs to understand why it crashed.
>
> - Bobby
>
> On 8/20/14, 5:29 AM, "DashengJu" <da...@gmail.com> wrote:
>
> >---------- Forwarded message ----------
> >From: DashengJu <da...@gmail.com>
> >Date: Wed, Aug 20, 2014 at 6:26 PM
> >Subject: worker always timeout to heartbeat and was restarted by
> >supervisor
> >To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>
> >
> >
> >hi, all
> >
> >In out production environment, we have a topology named
> >logparser_mobile_nginx，it has 50 worker, spout have 48 executors,
> >bolt_parser have 1000 executors and bolt_saver have 50 executors.
> >
> >The topology running normal most of times, but 1~5 workers restarted every
> >1~2 hours. When we see the logs of supervisor and worker, found 1) worker
> >have no error or exception; 2) supervisor says the worker did not do
> >heartbeat and timeout happened.
> >
> >because worker have no log, I do not know why worker did not do heartbeat.
> >anyone have any ideas  how to investigate?
> >0) is the worker exist caused?
> >1) does it related to GC problem?
> >2) does it related to Memory problem? If this, I think the JVM will report
> >Memory Exception in worker log.
> >
> >By the way, some small topologies works well on the same environment.
> >
> >below is the supervisor log:
> >--------------------------------------------------------------------------
> >--------------------
> >2014-08-20 15:51:33 b.s.d.supervisor [INFO]
> >90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started
> >
> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id c7e8d375-db76-4e2
> >9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
> >:timed-out, Heartbeat: #backtype.sto
> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
> >"logparser_mobile_nginx-259-1408518
> >662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
> >:port 9714}
> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
> >75-db76-4e29-8019-e783ab3cd6de
> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
> >Process is probably already dead
> >.
> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
> >Process is probably already dead
> >.
> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
> >b76-4e29-8019-e783ab3cd6de
> >
> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id d5a8d578-89ff-4a5
> >0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
> >:timed-out, Heartbeat: #backtype.sto
> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
> >"logparser_nginx-265-1408521077", :
> >executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
> >78-89ff-4a50-a906-75e847ac63a1
> >2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
> >Process is probably already dead
> >.
> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
> >9ff-4a50-a906-75e847ac63a1
> >
> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id 5154f643-cd79-411
> >9-9368-153f1bede757. Current supervisor time: 1408521676. State:
> >:timed-out, Heartbeat: #backtype.sto
> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
> >"logparser_mobile_nginx-259-1408518
> >662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
> >:port 9720}
> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
> >43-cd79-4119-9368-153f1bede757
> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
> >Process is probably already dead.
> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
> >Process is probably already dead.
> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757
> >
> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor
> >time:
> >1408521676. State: :timed-out, Heartbeat:
> >#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
> >:storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40] [24
> >24] [-1 -1]}, :port 9713}
> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
> >2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
> >Process is probably already dead.
> >2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
> >
> >--
> >dashengju
> >+86 13810875910
> >dashengju@gmail.com
> >
> >
> >
> >--
> >dashengju
> >+86 13810875910
> >dashengju@gmail.com
>
>


-- 
dashengju
+86 13810875910
dashengju@gmail.com

Re: worker always timeout to heartbeat and was restarted by supervisor

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

Sorry I have not responded sooner, I am trying to catch up on the mailing
list.

In my experience there are only two times when a worker times out to the
supervisor.  One is when it has died, and the other is when it is stuck
doing GC and cannot get enough cpu to write out a small file.  I have seen
another really rare time when the disk filled up and a lot of bad things
started to happen.

Because the supervisor/kill complain that the process is already dead, it
indicates to me that the process crashed for some reason, and you need to
look at the worker logs to understand why it crashed.

- Bobby

On 8/20/14, 5:29 AM, "DashengJu" <da...@gmail.com> wrote:

>---------- Forwarded message ----------
>From: DashengJu <da...@gmail.com>
>Date: Wed, Aug 20, 2014 at 6:26 PM
>Subject: worker always timeout to heartbeat and was restarted by
>supervisor
>To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>
>
>
>hi, all
>
>In out production environment, we have a topology named
>logparser_mobile_nginx，it has 50 worker, spout have 48 executors,
>bolt_parser have 1000 executors and bolt_saver have 50 executors.
>
>The topology running normal most of times, but 1~5 workers restarted every
>1~2 hours. When we see the logs of supervisor and worker, found 1) worker
>have no error or exception; 2) supervisor says the worker did not do
>heartbeat and timeout happened.
>
>because worker have no log, I do not know why worker did not do heartbeat.
>anyone have any ideas  how to investigate?
>0) is the worker exist caused?
>1) does it related to GC problem?
>2) does it related to Memory problem? If this, I think the JVM will report
>Memory Exception in worker log.
>
>By the way, some small topologies works well on the same environment.
>
>below is the supervisor log:
>--------------------------------------------------------------------------
>--------------------
>2014-08-20 15:51:33 b.s.d.supervisor [INFO]
>90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started
>
>2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
>state for id c7e8d375-db76-4e2
>9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
>:timed-out, Heartbeat: #backtype.sto
>rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
>"logparser_mobile_nginx-259-1408518
>662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
>:port 9714}
>2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
>75-db76-4e29-8019-e783ab3cd6de
>2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
>Process is probably already dead
>.
>2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
>Process is probably already dead
>.
>2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
>b76-4e29-8019-e783ab3cd6de
>
>2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
>state for id d5a8d578-89ff-4a5
>0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
>:timed-out, Heartbeat: #backtype.sto
>rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
>"logparser_nginx-265-1408521077", :
>executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
>2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
>78-89ff-4a50-a906-75e847ac63a1
>2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
>Process is probably already dead
>.
>2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
>9ff-4a50-a906-75e847ac63a1
>
>2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
>state for id 5154f643-cd79-411
>9-9368-153f1bede757. Current supervisor time: 1408521676. State:
>:timed-out, Heartbeat: #backtype.sto
>rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
>"logparser_mobile_nginx-259-1408518
>662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
>:port 9720}
>2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
>43-cd79-4119-9368-153f1bede757
>2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
>Process is probably already dead.
>2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
>Process is probably already dead.
>2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757
>
>2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
>state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor
>time:
>1408521676. State: :timed-out, Heartbeat:
>#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
>:storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40] [24
>24] [-1 -1]}, :port 9713}
>2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
>2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
>Process is probably already dead.
>2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
>
>-- 
>dashengju
>+86 13810875910
>dashengju@gmail.com
>
>
>
>-- 
>dashengju
>+86 13810875910
>dashengju@gmail.com

Fwd: worker always timeout to heartbeat and was restarted by supervisor

Posted by DashengJu <da...@gmail.com>.

---------- Forwarded message ----------
From: DashengJu <da...@gmail.com>
Date: Wed, Aug 20, 2014 at 6:26 PM
Subject: worker always timeout to heartbeat and was restarted by supervisor
To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>


hi, all

In out production environment, we have a topology named
logparser_mobile_nginx，it has 50 worker, spout have 48 executors,
bolt_parser have 1000 executors and bolt_saver have 50 executors.

The topology running normal most of times, but 1~5 workers restarted every
1~2 hours. When we see the logs of supervisor and worker, found 1) worker
have no error or exception; 2) supervisor says the worker did not do
heartbeat and timeout happened.

because worker have no log, I do not know why worker did not do heartbeat.
anyone have any ideas  how to investigate?
0) is the worker exist caused?
1) does it related to GC problem?
2) does it related to Memory problem? If this, I think the JVM will report
Memory Exception in worker log.

By the way, some small topologies works well on the same environment.

below is the supervisor log:
----------------------------------------------------------------------------------------------
2014-08-20 15:51:33 b.s.d.supervisor [INFO]
90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started

2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
state for id c7e8d375-db76-4e2
9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
"logparser_mobile_nginx-259-1408518
662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
:port 9714}
2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
75-db76-4e29-8019-e783ab3cd6de
2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
Process is probably already dead
.
2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
Process is probably already dead
.
2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
b76-4e29-8019-e783ab3cd6de

2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
state for id d5a8d578-89ff-4a5
0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
"logparser_nginx-265-1408521077", :
executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
78-89ff-4a50-a906-75e847ac63a1
2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
Process is probably already dead
.
2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
9ff-4a50-a906-75e847ac63a1

2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 5154f643-cd79-411
9-9368-153f1bede757. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
"logparser_mobile_nginx-259-1408518
662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
:port 9720}
2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
43-cd79-4119-9368-153f1bede757
2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
Process is probably already dead.
2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
Process is probably already dead.
2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757

2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor time:
1408521676. State: :timed-out, Heartbeat:
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
:storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40] [24
24] [-1 -1]}, :port 9713}
2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
Process is probably already dead.
2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba

-- 
dashengju
+86 13810875910
dashengju@gmail.com



-- 
dashengju
+86 13810875910
dashengju@gmail.com