You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by Zhechao Ma <ma...@gmail.com> on 2016/10/21 02:44:42 UTC

ShellBolt raise subprocess heartbeat timeout Exception

I made an issue (STORM-2150
<https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone can
help?

I've got a simple topology running with Storm 1.0.1. The topology consists
of a KafkaSpout and several python multilang ShellBolt. I frequently got
the following exceptions.

java.lang.RuntimeException: subprocess heartbeat timeout at
org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

More information here:
1. Topology run with ACK mode.
2. Topology had 40 workers.
3. Topology emitted about 10 milliom tuples every 10 minutes.

Every time subprocess heartbeat timeout, workers would restart and python
processes exited with exitCode:-1, which affected processing capacity and
stability of the topology.

I've checked some related issues from Storm Jira. I first found STORM-1946
<https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
to this problem and said bug had been fixed in Storm 1.0.2. However I got
the same exception even after I upgraded Storm to 1.0.2.

I checked other related issues. Let's look at history of this problem.
DashengJu first reported this problem with Non-ACK mode in STORM-738
<https://issues.apache.org/jira/browse/STORM-738>. STORM-742
<https://issues.apache.org/jira/browse/STORM-742> discussed the approach of
this problem with ACK mode, and it seemed that bug had been fixed in
0.10.0. I don't know whether this patch is included in storm-1.x branch. In
a word, this problem still exists in the latest stable version.

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Zhechao Ma <ma...@gmail.com>.

Thanks for your reply, and it's more clear to me.

2016-10-25 21:35 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:

> First of all, ShellBolt writes to stdout and reads to stdin which are not
> thread-safe, so all of things are written based on this restriction.
>
> setHeartbeat() is called when subprocess sends any messages which shows
> that subprocess is still working. That's why subprocess should send any
> messages (whether it responds heartbeat ping, or emit messages, or sync)
> within heartbeat timeout.
>
> - Jungtaek Lim (HeartSaVioR)
>
> On Tue, 25 Oct 2016 at 8:31 PM Zhechao Ma <ma...@gmail.com>
> wrote:
>
>> Besides, I have another question.
>>
>> In ShellBolt.java, the method setHeartbeat() is called in method run() of
>> class BoltReaderRunnable, but why not called in the same place in class
>> BoltWriterRunnable? In other words, when should setHeartbeat() be called?
>>
>> 2016-10-24 18:32 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>>
>> Really, there's timeout ! Last heartbeat doesn't update properly. I'll
>> try to adjust worker log format to determin which thread.
>> Thanks.
>>
>> 2016-10-24 17:14 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>>
>> Sorry but you seem to run multiple ShellBolts in one worker, so I can't
>> see which threads print out which log.
>> Could you add %t to worker log format? You can just copy the content from
>> below link to your worker.xml.
>> https://github.com/apache/storm/blob/1.x-branch/log4j2/worker.xml
>>
>> And there's actual heartbeat timeout occurred:
>> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time :
>> 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>
>> 1477297083673 - 1477296962746 = 120927 > 120000
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> 2016년 10월 24일 (월) 오후 6:01, Zhechao Ma <ma...@gmail.com>님이 작성:
>>
>> I find something strange. I look at a worker.log, there three subprocess
>> heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03.
>>
>>
>> ...
>> ...
>> 2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079678, last heartbeat : 1477297079165, worker timeout (ms) : 120000
>> 2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079850, last heartbeat : 1477297079644, worker timeout (ms) : 120000
>> 2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080678, last heartbeat : 1477297080165, worker timeout (ms) : 120000
>> 2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080850, last heartbeat : 1477297080814, worker timeout (ms) : 120000
>> 2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081678, last heartbeat : 1477297081165, worker timeout (ms) : 120000
>> 2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081850, last heartbeat : 1477297081816, worker timeout (ms) : 120000
>> 2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:02.165 <02-165> o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082678, last heartbeat : 1477297082166, worker timeout (ms) : 120000
>> 2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082850, last heartbeat : 1477297082843, worker timeout (ms) : 120000
>> 2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083678, last heartbeat : 1477297083166, worker timeout (ms) : 120000
>>
>>
>> And then, heartbaet timeout just at 16:18:03.675, before the third
>> heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s.
>> How could that happen?
>>
>> ...
>> ...
>> 2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died!
>> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484) [storm-core-1.0.1.jar:1.0.1]
>>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>>         at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]
>> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>         at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439) ~[storm-core-1.0.1.jar:1.0.1]
>>         ... 6 more
>> Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>         at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) ~[storm-core-1.0.1.jar:1.0.1]
>>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) ~[?:1.7.0_79]
>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.7.0_79]
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_79]
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_79]
>>         ... 1 more
>> Caused by: java.lang.RuntimeException: subprocess heartbeat timeout
>>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) ~[storm-core-1.0.1.jar:1.0.1]
>>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) ~[?:1.7.0_79]
>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.7.0_79]
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_79]
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_79]
>>         ... 1 more
>> 2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR]
>> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) ~[storm-core-1.0.1.jar:1.0.1]
>>
>>
>>
>> 2016-10-24 15:23 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>>
>> I open DEBUG-level log, and see the BOLT heartbeat information, timeout
>> is 30000ms, everything looks OK.
>>
>> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>>
>> I will try to do this, and reply latter. Thanks.
>>
>> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>>
>> Could you modify your log level to DEBUG and see worker's log? If you use
>> Storm 1.x you can modify log level from UI on the fly.
>> ShellBolt writes log regarding subprocess heartbeat but its level is
>> DEBUG since it could produce lots of logs.
>>
>> Two lines:
>> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
>> - BOLT - sending heartbeat request to subprocess
>>
>> Two lines will be logged to each 1 second. Please check logs are
>> existing, and 'last heartbeat' is updated properly, and also worker timeout
>> is set properly.
>>
>> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <ma...@gmail.com>님이 작성:
>>
>> I do not set "topology.subprocess.timeout.secs", so "
>> supervisor.worker.timeout.secs" will be used according to STORM-1314,
>> which is set 30 for my cluster.
>> 30 seconds is a very very very big value, it will never take more than 30
>> seconds processing my tuple.
>> I think there must be problem somewhere else.
>>
>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>>
>> There're many situations for ShellBolt to trigger heartbeat issue, and at
>> least STORM-1946 is not the case.
>>
>> How long does your tuple take to be processed? You need to set subprocess
>> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
>> time to process. You can even set it fairly big value so that subprocess
>> heartbeat issue will not happen.
>>
>>
>> ShellBolt requires that each tuple is handled and acked within heartbeat
>> timeout. I struggled to change this behavior for subprocess to periodically
>> sends heartbeat, but no luck because of GIL - global interpreter lock (same
>> for Ruby). We need to choose one: stick this restriction, or disable
>> subprocess heartbeat.
>>
>> I hope that we can resolve this issue clearly, but I guess multi-thread
>> approach doesn't work on Python, Ruby, and any language which uses GIL, and
>> I have no idea on alternatives
>>
>> - Jungtaek Lim (HeartSaVioR).
>>
>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이
>> 작성:
>>
>> I made an issue (STORM-2150
>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>> can
>> help?
>>
>> I've got a simple topology running with Storm 1.0.1. The topology consists
>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>> the following exceptions.
>>
>> java.lang.RuntimeException: subprocess heartbeat timeout at
>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(
>> ShellBolt.java:322)
>> at java.util.concurrent.Executors$RunnableAdapter.
>> call(Executors.java:471)
>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>> java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>> at java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> More information here:
>> 1. Topology run with ACK mode.
>> 2. Topology had 40 workers.
>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>
>> Every time subprocess heartbeat timeout, workers would restart and python
>> processes exited with exitCode:-1, which affected processing capacity and
>> stability of the topology.
>>
>> I've checked some related issues from Storm Jira. I first found STORM-1946
>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
>> to this problem and said bug had been fixed in Storm 1.0.2. However I got
>> the same exception even after I upgraded Storm to 1.0.2.
>>
>> I checked other related issues. Let's look at history of this problem.
>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
>> of
>> this problem with ACK mode, and it seemed that bug had been fixed in
>> 0.10.0. I don't know whether this patch is included in storm-1.x branch.
>> In
>> a word, this problem still exists in the latest stable version.
>>
>>
>>
>>
>>
>>
>> --
>> Thanks
>> Zhechao Ma
>>
>>
>>
>>
>> --
>> Thanks
>> Zhechao Ma
>>
>>
>>
>>
>> --
>> Thanks
>> Zhechao Ma
>>
>>
>>
>>
>> --
>> Thanks
>> Zhechao Ma
>>
>


-- 
Thanks
Zhechao Ma

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Jungtaek Lim <ka...@gmail.com>.

First of all, ShellBolt writes to stdout and reads to stdin which are not
thread-safe, so all of things are written based on this restriction.

setHeartbeat() is called when subprocess sends any messages which shows
that subprocess is still working. That's why subprocess should send any
messages (whether it responds heartbeat ping, or emit messages, or sync)
within heartbeat timeout.

- Jungtaek Lim (HeartSaVioR)
On Tue, 25 Oct 2016 at 8:31 PM Zhechao Ma <ma...@gmail.com>
wrote:

> Besides, I have another question.
>
> In ShellBolt.java, the method setHeartbeat() is called in method run() of
> class BoltReaderRunnable, but why not called in the same place in class
> BoltWriterRunnable? In other words, when should setHeartbeat() be called?
>
> 2016-10-24 18:32 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>
> Really, there's timeout ! Last heartbeat doesn't update properly. I'll try
> to adjust worker log format to determin which thread.
> Thanks.
>
> 2016-10-24 17:14 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>
> Sorry but you seem to run multiple ShellBolts in one worker, so I can't
> see which threads print out which log.
> Could you add %t to worker log format? You can just copy the content from
> below link to your worker.xml.
> https://github.com/apache/storm/blob/1.x-branch/log4j2/worker.xml
>
> And there's actual heartbeat timeout occurred:
> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time :
> 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>
> 1477297083673 - 1477296962746 = 120927 > 120000
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> 2016년 10월 24일 (월) 오후 6:01, Zhechao Ma <ma...@gmail.com>님이 작성:
>
> I find something strange. I look at a worker.log, there three subprocess
> heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03.
>
>
> ...
> ...
> 2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079678, last heartbeat : 1477297079165, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079850, last heartbeat : 1477297079644, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080678, last heartbeat : 1477297080165, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080850, last heartbeat : 1477297080814, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081678, last heartbeat : 1477297081165, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081850, last heartbeat : 1477297081816, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:02.165 <02-165> o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082678, last heartbeat : 1477297082166, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082850, last heartbeat : 1477297082843, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083678, last heartbeat : 1477297083166, worker timeout (ms) : 120000
>
>
> And then, heartbaet timeout just at 16:18:03.675, before the third
> heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s.
> How could that happen?
>
> ...
> ...
> 2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484) [storm-core-1.0.1.jar:1.0.1]
>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>         at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]
> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439) ~[storm-core-1.0.1.jar:1.0.1]
>         ... 6 more
> Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) ~[storm-core-1.0.1.jar:1.0.1]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_79]
>         ... 1 more
> Caused by: java.lang.RuntimeException: subprocess heartbeat timeout
>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) ~[storm-core-1.0.1.jar:1.0.1]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_79]
>         ... 1 more
> 2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR]
> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) ~[storm-core-1.0.1.jar:1.0.1]
>
>
>
> 2016-10-24 15:23 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>
> I open DEBUG-level log, and see the BOLT heartbeat information, timeout is
> 30000ms, everything looks OK.
>
> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>
> I will try to do this, and reply latter. Thanks.
>
> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>
> Could you modify your log level to DEBUG and see worker's log? If you use
> Storm 1.x you can modify log level from UI on the fly.
> ShellBolt writes log regarding subprocess heartbeat but its level is DEBUG
> since it could produce lots of logs.
>
> Two lines:
> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
> - BOLT - sending heartbeat request to subprocess
>
> Two lines will be logged to each 1 second. Please check logs are existing,
> and 'last heartbeat' is updated properly, and also worker timeout is set
> properly.
>
> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <ma...@gmail.com>님이 작성:
>
> I do not set "topology.subprocess.timeout.secs", so "
> supervisor.worker.timeout.secs" will be used according to STORM-1314,
> which is set 30 for my cluster.
> 30 seconds is a very very very big value, it will never take more than 30
> seconds processing my tuple.
> I think there must be problem somewhere else.
>
> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>
> There're many situations for ShellBolt to trigger heartbeat issue, and at
> least STORM-1946 is not the case.
>
> How long does your tuple take to be processed? You need to set subprocess
> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
> time to process. You can even set it fairly big value so that subprocess
> heartbeat issue will not happen.
>
>
> ShellBolt requires that each tuple is handled and acked within heartbeat
> timeout. I struggled to change this behavior for subprocess to periodically
> sends heartbeat, but no luck because of GIL - global interpreter lock (same
> for Ruby). We need to choose one: stick this restriction, or disable
> subprocess heartbeat.
>
> I hope that we can resolve this issue clearly, but I guess multi-thread
> approach doesn't work on Python, Ruby, and any language which uses GIL, and
> I have no idea on alternatives
>
> - Jungtaek Lim (HeartSaVioR).
>
> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이 작성:
>
> I made an issue (STORM-2150
> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone can
> help?
>
> I've got a simple topology running with Storm 1.0.1. The topology consists
> of a KafkaSpout and several python multilang ShellBolt. I frequently got
> the following exceptions.
>
> java.lang.RuntimeException: subprocess heartbeat timeout at
>
> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> More information here:
> 1. Topology run with ACK mode.
> 2. Topology had 40 workers.
> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>
> Every time subprocess heartbeat timeout, workers would restart and python
> processes exited with exitCode:-1, which affected processing capacity and
> stability of the topology.
>
> I've checked some related issues from Storm Jira. I first found STORM-1946
> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
> to this problem and said bug had been fixed in Storm 1.0.2. However I got
> the same exception even after I upgraded Storm to 1.0.2.
>
> I checked other related issues. Let's look at history of this problem.
> DashengJu first reported this problem with Non-ACK mode in STORM-738
> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
> of
> this problem with ACK mode, and it seemed that bug had been fixed in
> 0.10.0. I don't know whether this patch is included in storm-1.x branch. In
> a word, this problem still exists in the latest stable version.
>
>
>
>
>
>
> --
> Thanks
> Zhechao Ma
>
>
>
>
> --
> Thanks
> Zhechao Ma
>
>
>
>
> --
> Thanks
> Zhechao Ma
>
>
>
>
> --
> Thanks
> Zhechao Ma
>

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Zhechao Ma <ma...@gmail.com>.

Besides, I have another question.

In ShellBolt.java, the method setHeartbeat() is called in method run() of
class BoltReaderRunnable, but why not called in the same place in class
BoltWriterRunnable? In other words, when should setHeartbeat() be called?

2016-10-24 18:32 GMT+08:00 Zhechao Ma <ma...@gmail.com>:

> Really, there's timeout ! Last heartbeat doesn't update properly. I'll try
> to adjust worker log format to determin which thread.
> Thanks.
>
> 2016-10-24 17:14 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>
>> Sorry but you seem to run multiple ShellBolts in one worker, so I can't
>> see which threads print out which log.
>> Could you add %t to worker log format? You can just copy the content from
>> below link to your worker.xml.
>> https://github.com/apache/storm/blob/1.x-branch/log4j2/worker.xml
>>
>> And there's actual heartbeat timeout occurred:
>> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time :
>> 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>
>> 1477297083673 - 1477296962746 = 120927 > 120000
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> 2016년 10월 24일 (월) 오후 6:01, Zhechao Ma <ma...@gmail.com>님이 작성:
>>
>>> I find something strange. I look at a worker.log, there three subprocess
>>> heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03.
>>>
>>>
>>> ...
>>> ...
>>> 2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>> 2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079678, last heartbeat : 1477297079165, worker timeout (ms) : 120000
>>> 2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079850, last heartbeat : 1477297079644, worker timeout (ms) : 120000
>>> 2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080678, last heartbeat : 1477297080165, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080850, last heartbeat : 1477297080814, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081678, last heartbeat : 1477297081165, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081850, last heartbeat : 1477297081816, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:18:02.165 <02-165> o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082678, last heartbeat : 1477297082166, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082850, last heartbeat : 1477297082843, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>>> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083678, last heartbeat : 1477297083166, worker timeout (ms) : 120000
>>>
>>>
>>> And then, heartbaet timeout just at 16:18:03.675, before the third
>>> heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s.
>>> How could that happen?
>>>
>>> ...
>>> ...
>>> 2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died!
>>> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484) [storm-core-1.0.1.jar:1.0.1]
>>>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>>>         at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]
>>> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>>         at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439) ~[storm-core-1.0.1.jar:1.0.1]
>>>         ... 6 more
>>> Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>>         at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>>>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) ~[?:1.7.0_79]
>>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.7.0_79]
>>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_79]
>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_79]
>>>         ... 1 more
>>> Caused by: java.lang.RuntimeException: subprocess heartbeat timeout
>>>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>>>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) ~[?:1.7.0_79]
>>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.7.0_79]
>>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_79]
>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_79]
>>>         ... 1 more
>>> 2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR]
>>> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) ~[storm-core-1.0.1.jar:1.0.1]
>>>
>>>
>>>
>>> 2016-10-24 15:23 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>>>
>>> I open DEBUG-level log, and see the BOLT heartbeat information, timeout
>>> is 30000ms, everything looks OK.
>>>
>>> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>>>
>>> I will try to do this, and reply latter. Thanks.
>>>
>>> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>>>
>>> Could you modify your log level to DEBUG and see worker's log? If you
>>> use Storm 1.x you can modify log level from UI on the fly.
>>> ShellBolt writes log regarding subprocess heartbeat but its level is
>>> DEBUG since it could produce lots of logs.
>>>
>>> Two lines:
>>> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
>>> - BOLT - sending heartbeat request to subprocess
>>>
>>> Two lines will be logged to each 1 second. Please check logs are
>>> existing, and 'last heartbeat' is updated properly, and also worker timeout
>>> is set properly.
>>>
>>> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <ma...@gmail.com>님이
>>> 작성:
>>>
>>> I do not set "topology.subprocess.timeout.secs", so "
>>> supervisor.worker.timeout.secs" will be used according to STORM-1314,
>>> which is set 30 for my cluster.
>>> 30 seconds is a very very very big value, it will never take more than
>>> 30 seconds processing my tuple.
>>> I think there must be problem somewhere else.
>>>
>>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>>>
>>> There're many situations for ShellBolt to trigger heartbeat issue, and
>>> at least STORM-1946 is not the case.
>>>
>>> How long does your tuple take to be processed? You need to set
>>> subprocess timeout seconds ("topology.subprocess.timeout.secs") to
>>> higher than max time to process. You can even set it fairly big value so
>>> that subprocess heartbeat issue will not happen.
>>>
>>>
>>> ShellBolt requires that each tuple is handled and acked within heartbeat
>>> timeout. I struggled to change this behavior for subprocess to periodically
>>> sends heartbeat, but no luck because of GIL - global interpreter lock (same
>>> for Ruby). We need to choose one: stick this restriction, or disable
>>> subprocess heartbeat.
>>>
>>> I hope that we can resolve this issue clearly, but I guess multi-thread
>>> approach doesn't work on Python, Ruby, and any language which uses GIL, and
>>> I have no idea on alternatives
>>>
>>> - Jungtaek Lim (HeartSaVioR).
>>>
>>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이
>>> 작성:
>>>
>>> I made an issue (STORM-2150
>>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>>> can
>>> help?
>>>
>>> I've got a simple topology running with Storm 1.0.1. The topology
>>> consists
>>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>>> the following exceptions.
>>>
>>> java.lang.RuntimeException: subprocess heartbeat timeout at
>>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(S
>>> hellBolt.java:322)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(
>>> Executors.java:471)
>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.run(ScheduledThreadPoolExecutor.java:293)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1145)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> More information here:
>>> 1. Topology run with ACK mode.
>>> 2. Topology had 40 workers.
>>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>>
>>> Every time subprocess heartbeat timeout, workers would restart and python
>>> processes exited with exitCode:-1, which affected processing capacity and
>>> stability of the topology.
>>>
>>> I've checked some related issues from Storm Jira. I first found
>>> STORM-1946
>>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug
>>> related
>>> to this problem and said bug had been fixed in Storm 1.0.2. However I got
>>> the same exception even after I upgraded Storm to 1.0.2.
>>>
>>> I checked other related issues. Let's look at history of this problem.
>>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>>> <https://issues.apache.org/jira/browse/STORM-742> discussed the
>>> approach of
>>> this problem with ACK mode, and it seemed that bug had been fixed in
>>> 0.10.0. I don't know whether this patch is included in storm-1.x branch.
>>> In
>>> a word, this problem still exists in the latest stable version.
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Thanks
>>> Zhechao Ma
>>>
>>>
>>>
>>>
>>> --
>>> Thanks
>>> Zhechao Ma
>>>
>>
>
>
> --
> Thanks
> Zhechao Ma
>



-- 
Thanks
Zhechao Ma

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Zhechao Ma <ma...@gmail.com>.

Really, there's timeout ! Last heartbeat doesn't update properly. I'll try
to adjust worker log format to determin which thread.
Thanks.

2016-10-24 17:14 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:

> Sorry but you seem to run multiple ShellBolts in one worker, so I can't
> see which threads print out which log.
> Could you add %t to worker log format? You can just copy the content from
> below link to your worker.xml.
> https://github.com/apache/storm/blob/1.x-branch/log4j2/worker.xml
>
> And there's actual heartbeat timeout occurred:
> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time :
> 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>
> 1477297083673 - 1477296962746 = 120927 > 120000
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> 2016년 10월 24일 (월) 오후 6:01, Zhechao Ma <ma...@gmail.com>님이 작성:
>
>> I find something strange. I look at a worker.log, there three subprocess
>> heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03.
>>
>>
>> ...
>> ...
>> 2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079678, last heartbeat : 1477297079165, worker timeout (ms) : 120000
>> 2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079850, last heartbeat : 1477297079644, worker timeout (ms) : 120000
>> 2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080678, last heartbeat : 1477297080165, worker timeout (ms) : 120000
>> 2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080850, last heartbeat : 1477297080814, worker timeout (ms) : 120000
>> 2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081678, last heartbeat : 1477297081165, worker timeout (ms) : 120000
>> 2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081850, last heartbeat : 1477297081816, worker timeout (ms) : 120000
>> 2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:02.165 <02-165> o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082678, last heartbeat : 1477297082166, worker timeout (ms) : 120000
>> 2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082850, last heartbeat : 1477297082843, worker timeout (ms) : 120000
>> 2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
>> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083678, last heartbeat : 1477297083166, worker timeout (ms) : 120000
>>
>>
>> And then, heartbaet timeout just at 16:18:03.675, before the third
>> heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s.
>> How could that happen?
>>
>> ...
>> ...
>> 2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died!
>> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484) [storm-core-1.0.1.jar:1.0.1]
>>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>>         at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]
>> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>         at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439) ~[storm-core-1.0.1.jar:1.0.1]
>>         ... 6 more
>> Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>         at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) ~[storm-core-1.0.1.jar:1.0.1]
>>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) ~[?:1.7.0_79]
>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.7.0_79]
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_79]
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_79]
>>         ... 1 more
>> Caused by: java.lang.RuntimeException: subprocess heartbeat timeout
>>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) ~[storm-core-1.0.1.jar:1.0.1]
>>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) ~[?:1.7.0_79]
>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.7.0_79]
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_79]
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_79]
>>         ... 1 more
>> 2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR]
>> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) ~[storm-core-1.0.1.jar:1.0.1]
>>
>>
>>
>> 2016-10-24 15:23 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>>
>> I open DEBUG-level log, and see the BOLT heartbeat information, timeout
>> is 30000ms, everything looks OK.
>>
>> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>>
>> I will try to do this, and reply latter. Thanks.
>>
>> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>>
>> Could you modify your log level to DEBUG and see worker's log? If you use
>> Storm 1.x you can modify log level from UI on the fly.
>> ShellBolt writes log regarding subprocess heartbeat but its level is
>> DEBUG since it could produce lots of logs.
>>
>> Two lines:
>> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
>> - BOLT - sending heartbeat request to subprocess
>>
>> Two lines will be logged to each 1 second. Please check logs are
>> existing, and 'last heartbeat' is updated properly, and also worker timeout
>> is set properly.
>>
>> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <ma...@gmail.com>님이 작성:
>>
>> I do not set "topology.subprocess.timeout.secs", so "
>> supervisor.worker.timeout.secs" will be used according to STORM-1314,
>> which is set 30 for my cluster.
>> 30 seconds is a very very very big value, it will never take more than 30
>> seconds processing my tuple.
>> I think there must be problem somewhere else.
>>
>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>>
>> There're many situations for ShellBolt to trigger heartbeat issue, and at
>> least STORM-1946 is not the case.
>>
>> How long does your tuple take to be processed? You need to set subprocess
>> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
>> time to process. You can even set it fairly big value so that subprocess
>> heartbeat issue will not happen.
>>
>>
>> ShellBolt requires that each tuple is handled and acked within heartbeat
>> timeout. I struggled to change this behavior for subprocess to periodically
>> sends heartbeat, but no luck because of GIL - global interpreter lock (same
>> for Ruby). We need to choose one: stick this restriction, or disable
>> subprocess heartbeat.
>>
>> I hope that we can resolve this issue clearly, but I guess multi-thread
>> approach doesn't work on Python, Ruby, and any language which uses GIL, and
>> I have no idea on alternatives
>>
>> - Jungtaek Lim (HeartSaVioR).
>>
>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이
>> 작성:
>>
>> I made an issue (STORM-2150
>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>> can
>> help?
>>
>> I've got a simple topology running with Storm 1.0.1. The topology consists
>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>> the following exceptions.
>>
>> java.lang.RuntimeException: subprocess heartbeat timeout at
>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(
>> ShellBolt.java:322)
>> at java.util.concurrent.Executors$RunnableAdapter.
>> call(Executors.java:471)
>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>> java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>> at java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> More information here:
>> 1. Topology run with ACK mode.
>> 2. Topology had 40 workers.
>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>
>> Every time subprocess heartbeat timeout, workers would restart and python
>> processes exited with exitCode:-1, which affected processing capacity and
>> stability of the topology.
>>
>> I've checked some related issues from Storm Jira. I first found STORM-1946
>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
>> to this problem and said bug had been fixed in Storm 1.0.2. However I got
>> the same exception even after I upgraded Storm to 1.0.2.
>>
>> I checked other related issues. Let's look at history of this problem.
>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
>> of
>> this problem with ACK mode, and it seemed that bug had been fixed in
>> 0.10.0. I don't know whether this patch is included in storm-1.x branch.
>> In
>> a word, this problem still exists in the latest stable version.
>>
>>
>>
>>
>>
>>
>> --
>> Thanks
>> Zhechao Ma
>>
>>
>>
>>
>> --
>> Thanks
>> Zhechao Ma
>>
>


-- 
Thanks
Zhechao Ma

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Jungtaek Lim <ka...@gmail.com>.

Sorry but you seem to run multiple ShellBolts in one worker, so I can't see
which threads print out which log.
Could you add %t to worker log format? You can just copy the content from
below link to your worker.xml.
https://github.com/apache/storm/blob/1.x-branch/log4j2/worker.xml

And there's actual heartbeat timeout occurred:
2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time :
1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000

1477297083673 - 1477296962746 = 120927 > 120000

Thanks,
Jungtaek Lim (HeartSaVioR)


2016년 10월 24일 (월) 오후 6:01, Zhechao Ma <ma...@gmail.com>님이 작성:

> I find something strange. I look at a worker.log, there three subprocess
> heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03.
>
>
> ...
> ...
> 2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079678, last heartbeat : 1477297079165, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079850, last heartbeat : 1477297079644, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080678, last heartbeat : 1477297080165, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080850, last heartbeat : 1477297080814, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081678, last heartbeat : 1477297081165, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081850, last heartbeat : 1477297081816, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:02.165 <02-165> o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082678, last heartbeat : 1477297082166, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082850, last heartbeat : 1477297082843, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to subprocess
> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083678, last heartbeat : 1477297083166, worker timeout (ms) : 120000
>
>
> And then, heartbaet timeout just at 16:18:03.675, before the third
> heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s.
> How could that happen?
>
> ...
> ...
> 2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484) [storm-core-1.0.1.jar:1.0.1]
>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>         at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]
> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439) ~[storm-core-1.0.1.jar:1.0.1]
>         ... 6 more
> Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) ~[storm-core-1.0.1.jar:1.0.1]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_79]
>         ... 1 more
> Caused by: java.lang.RuntimeException: subprocess heartbeat timeout
>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) ~[storm-core-1.0.1.jar:1.0.1]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_79]
>         ... 1 more
> 2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR]
> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) ~[storm-core-1.0.1.jar:1.0.1]
>
>
>
> 2016-10-24 15:23 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>
> I open DEBUG-level log, and see the BOLT heartbeat information, timeout is
> 30000ms, everything looks OK.
>
> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>
> I will try to do this, and reply latter. Thanks.
>
> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>
> Could you modify your log level to DEBUG and see worker's log? If you use
> Storm 1.x you can modify log level from UI on the fly.
> ShellBolt writes log regarding subprocess heartbeat but its level is DEBUG
> since it could produce lots of logs.
>
> Two lines:
> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
> - BOLT - sending heartbeat request to subprocess
>
> Two lines will be logged to each 1 second. Please check logs are existing,
> and 'last heartbeat' is updated properly, and also worker timeout is set
> properly.
>
> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <ma...@gmail.com>님이 작성:
>
> I do not set "topology.subprocess.timeout.secs", so "
> supervisor.worker.timeout.secs" will be used according to STORM-1314,
> which is set 30 for my cluster.
> 30 seconds is a very very very big value, it will never take more than 30
> seconds processing my tuple.
> I think there must be problem somewhere else.
>
> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>
> There're many situations for ShellBolt to trigger heartbeat issue, and at
> least STORM-1946 is not the case.
>
> How long does your tuple take to be processed? You need to set subprocess
> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
> time to process. You can even set it fairly big value so that subprocess
> heartbeat issue will not happen.
>
>
> ShellBolt requires that each tuple is handled and acked within heartbeat
> timeout. I struggled to change this behavior for subprocess to periodically
> sends heartbeat, but no luck because of GIL - global interpreter lock (same
> for Ruby). We need to choose one: stick this restriction, or disable
> subprocess heartbeat.
>
> I hope that we can resolve this issue clearly, but I guess multi-thread
> approach doesn't work on Python, Ruby, and any language which uses GIL, and
> I have no idea on alternatives
>
> - Jungtaek Lim (HeartSaVioR).
>
> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이 작성:
>
> I made an issue (STORM-2150
> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone can
> help?
>
> I've got a simple topology running with Storm 1.0.1. The topology consists
> of a KafkaSpout and several python multilang ShellBolt. I frequently got
> the following exceptions.
>
> java.lang.RuntimeException: subprocess heartbeat timeout at
>
> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> More information here:
> 1. Topology run with ACK mode.
> 2. Topology had 40 workers.
> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>
> Every time subprocess heartbeat timeout, workers would restart and python
> processes exited with exitCode:-1, which affected processing capacity and
> stability of the topology.
>
> I've checked some related issues from Storm Jira. I first found STORM-1946
> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
> to this problem and said bug had been fixed in Storm 1.0.2. However I got
> the same exception even after I upgraded Storm to 1.0.2.
>
> I checked other related issues. Let's look at history of this problem.
> DashengJu first reported this problem with Non-ACK mode in STORM-738
> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
> of
> this problem with ACK mode, and it seemed that bug had been fixed in
> 0.10.0. I don't know whether this patch is included in storm-1.x branch. In
> a word, this problem still exists in the latest stable version.
>
>
>
>
>
>
> --
> Thanks
> Zhechao Ma
>
>
>
>
> --
> Thanks
> Zhechao Ma
>

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Zhechao Ma <ma...@gmail.com>.

I find something strange. I look at a worker.log, there three subprocess
heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03.


...
...
2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297079673, last heartbeat : 1477296962746, worker timeout (ms) :
120000
2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297079678, last heartbeat : 1477297079165, worker timeout (ms) :
120000
2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297079850, last heartbeat : 1477297079644, worker timeout (ms) :
120000
2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297080673, last heartbeat : 1477296962746, worker timeout (ms) :
120000
2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297080678, last heartbeat : 1477297080165, worker timeout (ms) :
120000
2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297080850, last heartbeat : 1477297080814, worker timeout (ms) :
120000
2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297081673, last heartbeat : 1477296962746, worker timeout (ms) :
120000
2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297081678, last heartbeat : 1477297081165, worker timeout (ms) :
120000
2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297081850, last heartbeat : 1477297081816, worker timeout (ms) :
120000
2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:02.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297082673, last heartbeat : 1477296962746, worker timeout (ms) :
120000
2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297082678, last heartbeat : 1477297082166, worker timeout (ms) :
120000
2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297082850, last heartbeat : 1477297082843, worker timeout (ms) :
120000
2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) :
120000
2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297083678, last heartbeat : 1477297083166, worker timeout (ms) :
120000


And then, heartbaet timeout just at 16:18:03.675, before the third
heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s.
How could that happen?

...
...
2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died!
java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1,
errorString:
        at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484)
[storm-core-1.0.1.jar:1.0.1]
        at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
        at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException:
pid:7243, name:normalizeLog exitCode:-1, errorString:
        at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439)
~[storm-core-1.0.1.jar:1.0.1]
        ... 6 more
Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog
exitCode:-1, errorString:
        at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
~[storm-core-1.0.1.jar:1.0.1]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
~[?:1.7.0_79]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
~[?:1.7.0_79]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
~[?:1.7.0_79]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.7.0_79]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[?:1.7.0_79]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
~[?:1.7.0_79]
        ... 1 more
Caused by: java.lang.RuntimeException: subprocess heartbeat timeout
        at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
~[storm-core-1.0.1.jar:1.0.1]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
~[?:1.7.0_79]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
~[?:1.7.0_79]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
~[?:1.7.0_79]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.7.0_79]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[?:1.7.0_79]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
~[?:1.7.0_79]
        ... 1 more
2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR]
java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1,
errorString:
        at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
~[storm-core-1.0.1.jar:1.0.1]



2016-10-24 15:23 GMT+08:00 Zhechao Ma <ma...@gmail.com>:

> I open DEBUG-level log, and see the BOLT heartbeat information, timeout is
> 30000ms, everything looks OK.
>
> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <ma...@gmail.com>:
>
>> I will try to do this, and reply latter. Thanks.
>>
>> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>>
>>> Could you modify your log level to DEBUG and see worker's log? If you
>>> use Storm 1.x you can modify log level from UI on the fly.
>>> ShellBolt writes log regarding subprocess heartbeat but its level is
>>> DEBUG since it could produce lots of logs.
>>>
>>> Two lines:
>>> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
>>> - BOLT - sending heartbeat request to subprocess
>>>
>>> Two lines will be logged to each 1 second. Please check logs are
>>> existing, and 'last heartbeat' is updated properly, and also worker timeout
>>> is set properly.
>>>
>>> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <ma...@gmail.com>님이
>>> 작성:
>>>
>>>> I do not set "topology.subprocess.timeout.secs", so "
>>>> supervisor.worker.timeout.secs" will be used according to STORM-1314,
>>>> which is set 30 for my cluster.
>>>> 30 seconds is a very very very big value, it will never take more than
>>>> 30 seconds processing my tuple.
>>>> I think there must be problem somewhere else.
>>>>
>>>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>>>>
>>>> There're many situations for ShellBolt to trigger heartbeat issue, and
>>>> at least STORM-1946 is not the case.
>>>>
>>>> How long does your tuple take to be processed? You need to set
>>>> subprocess timeout seconds ("topology.subprocess.timeout.secs") to
>>>> higher than max time to process. You can even set it fairly big value so
>>>> that subprocess heartbeat issue will not happen.
>>>>
>>>>
>>>> ShellBolt requires that each tuple is handled and acked within
>>>> heartbeat timeout. I struggled to change this behavior for subprocess to
>>>> periodically sends heartbeat, but no luck because of GIL - global
>>>> interpreter lock (same for Ruby). We need to choose one: stick this
>>>> restriction, or disable subprocess heartbeat.
>>>>
>>>> I hope that we can resolve this issue clearly, but I guess multi-thread
>>>> approach doesn't work on Python, Ruby, and any language which uses GIL, and
>>>> I have no idea on alternatives
>>>>
>>>> - Jungtaek Lim (HeartSaVioR).
>>>>
>>>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이
>>>> 작성:
>>>>
>>>> I made an issue (STORM-2150
>>>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>>>> can
>>>> help?
>>>>
>>>> I've got a simple topology running with Storm 1.0.1. The topology
>>>> consists
>>>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>>>> the following exceptions.
>>>>
>>>> java.lang.RuntimeException: subprocess heartbeat timeout at
>>>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(S
>>>> hellBolt.java:322)
>>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executor
>>>> s.java:471)
>>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>>> tureTask.run(ScheduledThreadPoolExecutor.java:293)
>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>> Executor.java:1145)
>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>> lExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:745)
>>>>
>>>> More information here:
>>>> 1. Topology run with ACK mode.
>>>> 2. Topology had 40 workers.
>>>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>>>
>>>> Every time subprocess heartbeat timeout, workers would restart and
>>>> python
>>>> processes exited with exitCode:-1, which affected processing capacity
>>>> and
>>>> stability of the topology.
>>>>
>>>> I've checked some related issues from Storm Jira. I first found
>>>> STORM-1946
>>>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug
>>>> related
>>>> to this problem and said bug had been fixed in Storm 1.0.2. However I
>>>> got
>>>> the same exception even after I upgraded Storm to 1.0.2.
>>>>
>>>> I checked other related issues. Let's look at history of this problem.
>>>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>>>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>>>> <https://issues.apache.org/jira/browse/STORM-742> discussed the
>>>> approach of
>>>> this problem with ACK mode, and it seemed that bug had been fixed in
>>>> 0.10.0. I don't know whether this patch is included in storm-1.x
>>>> branch. In
>>>> a word, this problem still exists in the latest stable version.
>>>>
>>>>
>>>>
>>
>
>
> --
> Thanks
> Zhechao Ma
>



-- 
Thanks
Zhechao Ma

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Zhechao Ma <ma...@gmail.com>.

I open DEBUG-level log, and see the BOLT heartbeat information, timeout is
30000ms, everything looks OK.

2016-10-21 13:14 GMT+08:00 Zhechao Ma <ma...@gmail.com>:

> I will try to do this, and reply latter. Thanks.
>
> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>
>> Could you modify your log level to DEBUG and see worker's log? If you use
>> Storm 1.x you can modify log level from UI on the fly.
>> ShellBolt writes log regarding subprocess heartbeat but its level is
>> DEBUG since it could produce lots of logs.
>>
>> Two lines:
>> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
>> - BOLT - sending heartbeat request to subprocess
>>
>> Two lines will be logged to each 1 second. Please check logs are
>> existing, and 'last heartbeat' is updated properly, and also worker timeout
>> is set properly.
>>
>> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <ma...@gmail.com>님이 작성:
>>
>>> I do not set "topology.subprocess.timeout.secs", so "
>>> supervisor.worker.timeout.secs" will be used according to STORM-1314,
>>> which is set 30 for my cluster.
>>> 30 seconds is a very very very big value, it will never take more than
>>> 30 seconds processing my tuple.
>>> I think there must be problem somewhere else.
>>>
>>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>>>
>>> There're many situations for ShellBolt to trigger heartbeat issue, and
>>> at least STORM-1946 is not the case.
>>>
>>> How long does your tuple take to be processed? You need to set
>>> subprocess timeout seconds ("topology.subprocess.timeout.secs") to
>>> higher than max time to process. You can even set it fairly big value so
>>> that subprocess heartbeat issue will not happen.
>>>
>>>
>>> ShellBolt requires that each tuple is handled and acked within heartbeat
>>> timeout. I struggled to change this behavior for subprocess to periodically
>>> sends heartbeat, but no luck because of GIL - global interpreter lock (same
>>> for Ruby). We need to choose one: stick this restriction, or disable
>>> subprocess heartbeat.
>>>
>>> I hope that we can resolve this issue clearly, but I guess multi-thread
>>> approach doesn't work on Python, Ruby, and any language which uses GIL, and
>>> I have no idea on alternatives
>>>
>>> - Jungtaek Lim (HeartSaVioR).
>>>
>>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이
>>> 작성:
>>>
>>> I made an issue (STORM-2150
>>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>>> can
>>> help?
>>>
>>> I've got a simple topology running with Storm 1.0.1. The topology
>>> consists
>>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>>> the following exceptions.
>>>
>>> java.lang.RuntimeException: subprocess heartbeat timeout at
>>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(S
>>> hellBolt.java:322)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(
>>> Executors.java:471)
>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.run(ScheduledThreadPoolExecutor.java:293)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1145)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> More information here:
>>> 1. Topology run with ACK mode.
>>> 2. Topology had 40 workers.
>>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>>
>>> Every time subprocess heartbeat timeout, workers would restart and python
>>> processes exited with exitCode:-1, which affected processing capacity and
>>> stability of the topology.
>>>
>>> I've checked some related issues from Storm Jira. I first found
>>> STORM-1946
>>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug
>>> related
>>> to this problem and said bug had been fixed in Storm 1.0.2. However I got
>>> the same exception even after I upgraded Storm to 1.0.2.
>>>
>>> I checked other related issues. Let's look at history of this problem.
>>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>>> <https://issues.apache.org/jira/browse/STORM-742> discussed the
>>> approach of
>>> this problem with ACK mode, and it seemed that bug had been fixed in
>>> 0.10.0. I don't know whether this patch is included in storm-1.x branch.
>>> In
>>> a word, this problem still exists in the latest stable version.
>>>
>>>
>>>
>


-- 
Thanks
Zhechao Ma

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Zhechao Ma <ma...@gmail.com>.

I will try to do this, and reply latter. Thanks.

2016-10-21 13:09 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:

> Could you modify your log level to DEBUG and see worker's log? If you use
> Storm 1.x you can modify log level from UI on the fly.
> ShellBolt writes log regarding subprocess heartbeat but its level is DEBUG
> since it could produce lots of logs.
>
> Two lines:
> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
> - BOLT - sending heartbeat request to subprocess
>
> Two lines will be logged to each 1 second. Please check logs are existing,
> and 'last heartbeat' is updated properly, and also worker timeout is set
> properly.
>
> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <ma...@gmail.com>님이 작성:
>
>> I do not set "topology.subprocess.timeout.secs", so "
>> supervisor.worker.timeout.secs" will be used according to STORM-1314,
>> which is set 30 for my cluster.
>> 30 seconds is a very very very big value, it will never take more than 30
>> seconds processing my tuple.
>> I think there must be problem somewhere else.
>>
>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>>
>> There're many situations for ShellBolt to trigger heartbeat issue, and at
>> least STORM-1946 is not the case.
>>
>> How long does your tuple take to be processed? You need to set subprocess
>> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
>> time to process. You can even set it fairly big value so that subprocess
>> heartbeat issue will not happen.
>>
>>
>> ShellBolt requires that each tuple is handled and acked within heartbeat
>> timeout. I struggled to change this behavior for subprocess to periodically
>> sends heartbeat, but no luck because of GIL - global interpreter lock (same
>> for Ruby). We need to choose one: stick this restriction, or disable
>> subprocess heartbeat.
>>
>> I hope that we can resolve this issue clearly, but I guess multi-thread
>> approach doesn't work on Python, Ruby, and any language which uses GIL, and
>> I have no idea on alternatives
>>
>> - Jungtaek Lim (HeartSaVioR).
>>
>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이
>> 작성:
>>
>> I made an issue (STORM-2150
>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>> can
>> help?
>>
>> I've got a simple topology running with Storm 1.0.1. The topology consists
>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>> the following exceptions.
>>
>> java.lang.RuntimeException: subprocess heartbeat timeout at
>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(
>> ShellBolt.java:322)
>> at java.util.concurrent.Executors$RunnableAdapter.
>> call(Executors.java:471)
>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>> java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>> at java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> More information here:
>> 1. Topology run with ACK mode.
>> 2. Topology had 40 workers.
>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>
>> Every time subprocess heartbeat timeout, workers would restart and python
>> processes exited with exitCode:-1, which affected processing capacity and
>> stability of the topology.
>>
>> I've checked some related issues from Storm Jira. I first found STORM-1946
>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
>> to this problem and said bug had been fixed in Storm 1.0.2. However I got
>> the same exception even after I upgraded Storm to 1.0.2.
>>
>> I checked other related issues. Let's look at history of this problem.
>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
>> of
>> this problem with ACK mode, and it seemed that bug had been fixed in
>> 0.10.0. I don't know whether this patch is included in storm-1.x branch.
>> In
>> a word, this problem still exists in the latest stable version.
>>
>>
>>

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Jungtaek Lim <ka...@gmail.com>.

Could you modify your log level to DEBUG and see worker's log? If you use
Storm 1.x you can modify log level from UI on the fly.
ShellBolt writes log regarding subprocess heartbeat but its level is DEBUG
since it could produce lots of logs.

Two lines:
- BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
- BOLT - sending heartbeat request to subprocess

Two lines will be logged to each 1 second. Please check logs are existing,
and 'last heartbeat' is updated properly, and also worker timeout is set
properly.

2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <ma...@gmail.com>님이 작성:

> I do not set "topology.subprocess.timeout.secs", so "
> supervisor.worker.timeout.secs" will be used according to STORM-1314,
> which is set 30 for my cluster.
> 30 seconds is a very very very big value, it will never take more than 30
> seconds processing my tuple.
> I think there must be problem somewhere else.
>
> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:
>
> There're many situations for ShellBolt to trigger heartbeat issue, and at
> least STORM-1946 is not the case.
>
> How long does your tuple take to be processed? You need to set subprocess
> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
> time to process. You can even set it fairly big value so that subprocess
> heartbeat issue will not happen.
>
>
> ShellBolt requires that each tuple is handled and acked within heartbeat
> timeout. I struggled to change this behavior for subprocess to periodically
> sends heartbeat, but no luck because of GIL - global interpreter lock (same
> for Ruby). We need to choose one: stick this restriction, or disable
> subprocess heartbeat.
>
> I hope that we can resolve this issue clearly, but I guess multi-thread
> approach doesn't work on Python, Ruby, and any language which uses GIL, and
> I have no idea on alternatives
>
> - Jungtaek Lim (HeartSaVioR).
>
> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이 작성:
>
> I made an issue (STORM-2150
> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone can
> help?
>
> I've got a simple topology running with Storm 1.0.1. The topology consists
> of a KafkaSpout and several python multilang ShellBolt. I frequently got
> the following exceptions.
>
> java.lang.RuntimeException: subprocess heartbeat timeout at
>
> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> More information here:
> 1. Topology run with ACK mode.
> 2. Topology had 40 workers.
> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>
> Every time subprocess heartbeat timeout, workers would restart and python
> processes exited with exitCode:-1, which affected processing capacity and
> stability of the topology.
>
> I've checked some related issues from Storm Jira. I first found STORM-1946
> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
> to this problem and said bug had been fixed in Storm 1.0.2. However I got
> the same exception even after I upgraded Storm to 1.0.2.
>
> I checked other related issues. Let's look at history of this problem.
> DashengJu first reported this problem with Non-ACK mode in STORM-738
> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
> of
> this problem with ACK mode, and it seemed that bug had been fixed in
> 0.10.0. I don't know whether this patch is included in storm-1.x branch. In
> a word, this problem still exists in the latest stable version.
>
>
>

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Zhechao Ma <ma...@gmail.com>.

I do not set "topology.subprocess.timeout.secs", so "
supervisor.worker.timeout.secs" will be used according to STORM-1314, which
is set 30 for my cluster.
30 seconds is a very very very big value, it will never take more than 30
seconds processing my tuple.
I think there must be problem somewhere else.

2016-10-21 11:11 GMT+08:00 Jungtaek Lim <ka...@gmail.com>:

> There're many situations for ShellBolt to trigger heartbeat issue, and at
> least STORM-1946 is not the case.
>
> How long does your tuple take to be processed? You need to set subprocess
> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
> time to process. You can even set it fairly big value so that subprocess
> heartbeat issue will not happen.
>
>
> ShellBolt requires that each tuple is handled and acked within heartbeat
> timeout. I struggled to change this behavior for subprocess to periodically
> sends heartbeat, but no luck because of GIL - global interpreter lock (same
> for Ruby). We need to choose one: stick this restriction, or disable
> subprocess heartbeat.
>
> I hope that we can resolve this issue clearly, but I guess multi-thread
> approach doesn't work on Python, Ruby, and any language which uses GIL, and
> I have no idea on alternatives
>
> - Jungtaek Lim (HeartSaVioR).
>
> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이 작성:
>
>> I made an issue (STORM-2150
>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>> can
>> help?
>>
>> I've got a simple topology running with Storm 1.0.1. The topology consists
>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>> the following exceptions.
>>
>> java.lang.RuntimeException: subprocess heartbeat timeout at
>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(
>> ShellBolt.java:322)
>> at java.util.concurrent.Executors$RunnableAdapter.
>> call(Executors.java:471)
>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>> java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>> at java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> More information here:
>> 1. Topology run with ACK mode.
>> 2. Topology had 40 workers.
>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>
>> Every time subprocess heartbeat timeout, workers would restart and python
>> processes exited with exitCode:-1, which affected processing capacity and
>> stability of the topology.
>>
>> I've checked some related issues from Storm Jira. I first found STORM-1946
>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
>> to this problem and said bug had been fixed in Storm 1.0.2. However I got
>> the same exception even after I upgraded Storm to 1.0.2.
>>
>> I checked other related issues. Let's look at history of this problem.
>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
>> of
>> this problem with ACK mode, and it seemed that bug had been fixed in
>> 0.10.0. I don't know whether this patch is included in storm-1.x branch.
>> In
>> a word, this problem still exists in the latest stable version.
>>
>

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Jungtaek Lim <ka...@gmail.com>.

There're many situations for ShellBolt to trigger heartbeat issue, and at
least STORM-1946 is not the case.

How long does your tuple take to be processed? You need to set subprocess
timeout seconds ("topology.subprocess.timeout.secs") to higher than max
time to process. You can even set it fairly big value so that subprocess
heartbeat issue will not happen.


ShellBolt requires that each tuple is handled and acked within heartbeat
timeout. I struggled to change this behavior for subprocess to periodically
sends heartbeat, but no luck because of GIL - global interpreter lock (same
for Ruby). We need to choose one: stick this restriction, or disable
subprocess heartbeat.

I hope that we can resolve this issue clearly, but I guess multi-thread
approach doesn't work on Python, Ruby, and any language which uses GIL, and
I have no idea on alternatives

- Jungtaek Lim (HeartSaVioR).

2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이 작성:

> I made an issue (STORM-2150
> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone can
> help?
>
> I've got a simple topology running with Storm 1.0.1. The topology consists
> of a KafkaSpout and several python multilang ShellBolt. I frequently got
> the following exceptions.
>
> java.lang.RuntimeException: subprocess heartbeat timeout at
>
> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> More information here:
> 1. Topology run with ACK mode.
> 2. Topology had 40 workers.
> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>
> Every time subprocess heartbeat timeout, workers would restart and python
> processes exited with exitCode:-1, which affected processing capacity and
> stability of the topology.
>
> I've checked some related issues from Storm Jira. I first found STORM-1946
> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
> to this problem and said bug had been fixed in Storm 1.0.2. However I got
> the same exception even after I upgraded Storm to 1.0.2.
>
> I checked other related issues. Let's look at history of this problem.
> DashengJu first reported this problem with Non-ACK mode in STORM-738
> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
> of
> this problem with ACK mode, and it seemed that bug had been fixed in
> 0.10.0. I don't know whether this patch is included in storm-1.x branch. In
> a word, this problem still exists in the latest stable version.
>

Re: ShellBolt raise subprocess heartbeat timeout Exception

Posted by Jungtaek Lim <ka...@gmail.com>.

There're many situations for ShellBolt to trigger heartbeat issue, and at
least STORM-1946 is not the case.

How long does your tuple take to be processed? You need to set subprocess
timeout seconds ("topology.subprocess.timeout.secs") to higher than max
time to process. You can even set it fairly big value so that subprocess
heartbeat issue will not happen.


ShellBolt requires that each tuple is handled and acked within heartbeat
timeout. I struggled to change this behavior for subprocess to periodically
sends heartbeat, but no luck because of GIL - global interpreter lock (same
for Ruby). We need to choose one: stick this restriction, or disable
subprocess heartbeat.

I hope that we can resolve this issue clearly, but I guess multi-thread
approach doesn't work on Python, Ruby, and any language which uses GIL, and
I have no idea on alternatives

- Jungtaek Lim (HeartSaVioR).

2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <ma...@gmail.com>님이 작성:

> I made an issue (STORM-2150
> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone can
> help?
>
> I've got a simple topology running with Storm 1.0.1. The topology consists
> of a KafkaSpout and several python multilang ShellBolt. I frequently got
> the following exceptions.
>
> java.lang.RuntimeException: subprocess heartbeat timeout at
>
> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> More information here:
> 1. Topology run with ACK mode.
> 2. Topology had 40 workers.
> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>
> Every time subprocess heartbeat timeout, workers would restart and python
> processes exited with exitCode:-1, which affected processing capacity and
> stability of the topology.
>
> I've checked some related issues from Storm Jira. I first found STORM-1946
> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
> to this problem and said bug had been fixed in Storm 1.0.2. However I got
> the same exception even after I upgraded Storm to 1.0.2.
>
> I checked other related issues. Let's look at history of this problem.
> DashengJu first reported this problem with Non-ACK mode in STORM-738
> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
> of
> this problem with ACK mode, and it seemed that bug had been fixed in
> 0.10.0. I don't know whether this patch is included in storm-1.x branch. In
> a word, this problem still exists in the latest stable version.
>