You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Gyula Fóra <gy...@gmail.com> on 2017/08/10 11:31:49 UTC
Delay between job manager recovery and job recovery (1.3.2)
Hi!
In some cases it seems to take a long time for the job to start the
zookeeper based job recovery after recovering from a JM failure.
Looking at the logs there is a 2 minute gap between the last recovered TM
was started successfully and the job recovery:
2017-08-10 13:14:06,369 INFO
org.apache.flink.runtime.instance.InstanceManager -
Registered TaskManager at splat31
(akka.tcp://flink@splat31.sto.midasplayer.com:41821/user/taskmanager)
as 2de2376edd4670b48600f29df9e821a1. Current number of registered
hosts is 15. Current number of alive task slots is 120.
2017-08-10 13:14:08,198 INFO
org.apache.flink.yarn.YarnFlinkResourceManager -
TaskManager container_e22_1495032222473_0939_02_000002 has started.
2017-08-10 13:14:08,198 INFO
org.apache.flink.runtime.instance.InstanceManager -
Registered TaskManager at splat19
(akka.tcp://flink@splat19.sto.midasplayer.com:36377/user/taskmanager)
as 06ed7098060f4d539150c1c26a32ac07. Current number of registered
hosts is 16. Current number of alive task slots is 128.
2017-08-10 13:14:56,710 INFO kafka.client.ClientUtils$
- Fetching metadata from broker
id:0,host:kafka10.sto.midasplayer.com,port:9092 with correlation id 0
for 1 topic(s) Set(meter.rbea.log)
2017-08-10 13:14:56,714 INFO kafka.producer.SyncProducer
- Connected to kafka10.sto.midasplayer.com:9092
for producing
2017-08-10 13:14:56,734 INFO kafka.producer.SyncProducer
- Disconnecting from
kafka10.sto.midasplayer.com:9092
2017-08-10 13:14:56,780 INFO kafka.producer.SyncProducer
- Connected to kafka10.sto.midasplayer.com:9092
for producing
2017-08-10 13:14:56,791 INFO kafka.producer.SyncProducer
- Connected to kafka21.sto.midasplayer.com:9092
for producing
2017-08-10 13:14:56,796 INFO kafka.producer.SyncProducer
- Connected to kafka11.sto.midasplayer.com:9092
for producing
2017-08-10 13:14:56,800 INFO kafka.producer.SyncProducer
- Connected to kafka09.sto.midasplayer.com:9092
for producing
2017-08-10 13:14:56,806 INFO kafka.producer.SyncProducer
- Connected to kafka23.sto.midasplayer.com:9092
for producing
2017-08-10 13:14:56,826 INFO kafka.producer.SyncProducer
- Connected to kafka22.sto.midasplayer.com:9092
for producing
2017-08-10 13:14:56,830 INFO kafka.producer.SyncProducer
- Connected to kafka08.sto.midasplayer.com:9092
for producing
2017-08-10 13:14:56,845 INFO kafka.producer.SyncProducer
- Connected to kafka20.sto.midasplayer.com:9092
for producing
2017-08-10 13:14:56,852 INFO kafka.producer.SyncProducer
- Connected to kafka15.sto.midasplayer.com:9092
for producing
2017-08-10 13:14:56,857 INFO kafka.producer.SyncProducer
- Connected to kafka17.sto.midasplayer.com:9092
for producing
2017-08-10 13:14:56,861 INFO kafka.producer.SyncProducer
- Connected to kafka18.sto.midasplayer.com:9092
for producing
2017-08-10 13:15:56,814 INFO org.apache.flink.yarn.YarnJobManager
- Attempting to recover all jobs.
2017-08-10 13:15:56,820 INFO org.apache.flink.yarn.YarnJobManager
- There are 1 jobs to recover. Starting the job
recovery.
2017-08-10 13:15:56,821 INFO org.apache.flink.yarn.YarnJobManager
- Attempting to recover job
fea2eb82ef98849be2a8ac1731379193.
2017-08-10 13:15:56,886 INFO kafka.producer.SyncProducer
- Connected to kafka13.sto.midasplayer.com:9092
for producing
2017-08-10 13:15:56,889 INFO kafka.producer.SyncProducer
- Connected to kafka14.sto.midasplayer.com:9092
for producing
2017-08-10 13:15:56,943 INFO kafka.producer.SyncProducer
- Connected to kafka12.sto.midasplayer.com:9092
for producing
Disregard the kafka logs, they belong to a metrics reporter.
Any idea what might cause the delay here? What is happenning in the background?
Thanks,
Gyula
Re: Delay between job manager recovery and job recovery (1.3.2)
Posted by Gyula Fóra <gy...@gmail.com>.
Sweet, thanks Aljoscha for the quick help.
Gyula
Aljoscha Krettek <al...@apache.org> ezt írta (időpont: 2017. aug. 10.,
Cs, 15:33):
> Don't worry! :-) I found that you can configure this via "high-availability.job.delay"
> (in HighAvailabilityOptions).
>
> Best,
> Aljoscha
>
> On 10. Aug 2017, at 15:13, Gyula Fóra <gy...@gmail.com> wrote:
>
> Here is actually the whole log for the relevant parts at least:
> https://gist.github.com/gyfora/b70dd18c048b862751b194f613514300
>
> Sorry for not pasting it earlier.
>
> Gyula
>
> Gyula Fóra <gy...@gmail.com> ezt írta (időpont: 2017. aug. 10., Cs,
> 15:04):
>
>> Oh, I found this in the log that seems to explain it: 2017-08-10
>> 13:13:56,795 INFO org.apache.flink.yarn.YarnJobManager - Delaying recovery
>> of all jobs by 120000 milliseconds.
>>
>> I wonder why is this...
>>
>> Aljoscha Krettek <al...@apache.org> ezt írta (időpont: 2017. aug.
>> 10., Cs, 14:41):
>>
>>> Hi,
>>>
>>> Let me also investigate that? Did you observe this in 1.3.2 and not in
>>> 1.3.0 and/or 1.3.1 or did you directly go from 1.2.x to 1.3.2?
>>>
>>> Best,
>>> Aljoscha
>>>
>>> On 10. Aug 2017, at 13:31, Gyula Fóra <gy...@gmail.com> wrote:
>>>
>>> Hi!
>>> In some cases it seems to take a long time for the job to start the
>>> zookeeper based job recovery after recovering from a JM failure.
>>> Looking at the logs there is a 2 minute gap between the last recovered
>>> TM was started successfully and the job recovery:
>>>
>>> 2017-08-10 13:14:06,369 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at splat31 (akka.tcp://flink@splat31.sto.midasplayer.com:41821/user/taskmanager) as 2de2376edd4670b48600f29df9e821a1. Current number of registered hosts is 15. Current number of alive task slots is 120.
>>> 2017-08-10 13:14:08,198 INFO org.apache.flink.yarn.YarnFlinkResourceManager - TaskManager container_e22_1495032222473_0939_02_000002 has started.
>>> 2017-08-10 13:14:08,198 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at splat19 (akka.tcp://flink@splat19.sto.midasplayer.com:36377/user/taskmanager) as 06ed7098060f4d539150c1c26a32ac07. Current number of registered hosts is 16. Current number of alive task slots is 128.
>>> 2017-08-10 13:14:56,710 INFO kafka.client.ClientUtils$ - Fetching metadata from broker id:0,host:kafka10.sto.midasplayer.com,port:9092 with correlation id 0 for 1 topic(s) Set(meter.rbea.log)
>>> 2017-08-10 13:14:56,714 INFO kafka.producer.SyncProducer - Connected to kafka10.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:14:56,734 INFO kafka.producer.SyncProducer - Disconnecting from kafka10.sto.midasplayer.com:9092
>>> 2017-08-10 13:14:56,780 INFO kafka.producer.SyncProducer - Connected to kafka10.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:14:56,791 INFO kafka.producer.SyncProducer - Connected to kafka21.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:14:56,796 INFO kafka.producer.SyncProducer - Connected to kafka11.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:14:56,800 INFO kafka.producer.SyncProducer - Connected to kafka09.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:14:56,806 INFO kafka.producer.SyncProducer - Connected to kafka23.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:14:56,826 INFO kafka.producer.SyncProducer - Connected to kafka22.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:14:56,830 INFO kafka.producer.SyncProducer - Connected to kafka08.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:14:56,845 INFO kafka.producer.SyncProducer - Connected to kafka20.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:14:56,852 INFO kafka.producer.SyncProducer - Connected to kafka15.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:14:56,857 INFO kafka.producer.SyncProducer - Connected to kafka17.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:14:56,861 INFO kafka.producer.SyncProducer - Connected to kafka18.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:15:56,814 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover all jobs.
>>> 2017-08-10 13:15:56,820 INFO org.apache.flink.yarn.YarnJobManager - There are 1 jobs to recover. Starting the job recovery.
>>> 2017-08-10 13:15:56,821 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover job fea2eb82ef98849be2a8ac1731379193.
>>> 2017-08-10 13:15:56,886 INFO kafka.producer.SyncProducer - Connected to kafka13.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:15:56,889 INFO kafka.producer.SyncProducer - Connected to kafka14.sto.midasplayer.com:9092 for producing
>>> 2017-08-10 13:15:56,943 INFO kafka.producer.SyncProducer - Connected to kafka12.sto.midasplayer.com:9092 for producing
>>>
>>> Disregard the kafka logs, they belong to a metrics reporter.
>>>
>>> Any idea what might cause the delay here? What is happenning in the background?
>>> Thanks,
>>> Gyula
>>>
>>>
>>>
>
Re: Delay between job manager recovery and job recovery (1.3.2)
Posted by Aljoscha Krettek <al...@apache.org>.
Don't worry! :-) I found that you can configure this via "high-availability.job.delay" (in HighAvailabilityOptions).
Best,
Aljoscha
> On 10. Aug 2017, at 15:13, Gyula Fóra <gy...@gmail.com> wrote:
>
> Here is actually the whole log for the relevant parts at least: https://gist.github.com/gyfora/b70dd18c048b862751b194f613514300 <https://gist.github.com/gyfora/b70dd18c048b862751b194f613514300>
>
> Sorry for not pasting it earlier.
>
> Gyula
>
> Gyula Fóra <gyula.fora@gmail.com <ma...@gmail.com>> ezt írta (időpont: 2017. aug. 10., Cs, 15:04):
> Oh, I found this in the log that seems to explain it: 2017-08-10 13:13:56,795 INFO org.apache.flink.yarn.YarnJobManager - Delaying recovery of all jobs by 120000 milliseconds.
>
> I wonder why is this...
>
> Aljoscha Krettek <aljoscha@apache.org <ma...@apache.org>> ezt írta (időpont: 2017. aug. 10., Cs, 14:41):
> Hi,
>
> Let me also investigate that? Did you observe this in 1.3.2 and not in 1.3.0 and/or 1.3.1 or did you directly go from 1.2.x to 1.3.2?
>
> Best,
> Aljoscha
>
>> On 10. Aug 2017, at 13:31, Gyula Fóra <gyula.fora@gmail.com <ma...@gmail.com>> wrote:
>>
>> Hi!
>> In some cases it seems to take a long time for the job to start the zookeeper based job recovery after recovering from a JM failure.
>> Looking at the logs there is a 2 minute gap between the last recovered TM was started successfully and the job recovery:
>>
>> 2017-08-10 13:14:06,369 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at splat31 (akka.tcp://flink@splat31.sto.midasplayer.com:41821/user/taskmanager <http://flink@splat31.sto.midasplayer.com:41821/user/taskmanager>) as 2de2376edd4670b48600f29df9e821a1. Current number of registered hosts is 15. Current number of alive task slots is 120.
>> 2017-08-10 13:14:08,198 INFO org.apache.flink.yarn.YarnFlinkResourceManager - TaskManager container_e22_1495032222473_0939_02_000002 has started.
>> 2017-08-10 13:14:08,198 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at splat19 (akka.tcp://flink@splat19.sto.midasplayer.com:36377/user/taskmanager <http://flink@splat19.sto.midasplayer.com:36377/user/taskmanager>) as 06ed7098060f4d539150c1c26a32ac07. Current number of registered hosts is 16. Current number of alive task slots is 128.
>> 2017-08-10 13:14:56,710 INFO kafka.client.ClientUtils$ - Fetching metadata from broker id:0,host:kafka10.sto.midasplayer.com <http://kafka10.sto.midasplayer.com/>,port:9092 with correlation id 0 for 1 topic(s) Set(meter.rbea.log)
>> 2017-08-10 13:14:56,714 INFO kafka.producer.SyncProducer - Connected to kafka10.sto.midasplayer.com:9092 <http://kafka10.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:14:56,734 INFO kafka.producer.SyncProducer - Disconnecting from kafka10.sto.midasplayer.com:9092 <http://kafka10.sto.midasplayer.com:9092/>
>> 2017-08-10 13:14:56,780 INFO kafka.producer.SyncProducer - Connected to kafka10.sto.midasplayer.com:9092 <http://kafka10.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:14:56,791 INFO kafka.producer.SyncProducer - Connected to kafka21.sto.midasplayer.com:9092 <http://kafka21.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:14:56,796 INFO kafka.producer.SyncProducer - Connected to kafka11.sto.midasplayer.com:9092 <http://kafka11.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:14:56,800 INFO kafka.producer.SyncProducer - Connected to kafka09.sto.midasplayer.com:9092 <http://kafka09.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:14:56,806 INFO kafka.producer.SyncProducer - Connected to kafka23.sto.midasplayer.com:9092 <http://kafka23.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:14:56,826 INFO kafka.producer.SyncProducer - Connected to kafka22.sto.midasplayer.com:9092 <http://kafka22.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:14:56,830 INFO kafka.producer.SyncProducer - Connected to kafka08.sto.midasplayer.com:9092 <http://kafka08.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:14:56,845 INFO kafka.producer.SyncProducer - Connected to kafka20.sto.midasplayer.com:9092 <http://kafka20.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:14:56,852 INFO kafka.producer.SyncProducer - Connected to kafka15.sto.midasplayer.com:9092 <http://kafka15.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:14:56,857 INFO kafka.producer.SyncProducer - Connected to kafka17.sto.midasplayer.com:9092 <http://kafka17.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:14:56,861 INFO kafka.producer.SyncProducer - Connected to kafka18.sto.midasplayer.com:9092 <http://kafka18.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:15:56,814 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover all jobs.
>> 2017-08-10 13:15:56,820 INFO org.apache.flink.yarn.YarnJobManager - There are 1 jobs to recover. Starting the job recovery.
>> 2017-08-10 13:15:56,821 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover job fea2eb82ef98849be2a8ac1731379193.
>> 2017-08-10 13:15:56,886 INFO kafka.producer.SyncProducer - Connected to kafka13.sto.midasplayer.com:9092 <http://kafka13.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:15:56,889 INFO kafka.producer.SyncProducer - Connected to kafka14.sto.midasplayer.com:9092 <http://kafka14.sto.midasplayer.com:9092/> for producing
>> 2017-08-10 13:15:56,943 INFO kafka.producer.SyncProducer - Connected to kafka12.sto.midasplayer.com:9092 <http://kafka12.sto.midasplayer.com:9092/> for producing
>> Disregard the kafka logs, they belong to a metrics reporter.
>>
>> Any idea what might cause the delay here? What is happenning in the background?
>> Thanks,
>> Gyula
>
Re: Delay between job manager recovery and job recovery (1.3.2)
Posted by Gyula Fóra <gy...@gmail.com>.
Here is actually the whole log for the relevant parts at least:
https://gist.github.com/gyfora/b70dd18c048b862751b194f613514300
Sorry for not pasting it earlier.
Gyula
Gyula Fóra <gy...@gmail.com> ezt írta (időpont: 2017. aug. 10., Cs,
15:04):
> Oh, I found this in the log that seems to explain it: 2017-08-10
> 13:13:56,795 INFO org.apache.flink.yarn.YarnJobManager - Delaying recovery
> of all jobs by 120000 milliseconds.
>
> I wonder why is this...
>
> Aljoscha Krettek <al...@apache.org> ezt írta (időpont: 2017. aug. 10.,
> Cs, 14:41):
>
>> Hi,
>>
>> Let me also investigate that? Did you observe this in 1.3.2 and not in
>> 1.3.0 and/or 1.3.1 or did you directly go from 1.2.x to 1.3.2?
>>
>> Best,
>> Aljoscha
>>
>> On 10. Aug 2017, at 13:31, Gyula Fóra <gy...@gmail.com> wrote:
>>
>> Hi!
>> In some cases it seems to take a long time for the job to start the
>> zookeeper based job recovery after recovering from a JM failure.
>> Looking at the logs there is a 2 minute gap between the last recovered TM
>> was started successfully and the job recovery:
>>
>> 2017-08-10 13:14:06,369 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at splat31 (akka.tcp://flink@splat31.sto.midasplayer.com:41821/user/taskmanager) as 2de2376edd4670b48600f29df9e821a1. Current number of registered hosts is 15. Current number of alive task slots is 120.
>> 2017-08-10 13:14:08,198 INFO org.apache.flink.yarn.YarnFlinkResourceManager - TaskManager container_e22_1495032222473_0939_02_000002 has started.
>> 2017-08-10 13:14:08,198 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at splat19 (akka.tcp://flink@splat19.sto.midasplayer.com:36377/user/taskmanager) as 06ed7098060f4d539150c1c26a32ac07. Current number of registered hosts is 16. Current number of alive task slots is 128.
>> 2017-08-10 13:14:56,710 INFO kafka.client.ClientUtils$ - Fetching metadata from broker id:0,host:kafka10.sto.midasplayer.com,port:9092 with correlation id 0 for 1 topic(s) Set(meter.rbea.log)
>> 2017-08-10 13:14:56,714 INFO kafka.producer.SyncProducer - Connected to kafka10.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:14:56,734 INFO kafka.producer.SyncProducer - Disconnecting from kafka10.sto.midasplayer.com:9092
>> 2017-08-10 13:14:56,780 INFO kafka.producer.SyncProducer - Connected to kafka10.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:14:56,791 INFO kafka.producer.SyncProducer - Connected to kafka21.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:14:56,796 INFO kafka.producer.SyncProducer - Connected to kafka11.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:14:56,800 INFO kafka.producer.SyncProducer - Connected to kafka09.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:14:56,806 INFO kafka.producer.SyncProducer - Connected to kafka23.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:14:56,826 INFO kafka.producer.SyncProducer - Connected to kafka22.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:14:56,830 INFO kafka.producer.SyncProducer - Connected to kafka08.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:14:56,845 INFO kafka.producer.SyncProducer - Connected to kafka20.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:14:56,852 INFO kafka.producer.SyncProducer - Connected to kafka15.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:14:56,857 INFO kafka.producer.SyncProducer - Connected to kafka17.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:14:56,861 INFO kafka.producer.SyncProducer - Connected to kafka18.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:15:56,814 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover all jobs.
>> 2017-08-10 13:15:56,820 INFO org.apache.flink.yarn.YarnJobManager - There are 1 jobs to recover. Starting the job recovery.
>> 2017-08-10 13:15:56,821 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover job fea2eb82ef98849be2a8ac1731379193.
>> 2017-08-10 13:15:56,886 INFO kafka.producer.SyncProducer - Connected to kafka13.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:15:56,889 INFO kafka.producer.SyncProducer - Connected to kafka14.sto.midasplayer.com:9092 for producing
>> 2017-08-10 13:15:56,943 INFO kafka.producer.SyncProducer - Connected to kafka12.sto.midasplayer.com:9092 for producing
>>
>> Disregard the kafka logs, they belong to a metrics reporter.
>>
>> Any idea what might cause the delay here? What is happenning in the background?
>> Thanks,
>> Gyula
>>
>>
>>
Re: Delay between job manager recovery and job recovery (1.3.2)
Posted by Gyula Fóra <gy...@gmail.com>.
Oh, I found this in the log that seems to explain it: 2017-08-10
13:13:56,795 INFO org.apache.flink.yarn.YarnJobManager - Delaying recovery
of all jobs by 120000 milliseconds.
I wonder why is this...
Aljoscha Krettek <al...@apache.org> ezt írta (időpont: 2017. aug. 10.,
Cs, 14:41):
> Hi,
>
> Let me also investigate that? Did you observe this in 1.3.2 and not in
> 1.3.0 and/or 1.3.1 or did you directly go from 1.2.x to 1.3.2?
>
> Best,
> Aljoscha
>
> On 10. Aug 2017, at 13:31, Gyula Fóra <gy...@gmail.com> wrote:
>
> Hi!
> In some cases it seems to take a long time for the job to start the
> zookeeper based job recovery after recovering from a JM failure.
> Looking at the logs there is a 2 minute gap between the last recovered TM
> was started successfully and the job recovery:
>
> 2017-08-10 13:14:06,369 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at splat31 (akka.tcp://flink@splat31.sto.midasplayer.com:41821/user/taskmanager) as 2de2376edd4670b48600f29df9e821a1. Current number of registered hosts is 15. Current number of alive task slots is 120.
> 2017-08-10 13:14:08,198 INFO org.apache.flink.yarn.YarnFlinkResourceManager - TaskManager container_e22_1495032222473_0939_02_000002 has started.
> 2017-08-10 13:14:08,198 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at splat19 (akka.tcp://flink@splat19.sto.midasplayer.com:36377/user/taskmanager) as 06ed7098060f4d539150c1c26a32ac07. Current number of registered hosts is 16. Current number of alive task slots is 128.
> 2017-08-10 13:14:56,710 INFO kafka.client.ClientUtils$ - Fetching metadata from broker id:0,host:kafka10.sto.midasplayer.com,port:9092 with correlation id 0 for 1 topic(s) Set(meter.rbea.log)
> 2017-08-10 13:14:56,714 INFO kafka.producer.SyncProducer - Connected to kafka10.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:14:56,734 INFO kafka.producer.SyncProducer - Disconnecting from kafka10.sto.midasplayer.com:9092
> 2017-08-10 13:14:56,780 INFO kafka.producer.SyncProducer - Connected to kafka10.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:14:56,791 INFO kafka.producer.SyncProducer - Connected to kafka21.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:14:56,796 INFO kafka.producer.SyncProducer - Connected to kafka11.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:14:56,800 INFO kafka.producer.SyncProducer - Connected to kafka09.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:14:56,806 INFO kafka.producer.SyncProducer - Connected to kafka23.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:14:56,826 INFO kafka.producer.SyncProducer - Connected to kafka22.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:14:56,830 INFO kafka.producer.SyncProducer - Connected to kafka08.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:14:56,845 INFO kafka.producer.SyncProducer - Connected to kafka20.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:14:56,852 INFO kafka.producer.SyncProducer - Connected to kafka15.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:14:56,857 INFO kafka.producer.SyncProducer - Connected to kafka17.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:14:56,861 INFO kafka.producer.SyncProducer - Connected to kafka18.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:15:56,814 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover all jobs.
> 2017-08-10 13:15:56,820 INFO org.apache.flink.yarn.YarnJobManager - There are 1 jobs to recover. Starting the job recovery.
> 2017-08-10 13:15:56,821 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover job fea2eb82ef98849be2a8ac1731379193.
> 2017-08-10 13:15:56,886 INFO kafka.producer.SyncProducer - Connected to kafka13.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:15:56,889 INFO kafka.producer.SyncProducer - Connected to kafka14.sto.midasplayer.com:9092 for producing
> 2017-08-10 13:15:56,943 INFO kafka.producer.SyncProducer - Connected to kafka12.sto.midasplayer.com:9092 for producing
>
> Disregard the kafka logs, they belong to a metrics reporter.
>
> Any idea what might cause the delay here? What is happenning in the background?
> Thanks,
> Gyula
>
>
>
Re: Delay between job manager recovery and job recovery (1.3.2)
Posted by Aljoscha Krettek <al...@apache.org>.
Hi,
Let me also investigate that? Did you observe this in 1.3.2 and not in 1.3.0 and/or 1.3.1 or did you directly go from 1.2.x to 1.3.2?
Best,
Aljoscha
> On 10. Aug 2017, at 13:31, Gyula Fóra <gy...@gmail.com> wrote:
>
> Hi!
> In some cases it seems to take a long time for the job to start the zookeeper based job recovery after recovering from a JM failure.
> Looking at the logs there is a 2 minute gap between the last recovered TM was started successfully and the job recovery:
>
> 2017-08-10 13:14:06,369 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at splat31 (akka.tcp://flink@splat31.sto.midasplayer.com:41821/user/taskmanager <http://flink@splat31.sto.midasplayer.com:41821/user/taskmanager>) as 2de2376edd4670b48600f29df9e821a1. Current number of registered hosts is 15. Current number of alive task slots is 120.
> 2017-08-10 13:14:08,198 INFO org.apache.flink.yarn.YarnFlinkResourceManager - TaskManager container_e22_1495032222473_0939_02_000002 has started.
> 2017-08-10 13:14:08,198 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at splat19 (akka.tcp://flink@splat19.sto.midasplayer.com:36377/user/taskmanager <http://flink@splat19.sto.midasplayer.com:36377/user/taskmanager>) as 06ed7098060f4d539150c1c26a32ac07. Current number of registered hosts is 16. Current number of alive task slots is 128.
> 2017-08-10 13:14:56,710 INFO kafka.client.ClientUtils$ - Fetching metadata from broker id:0,host:kafka10.sto.midasplayer.com <http://kafka10.sto.midasplayer.com/>,port:9092 with correlation id 0 for 1 topic(s) Set(meter.rbea.log)
> 2017-08-10 13:14:56,714 INFO kafka.producer.SyncProducer - Connected to kafka10.sto.midasplayer.com:9092 <http://kafka10.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:14:56,734 INFO kafka.producer.SyncProducer - Disconnecting from kafka10.sto.midasplayer.com:9092 <http://kafka10.sto.midasplayer.com:9092/>
> 2017-08-10 13:14:56,780 INFO kafka.producer.SyncProducer - Connected to kafka10.sto.midasplayer.com:9092 <http://kafka10.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:14:56,791 INFO kafka.producer.SyncProducer - Connected to kafka21.sto.midasplayer.com:9092 <http://kafka21.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:14:56,796 INFO kafka.producer.SyncProducer - Connected to kafka11.sto.midasplayer.com:9092 <http://kafka11.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:14:56,800 INFO kafka.producer.SyncProducer - Connected to kafka09.sto.midasplayer.com:9092 <http://kafka09.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:14:56,806 INFO kafka.producer.SyncProducer - Connected to kafka23.sto.midasplayer.com:9092 <http://kafka23.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:14:56,826 INFO kafka.producer.SyncProducer - Connected to kafka22.sto.midasplayer.com:9092 <http://kafka22.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:14:56,830 INFO kafka.producer.SyncProducer - Connected to kafka08.sto.midasplayer.com:9092 <http://kafka08.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:14:56,845 INFO kafka.producer.SyncProducer - Connected to kafka20.sto.midasplayer.com:9092 <http://kafka20.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:14:56,852 INFO kafka.producer.SyncProducer - Connected to kafka15.sto.midasplayer.com:9092 <http://kafka15.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:14:56,857 INFO kafka.producer.SyncProducer - Connected to kafka17.sto.midasplayer.com:9092 <http://kafka17.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:14:56,861 INFO kafka.producer.SyncProducer - Connected to kafka18.sto.midasplayer.com:9092 <http://kafka18.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:15:56,814 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover all jobs.
> 2017-08-10 13:15:56,820 INFO org.apache.flink.yarn.YarnJobManager - There are 1 jobs to recover. Starting the job recovery.
> 2017-08-10 13:15:56,821 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover job fea2eb82ef98849be2a8ac1731379193.
> 2017-08-10 13:15:56,886 INFO kafka.producer.SyncProducer - Connected to kafka13.sto.midasplayer.com:9092 <http://kafka13.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:15:56,889 INFO kafka.producer.SyncProducer - Connected to kafka14.sto.midasplayer.com:9092 <http://kafka14.sto.midasplayer.com:9092/> for producing
> 2017-08-10 13:15:56,943 INFO kafka.producer.SyncProducer - Connected to kafka12.sto.midasplayer.com:9092 <http://kafka12.sto.midasplayer.com:9092/> for producing
> Disregard the kafka logs, they belong to a metrics reporter.
>
> Any idea what might cause the delay here? What is happenning in the background?
> Thanks,
> Gyula