You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Kashmar, Ali" <Al...@emc.com> on 2015/12/16 17:26:45 UTC
No job recovery after job manager failure
Hi,
I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader:
11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16.
11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32.
11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs.
I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node:
11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds)
11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache.
11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms).
11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322.
11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache.
11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e
Is this a bug?
Thanks,
Ali
Re: No job recovery after job manager failure
Posted by Ufuk Celebi <uc...@apache.org>.
Hey Ali,
can you send me the complete logs?
I don’t think it’s possible via the mailing list. Just send it to my private email uce@apache.org.
– Ufuk
> On 16 Dec 2015, at 17:26, Kashmar, Ali <Al...@emc.com> wrote:
>
> Hi,
>
> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader:
>
> 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
> 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
> 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16.
> 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32.
> 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs.
>
>
> I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node:
>
>
> 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds)
> 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache.
> 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms).
> 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322.
> 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache.
> 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e
>
>
> Is this a bug?
>
> Thanks,
> Ali
Re: No job recovery after job manager failure
Posted by Ufuk Celebi <uc...@apache.org>.
The issue was that 1) local state backend but loss of VM and 2) recovery did not log any Exception.
2) has been addressed in this PR: https://github.com/apache/flink/pull/1472
– Ufuk
> On 17 Dec 2015, at 15:26, Ufuk Celebi <uc...@apache.org> wrote:
>
> As an update: I’m investigating this. Ali sent me the log files.
>
>> On 16 Dec 2015, at 18:15, Ufuk Celebi <uc...@apache.org> wrote:
>>
>> Hey Ali,
>>
>> can you send me the complete logs?
>>
>> I don’t think it’s possible via the mailing list. Just send it to my private email uce@apache.org.
>>
>> – Ufuk
>>
>>> On 16 Dec 2015, at 17:26, Kashmar, Ali <Al...@emc.com> wrote:
>>>
>>> Hi,
>>>
>>> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader:
>>>
>>> 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
>>> 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
>>> 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16.
>>> 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32.
>>> 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs.
>>>
>>>
>>> I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node:
>>>
>>>
>>> 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds)
>>> 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache.
>>> 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms).
>>> 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322.
>>> 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache.
>>> 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e
>>>
>>>
>>> Is this a bug?
>>>
>>> Thanks,
>>> Ali
>>
>
Re: No job recovery after job manager failure
Posted by Ufuk Celebi <uc...@apache.org>.
As an update: I’m investigating this. Ali sent me the log files.
> On 16 Dec 2015, at 18:15, Ufuk Celebi <uc...@apache.org> wrote:
>
> Hey Ali,
>
> can you send me the complete logs?
>
> I don’t think it’s possible via the mailing list. Just send it to my private email uce@apache.org.
>
> – Ufuk
>
>> On 16 Dec 2015, at 17:26, Kashmar, Ali <Al...@emc.com> wrote:
>>
>> Hi,
>>
>> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader:
>>
>> 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
>> 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
>> 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16.
>> 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32.
>> 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs.
>>
>>
>> I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node:
>>
>>
>> 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds)
>> 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache.
>> 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms).
>> 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322.
>> 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache.
>> 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e
>>
>>
>> Is this a bug?
>>
>> Thanks,
>> Ali
>
Re: No job recovery after job manager failure
Posted by Ufuk Celebi <uc...@apache.org>.
Hey Ali,
can you send me the complete logs?
I don’t think it’s possible via the mailing list. Just send it to my private email uce@apache.org.
– Ufuk
> On 16 Dec 2015, at 17:26, Kashmar, Ali <Al...@emc.com> wrote:
>
> Hi,
>
> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader:
>
> 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
> 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
> 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16.
> 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32.
> 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs.
>
>
> I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node:
>
>
> 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds)
> 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache.
> 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms).
> 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322.
> 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache.
> 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e
>
>
> Is this a bug?
>
> Thanks,
> Ali