You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Puneet Duggal <pu...@gmail.com> on 2021/09/08 13:05:40 UTC

Documentation for deep diving into flink (data-streaming) job restart process

Hi,

So for past 2-3 days i have been looking for documentation which elaborates how flink takes care of restarting the data streaming job. I know all the restart and failover strategies but wanted to know how different components (Job Manager, Task Manager etc) play a role while restarting the flink data streaming job. 

I am asking this because recently in production.. when i restarted a task manger, all the jobs running on it, instead of getting restarted, disappeared. Within flink UI, couldn't tack those jobs in completed jobs as well. Logs also couldnt provide me with good enough information.

Also if anyone can tell me what is the role of /tmp/executionGraphStore  folder in Job Manager machine.

Thanks



Re: Documentation for deep diving into flink (data-streaming) job restart process

Posted by Puneet Duggal <pu...@gmail.com>.
Hi Robert,

Any solution / alternate approach to above issue would be appreciated as
going live with new jobs will be unreliable w.r.t task manager going down.

On Fri, Sep 10, 2021 at 1:17 PM Puneet Duggal <pu...@gmail.com>
wrote:

> Hi Robert,
>
> Thanks for taking out time to go through the logs.
>
> Problem:
> So reason for restarting all the task managers was to incorporate
> increased jvm metaspace size for each existing task manager. Currently each
> taskmanager has 32 slots. But JVM metaspace size was 256 MB which used to
> get filled by deploying 4-5 jobs (irrespective of their parallelism). Since
> our use case is generic.. worst case is that there are 32 jobs running on a
> single task manager.
>
> Solution:
> Basic solution was to increase JVM Metaspace size to 3GB to incorporate 32
> jobs. This required restart of all the task manager JVM with given changes.
> We had a total of 10 task managers of which 7 task managers were completely
> empty. In slot terms there were toal of 320 slots of which around 240 slots
> were availaible at restart time. First we targeted all those task managers
> which were completely empty. Once those restarted, we targeted task
> managers where job were up and running.
>
> Issue Faced:
> First task manager we targeted, i faced above mentioned issue where jobs
> instead of going into restart phase and getting spawned on other task
> managers failed. But these failed jobs were not even listed in completed
> jobs section in Flink UI. That is why i used the term disappeared. Usually
> with prior experience, any job with terminal status gets listed in
> completed jobs.
>
> Thanks
>
> On 10-Sep-2021, at 11:34 AM, Robert Metzger <rm...@apache.org> wrote:
>
> Thanks for the log.
>
> From the partial log that you shared with me, my assumption is that some
> external resource manager is shutting down your cluster. Multiple
> TaskManagers are disconnecting, and finally the job is switching into
> failed state.
> It seems that you are not stopping only one TaskManger, but all of them.
>
> Why are you restarting a TaskManager?
> How are you deploying Flink?
>
> On Fri, Sep 10, 2021 at 12:46 AM Puneet Duggal <pu...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Please find attached logfile regarding job not getting restarted on
>> another task manager once existing task manager got restarted.
>>
>> Just FYI - We are using Fixed Delay Restart (5 times, 10s delay)
>>
>> On Thu, Sep 9, 2021 at 4:29 PM Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> Hi Puneet,
>>>
>>> Can you provide us with the JobManager logs of this incident? Jobs
>>> should not disappear, they should restart on other Task Managers.
>>>
>>> On Wed, Sep 8, 2021 at 3:06 PM Puneet Duggal <pu...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> So for past 2-3 days i have been looking for documentation which
>>>> elaborates how flink takes care of restarting the data streaming job. I
>>>> know all the restart and failover strategies but wanted to know how
>>>> different components (Job Manager, Task Manager etc) play a role while
>>>> restarting the flink data streaming job.
>>>>
>>>> I am asking this because recently in production.. when i restarted a
>>>> task manger, all the jobs running on it, instead of getting restarted,
>>>> disappeared. Within flink UI, couldn't tack those jobs in completed jobs as
>>>> well. Logs also couldnt provide me with good enough information.
>>>>
>>>> Also if anyone can tell me what is the role of
>>>> /tmp/executionGraphStore  folder in Job Manager machine.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>

Re: Documentation for deep diving into flink (data-streaming) job restart process

Posted by Puneet Duggal <pu...@gmail.com>.
Hi Robert,

Thanks for taking out time to go through the logs. 

Problem:
So reason for restarting all the task managers was to incorporate increased jvm metaspace size for each existing task manager. Currently each taskmanager has 32 slots. But JVM metaspace size was 256 MB which used to get filled by deploying 4-5 jobs (irrespective of their parallelism). Since our use case is generic.. worst case is that there are 32 jobs running on a single task manager.

Solution:
Basic solution was to increase JVM Metaspace size to 3GB to incorporate 32 jobs. This required restart of all the task manager JVM with given changes. We had a total of 10 task managers of which 7 task managers were completely empty. In slot terms there were toal of 320 slots of which around 240 slots were availaible at restart time. First we targeted all those task managers which were completely empty. Once those restarted, we targeted task managers where job were up and running. 

Issue Faced:
First task manager we targeted, i faced above mentioned issue where jobs instead of going into restart phase and getting spawned on other task managers failed. But these failed jobs were not even listed in completed jobs section in Flink UI. That is why i used the term disappeared. Usually with prior experience, any job with terminal status gets listed in completed jobs.

Thanks

> On 10-Sep-2021, at 11:34 AM, Robert Metzger <rm...@apache.org> wrote:
> 
> Thanks for the log.
> 
> From the partial log that you shared with me, my assumption is that some external resource manager is shutting down your cluster. Multiple TaskManagers are disconnecting, and finally the job is switching into failed state.
> It seems that you are not stopping only one TaskManger, but all of them.
> 
> Why are you restarting a TaskManager?
> How are you deploying Flink?
> 
> On Fri, Sep 10, 2021 at 12:46 AM Puneet Duggal <puneetduggal1795@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> 
> Please find attached logfile regarding job not getting restarted on another task manager once existing task manager got restarted.
> 
> Just FYI - We are using Fixed Delay Restart (5 times, 10s delay)
> 
> On Thu, Sep 9, 2021 at 4:29 PM Robert Metzger <rmetzger@apache.org <ma...@apache.org>> wrote:
> Hi Puneet,
> 
> Can you provide us with the JobManager logs of this incident? Jobs should not disappear, they should restart on other Task Managers.
> 
> On Wed, Sep 8, 2021 at 3:06 PM Puneet Duggal <puneetduggal1795@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> 
> So for past 2-3 days i have been looking for documentation which elaborates how flink takes care of restarting the data streaming job. I know all the restart and failover strategies but wanted to know how different components (Job Manager, Task Manager etc) play a role while restarting the flink data streaming job. 
> 
> I am asking this because recently in production.. when i restarted a task manger, all the jobs running on it, instead of getting restarted, disappeared. Within flink UI, couldn't tack those jobs in completed jobs as well. Logs also couldnt provide me with good enough information.
> 
> Also if anyone can tell me what is the role of /tmp/executionGraphStore  folder in Job Manager machine.
> 
> Thanks
> 
> 


Re: Documentation for deep diving into flink (data-streaming) job restart process

Posted by Robert Metzger <rm...@apache.org>.
Thanks for the log.

From the partial log that you shared with me, my assumption is that some
external resource manager is shutting down your cluster. Multiple
TaskManagers are disconnecting, and finally the job is switching into
failed state.
It seems that you are not stopping only one TaskManger, but all of them.

Why are you restarting a TaskManager?
How are you deploying Flink?

On Fri, Sep 10, 2021 at 12:46 AM Puneet Duggal <pu...@gmail.com>
wrote:

> Hi,
>
> Please find attached logfile regarding job not getting restarted on
> another task manager once existing task manager got restarted.
>
> Just FYI - We are using Fixed Delay Restart (5 times, 10s delay)
>
> On Thu, Sep 9, 2021 at 4:29 PM Robert Metzger <rm...@apache.org> wrote:
>
>> Hi Puneet,
>>
>> Can you provide us with the JobManager logs of this incident? Jobs should
>> not disappear, they should restart on other Task Managers.
>>
>> On Wed, Sep 8, 2021 at 3:06 PM Puneet Duggal <pu...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> So for past 2-3 days i have been looking for documentation which
>>> elaborates how flink takes care of restarting the data streaming job. I
>>> know all the restart and failover strategies but wanted to know how
>>> different components (Job Manager, Task Manager etc) play a role while
>>> restarting the flink data streaming job.
>>>
>>> I am asking this because recently in production.. when i restarted a
>>> task manger, all the jobs running on it, instead of getting restarted,
>>> disappeared. Within flink UI, couldn't tack those jobs in completed jobs as
>>> well. Logs also couldnt provide me with good enough information.
>>>
>>> Also if anyone can tell me what is the role of /tmp/executionGraphStore
>>> folder in Job Manager machine.
>>>
>>> Thanks
>>>
>>>
>>>

Re: Documentation for deep diving into flink (data-streaming) job restart process

Posted by Puneet Duggal <pu...@gmail.com>.
Hi,

Please find attached logfile regarding job not getting restarted on another
task manager once existing task manager got restarted.

Just FYI - We are using Fixed Delay Restart (5 times, 10s delay)

On Thu, Sep 9, 2021 at 4:29 PM Robert Metzger <rm...@apache.org> wrote:

> Hi Puneet,
>
> Can you provide us with the JobManager logs of this incident? Jobs should
> not disappear, they should restart on other Task Managers.
>
> On Wed, Sep 8, 2021 at 3:06 PM Puneet Duggal <pu...@gmail.com>
> wrote:
>
>> Hi,
>>
>> So for past 2-3 days i have been looking for documentation which
>> elaborates how flink takes care of restarting the data streaming job. I
>> know all the restart and failover strategies but wanted to know how
>> different components (Job Manager, Task Manager etc) play a role while
>> restarting the flink data streaming job.
>>
>> I am asking this because recently in production.. when i restarted a task
>> manger, all the jobs running on it, instead of getting restarted,
>> disappeared. Within flink UI, couldn't tack those jobs in completed jobs as
>> well. Logs also couldnt provide me with good enough information.
>>
>> Also if anyone can tell me what is the role of /tmp/executionGraphStore
>> folder in Job Manager machine.
>>
>> Thanks
>>
>>
>>

Re: Documentation for deep diving into flink (data-streaming) job restart process

Posted by Robert Metzger <rm...@apache.org>.
Hi Puneet,

Can you provide us with the JobManager logs of this incident? Jobs should
not disappear, they should restart on other Task Managers.

On Wed, Sep 8, 2021 at 3:06 PM Puneet Duggal <pu...@gmail.com>
wrote:

> Hi,
>
> So for past 2-3 days i have been looking for documentation which
> elaborates how flink takes care of restarting the data streaming job. I
> know all the restart and failover strategies but wanted to know how
> different components (Job Manager, Task Manager etc) play a role while
> restarting the flink data streaming job.
>
> I am asking this because recently in production.. when i restarted a task
> manger, all the jobs running on it, instead of getting restarted,
> disappeared. Within flink UI, couldn't tack those jobs in completed jobs as
> well. Logs also couldnt provide me with good enough information.
>
> Also if anyone can tell me what is the role of /tmp/executionGraphStore
> folder in Job Manager machine.
>
> Thanks
>
>
>