You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Navneeth Krishnan <re...@gmail.com> on 2018/03/29 18:02:04 UTC

Job restart hook

Hi,

Is there a way for a script to be called whenever a job gets restarted? My
scenario is lets say there are 20 slots and the job runs on all 20 slots.
After a while a task manager goes down and now there are only 14 slots and
I need to readjust the parallelism of my job to ensure the job runs until
the lost TM comes up again. It would be great to know how others are
handling this situation.

Thanks,
Navneeth

Re: Job restart hook

Posted by Kostas Kloudas <k....@data-artisans.com>.
Hi Navneeth,

I am sending the answer to the user mailing list so that we keep the discussion public.
There may also be other users interested in the question.

So the answer to the question is that you cannot restart from an externalized checkpoint 
with a different parallelism. To be able to do so, you have to take a savepoint. 
You can find more on this in [1].

Thanks,
Kostas

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/checkpoints.html <https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/checkpoints.html>

> On Apr 3, 2018, at 7:40 PM, Navneeth Krishnan <re...@gmail.com> wrote:
> 
> Thanks a lot Kostas, the issue we are facing is sometimes it takes a very long time to bring up the TM and we don't want to stall the entire job until the TM is back up. Thats why we wanted to explore this options and see if it works. One small question on the same is can we restore from checkpoints with different parallelism?
> 
> On Tue, Apr 3, 2018 at 2:48 AM, Kostas Kloudas <k.kloudas@data-artisans.com <ma...@data-artisans.com>> wrote:
> Hi Navneeth,
> 
> If I understand correctly, you have a job with parallelism p=20, a TM goes down (eg. with 4 slots),
> and you want until the TM comes up, to run the job with p=16 and then re-running it with 20 again,
> when the TM comes up.
> 
> If this is the case, one important thing to keep in mind is that when a TM fails, the whole job restarts,
> and not only the tasks that were running on that TM.
> 
> Given this, and assuming that the lost TM will not take long until it comes up, I am not sure
> if you save anything by starting a job with parallelism = 20, then restarting it with parallelism
> of 16 (in your example) until the TM comes up, and then taking a savepoint, stopping it and
> restarting it with parallelism 20 again.
> 
> If you still want to do it, one way you can can do it, is to use the REST API to get the necessary
> information about your cluster and the state of your job and write a script that takes the necessary
> actions, e.g. resubmit a job with different parallelism.
> 
> I hope this helps,
> Kostas
> 
> > On Mar 29, 2018, at 8:02 PM, Navneeth Krishnan <reachnavneeth2@gmail.com <ma...@gmail.com>> wrote:
> >
> > Hi,
> >
> > Is there a way for a script to be called whenever a job gets restarted? My scenario is lets say there are 20 slots and the job runs on all 20 slots. After a while a task manager goes down and now there are only 14 slots and I need to readjust the parallelism of my job to ensure the job runs until the lost TM comes up again. It would be great to know how others are handling this situation.
> >
> > Thanks,
> > Navneeth
> 
> 


Re: Job restart hook

Posted by Kostas Kloudas <k....@data-artisans.com>.
Hi Navneeth,

If I understand correctly, you have a job with parallelism p=20, a TM goes down (eg. with 4 slots),
and you want until the TM comes up, to run the job with p=16 and then re-running it with 20 again,
when the TM comes up.

If this is the case, one important thing to keep in mind is that when a TM fails, the whole job restarts, 
and not only the tasks that were running on that TM. 

Given this, and assuming that the lost TM will not take long until it comes up, I am not sure 
if you save anything by starting a job with parallelism = 20, then restarting it with parallelism 
of 16 (in your example) until the TM comes up, and then taking a savepoint, stopping it and 
restarting it with parallelism 20 again.

If you still want to do it, one way you can can do it, is to use the REST API to get the necessary 
information about your cluster and the state of your job and write a script that takes the necessary 
actions, e.g. resubmit a job with different parallelism.

I hope this helps,
Kostas

> On Mar 29, 2018, at 8:02 PM, Navneeth Krishnan <re...@gmail.com> wrote:
> 
> Hi,
> 
> Is there a way for a script to be called whenever a job gets restarted? My scenario is lets say there are 20 slots and the job runs on all 20 slots. After a while a task manager goes down and now there are only 14 slots and I need to readjust the parallelism of my job to ensure the job runs until the lost TM comes up again. It would be great to know how others are handling this situation.
> 
> Thanks,
> Navneeth