You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Zhu Zhu <re...@gmail.com> on 2019/11/11 09:18:23 UTC

Re: Flink disaster recovery test problems

Hi Zhong,

Looks you are assigning tasks to different slot sharing groups to force
them to not share the same slot.
So you will need at least 2 slots for the streaming job to start running
successfully.
Killing one of the 2 TM, one slot in each, will lead to insufficient slots
and your job will hang at slot allocation.

Task states are needed to not skip unprocessed source data, thus to avoid
data loss. It's also needed if you want the failed task to recovery to the
state right before failure.
Checkpointing is needed to persist the task states. If it is not enabled,
the job will restart with the initial state, i.e. the job will consume data
from the very beginning and there can be a big data regression.

Thanks,
Zhu Zhu

钟旭阳 <ye...@mail.ustc.edu.cn> 于2019年11月5日周二 下午3:01写道：

> hello:
>
>
> I am currently learning flink.I recently had a problem with Flink for
> disaster recovery testing.I tried to find an answer on the official website
> and blog but failed.I am trying to find community help.
>
>
> The current situation is：I have two servers, each with one slot.My
> application has two parallel operators with a degree of parallelism of 1,
> using the slotSharingGroup function to make them run in these two slots
> respectively.
>
>
> My disaster recovery test is to shut down one of the servers. But is it
> possible that two parallel operators compete for the same server slot? In
> addition to this,I want to dynamically add or remove servers (simulated
> power failures,etc) while Flink is running, but I think this must cause
> stream data loss. Is it only one way to restart Flink through the
> checkpoint mechanism to ensure that data is not lost and the number of
> servers is dynamically configured?
>
>
> Best
> Zhong