You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Avi Levi <av...@bluevoyant.com> on 2019/02/14 00:19:50 UTC
flink restoring from state
Hi ,
Any help figuring this will be highly appreciated. we are running on GC ,
after uploading new jar with old savepoint (taken day before) some of our
checkpoints are fails on "*Checkpoint failed: The assigned slot
container_e02_1550091678485_0001_01_000023_7 was removed*." what is the
reason for that ? some used to fail on timeout, but after I increased it to
15 min, Than some crashed on "*Checkpoint failed: Checkpoint Coordinator is
suspending"*. what can cause that and how to solve it ?
another question - recovering old state will case that the consumer will
consume messages from that savepoint ?
regards
Avi
Re: flink restoring from state
Posted by Avi Levi <av...@bluevoyant.com>.
Thank you very much,
Please find attached the job manager log and the task manager log .
Thanks
Avi
On Thu, Feb 14, 2019 at 3:30 AM Congxian Qiu <qc...@gmail.com> wrote:
> Hi, Avi
> I think the "*Checkpoint failed: The assigned slot
> container_e02_1550091678485_0001_01_000023_7 was removed"*(this may be a
> container failure or something else, could double check the taskamanger log
> for more information)and *"**Checkpoint failed: Checkpoint Coordinator is
> suspending" *are not the root cause, could you please share the
> jobmanager log
>
> Whether the consumer consumes messages from that savepoint after
> recovering from the old state is controlled by the consumer, restoring just
> restore the offset if we snapshot it out when savepoint.
> Best,
> Congxian
>
>
> Avi Levi <av...@bluevoyant.com> 于2019年2月14日周四 上午8:20写道:
>
>> Hi ,
>> Any help figuring this will be highly appreciated. we are running on GC ,
>> after uploading new jar with old savepoint (taken day before) some of our
>> checkpoints are fails on "*Checkpoint failed: The assigned slot
>> container_e02_1550091678485_0001_01_000023_7 was removed*." what is the
>> reason for that ? some used to fail on timeout, but after I increased it to
>> 15 min, Than some crashed on "*Checkpoint failed: Checkpoint Coordinator
>> is suspending"*. what can cause that and how to solve it ?
>>
>> another question - recovering old state will case that the consumer will
>> consume messages from that savepoint ?
>>
>> regards
>> Avi
>>
>>
>>
Re: flink restoring from state
Posted by Congxian Qiu <qc...@gmail.com>.
Hi, Avi
I think the "*Checkpoint failed: The assigned slot
container_e02_1550091678485_0001_01_000023_7 was removed"*(this may be a
container failure or something else, could double check the taskamanger log
for more information)and *"**Checkpoint failed: Checkpoint Coordinator is
suspending" *are not the root cause, could you please share the jobmanager
log
Whether the consumer consumes messages from that savepoint after
recovering from the old state is controlled by the consumer, restoring just
restore the offset if we snapshot it out when savepoint.
Best,
Congxian
Avi Levi <av...@bluevoyant.com> 于2019年2月14日周四 上午8:20写道:
> Hi ,
> Any help figuring this will be highly appreciated. we are running on GC ,
> after uploading new jar with old savepoint (taken day before) some of our
> checkpoints are fails on "*Checkpoint failed: The assigned slot
> container_e02_1550091678485_0001_01_000023_7 was removed*." what is the
> reason for that ? some used to fail on timeout, but after I increased it to
> 15 min, Than some crashed on "*Checkpoint failed: Checkpoint Coordinator
> is suspending"*. what can cause that and how to solve it ?
>
> another question - recovering old state will case that the consumer will
> consume messages from that savepoint ?
>
> regards
> Avi
>
>
>