You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Avi Levi <av...@bluevoyant.com> on 2019/02/14 00:19:50 UTC

flink restoring from state

Hi ,
Any help figuring this will be highly appreciated. we are running on GC ,
after uploading new jar with old savepoint (taken day before) some of our
checkpoints are fails on "*Checkpoint failed: The assigned slot
container_e02_1550091678485_0001_01_000023_7 was removed*." what is the
reason for that ? some used to fail on timeout, but after I increased it to
15 min, Than some crashed on "*Checkpoint failed: Checkpoint Coordinator is
suspending"*.  what can cause that and how to solve it ?

another question - recovering old state will case that the consumer will
consume messages from that savepoint ?

regards
Avi

Re: flink restoring from state

Posted by Avi Levi <av...@bluevoyant.com>.

Thank you very much,
Please find attached the job manager log and the task manager log .

Thanks
Avi

On Thu, Feb 14, 2019 at 3:30 AM Congxian Qiu <qc...@gmail.com> wrote:

> Hi, Avi
>     I think the "*Checkpoint failed: The assigned slot
> container_e02_1550091678485_0001_01_000023_7 was removed"*(this may be a
> container failure or something else, could double check the taskamanger log
> for more information)and *"**Checkpoint failed: Checkpoint Coordinator is
> suspending" *are not the root cause, could you please share the
> jobmanager log
>
>     Whether the consumer consumes messages from that savepoint after
> recovering from the old state is controlled by the consumer, restoring just
> restore the offset if we snapshot it out when savepoint.
> Best,
> Congxian
>
>
> Avi Levi <av...@bluevoyant.com> 于2019年2月14日周四 上午8:20写道：
>
>> Hi ,
>> Any help figuring this will be highly appreciated. we are running on GC ,
>> after uploading new jar with old savepoint (taken day before) some of our
>> checkpoints are fails on "*Checkpoint failed: The assigned slot
>> container_e02_1550091678485_0001_01_000023_7 was removed*." what is the
>> reason for that ? some used to fail on timeout, but after I increased it to
>> 15 min, Than some crashed on "*Checkpoint failed: Checkpoint Coordinator
>> is suspending"*.  what can cause that and how to solve it ?
>>
>> another question - recovering old state will case that the consumer will
>> consume messages from that savepoint ?
>>
>> regards
>> Avi
>>
>>
>>

Re: flink restoring from state

Posted by Congxian Qiu <qc...@gmail.com>.

Hi, Avi
    I think the "*Checkpoint failed: The assigned slot
container_e02_1550091678485_0001_01_000023_7 was removed"*(this may be a
container failure or something else, could double check the taskamanger log
for more information)and *"**Checkpoint failed: Checkpoint Coordinator is
suspending" *are not the root cause, could you please share the jobmanager
log

    Whether the consumer consumes messages from that savepoint after
recovering from the old state is controlled by the consumer, restoring just
restore the offset if we snapshot it out when savepoint.
Best,
Congxian


Avi Levi <av...@bluevoyant.com> 于2019年2月14日周四 上午8:20写道：

> Hi ,
> Any help figuring this will be highly appreciated. we are running on GC ,
> after uploading new jar with old savepoint (taken day before) some of our
> checkpoints are fails on "*Checkpoint failed: The assigned slot
> container_e02_1550091678485_0001_01_000023_7 was removed*." what is the
> reason for that ? some used to fail on timeout, but after I increased it to
> 15 min, Than some crashed on "*Checkpoint failed: Checkpoint Coordinator
> is suspending"*.  what can cause that and how to solve it ?
>
> another question - recovering old state will case that the consumer will
> consume messages from that savepoint ?
>
> regards
> Avi
>
>
>