You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Paul Lam <pa...@gmail.com> on 2020/09/29 11:52:21 UTC

Savepoint incomplete when job was killed after a cancel timeout

Hi,

We have a Flink job that was stopped erroneously with no available checkpoint/savepoint to restore, 
and are looking for some help to narrow down the problem.

How we ran into this problem:

We stopped the job using cancel with savepoint command (for compatibility issue), but the command
timed out after 1 min because there was some backpressure. So we force kill the job by yarn kill command.
Usually, this would not cause troubles because we can still use the last checkpoint to restore the job.

But at this time, the last checkpoint dir was cleaned up and empty (the retained checkpoint number was 1).
According to zookeeper and the logs, the savepoint finished (job master logged “Savepoint stored in …”) 
right after the cancel timeout. However, the savepoint directory contains only _metadata file, and other 
state files referred by metadata are absent. 

Environment & Config:
- Flink 1.11.0
- YARN job cluster
- HA via zookeeper
- FsStateBackend
- Aligned non-incremental checkpoint

Any comments and suggestions are appreciated! Thanks!

Best,
Paul Lam

Re: Savepoint incomplete when job was killed after a cancel timeout

Posted by Till Rohrmann <tr...@apache.org>.

Glad to hear that your job data was not lost!

Cheers,
Till

On Tue, Sep 29, 2020 at 7:28 PM Paul Lam <pa...@gmail.com> wrote:

> Hi Till,
>
> Thanks a lot for the pointer! I tried to restore the job using the
> savepoint in a dry run, and it worked!
>
> Guess I've misunderstood the configuration option, and confused by the
> non-existent paths that the metadata contains.
>
> Best,
> Paul Lam
>
> Till Rohrmann <tr...@apache.org> 于2020年9月29日周二 下午10:30写道：
>
>> Thanks for sharing the logs with me. It looks as if the total size of the
>> savepoint is 335kb for a job with a parallelism of 60 and a total of 120
>> tasks. Hence, the average size of a state per task is between 2.5kb - 5kb.
>> I think that the state size threshold refers to the size of the per task
>> state. Hence, I believe that the _metadata file should contain all of your
>> state. Have you tried restoring from this savepoint?
>>
>> Cheers,
>> Till
>>
>> On Tue, Sep 29, 2020 at 3:47 PM Paul Lam <pa...@gmail.com> wrote:
>>
>>> Hi Till,
>>>
>>> Thanks for your quick reply.
>>>
>>> The checkpoint/savepoint size would be around 2MB, which is larger than
>>> `state.backend.fs.memory-threshold`.
>>>
>>> The jobmanager logs are attached, which looks normal to me.
>>>
>>> Thanks again!
>>>
>>> Best,
>>> Paul Lam
>>>
>>> Till Rohrmann <tr...@apache.org> 于2020年9月29日周二 下午8:32写道：
>>>
>>>> Hi Paul,
>>>>
>>>> could you share with us the logs of the JobManager? They might help to
>>>> better understand in which order each operation occurred.
>>>>
>>>> How big are you expecting the size of the state to be? If it is smaller
>>>> than state.backend.fs.memory-threshold, then the state data will be stored
>>>> in the _metadata file.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <pa...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We have a Flink job that was stopped erroneously with no available
>>>>> checkpoint/savepoint to restore,
>>>>> and are looking for some help to narrow down the problem.
>>>>>
>>>>> How we ran into this problem:
>>>>>
>>>>> We stopped the job using cancel with savepoint command (for
>>>>> compatibility issue), but the command
>>>>> timed out after 1 min because there was some backpressure. So we force
>>>>> kill the job by yarn kill command.
>>>>> Usually, this would not cause troubles because we can still use the
>>>>> last checkpoint to restore the job.
>>>>>
>>>>> But at this time, the last checkpoint dir was cleaned up and empty
>>>>> (the retained checkpoint number was 1).
>>>>> According to zookeeper and the logs, the savepoint finished (job
>>>>> master logged “Savepoint stored in …”)
>>>>> right after the cancel timeout. However, the savepoint directory
>>>>> contains only _metadata file, and other
>>>>> state files referred by metadata are absent.
>>>>>
>>>>> Environment & Config:
>>>>> - Flink 1.11.0
>>>>> - YARN job cluster
>>>>> - HA via zookeeper
>>>>> - FsStateBackend
>>>>> - Aligned non-incremental checkpoint
>>>>>
>>>>> Any comments and suggestions are appreciated! Thanks!
>>>>>
>>>>> Best,
>>>>> Paul Lam
>>>>>
>>>>>

Re: Savepoint incomplete when job was killed after a cancel timeout

Posted by Paul Lam <pa...@gmail.com>.

Hi Till,

Thanks a lot for the pointer! I tried to restore the job using the
savepoint in a dry run, and it worked!

Guess I've misunderstood the configuration option, and confused by the
non-existent paths that the metadata contains.

Best,
Paul Lam

Till Rohrmann <tr...@apache.org> 于2020年9月29日周二 下午10:30写道：

> Thanks for sharing the logs with me. It looks as if the total size of the
> savepoint is 335kb for a job with a parallelism of 60 and a total of 120
> tasks. Hence, the average size of a state per task is between 2.5kb - 5kb.
> I think that the state size threshold refers to the size of the per task
> state. Hence, I believe that the _metadata file should contain all of your
> state. Have you tried restoring from this savepoint?
>
> Cheers,
> Till
>
> On Tue, Sep 29, 2020 at 3:47 PM Paul Lam <pa...@gmail.com> wrote:
>
>> Hi Till,
>>
>> Thanks for your quick reply.
>>
>> The checkpoint/savepoint size would be around 2MB, which is larger than
>> `state.backend.fs.memory-threshold`.
>>
>> The jobmanager logs are attached, which looks normal to me.
>>
>> Thanks again!
>>
>> Best,
>> Paul Lam
>>
>> Till Rohrmann <tr...@apache.org> 于2020年9月29日周二 下午8:32写道：
>>
>>> Hi Paul,
>>>
>>> could you share with us the logs of the JobManager? They might help to
>>> better understand in which order each operation occurred.
>>>
>>> How big are you expecting the size of the state to be? If it is smaller
>>> than state.backend.fs.memory-threshold, then the state data will be stored
>>> in the _metadata file.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <pa...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have a Flink job that was stopped erroneously with no available
>>>> checkpoint/savepoint to restore,
>>>> and are looking for some help to narrow down the problem.
>>>>
>>>> How we ran into this problem:
>>>>
>>>> We stopped the job using cancel with savepoint command (for
>>>> compatibility issue), but the command
>>>> timed out after 1 min because there was some backpressure. So we force
>>>> kill the job by yarn kill command.
>>>> Usually, this would not cause troubles because we can still use the
>>>> last checkpoint to restore the job.
>>>>
>>>> But at this time, the last checkpoint dir was cleaned up and empty (the
>>>> retained checkpoint number was 1).
>>>> According to zookeeper and the logs, the savepoint finished (job master
>>>> logged “Savepoint stored in …”)
>>>> right after the cancel timeout. However, the savepoint directory
>>>> contains only _metadata file, and other
>>>> state files referred by metadata are absent.
>>>>
>>>> Environment & Config:
>>>> - Flink 1.11.0
>>>> - YARN job cluster
>>>> - HA via zookeeper
>>>> - FsStateBackend
>>>> - Aligned non-incremental checkpoint
>>>>
>>>> Any comments and suggestions are appreciated! Thanks!
>>>>
>>>> Best,
>>>> Paul Lam
>>>>
>>>>

Re: Savepoint incomplete when job was killed after a cancel timeout

Posted by Till Rohrmann <tr...@apache.org>.

Thanks for sharing the logs with me. It looks as if the total size of the
savepoint is 335kb for a job with a parallelism of 60 and a total of 120
tasks. Hence, the average size of a state per task is between 2.5kb - 5kb.
I think that the state size threshold refers to the size of the per task
state. Hence, I believe that the _metadata file should contain all of your
state. Have you tried restoring from this savepoint?

Cheers,
Till

On Tue, Sep 29, 2020 at 3:47 PM Paul Lam <pa...@gmail.com> wrote:

> Hi Till,
>
> Thanks for your quick reply.
>
> The checkpoint/savepoint size would be around 2MB, which is larger than
> `state.backend.fs.memory-threshold`.
>
> The jobmanager logs are attached, which looks normal to me.
>
> Thanks again!
>
> Best,
> Paul Lam
>
> Till Rohrmann <tr...@apache.org> 于2020年9月29日周二 下午8:32写道：
>
>> Hi Paul,
>>
>> could you share with us the logs of the JobManager? They might help to
>> better understand in which order each operation occurred.
>>
>> How big are you expecting the size of the state to be? If it is smaller
>> than state.backend.fs.memory-threshold, then the state data will be stored
>> in the _metadata file.
>>
>> Cheers,
>> Till
>>
>> On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <pa...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We have a Flink job that was stopped erroneously with no available
>>> checkpoint/savepoint to restore,
>>> and are looking for some help to narrow down the problem.
>>>
>>> How we ran into this problem:
>>>
>>> We stopped the job using cancel with savepoint command (for
>>> compatibility issue), but the command
>>> timed out after 1 min because there was some backpressure. So we force
>>> kill the job by yarn kill command.
>>> Usually, this would not cause troubles because we can still use the last
>>> checkpoint to restore the job.
>>>
>>> But at this time, the last checkpoint dir was cleaned up and empty (the
>>> retained checkpoint number was 1).
>>> According to zookeeper and the logs, the savepoint finished (job master
>>> logged “Savepoint stored in …”)
>>> right after the cancel timeout. However, the savepoint directory
>>> contains only _metadata file, and other
>>> state files referred by metadata are absent.
>>>
>>> Environment & Config:
>>> - Flink 1.11.0
>>> - YARN job cluster
>>> - HA via zookeeper
>>> - FsStateBackend
>>> - Aligned non-incremental checkpoint
>>>
>>> Any comments and suggestions are appreciated! Thanks!
>>>
>>> Best,
>>> Paul Lam
>>>
>>>

Re: Savepoint incomplete when job was killed after a cancel timeout

Posted by Till Rohrmann <tr...@apache.org>.

Hi Paul,

could you share with us the logs of the JobManager? They might help to
better understand in which order each operation occurred.

How big are you expecting the size of the state to be? If it is smaller
than state.backend.fs.memory-threshold, then the state data will be stored
in the _metadata file.

Cheers,
Till

On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <pa...@gmail.com> wrote:

> Hi,
>
> We have a Flink job that was stopped erroneously with no available
> checkpoint/savepoint to restore,
> and are looking for some help to narrow down the problem.
>
> How we ran into this problem:
>
> We stopped the job using cancel with savepoint command (for compatibility
> issue), but the command
> timed out after 1 min because there was some backpressure. So we force
> kill the job by yarn kill command.
> Usually, this would not cause troubles because we can still use the last
> checkpoint to restore the job.
>
> But at this time, the last checkpoint dir was cleaned up and empty (the
> retained checkpoint number was 1).
> According to zookeeper and the logs, the savepoint finished (job master
> logged “Savepoint stored in …”)
> right after the cancel timeout. However, the savepoint directory contains
> only _metadata file, and other
> state files referred by metadata are absent.
>
> Environment & Config:
> - Flink 1.11.0
> - YARN job cluster
> - HA via zookeeper
> - FsStateBackend
> - Aligned non-incremental checkpoint
>
> Any comments and suggestions are appreciated! Thanks!
>
> Best,
> Paul Lam
>
>