You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Konstantin Knauf <ko...@tngtech.com> on 2017/02/28 15:06:52 UTC

Flink 1.2 Jobmanager OOME - CheckpointCoordinators

Hi everyone,

I am currently running a small Flink job locally, which checkpoints
every 100ms.

After a few minutes the JM crashes with an OOME. In the Headump I can
see, that a TimerTask holds references to all completed
CheckpointCoordinators. I assume this task is supposed to clean these
checkpoints up eventually.

First, is this the expected behaviour? Second, is there a configuration
option to trigger this cleanup timer earlier?

Cheers,

Konstantin

-- 
Konstantin Knauf * konstantin.knauf@tngtech.com * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Re: Flink 1.2 Jobmanager OOME - CheckpointCoordinators

Posted by Konstantin Knauf <ko...@tngtech.com>.

Hi Ufuk,

thank's for looking into it. I have shared the heap dump with you (link
in a separate e-mail). Additionally, attach two screenshots of the dump.

I was actually wrong in my original e-mail, oversaw the "$1" in the
classname. It really seems that it's just the TimerTasks created in
CheckpointCoordinator:453. With a checkpoint interval of 100ms this
means, 600 checkpoints per minute, so 6000 Checkpoints in the jobmanager
until the first TimerTasks (which hold a reference to the checkpoint)
expire. After roughly 4500 checkpoints the OOME happens.

From my understanding, this timer should be deleted as soon as the
checkpoint is completed.

Cheers,

Konstantin


On 28.02.2017 18:16, Ufuk Celebi wrote:
> @Konstantion: Could you share a relevant part of the heap dump just to
> get a second look?
> 
> The timer tasks are responsible to abort the checkpoint if a
> checkpoint timeout occurs. You can decrease the timeout via the
> CheckpointConfig
> (env.getCheckpointConfig().setCheckpointTimeout(long)), the current
> default is 10 mins.
> 
> On a first skim of the checkpoint coordinator code I didn't see
> anything that cancels these tasks when the checkpoint is fully ack'd.
> @Stephan: I think we should do that. What do you think?
> 
> On Tue, Feb 28, 2017 at 4:06 PM, Konstantin Knauf
> <ko...@tngtech.com> wrote:
>> Hi everyone,
>>
>> I am currently running a small Flink job locally, which checkpoints
>> every 100ms.
>>
>> After a few minutes the JM crashes with an OOME. In the Headump I can
>> see, that a TimerTask holds references to all completed
>> CheckpointCoordinators. I assume this task is supposed to clean these
>> checkpoints up eventually.
>>
>> First, is this the expected behaviour? Second, is there a configuration
>> option to trigger this cleanup timer earlier?
>>
>> Cheers,
>>
>> Konstantin
>>
>> --
>> Konstantin Knauf * konstantin.knauf@tngtech.com * +49-174-3413182
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>
> 

-- 
Konstantin Knauf * konstantin.knauf@tngtech.com * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Re: Flink 1.2 Jobmanager OOME - CheckpointCoordinators

Posted by Ufuk Celebi <uc...@apache.org>.

@Konstantion: Could you share a relevant part of the heap dump just to
get a second look?

The timer tasks are responsible to abort the checkpoint if a
checkpoint timeout occurs. You can decrease the timeout via the
CheckpointConfig
(env.getCheckpointConfig().setCheckpointTimeout(long)), the current
default is 10 mins.

On a first skim of the checkpoint coordinator code I didn't see
anything that cancels these tasks when the checkpoint is fully ack'd.
@Stephan: I think we should do that. What do you think?

On Tue, Feb 28, 2017 at 4:06 PM, Konstantin Knauf
<ko...@tngtech.com> wrote:
> Hi everyone,
>
> I am currently running a small Flink job locally, which checkpoints
> every 100ms.
>
> After a few minutes the JM crashes with an OOME. In the Headump I can
> see, that a TimerTask holds references to all completed
> CheckpointCoordinators. I assume this task is supposed to clean these
> checkpoints up eventually.
>
> First, is this the expected behaviour? Second, is there a configuration
> option to trigger this cleanup timer earlier?
>
> Cheers,
>
> Konstantin
>
> --
> Konstantin Knauf * konstantin.knauf@tngtech.com * +49-174-3413182
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>