You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Alberto Mancini <ab...@gmail.com> on 2018/05/25 11:41:37 UTC

Timers and Checkpoints

Hello,
I think we are experiencing this issue:
https://issues.apache.org/jira/browse/FLINK-6291

In fact we have a long running job that is unable to complete a checkpoint
and so we are unable to create a savepoint.

I do not really understand from 6291 how the timer service has been
removed in my job and mostly i do not find how i can let my job to create a
savepoint.
We are using flink 1.3.2.

Thanks,
   Alberto.

Re: Timers and Checkpoints

Posted by Andrea Spina <an...@radicalbit.io>.
Hi everybody,
I think I'm in the same issue above described in
https://issues.apache.org/jira/browse/FLINK-6291 . Flink1-6.4
I have had this savepoint with a timer service belonging to a process
function. When I restore a new job w/o the former process function ti fails
in the following way.
What is a valuable workaround for this?

        at
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:641)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:586)
        at
org.apache.flink.streaming.runtime.io.BarrierBuffer.notifyCheckpoint(BarrierBuffer.java:396)
        at
org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:292)
        at
org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:200)
        at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:209)
        at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
        at
org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.snapshotTimersForKeyGroup(InternalTimerServiceImpl.java:264)
        at
org.apache.flink.streaming.api.operators.InternalTimerServiceSerializationProxy.write(InternalTimerServiceSerializationProxy.java:90)
        at
org.apache.flink.streaming.api.operators.InternalTimeServiceManager.snapshotStateForKeyGroup(InternalTimeServiceManager.java:139)
        at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:452)
        ... 15 more

Really thank you,
Andrea

Il giorno sab 26 mag 2018 alle ore 00:11 Alberto Mancini <
ab.mancini@gmail.com> ha scritto:

> Hello Timo,
> we found that the problem was not related to a timer but to an hardware
> issue in the production system.
> On the other hand the NPE exception in HeapInternalTimerService in the
> testing system was due to
> the fact the savepoint was created by a different version of the
> application; definitely not our day.
>
> BTW, the application used to create the savepoint uses actually a
> processFunction
> (with timers) replacing a flatMap with the same uid so makes sense that
> the got the same result of FLINK-6291
> <https://issues.apache.org/jira/browse/FLINK-6291>.
>
>
> Thanks,
>   A.
>
>
>
> On Fri, May 25, 2018 at 3:19 PM Alberto Mancini <ab...@gmail.com>
> wrote:
>
>> Hello Timo,
>> thanks for the response.
>>
>> We are still investigating in the production system but in test we get
>> now this exception that seems  very much related to the issue 6291.
>>
>>
>> java.lang.Exception: Could not perform checkpoint 13468 for operator Aggregator -> Sink: HBase (1/1).
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:552)
>> 	at org.apache.flink.streaming.runtime.io.BarrierBuffer.notifyCheckpoint(BarrierBuffer.java:378)
>> 	at org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:281)
>> 	at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:183)
>> 	at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:277)
>> 	at org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.run(TwoInputStreamTask.java:91)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
>> 	at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.Exception: Could not complete snapshot 13468 for operator Aggregator -> Sink: HBase (1/1).
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:407)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1162)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1094)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:654)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:590)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:543)
>> 	... 8 more
>> Caused by: java.lang.Exception: Could not write timer service of Aggregator -> Sink: HBase (1/1) to checkpoint state stream.
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:438)
>> 	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:98)
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:385)
>> 	... 13 more
>> Caused by: java.lang.NullPointerException
>> 	at org.apache.flink.streaming.api.operators.HeapInternalTimerService.snapshotTimersForKeyGroup(HeapInternalTimerService.java:304)
>> 	at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.snapshotStateForKeyGroup(InternalTimeServiceManager.java:121)
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
>> 	... 15 more
>>
>>
>>
>> On Fri, May 25, 2018 at 3:11 PM Timo Walther <tw...@apache.org> wrote:
>>
>>> Hi Alberto,
>>>
>>> do you get exactly the same exception? Maybe you can share some logs
>>> with us?
>>>
>>> Regards,
>>> Timo
>>>
>>> Am 25.05.18 um 13:41 schrieb Alberto Mancini:
>>> > Hello,
>>> > I think we are experiencing this issue:
>>> > https://issues.apache.org/jira/browse/FLINK-6291
>>> >
>>> > In fact we have a long running job that is unable to complete a
>>> > checkpoint and so we are unable to create a savepoint.
>>> >
>>> > I do not really understand from 6291 how the timer service has been
>>> > removed in my job and mostly i do not find how i can let my job to
>>> > create a savepoint.
>>> > We are using flink 1.3.2.
>>> >
>>> > Thanks,
>>> >    Alberto.
>>> >
>>>
>>>

-- 
*Andrea Spina*
Head of R&D @ Radicalbit Srl
Via Giovanni Battista Pirelli 11, 20124, Milano - IT

Re: Timers and Checkpoints

Posted by Alberto Mancini <ab...@gmail.com>.
Hello Timo,
we found that the problem was not related to a timer but to an hardware
issue in the production system.
On the other hand the NPE exception in HeapInternalTimerService in the
testing system was due to
the fact the savepoint was created by a different version of the
application; definitely not our day.

BTW, the application used to create the savepoint uses actually a
processFunction
(with timers) replacing a flatMap with the same uid so makes sense that the
got the same result of FLINK-6291
<https://issues.apache.org/jira/browse/FLINK-6291>.


Thanks,
  A.



On Fri, May 25, 2018 at 3:19 PM Alberto Mancini <ab...@gmail.com>
wrote:

> Hello Timo,
> thanks for the response.
>
> We are still investigating in the production system but in test we get now
> this exception that seems  very much related to the issue 6291.
>
>
> java.lang.Exception: Could not perform checkpoint 13468 for operator Aggregator -> Sink: HBase (1/1).
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:552)
> 	at org.apache.flink.streaming.runtime.io.BarrierBuffer.notifyCheckpoint(BarrierBuffer.java:378)
> 	at org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:281)
> 	at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:183)
> 	at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:277)
> 	at org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.run(TwoInputStreamTask.java:91)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.Exception: Could not complete snapshot 13468 for operator Aggregator -> Sink: HBase (1/1).
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:407)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1162)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1094)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:654)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:590)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:543)
> 	... 8 more
> Caused by: java.lang.Exception: Could not write timer service of Aggregator -> Sink: HBase (1/1) to checkpoint state stream.
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:438)
> 	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:98)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:385)
> 	... 13 more
> Caused by: java.lang.NullPointerException
> 	at org.apache.flink.streaming.api.operators.HeapInternalTimerService.snapshotTimersForKeyGroup(HeapInternalTimerService.java:304)
> 	at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.snapshotStateForKeyGroup(InternalTimeServiceManager.java:121)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
> 	... 15 more
>
>
>
> On Fri, May 25, 2018 at 3:11 PM Timo Walther <tw...@apache.org> wrote:
>
>> Hi Alberto,
>>
>> do you get exactly the same exception? Maybe you can share some logs
>> with us?
>>
>> Regards,
>> Timo
>>
>> Am 25.05.18 um 13:41 schrieb Alberto Mancini:
>> > Hello,
>> > I think we are experiencing this issue:
>> > https://issues.apache.org/jira/browse/FLINK-6291
>> >
>> > In fact we have a long running job that is unable to complete a
>> > checkpoint and so we are unable to create a savepoint.
>> >
>> > I do not really understand from 6291 how the timer service has been
>> > removed in my job and mostly i do not find how i can let my job to
>> > create a savepoint.
>> > We are using flink 1.3.2.
>> >
>> > Thanks,
>> >    Alberto.
>> >
>>
>>

Re: Timers and Checkpoints

Posted by Alberto Mancini <ab...@gmail.com>.
Hello Timo,
thanks for the response.

We are still investigating in the production system but in test we get now
this exception that seems  very much related to the issue 6291.


java.lang.Exception: Could not perform checkpoint 13468 for operator
Aggregator -> Sink: HBase (1/1).
	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:552)
	at org.apache.flink.streaming.runtime.io.BarrierBuffer.notifyCheckpoint(BarrierBuffer.java:378)
	at org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:281)
	at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:183)
	at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:277)
	at org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.run(TwoInputStreamTask.java:91)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Exception: Could not complete snapshot 13468 for
operator Aggregator -> Sink: HBase (1/1).
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:407)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1162)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1094)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:654)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:590)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:543)
	... 8 more
Caused by: java.lang.Exception: Could not write timer service of
Aggregator -> Sink: HBase (1/1) to checkpoint state stream.
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:438)
	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:98)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:385)
	... 13 more
Caused by: java.lang.NullPointerException
	at org.apache.flink.streaming.api.operators.HeapInternalTimerService.snapshotTimersForKeyGroup(HeapInternalTimerService.java:304)
	at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.snapshotStateForKeyGroup(InternalTimeServiceManager.java:121)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
	... 15 more



On Fri, May 25, 2018 at 3:11 PM Timo Walther <tw...@apache.org> wrote:

> Hi Alberto,
>
> do you get exactly the same exception? Maybe you can share some logs
> with us?
>
> Regards,
> Timo
>
> Am 25.05.18 um 13:41 schrieb Alberto Mancini:
> > Hello,
> > I think we are experiencing this issue:
> > https://issues.apache.org/jira/browse/FLINK-6291
> >
> > In fact we have a long running job that is unable to complete a
> > checkpoint and so we are unable to create a savepoint.
> >
> > I do not really understand from 6291 how the timer service has been
> > removed in my job and mostly i do not find how i can let my job to
> > create a savepoint.
> > We are using flink 1.3.2.
> >
> > Thanks,
> >    Alberto.
> >
>
>

Re: Timers and Checkpoints

Posted by Timo Walther <tw...@apache.org>.
Hi Alberto,

do you get exactly the same exception? Maybe you can share some logs 
with us?

Regards,
Timo

Am 25.05.18 um 13:41 schrieb Alberto Mancini:
> Hello,
> I think we are experiencing this issue:
> https://issues.apache.org/jira/browse/FLINK-6291
>
> In fact we have a long running job that is unable to complete a 
> checkpoint and so we are unable to create a savepoint.
>
> I do not really understand from 6291 how the timer service has been  
> removed in my job and mostly i do not find how i can let my job to 
> create a savepoint.
> We are using flink 1.3.2.
>
> Thanks,
>    Alberto.
>