You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Binh Nguyen Van <bi...@gmail.com> on 2020/10/05 06:53:51 UTC

NPE when checkpointing

Hi,

I have a streaming job that is written in Apache Beam and uses Flink as its
runner. The job is working as expected for about 15 hours and then it
started to have checkpointing error. The error message looks like this

java.lang.Exception: Could not perform checkpoint 910 for operator
Source: <source-name> (8/60).
    at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
    at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
    at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776)
    ... 11 more

When this happened, I have to stop the job and then start it again, and
then 15 hours later the issue happens again.

Here are some additional information

   - Flink version is 1.10.1
   - Job reads data from Kafka, transform, and then writes to Kafka
   - There are 6 tasks with the parallelism of 60 each (each task reads
   from 1 Kafka topic)
   - The job is deployed to run on YARN with 60 task managers and each task
   manager has 1 slot
   - The State backend is filesystem and HDFS is the storage (Doesn’t seem
   to related to the type of state backend since the issue also happened when
   I use memory as the state backend)
   - The checkpointing interval is 60 seconds (The longest duration of the
   normal checkpoint as shown in Flink UI is 14 seconds)
   - The minimum pause between checkpoints is 30 seconds
   - Hadoop cluster is Kerberized but Kafka is not. Keytab and principal
   are set in the Flink configuration file

Can someone please help?

Thanks
-Binh

Re: NPE when checkpointing

Posted by Piotr Nowojski <pn...@apache.org>.

No worries, thanks for the update! It's good to hear that it worked for you.

Best regards,
Piotrek

wt., 13 paź 2020 o 22:43 Binh Nguyen Van <bi...@gmail.com> napisał(a):

> Hi,
>
> Sorry for the late reply. It took me quite a while to change the JDK
> version to reproduce the issue. I confirmed that if I upgrade to a newer
> JDK version (I tried with JDK 1.8.0_265) the issue doesn’t happen.
>
> Thank you for helping
> -Binh
>
> On Fri, Oct 9, 2020 at 11:36 AM Piotr Nowojski <pn...@apache.org>
> wrote:
>
>> Hi Binh,
>>
>> Could you try upgrading Flink's Java runtime? It was previously reported
>> that upgrading to jdk1.8.0_251 was solving the problem.
>>
>> Piotrek
>>
>> pt., 9 paź 2020 o 19:41 Binh Nguyen Van <bi...@gmail.com> napisał(a):
>>
>>> Hi,
>>>
>>> Thank you for helping me!
>>> The code is compiled on
>>>
>>> java version "1.8.0_161"
>>> Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
>>>
>>> But I just checked our Hadoop and its Java version is
>>>
>>> java version "1.8.0_77"
>>> Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
>>>
>>> Thanks
>>> -Binh
>>>
>>> On Fri, Oct 9, 2020 at 10:23 AM Piotr Nowojski <pn...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> One more thing. It looks like it's not a Flink issue, but some JDK bug.
>>>> Others reported that upgrading JDK version (for example to  jdk1.8.0_251)
>>>> seemed to be solving this problem. What JDK version are you using?
>>>>
>>>> Piotrek
>>>>
>>>> pt., 9 paź 2020 o 17:59 Piotr Nowojski <pn...@apache.org>
>>>> napisał(a):
>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks for reporting the problem. I think this is a known issue [1] on
>>>>> which we are working to fix.
>>>>>
>>>>> Piotrek
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-18196
>>>>>
>>>>> pon., 5 paź 2020 o 08:54 Binh Nguyen Van <bi...@gmail.com>
>>>>> napisał(a):
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a streaming job that is written in Apache Beam and uses Flink
>>>>>> as its runner. The job is working as expected for about 15 hours and then
>>>>>> it started to have checkpointing error. The error message looks like this
>>>>>>
>>>>>> java.lang.Exception: Could not perform checkpoint 910 for operator Source: <source-name> (8/60).
>>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760)
>>>>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
>>>>>>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
>>>>>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>>>>>>     at java.lang.Thread.run(Thread.java:745)
>>>>>> Caused by: java.lang.NullPointerException
>>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
>>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776)
>>>>>>     ... 11 more
>>>>>>
>>>>>> When this happened, I have to stop the job and then start it again,
>>>>>> and then 15 hours later the issue happens again.
>>>>>>
>>>>>> Here are some additional information
>>>>>>
>>>>>>    - Flink version is 1.10.1
>>>>>>    - Job reads data from Kafka, transform, and then writes to Kafka
>>>>>>    - There are 6 tasks with the parallelism of 60 each (each task
>>>>>>    reads from 1 Kafka topic)
>>>>>>    - The job is deployed to run on YARN with 60 task managers and
>>>>>>    each task manager has 1 slot
>>>>>>    - The State backend is filesystem and HDFS is the storage
>>>>>>    (Doesn’t seem to related to the type of state backend since the issue also
>>>>>>    happened when I use memory as the state backend)
>>>>>>    - The checkpointing interval is 60 seconds (The longest duration
>>>>>>    of the normal checkpoint as shown in Flink UI is 14 seconds)
>>>>>>    - The minimum pause between checkpoints is 30 seconds
>>>>>>    - Hadoop cluster is Kerberized but Kafka is not. Keytab and
>>>>>>    principal are set in the Flink configuration file
>>>>>>
>>>>>> Can someone please help?
>>>>>>
>>>>>> Thanks
>>>>>> -Binh
>>>>>>
>>>>>

Re: NPE when checkpointing

Posted by Binh Nguyen Van <bi...@gmail.com>.

Hi,

Sorry for the late reply. It took me quite a while to change the JDK
version to reproduce the issue. I confirmed that if I upgrade to a newer
JDK version (I tried with JDK 1.8.0_265) the issue doesn’t happen.

Thank you for helping
-Binh

On Fri, Oct 9, 2020 at 11:36 AM Piotr Nowojski <pn...@apache.org> wrote:

> Hi Binh,
>
> Could you try upgrading Flink's Java runtime? It was previously reported
> that upgrading to jdk1.8.0_251 was solving the problem.
>
> Piotrek
>
> pt., 9 paź 2020 o 19:41 Binh Nguyen Van <bi...@gmail.com> napisał(a):
>
>> Hi,
>>
>> Thank you for helping me!
>> The code is compiled on
>>
>> java version "1.8.0_161"
>> Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
>>
>> But I just checked our Hadoop and its Java version is
>>
>> java version "1.8.0_77"
>> Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
>>
>> Thanks
>> -Binh
>>
>> On Fri, Oct 9, 2020 at 10:23 AM Piotr Nowojski <pn...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> One more thing. It looks like it's not a Flink issue, but some JDK bug.
>>> Others reported that upgrading JDK version (for example to  jdk1.8.0_251)
>>> seemed to be solving this problem. What JDK version are you using?
>>>
>>> Piotrek
>>>
>>> pt., 9 paź 2020 o 17:59 Piotr Nowojski <pn...@apache.org>
>>> napisał(a):
>>>
>>>> Hi,
>>>>
>>>> Thanks for reporting the problem. I think this is a known issue [1] on
>>>> which we are working to fix.
>>>>
>>>> Piotrek
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-18196
>>>>
>>>> pon., 5 paź 2020 o 08:54 Binh Nguyen Van <bi...@gmail.com>
>>>> napisał(a):
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a streaming job that is written in Apache Beam and uses Flink
>>>>> as its runner. The job is working as expected for about 15 hours and then
>>>>> it started to have checkpointing error. The error message looks like this
>>>>>
>>>>> java.lang.Exception: Could not perform checkpoint 910 for operator Source: <source-name> (8/60).
>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785)
>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760)
>>>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
>>>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
>>>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
>>>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
>>>>>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
>>>>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>>>>>     at java.lang.Thread.run(Thread.java:745)
>>>>> Caused by: java.lang.NullPointerException
>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394)
>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
>>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776)
>>>>>     ... 11 more
>>>>>
>>>>> When this happened, I have to stop the job and then start it again,
>>>>> and then 15 hours later the issue happens again.
>>>>>
>>>>> Here are some additional information
>>>>>
>>>>>    - Flink version is 1.10.1
>>>>>    - Job reads data from Kafka, transform, and then writes to Kafka
>>>>>    - There are 6 tasks with the parallelism of 60 each (each task
>>>>>    reads from 1 Kafka topic)
>>>>>    - The job is deployed to run on YARN with 60 task managers and
>>>>>    each task manager has 1 slot
>>>>>    - The State backend is filesystem and HDFS is the storage (Doesn’t
>>>>>    seem to related to the type of state backend since the issue also happened
>>>>>    when I use memory as the state backend)
>>>>>    - The checkpointing interval is 60 seconds (The longest duration
>>>>>    of the normal checkpoint as shown in Flink UI is 14 seconds)
>>>>>    - The minimum pause between checkpoints is 30 seconds
>>>>>    - Hadoop cluster is Kerberized but Kafka is not. Keytab and
>>>>>    principal are set in the Flink configuration file
>>>>>
>>>>> Can someone please help?
>>>>>
>>>>> Thanks
>>>>> -Binh
>>>>>
>>>>

Re: NPE when checkpointing

Posted by Piotr Nowojski <pn...@apache.org>.

Hi Binh,

Could you try upgrading Flink's Java runtime? It was previously reported
that upgrading to jdk1.8.0_251 was solving the problem.

Piotrek

pt., 9 paź 2020 o 19:41 Binh Nguyen Van <bi...@gmail.com> napisał(a):

> Hi,
>
> Thank you for helping me!
> The code is compiled on
>
> java version "1.8.0_161"
> Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
>
> But I just checked our Hadoop and its Java version is
>
> java version "1.8.0_77"
> Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
>
> Thanks
> -Binh
>
> On Fri, Oct 9, 2020 at 10:23 AM Piotr Nowojski <pn...@apache.org>
> wrote:
>
>> Hi,
>>
>> One more thing. It looks like it's not a Flink issue, but some JDK bug.
>> Others reported that upgrading JDK version (for example to  jdk1.8.0_251)
>> seemed to be solving this problem. What JDK version are you using?
>>
>> Piotrek
>>
>> pt., 9 paź 2020 o 17:59 Piotr Nowojski <pn...@apache.org> napisał(a):
>>
>>> Hi,
>>>
>>> Thanks for reporting the problem. I think this is a known issue [1] on
>>> which we are working to fix.
>>>
>>> Piotrek
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-18196
>>>
>>> pon., 5 paź 2020 o 08:54 Binh Nguyen Van <bi...@gmail.com>
>>> napisał(a):
>>>
>>>> Hi,
>>>>
>>>> I have a streaming job that is written in Apache Beam and uses Flink as
>>>> its runner. The job is working as expected for about 15 hours and then it
>>>> started to have checkpointing error. The error message looks like this
>>>>
>>>> java.lang.Exception: Could not perform checkpoint 910 for operator Source: <source-name> (8/60).
>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785)
>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760)
>>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
>>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
>>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
>>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
>>>>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
>>>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>>>>     at java.lang.Thread.run(Thread.java:745)
>>>> Caused by: java.lang.NullPointerException
>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394)
>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
>>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776)
>>>>     ... 11 more
>>>>
>>>> When this happened, I have to stop the job and then start it again, and
>>>> then 15 hours later the issue happens again.
>>>>
>>>> Here are some additional information
>>>>
>>>>    - Flink version is 1.10.1
>>>>    - Job reads data from Kafka, transform, and then writes to Kafka
>>>>    - There are 6 tasks with the parallelism of 60 each (each task
>>>>    reads from 1 Kafka topic)
>>>>    - The job is deployed to run on YARN with 60 task managers and each
>>>>    task manager has 1 slot
>>>>    - The State backend is filesystem and HDFS is the storage (Doesn’t
>>>>    seem to related to the type of state backend since the issue also happened
>>>>    when I use memory as the state backend)
>>>>    - The checkpointing interval is 60 seconds (The longest duration of
>>>>    the normal checkpoint as shown in Flink UI is 14 seconds)
>>>>    - The minimum pause between checkpoints is 30 seconds
>>>>    - Hadoop cluster is Kerberized but Kafka is not. Keytab and
>>>>    principal are set in the Flink configuration file
>>>>
>>>> Can someone please help?
>>>>
>>>> Thanks
>>>> -Binh
>>>>
>>>

Re: NPE when checkpointing

Posted by Binh Nguyen Van <bi...@gmail.com>.

Hi,

Thank you for helping me!
The code is compiled on

java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

But I just checked our Hadoop and its Java version is

java version "1.8.0_77"
Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)

Thanks
-Binh

On Fri, Oct 9, 2020 at 10:23 AM Piotr Nowojski <pn...@apache.org> wrote:

> Hi,
>
> One more thing. It looks like it's not a Flink issue, but some JDK bug.
> Others reported that upgrading JDK version (for example to  jdk1.8.0_251)
> seemed to be solving this problem. What JDK version are you using?
>
> Piotrek
>
> pt., 9 paź 2020 o 17:59 Piotr Nowojski <pn...@apache.org> napisał(a):
>
>> Hi,
>>
>> Thanks for reporting the problem. I think this is a known issue [1] on
>> which we are working to fix.
>>
>> Piotrek
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-18196
>>
>> pon., 5 paź 2020 o 08:54 Binh Nguyen Van <bi...@gmail.com> napisał(a):
>>
>>> Hi,
>>>
>>> I have a streaming job that is written in Apache Beam and uses Flink as
>>> its runner. The job is working as expected for about 15 hours and then it
>>> started to have checkpointing error. The error message looks like this
>>>
>>> java.lang.Exception: Could not perform checkpoint 910 for operator Source: <source-name> (8/60).
>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785)
>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760)
>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
>>>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
>>>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
>>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>>>     at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.lang.NullPointerException
>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394)
>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
>>>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
>>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776)
>>>     ... 11 more
>>>
>>> When this happened, I have to stop the job and then start it again, and
>>> then 15 hours later the issue happens again.
>>>
>>> Here are some additional information
>>>
>>>    - Flink version is 1.10.1
>>>    - Job reads data from Kafka, transform, and then writes to Kafka
>>>    - There are 6 tasks with the parallelism of 60 each (each task reads
>>>    from 1 Kafka topic)
>>>    - The job is deployed to run on YARN with 60 task managers and each
>>>    task manager has 1 slot
>>>    - The State backend is filesystem and HDFS is the storage (Doesn’t
>>>    seem to related to the type of state backend since the issue also happened
>>>    when I use memory as the state backend)
>>>    - The checkpointing interval is 60 seconds (The longest duration of
>>>    the normal checkpoint as shown in Flink UI is 14 seconds)
>>>    - The minimum pause between checkpoints is 30 seconds
>>>    - Hadoop cluster is Kerberized but Kafka is not. Keytab and
>>>    principal are set in the Flink configuration file
>>>
>>> Can someone please help?
>>>
>>> Thanks
>>> -Binh
>>>
>>

Re: NPE when checkpointing

Posted by Piotr Nowojski <pn...@apache.org>.

Hi,

One more thing. It looks like it's not a Flink issue, but some JDK bug.
Others reported that upgrading JDK version (for example to  jdk1.8.0_251)
seemed to be solving this problem. What JDK version are you using?

Piotrek

pt., 9 paź 2020 o 17:59 Piotr Nowojski <pn...@apache.org> napisał(a):

> Hi,
>
> Thanks for reporting the problem. I think this is a known issue [1] on
> which we are working to fix.
>
> Piotrek
>
> [1] https://issues.apache.org/jira/browse/FLINK-18196
>
> pon., 5 paź 2020 o 08:54 Binh Nguyen Van <bi...@gmail.com> napisał(a):
>
>> Hi,
>>
>> I have a streaming job that is written in Apache Beam and uses Flink as
>> its runner. The job is working as expected for about 15 hours and then it
>> started to have checkpointing error. The error message looks like this
>>
>> java.lang.Exception: Could not perform checkpoint 910 for operator Source: <source-name> (8/60).
>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785)
>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760)
>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
>>     at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
>>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
>>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
>>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>>     at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.NullPointerException
>>     at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394)
>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
>>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
>>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776)
>>     ... 11 more
>>
>> When this happened, I have to stop the job and then start it again, and
>> then 15 hours later the issue happens again.
>>
>> Here are some additional information
>>
>>    - Flink version is 1.10.1
>>    - Job reads data from Kafka, transform, and then writes to Kafka
>>    - There are 6 tasks with the parallelism of 60 each (each task reads
>>    from 1 Kafka topic)
>>    - The job is deployed to run on YARN with 60 task managers and each
>>    task manager has 1 slot
>>    - The State backend is filesystem and HDFS is the storage (Doesn’t
>>    seem to related to the type of state backend since the issue also happened
>>    when I use memory as the state backend)
>>    - The checkpointing interval is 60 seconds (The longest duration of
>>    the normal checkpoint as shown in Flink UI is 14 seconds)
>>    - The minimum pause between checkpoints is 30 seconds
>>    - Hadoop cluster is Kerberized but Kafka is not. Keytab and principal
>>    are set in the Flink configuration file
>>
>> Can someone please help?
>>
>> Thanks
>> -Binh
>>
>

Re: NPE when checkpointing

Posted by Piotr Nowojski <pn...@apache.org>.

Hi,

Thanks for reporting the problem. I think this is a known issue [1] on
which we are working to fix.

Piotrek

[1] https://issues.apache.org/jira/browse/FLINK-18196

pon., 5 paź 2020 o 08:54 Binh Nguyen Van <bi...@gmail.com> napisał(a):

> Hi,
>
> I have a streaming job that is written in Apache Beam and uses Flink as
> its runner. The job is working as expected for about 15 hours and then it
> started to have checkpointing error. The error message looks like this
>
> java.lang.Exception: Could not perform checkpoint 910 for operator Source: <source-name> (8/60).
>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785)
>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
>     at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
>     at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
>     at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
>     at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>     at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394)
>     at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
>     at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
>     at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>     at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
>     at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776)
>     ... 11 more
>
> When this happened, I have to stop the job and then start it again, and
> then 15 hours later the issue happens again.
>
> Here are some additional information
>
>    - Flink version is 1.10.1
>    - Job reads data from Kafka, transform, and then writes to Kafka
>    - There are 6 tasks with the parallelism of 60 each (each task reads
>    from 1 Kafka topic)
>    - The job is deployed to run on YARN with 60 task managers and each
>    task manager has 1 slot
>    - The State backend is filesystem and HDFS is the storage (Doesn’t
>    seem to related to the type of state backend since the issue also happened
>    when I use memory as the state backend)
>    - The checkpointing interval is 60 seconds (The longest duration of
>    the normal checkpoint as shown in Flink UI is 14 seconds)
>    - The minimum pause between checkpoints is 30 seconds
>    - Hadoop cluster is Kerberized but Kafka is not. Keytab and principal
>    are set in the Flink configuration file
>
> Can someone please help?
>
> Thanks
> -Binh
>