You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Navneeth Krishnan <re...@gmail.com> on 2020/01/03 00:23:09 UTC

Checkpoints issue and job failing

Hi All,

We are running into checkpoint timeout issue more frequently in production
and we also see the below exception. We are running flink 1.4.0 and the
checkpoints are saved on NFS. Can someone suggest how to overcome this?

[image: image.png]

java.lang.IllegalStateException: Could not initialize operator state backend.
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException:
/mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01
(No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)


Thanks

Re: Checkpoints issue and job failing

Posted by Navneeth Krishnan <re...@gmail.com>.
Thanks Vino & Piotr,

sure, will upgrade the flink version and monitor it to see if the problem
still exist.

Thanks

On Mon, Jan 6, 2020 at 12:39 AM Piotr Nowojski <pi...@ververica.com> wrote:

> Hi,
>
> From the top of my head I don’t remember anything particular, however
> release 1.4.0 came with quite a lot of deep change which had it’s fair
> share number of bugs, that were subsequently fixed in later releases.
>
> Because 1.4.x tree is no longer supported I would strongly recommend to
> first upgrade to a more recent Flink version. If that’s not possible, I
> would at least upgrade to the latest release from 1.4.x tree (1.4.2).
>
> Piotrek
>
> On 6 Jan 2020, at 07:25, vino yang <ya...@gmail.com> wrote:
>
> Hi Navneeth,
>
> Since the file still exists, this exception is very strange.
>
> I want to ask, does it happen by accident or frequently?
>
> Another concern is that since the 1.4 version is very far away, all
> maintenance and response are not as timely as the recent versions. I
> personally recommend upgrading as soon as possible.
>
> I can ping @Piotr Nowojski <pi...@ververica.com>  and see if it is
> possible to explain the cause of this problem.
>
> Best,
> Vino
>
> Navneeth Krishnan <re...@gmail.com> 于2020年1月4日周六 上午1:03写道:
>
>> Thanks Congxian & Vino.
>>
>> Yes, the file do exist and I don't see any problem in accessing it.
>>
>> Regarding flink 1.9, we haven't migrated yet but we are planning to do.
>> Since we have to test it might take sometime.
>>
>> Thanks
>>
>> On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu <qc...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> Do you have ever check that this problem exists on Flink 1.9?
>>>
>>> Best,
>>> Congxian
>>>
>>>
>>> vino yang <ya...@gmail.com> 于2020年1月3日周五 下午3:54写道:
>>>
>>>> Hi Navneeth,
>>>>
>>>> Did you check if the path contains in the exception is really can not
>>>> be found?
>>>>
>>>> Best,
>>>> Vino
>>>>
>>>> Navneeth Krishnan <re...@gmail.com> 于2020年1月3日周五 上午8:23写道:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We are running into checkpoint timeout issue more frequently in
>>>>> production and we also see the below exception. We are running flink 1.4.0
>>>>> and the checkpoints are saved on NFS. Can someone suggest how to overcome
>>>>> this?
>>>>>
>>>>> <image.png>
>>>>>
>>>>> java.lang.IllegalStateException: Could not initialize operator state backend.
>>>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
>>>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
>>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
>>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
>>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
>>>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>>> Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
>>>>> 	at java.io.FileInputStream.open0(Native Method)
>>>>> 	at java.io.FileInputStream.open(FileInputStream.java:195)
>>>>> 	at java.io.FileInputStream.<init>(FileInputStream.java:138)
>>>>> 	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>

Re: Checkpoints issue and job failing

Posted by Piotr Nowojski <pi...@ververica.com>.
Hi,

From the top of my head I don’t remember anything particular, however release 1.4.0 came with quite a lot of deep change which had it’s fair share number of bugs, that were subsequently fixed in later releases. 

Because 1.4.x tree is no longer supported I would strongly recommend to first upgrade to a more recent Flink version. If that’s not possible, I would at least upgrade to the latest release from 1.4.x tree (1.4.2).

Piotrek

> On 6 Jan 2020, at 07:25, vino yang <ya...@gmail.com> wrote:
> 
> Hi Navneeth,
> 
> Since the file still exists, this exception is very strange.
> 
> I want to ask, does it happen by accident or frequently?
> 
> Another concern is that since the 1.4 version is very far away, all maintenance and response are not as timely as the recent versions. I personally recommend upgrading as soon as possible.
> 
> I can ping @Piotr Nowojski <ma...@ververica.com>  and see if it is possible to explain the cause of this problem.
> 
> Best,
> Vino
> 
> Navneeth Krishnan <reachnavneeth2@gmail.com <ma...@gmail.com>> 于2020年1月4日周六 上午1:03写道:
> Thanks Congxian & Vino.
> 
> Yes, the file do exist and I don't see any problem in accessing it.
> 
> Regarding flink 1.9, we haven't migrated yet but we are planning to do. Since we have to test it might take sometime.
> 
> Thanks
> 
> On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu <qcx978132955@gmail.com <ma...@gmail.com>> wrote:
> Hi
> 
> Do you have ever check that this problem exists on Flink 1.9?
> 
> Best,
> Congxian
> 
> 
> vino yang <yanghua1127@gmail.com <ma...@gmail.com>> 于2020年1月3日周五 下午3:54写道:
> Hi Navneeth,
> 
> Did you check if the path contains in the exception is really can not be found?
> 
> Best,
> Vino
> 
> Navneeth Krishnan <reachnavneeth2@gmail.com <ma...@gmail.com>> 于2020年1月3日周五 上午8:23写道:
> Hi All,
> 
> We are running into checkpoint timeout issue more frequently in production and we also see the below exception. We are running flink 1.4.0 and the checkpoints are saved on NFS. Can someone suggest how to overcome this? 
> 
> <image.png>
> 
> java.lang.IllegalStateException: Could not initialize operator state backend.
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
> 	at java.io.FileInputStream.open0(Native Method)
> 	at java.io.FileInputStream.open(FileInputStream.java:195)
> 	at java.io.FileInputStream.<init>(FileInputStream.java:138)
> 	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
> 
> Thanks


Re: Checkpoints issue and job failing

Posted by vino yang <ya...@gmail.com>.
Hi Navneeth,

Since the file still exists, this exception is very strange.

I want to ask, does it happen by accident or frequently?

Another concern is that since the 1.4 version is very far away, all
maintenance and response are not as timely as the recent versions. I
personally recommend upgrading as soon as possible.

I can ping @Piotr Nowojski <pi...@ververica.com>  and see if it is possible
to explain the cause of this problem.

Best,
Vino

Navneeth Krishnan <re...@gmail.com> 于2020年1月4日周六 上午1:03写道:

> Thanks Congxian & Vino.
>
> Yes, the file do exist and I don't see any problem in accessing it.
>
> Regarding flink 1.9, we haven't migrated yet but we are planning to do.
> Since we have to test it might take sometime.
>
> Thanks
>
> On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu <qc...@gmail.com>
> wrote:
>
>> Hi
>>
>> Do you have ever check that this problem exists on Flink 1.9?
>>
>> Best,
>> Congxian
>>
>>
>> vino yang <ya...@gmail.com> 于2020年1月3日周五 下午3:54写道:
>>
>>> Hi Navneeth,
>>>
>>> Did you check if the path contains in the exception is really can not be
>>> found?
>>>
>>> Best,
>>> Vino
>>>
>>> Navneeth Krishnan <re...@gmail.com> 于2020年1月3日周五 上午8:23写道:
>>>
>>>> Hi All,
>>>>
>>>> We are running into checkpoint timeout issue more frequently in
>>>> production and we also see the below exception. We are running flink 1.4.0
>>>> and the checkpoints are saved on NFS. Can someone suggest how to overcome
>>>> this?
>>>>
>>>> [image: image.png]
>>>>
>>>> java.lang.IllegalStateException: Could not initialize operator state backend.
>>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
>>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
>>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>> Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
>>>> 	at java.io.FileInputStream.open0(Native Method)
>>>> 	at java.io.FileInputStream.open(FileInputStream.java:195)
>>>> 	at java.io.FileInputStream.<init>(FileInputStream.java:138)
>>>> 	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
>>>>
>>>>
>>>> Thanks
>>>>
>>>>

Re: Checkpoints issue and job failing

Posted by Navneeth Krishnan <re...@gmail.com>.
Thanks Congxian & Vino.

Yes, the file do exist and I don't see any problem in accessing it.

Regarding flink 1.9, we haven't migrated yet but we are planning to do.
Since we have to test it might take sometime.

Thanks

On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu <qc...@gmail.com> wrote:

> Hi
>
> Do you have ever check that this problem exists on Flink 1.9?
>
> Best,
> Congxian
>
>
> vino yang <ya...@gmail.com> 于2020年1月3日周五 下午3:54写道:
>
>> Hi Navneeth,
>>
>> Did you check if the path contains in the exception is really can not be
>> found?
>>
>> Best,
>> Vino
>>
>> Navneeth Krishnan <re...@gmail.com> 于2020年1月3日周五 上午8:23写道:
>>
>>> Hi All,
>>>
>>> We are running into checkpoint timeout issue more frequently in
>>> production and we also see the below exception. We are running flink 1.4.0
>>> and the checkpoints are saved on NFS. Can someone suggest how to overcome
>>> this?
>>>
>>> [image: image.png]
>>>
>>> java.lang.IllegalStateException: Could not initialize operator state backend.
>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>>> 	at java.lang.Thread.run(Thread.java:748)
>>> Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
>>> 	at java.io.FileInputStream.open0(Native Method)
>>> 	at java.io.FileInputStream.open(FileInputStream.java:195)
>>> 	at java.io.FileInputStream.<init>(FileInputStream.java:138)
>>> 	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
>>>
>>>
>>> Thanks
>>>
>>>

Re: Checkpoints issue and job failing

Posted by Congxian Qiu <qc...@gmail.com>.
Hi

Do you have ever check that this problem exists on Flink 1.9?

Best,
Congxian


vino yang <ya...@gmail.com> 于2020年1月3日周五 下午3:54写道:

> Hi Navneeth,
>
> Did you check if the path contains in the exception is really can not be
> found?
>
> Best,
> Vino
>
> Navneeth Krishnan <re...@gmail.com> 于2020年1月3日周五 上午8:23写道:
>
>> Hi All,
>>
>> We are running into checkpoint timeout issue more frequently in
>> production and we also see the below exception. We are running flink 1.4.0
>> and the checkpoints are saved on NFS. Can someone suggest how to overcome
>> this?
>>
>> [image: image.png]
>>
>> java.lang.IllegalStateException: Could not initialize operator state backend.
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>> 	at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
>> 	at java.io.FileInputStream.open0(Native Method)
>> 	at java.io.FileInputStream.open(FileInputStream.java:195)
>> 	at java.io.FileInputStream.<init>(FileInputStream.java:138)
>> 	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
>>
>>
>> Thanks
>>
>>

Re: Checkpoints issue and job failing

Posted by vino yang <ya...@gmail.com>.
Hi Navneeth,

Did you check if the path contains in the exception is really can not be
found?

Best,
Vino

Navneeth Krishnan <re...@gmail.com> 于2020年1月3日周五 上午8:23写道:

> Hi All,
>
> We are running into checkpoint timeout issue more frequently in production
> and we also see the below exception. We are running flink 1.4.0 and the
> checkpoints are saved on NFS. Can someone suggest how to overcome this?
>
> [image: image.png]
>
> java.lang.IllegalStateException: Could not initialize operator state backend.
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
> 	at java.io.FileInputStream.open0(Native Method)
> 	at java.io.FileInputStream.open(FileInputStream.java:195)
> 	at java.io.FileInputStream.<init>(FileInputStream.java:138)
> 	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
>
>
> Thanks
>
>