You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Vishal Santoshi <vi...@gmail.com> on 2018/07/09 17:15:46 UTC

1.5 some thing weird

drwxr-xr-x   - root hadoop          0 2018-07-09 12:33
/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
drwxr-xr-x   - root hadoop          0 2018-07-09 12:35
/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
drwxr-xr-x   - root hadoop          0 2018-07-09 12:51
/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
drwxr-xr-x   - root hadoop          0 2018-07-09 12:53
/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
drwxr-xr-x   - root hadoop          0 2018-07-09 12:55
/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the  2018-07-09, 12:38:43   this
exception was thrown


the  chk-125 is missing from hdfs and the job complains about it
Caused by:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e
(inode 1987098987): File does not exist. Holder
DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

At about the same time

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..


Is this some race condition. A checkpoint had to be taken and , that was
was chk-125, it took longer than the configure time ( 1 minute ).  It
aborted the pipe. Should it have ? It actually did not even create the chk-125
but then refers to it and aborts the pipe.








This is the full exception.

AsynchronousException{java.lang.Exception: Could not materialize
checkpoint 125 for operator 360 minute interval -> 360 minutes to
TimeSeries.Entry.2 (5/6).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 125
for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2
(5/6).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
	... 6 more
Caused by: java.util.concurrent.ExecutionException:
java.io.IOException: Could not flush and close the file system output
stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e
in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file
system output stream to
hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e
in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e
(inode 1987098987): File does not exist. Holder
DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

Re: 1.5 some thing weird

Posted by Vishal Santoshi <vi...@gmail.com>.

 Will try the setting out.  Do not want to push it, but the exception can
be much more descriptive :)

Thanks much

On Tue, Jul 10, 2018 at 7:48 AM, Till Rohrmann <tr...@apache.org> wrote:

> Whether a Flink task should fail in case of a checkpoint error or not can
> be configured via the CheckpointConfig which you can access via the
> StreamExecutionEnvironment. You have to call `CheckpointConfig#
> setFailOnCheckpointingErrors(false)` to deactivate the default behaviour
> where the task always fails in case of a checkpoint error.
>
> Cheers,
> Till
>
> On Tue, Jul 10, 2018 at 10:50 AM Vishal Santoshi <
> vishal.santoshi@gmail.com> wrote:
>
>> That makes sense, what does not make sense is that the pipeline
>> restarted. I would have imagined that an aborted chk point would not abort
>> the pipeline.
>>
>> On Tue, Jul 10, 2018 at 3:16 AM, Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Hi Vishal,
>>>
>>> it looks as if the flushing of the checkpoint data to HDFS failed due to
>>> some expired lease on the checkpoint file. Therefore, Flink aborted the
>>> checkpoint `chk-125` and removed it. This is the normal behaviour if Flink
>>> cannot complete a checkpoint. As you can see, afterwards, the checkpoints
>>> are again successful.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:33
>>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
>>>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:35
>>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
>>>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:51
>>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
>>>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:53
>>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
>>>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:55
>>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128
>>>>
>>>> See the missing chk-125
>>>>
>>>> So I see the above checkpoints for a job. at the  2018-07-09,
>>>> 12:38:43   this exception was thrown
>>>>
>>>>
>>>> the  chk-125 is missing from hdfs and the job complains about it
>>>> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.
>>>> hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on
>>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428
>>>> bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987):
>>>> File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240
>>>> does not have any open files.
>>>>
>>>> At about the same time
>>>>
>>>> ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before
>>>> completing..
>>>>
>>>>
>>>> Is this some race condition. A checkpoint had to be taken and , that
>>>> was was chk-125, it took longer than the configure time ( 1 minute ).  It
>>>> aborted the pipe. Should it have ? It actually did not even create the chk-125
>>>> but then refers to it and aborts the pipe.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> This is the full exception.
>>>>
>>>> AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
>>>> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>> Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
>>>> 	... 6 more
>>>> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
>>>> 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>>> 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>>> 	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
>>>> 	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
>>>> 	... 5 more
>>>> Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
>>>> 	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
>>>> 	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
>>>> 	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
>>>> 	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
>>>> 	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
>>>> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>> 	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
>>>>
>>>> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.
>>>>
>>>>
>>>>
>>>>
>>

Re: 1.5 some thing weird

Posted by Till Rohrmann <tr...@apache.org>.

Whether a Flink task should fail in case of a checkpoint error or not can
be configured via the CheckpointConfig which you can access via the
StreamExecutionEnvironment. You have to call
`CheckpointConfig#setFailOnCheckpointingErrors(false)` to deactivate the
default behaviour where the task always fails in case of a checkpoint error.

Cheers,
Till

On Tue, Jul 10, 2018 at 10:50 AM Vishal Santoshi <vi...@gmail.com>
wrote:

> That makes sense, what does not make sense is that the pipeline restarted.
> I would have imagined that an aborted chk point would not abort the
> pipeline.
>
> On Tue, Jul 10, 2018 at 3:16 AM, Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Vishal,
>>
>> it looks as if the flushing of the checkpoint data to HDFS failed due to
>> some expired lease on the checkpoint file. Therefore, Flink aborted the
>> checkpoint `chk-125` and removed it. This is the normal behaviour if Flink
>> cannot complete a checkpoint. As you can see, afterwards, the checkpoints
>> are again successful.
>>
>> Cheers,
>> Till
>>
>> On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <vi...@gmail.com>
>> wrote:
>>
>>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:33
>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
>>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:35
>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
>>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:51
>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
>>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:53
>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
>>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:55
>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128
>>>
>>> See the missing chk-125
>>>
>>> So I see the above checkpoints for a job. at the  2018-07-09, 12:38:43
>>>  this exception was thrown
>>>
>>>
>>> the  chk-125 is missing from hdfs and the job complains about it
>>> Caused by:
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>> No lease on
>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e
>>> (inode 1987098987): File does not exist. Holder
>>> DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.
>>>
>>> At about the same time
>>>
>>> ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before
>>> completing..
>>>
>>>
>>> Is this some race condition. A checkpoint had to be taken and , that was
>>> was chk-125, it took longer than the configure time ( 1 minute ).  It
>>> aborted the pipe. Should it have ? It actually did not even create the chk-125
>>> but then refers to it and aborts the pipe.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> This is the full exception.
>>>
>>> AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
>>> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> 	at java.lang.Thread.run(Thread.java:748)
>>> Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
>>> 	... 6 more
>>> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
>>> 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>> 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>> 	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
>>> 	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
>>> 	... 5 more
>>> Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
>>> 	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
>>> 	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
>>> 	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
>>> 	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
>>> 	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
>>> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> 	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
>>>
>>> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.
>>>
>>>
>>>
>>>
>

Re: 1.5 some thing weird

Posted by Vishal Santoshi <vi...@gmail.com>.

That makes sense, what does not make sense is that the pipeline restarted.
I would have imagined that an aborted chk point would not abort the
pipeline.

On Tue, Jul 10, 2018 at 3:16 AM, Till Rohrmann <tr...@apache.org> wrote:

> Hi Vishal,
>
> it looks as if the flushing of the checkpoint data to HDFS failed due to
> some expired lease on the checkpoint file. Therefore, Flink aborted the
> checkpoint `chk-125` and removed it. This is the normal behaviour if Flink
> cannot complete a checkpoint. As you can see, afterwards, the checkpoints
> are again successful.
>
> Cheers,
> Till
>
> On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:33 /flink/kpi_unique/
>> 392d0436e53f3ef5e494ba3cc63428bf/chk-123
>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:35 /flink/kpi_unique/
>> 392d0436e53f3ef5e494ba3cc63428bf/chk-124
>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:51 /flink/kpi_unique/
>> 392d0436e53f3ef5e494ba3cc63428bf/chk-126
>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:53 /flink/kpi_unique/
>> 392d0436e53f3ef5e494ba3cc63428bf/chk-127
>> drwxr-xr-x   - root hadoop          0 2018-07-09 12:55 /flink/kpi_unique/
>> 392d0436e53f3ef5e494ba3cc63428bf/chk-128
>>
>> See the missing chk-125
>>
>> So I see the above checkpoints for a job. at the  2018-07-09, 12:38:43   this
>> exception was thrown
>>
>>
>> the  chk-125 is missing from hdfs and the job complains about it
>> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.
>> hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on
>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428
>> bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File
>> does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not
>> have any open files.
>>
>> At about the same time
>>
>> ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before
>> completing..
>>
>>
>> Is this some race condition. A checkpoint had to be taken and , that was
>> was chk-125, it took longer than the configure time ( 1 minute ).  It
>> aborted the pipe. Should it have ? It actually did not even create the chk-125
>> but then refers to it and aborts the pipe.
>>
>>
>>
>>
>>
>>
>>
>>
>> This is the full exception.
>>
>> AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
>> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> 	at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
>> 	... 6 more
>> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
>> 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>> 	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
>> 	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
>> 	... 5 more
>> Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
>> 	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
>> 	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
>> 	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
>> 	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
>> 	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
>> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> 	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
>>
>> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.
>>
>>
>>
>>

Re: 1.5 some thing weird

Posted by Till Rohrmann <tr...@apache.org>.

Hi Vishal,

it looks as if the flushing of the checkpoint data to HDFS failed due to
some expired lease on the checkpoint file. Therefore, Flink aborted the
checkpoint `chk-125` and removed it. This is the normal behaviour if Flink
cannot complete a checkpoint. As you can see, afterwards, the checkpoints
are again successful.

Cheers,
Till

On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> drwxr-xr-x   - root hadoop          0 2018-07-09 12:33
> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
> drwxr-xr-x   - root hadoop          0 2018-07-09 12:35
> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
> drwxr-xr-x   - root hadoop          0 2018-07-09 12:51
> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
> drwxr-xr-x   - root hadoop          0 2018-07-09 12:53
> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
> drwxr-xr-x   - root hadoop          0 2018-07-09 12:55
> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128
>
> See the missing chk-125
>
> So I see the above checkpoints for a job. at the  2018-07-09, 12:38:43   this
> exception was thrown
>
>
> the  chk-125 is missing from hdfs and the job complains about it
> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e
> (inode 1987098987): File does not exist. Holder
> DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.
>
> At about the same time
>
> ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..
>
>
> Is this some race condition. A checkpoint had to be taken and , that was
> was chk-125, it took longer than the configure time ( 1 minute ).  It
> aborted the pipe. Should it have ? It actually did not even create the chk-125
> but then refers to it and aborts the pipe.
>
>
>
>
>
>
>
>
> This is the full exception.
>
> AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
> 	... 6 more
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
> 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> 	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
> 	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
> 	... 5 more
> Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
> 	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
> 	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
> 	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
> 	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
> 	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
>
> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.
>
>
>
>