You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by Congxian Qiu <qc...@gmail.com> on 2022/09/07 09:24:33 UTC

Re: flink作业生成保存点失败

Hi

有 savepoint/checkpoint 失败时的具体 jobmanager log 以及失败 task 对应的 taskmanager log
的话可以发一下,大家帮助看一下

Best,
Congxian


Xuyang <xy...@163.com> 于2022年8月30日周二 23:18写道:

>
> Hi,看起来这个报错是用于输出信息的文件找不到了,可以尝试加一下这个配置再试一下“taskmanager.log.path”,找一下导致tasks超时的根本原因。
> 还可以试一下用火焰图或jstack查看一下那几个tasks超时的时候是卡在哪个方法上。
>
>
>
>
>
>
>
>
>
>
> --
>
>     Best!
>     Xuyang
>
>
>
>
>
>
> Hi,看起来这个报错是用于输出信息的文件找不到了,可以尝试加一下这个配置再试一下“taskmanager.log.path”,找一下导致tasks超时的根本原因。<br/>还可以试一下用火焰图或jstack查看一下那几个tasks超时的时候是卡在哪个方法上。
> 在 2022-08-29 16:19:15,"casel.chen" <ca...@126.com> 写道:
>
> >有一个线上flink作业在人为主动创建保存点时失败,作业有两个算子:从kafka读取数据和写到mongodb,都是48个并行度,出错后查看到写mongodb算子一共48个task,完成了45个,还有3个tasks超时(超时时长设为3分钟),正常情况下完成一次checkpoint要4秒,状态大小只有23.7kb。出错后,查看作业日志如下。在创建保存点失败后作业周期性的检查点生成也都失败了(每个算子各有3个tasks超时)。使用的是FileStateBackend,DFS用的是阿里云oss。请问出错会是因为什么原因造成的?
> >
> >
> >+5
> >[2022-08-29 15:38:32]
> >content:
> >2022-08-29 15:38:32,617 ERROR
> org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerStdoutFileHandler
> [] - Failed to transfer file from TaskExecutor
> sqrc-session-prod-taskmanager-1-30.
> >+6
> >[2022-08-29 15:38:32]
> >content:
> >java.util.concurrent.CompletionException:
> org.apache.flink.util.FlinkException: The file STDOUT does not exist on the
> TaskExecutor.
> >+7
> >[2022-08-29 15:38:32]
> >content:
> >at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$requestFileUploadByFilePath$24(TaskExecutor.java:2064)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> >+8
> >[2022-08-29 15:38:32]
> >content:
> >at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> ~[?:1.8.0_312]
> >+9
> >[2022-08-29 15:38:32]
> >content:
> >at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_312]
> >+10
> >[2022-08-29 15:38:32]
> >content:
> >at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_312]
> >+11
> >[2022-08-29 15:38:32]
> >content:
> >at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_312]
> >+12
> >[2022-08-29 15:38:32]
> >content:
> >Caused by: org.apache.flink.util.FlinkException: The file STDOUT does not
> exist on the TaskExecutor.
> >+13
> >[2022-08-29 15:38:32]
> >content:
> >... 5 more
> >+14
> >[2022-08-29 15:38:32]
> >content:
> >2022-08-29 15:38:32,617 ERROR
> org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerStdoutFileHandler
> [] - Unhandled exception.
> >+15
> >[2022-08-29 15:38:32]
> >content:
> >org.apache.flink.util.FlinkException: The file STDOUT does not exist on
> the TaskExecutor.
> >+16
> >[2022-08-29 15:38:32]
> >content:
> >at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$requestFileUploadByFilePath$24(TaskExecutor.java:2064)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> >+17
> >[2022-08-29 15:38:32]
> >content:
> >at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> ~[?:1.8.0_312]
> >+18
> >[2022-08-29 15:38:32]
> >content:
> >at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_312]
> >+19
> >[2022-08-29 15:38:32]
> >content:
> >at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_312]
> >+20
> >[2022-08-29 15:38:32]
> >content:
> >at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_312]
>