You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Yan Zhou [FDS Science]" <yz...@coupang.com> on 2018/05/30 06:04:52 UTC

Task did not exit gracefully and lost TaskManager

Hi,

when I stop my flink application in standalone cluster, one of the tasks can NOT exit gracefully. And the task managers are lost(or detached?). I can't see them in the web ui. However, the task managers are still running in the slave servers.


What could be the possible cause? My application run Over-window aggregation on a datastream table, the results are written into a custom mysql sink with org.apache.commons.dbcp.BasicDataSource. Close methods are called for the PreparedStatement,  Connection, and BasicDataSource within AbstractRichFunction::close()


Is it because of mysql jdbc doesn't handle interrupt propertly? Should I call PreparedStatement::cancel()? I find a similar issue here[1].  Thank you for your help.



[1] : https://stackoverflow.com/questions/40127228/flink-cannot-cancel-a-running-job-streaming


Best

Yan

[https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded]<https://stackoverflow.com/questions/40127228/flink-cannot-cancel-a-running-job-streaming>

Flink : cannot cancel a running job (streaming) - Stack ...<https://stackoverflow.com/questions/40127228/flink-cannot-cancel-a-running-job-streaming>
stackoverflow.com
I want to run a streaming job. When I try to run it locally using start-clusted.sh and the Flink Web Interface, I have no problem. However, I am currently trying to run my job using Flink on YARN

Re: Task did not exit gracefully and lost TaskManager

Posted by "Yan Zhou [FDS Science]" <yz...@coupang.com>.

Here is the exception and error log:


2018-05-29 14:41:04,762 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'over: (PARTITION BY: uid, ORDER BY: proctime, RANGEBETWEEN 86400000 PRECEDI
NG AND CURRENT ROW, select: (id, uid, proctime, group_concat($7) AS w0$o0)) -> select:
(id, uid, proctime, w0$o0 AS EXPR$3) -> to: Row -> Flat Map -> Filter -> Sink: Unnamed (10/1
5)' did not react to cancelling signal for 30 seconds, but is stuck in method:
 org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:199)
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:103)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)
java.lang.Thread.run(Thread.java:748)

...
...

2018-05-29 14:43:34,663 ERROR org.apache.flink.runtime.taskmanager.Task                     - Task did not exit gracefully within 180 + seconds.
2018-05-29 14:43:34,663 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor            - Task did not exit gracefully within 180 + seconds.
2018-05-29 14:43:34,663 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Fatal error occurred while executing the TaskManager. Shutting it down...
2018-05-29 14:43:34,666 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopping TaskExecutor akka.tcp://flink@fds-hadoop-prod04-mp:35187/user/taskmanager_0.
2018-05-29 14:43:34,669 INFO  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting down TaskExecutorLocalStateStoresManager.


________________________________
From: Yan Zhou [FDS Science] <yz...@coupang.com>
Sent: Tuesday, May 29, 2018 11:04:52 PM
To: user@flink.apache.org
Subject: Task did not exit gracefully and lost TaskManager


Hi,

when I stop my flink application in standalone cluster, one of the tasks can NOT exit gracefully. And the task managers are lost(or detached?). I can't see them in the web ui. However, the task managers are still running in the slave servers.


What could be the possible cause? My application run Over-window aggregation on a datastream table, the results are written into a custom mysql sink with org.apache.commons.dbcp.BasicDataSource. Close methods are called for the PreparedStatement,  Connection, and BasicDataSource within AbstractRichFunction::close()


Is it because of mysql jdbc doesn't handle interrupt propertly? Should I call PreparedStatement::cancel()? I find a similar issue here[1].  Thank you for your help.



[1] : https://stackoverflow.com/questions/40127228/flink-cannot-cancel-a-running-job-streaming


Best

Yan

[https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded]<https://stackoverflow.com/questions/40127228/flink-cannot-cancel-a-running-job-streaming>

Flink : cannot cancel a running job (streaming) - Stack ...<https://stackoverflow.com/questions/40127228/flink-cannot-cancel-a-running-job-streaming>
stackoverflow.com
I want to run a streaming job. When I try to run it locally using start-clusted.sh and the Flink Web Interface, I have no problem. However, I am currently trying to run my job using Flink on YARN

Re: Task did not exit gracefully and lost TaskManager

Posted by makeyang <ri...@hotmail.com>.

met the same problem in 1.4
when I cancel job, one of taskmanager keep logging the exception 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/