You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by John Smith <ja...@gmail.com> on 2022/09/23 15:26:12 UTC

Why is task manager shutting down?

Hi I have attached the logs here...

https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0

1- It looks like a timeout issue. Can someone confirm?
2- The task manager is restarted, since I have restart on failure in
SystemD. But it seems after a few restarts it stops. Does it mean that
SystemD has an internal counter of how many times it will restart a service
before it doesn't do it anymore?

Re: Why is task manager shutting down?

Posted by Congxian Qiu <qc...@gmail.com>.
Hi
    You can configure the key `task.cancellation.timeout`[1] to increase
the timeout, and the code about this logic is here[2]

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#task-cancellation-timeout
[2]
https://github.com/apache/flink/blob/f543b8ac690b1dee58bc3cb345a1c8ad0db0941e/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1775
Best,
Congxian


John Smith <ja...@gmail.com> 于2022年9月29日周四 19:04写道:

> Sorry I mean the 180 seconds. Where does flink decide that 180 seconds is
> the cutoff point... And can I increase it.
>
> On Thu., Sep. 29, 2022, 7:02 a.m. John Smith, <ja...@gmail.com>
> wrote:
>
>> Is there a way to increase the 30 seconds to 60? Where is that 30 second
>> timeout set?
>>
>> I have jdbc query timeout but at some point at night the insert takes a
>> bit longer cause of index rebuilding.
>>
>> On Wed., Sep. 28, 2022, 5:02 a.m. Congxian Qiu, <qc...@gmail.com>
>> wrote:
>>
>>> Hi John
>>>
>>> Yes, the whole TaskManager exited because the task did not react to
>>> cancelling signal in time
>>>
>>> ```
>>>
>>> 2022-08-30 09:14:22,138 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Task did not exit gracefully within 180 + seconds.
>>> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds.
>>> 	at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) [flink-dist_2.12-1.14.4.jar:1.14.4]
>>> 	at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
>>> 2022-08-30 09:14:22,139 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - Fatal error occurred while executing the TaskManager. Shutting it down...
>>>
>>> ```
>>>
>>>
>>>  And the task stack logged such as below when cancelling the sink task
>>>
>>> ```
>>>
>>> 2022-08-30 09:14:22,135 WARN  org.apache.flink.runtime.taskmanager.Task                    [] - Task 'Sink: jdbc (1/1)#359' did not react to cancelling signal - notifying TM; it is stuck for 180 seconds in method:
>>>  java.net.SocketInputStream.socketRead0(Native Method)
>>> java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>>> java.net.SocketInputStream.read(SocketInputStream.java:171)
>>> java.net.SocketInputStream.read(SocketInputStream.java:141)
>>> com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023)
>>> com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418)
>>> com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579)
>>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592)
>>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
>>> com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
>>> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2979)
>>> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
>>> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
>>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:505)
>>> com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonOutputFormat.flush(xxxxxxJdbcJsonOutputFormat.java:111)
>>> com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonSink.snapshotState(xxxxxxJdbcJsonSink.java:33)
>>> ```
>>>
>>>
>>> Best,
>>> Congxian
>>>
>>>
>>> John Smith <ja...@gmail.com> 于2022年9月23日周五 23:35写道:
>>>
>>>> Sorry new file:
>>>> https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0
>>>>
>>>> On Fri, Sep 23, 2022 at 11:26 AM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi I have attached the logs here...
>>>>>
>>>>>
>>>>> https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0
>>>>>
>>>>> 1- It looks like a timeout issue. Can someone confirm?
>>>>> 2- The task manager is restarted, since I have restart on failure in
>>>>> SystemD. But it seems after a few restarts it stops. Does it mean that
>>>>> SystemD has an internal counter of how many times it will restart a service
>>>>> before it doesn't do it anymore?
>>>>>
>>>>

Re: Why is task manager shutting down?

Posted by John Smith <ja...@gmail.com>.
Sorry I mean the 180 seconds. Where does flink decide that 180 seconds is
the cutoff point... And can I increase it.

On Thu., Sep. 29, 2022, 7:02 a.m. John Smith, <ja...@gmail.com>
wrote:

> Is there a way to increase the 30 seconds to 60? Where is that 30 second
> timeout set?
>
> I have jdbc query timeout but at some point at night the insert takes a
> bit longer cause of index rebuilding.
>
> On Wed., Sep. 28, 2022, 5:02 a.m. Congxian Qiu, <qc...@gmail.com>
> wrote:
>
>> Hi John
>>
>> Yes, the whole TaskManager exited because the task did not react to
>> cancelling signal in time
>>
>> ```
>>
>> 2022-08-30 09:14:22,138 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Task did not exit gracefully within 180 + seconds.
>> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds.
>> 	at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) [flink-dist_2.12-1.14.4.jar:1.14.4]
>> 	at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
>> 2022-08-30 09:14:22,139 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - Fatal error occurred while executing the TaskManager. Shutting it down...
>>
>> ```
>>
>>
>>  And the task stack logged such as below when cancelling the sink task
>>
>> ```
>>
>> 2022-08-30 09:14:22,135 WARN  org.apache.flink.runtime.taskmanager.Task                    [] - Task 'Sink: jdbc (1/1)#359' did not react to cancelling signal - notifying TM; it is stuck for 180 seconds in method:
>>  java.net.SocketInputStream.socketRead0(Native Method)
>> java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>> java.net.SocketInputStream.read(SocketInputStream.java:171)
>> java.net.SocketInputStream.read(SocketInputStream.java:141)
>> com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023)
>> com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418)
>> com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579)
>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592)
>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
>> com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
>> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2979)
>> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
>> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:505)
>> com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonOutputFormat.flush(xxxxxxJdbcJsonOutputFormat.java:111)
>> com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonSink.snapshotState(xxxxxxJdbcJsonSink.java:33)
>> ```
>>
>>
>> Best,
>> Congxian
>>
>>
>> John Smith <ja...@gmail.com> 于2022年9月23日周五 23:35写道:
>>
>>> Sorry new file:
>>> https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0
>>>
>>> On Fri, Sep 23, 2022 at 11:26 AM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> Hi I have attached the logs here...
>>>>
>>>>
>>>> https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0
>>>>
>>>> 1- It looks like a timeout issue. Can someone confirm?
>>>> 2- The task manager is restarted, since I have restart on failure in
>>>> SystemD. But it seems after a few restarts it stops. Does it mean that
>>>> SystemD has an internal counter of how many times it will restart a service
>>>> before it doesn't do it anymore?
>>>>
>>>

Re: Why is task manager shutting down?

Posted by John Smith <ja...@gmail.com>.
Is there a way to increase the 30 seconds to 60? Where is that 30 second
timeout set?

I have jdbc query timeout but at some point at night the insert takes a bit
longer cause of index rebuilding.

On Wed., Sep. 28, 2022, 5:02 a.m. Congxian Qiu, <qc...@gmail.com>
wrote:

> Hi John
>
> Yes, the whole TaskManager exited because the task did not react to
> cancelling signal in time
>
> ```
>
> 2022-08-30 09:14:22,138 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Task did not exit gracefully within 180 + seconds.
> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds.
> 	at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) [flink-dist_2.12-1.14.4.jar:1.14.4]
> 	at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
> 2022-08-30 09:14:22,139 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - Fatal error occurred while executing the TaskManager. Shutting it down...
>
> ```
>
>
>  And the task stack logged such as below when cancelling the sink task
>
> ```
>
> 2022-08-30 09:14:22,135 WARN  org.apache.flink.runtime.taskmanager.Task                    [] - Task 'Sink: jdbc (1/1)#359' did not react to cancelling signal - notifying TM; it is stuck for 180 seconds in method:
>  java.net.SocketInputStream.socketRead0(Native Method)
> java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> java.net.SocketInputStream.read(SocketInputStream.java:171)
> java.net.SocketInputStream.read(SocketInputStream.java:141)
> com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023)
> com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418)
> com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579)
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592)
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
> com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2979)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:505)
> com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonOutputFormat.flush(xxxxxxJdbcJsonOutputFormat.java:111)
> com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonSink.snapshotState(xxxxxxJdbcJsonSink.java:33)
> ```
>
>
> Best,
> Congxian
>
>
> John Smith <ja...@gmail.com> 于2022年9月23日周五 23:35写道:
>
>> Sorry new file:
>> https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0
>>
>> On Fri, Sep 23, 2022 at 11:26 AM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Hi I have attached the logs here...
>>>
>>>
>>> https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0
>>>
>>> 1- It looks like a timeout issue. Can someone confirm?
>>> 2- The task manager is restarted, since I have restart on failure in
>>> SystemD. But it seems after a few restarts it stops. Does it mean that
>>> SystemD has an internal counter of how many times it will restart a service
>>> before it doesn't do it anymore?
>>>
>>

Re: Why is task manager shutting down?

Posted by Congxian Qiu <qc...@gmail.com>.
Hi John

Yes, the whole TaskManager exited because the task did not react to
cancelling signal in time

```

2022-08-30 09:14:22,138 ERROR
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Task
did not exit gracefully within 180 + seconds.
org.apache.flink.util.FlinkRuntimeException: Task did not exit
gracefully within 180 + seconds.
	at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791)
[flink-dist_2.12-1.14.4.jar:1.14.4]
	at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
2022-08-30 09:14:22,139 ERROR
org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] -
Fatal error occurred while executing the TaskManager. Shutting it
down...

```


 And the task stack logged such as below when cancelling the sink task

```

2022-08-30 09:14:22,135 WARN
org.apache.flink.runtime.taskmanager.Task                    [] - Task
'Sink: jdbc (1/1)#359' did not react to cancelling signal - notifying
TM; it is stuck for 180 seconds in method:
 java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
java.net.SocketInputStream.read(SocketInputStream.java:171)
java.net.SocketInputStream.read(SocketInputStream.java:141)
com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023)
com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418)
com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579)
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592)
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2979)
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:505)
com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonOutputFormat.flush(xxxxxxJdbcJsonOutputFormat.java:111)
com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonSink.snapshotState(xxxxxxJdbcJsonSink.java:33)
```


Best,
Congxian


John Smith <ja...@gmail.com> 于2022年9月23日周五 23:35写道:

> Sorry new file:
> https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0
>
> On Fri, Sep 23, 2022 at 11:26 AM John Smith <ja...@gmail.com>
> wrote:
>
>> Hi I have attached the logs here...
>>
>>
>> https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0
>>
>> 1- It looks like a timeout issue. Can someone confirm?
>> 2- The task manager is restarted, since I have restart on failure in
>> SystemD. But it seems after a few restarts it stops. Does it mean that
>> SystemD has an internal counter of how many times it will restart a service
>> before it doesn't do it anymore?
>>
>

Re: Why is task manager shutting down?

Posted by John Smith <ja...@gmail.com>.
Sorry new file:
https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0

On Fri, Sep 23, 2022 at 11:26 AM John Smith <ja...@gmail.com> wrote:

> Hi I have attached the logs here...
>
>
> https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0
>
> 1- It looks like a timeout issue. Can someone confirm?
> 2- The task manager is restarted, since I have restart on failure in
> SystemD. But it seems after a few restarts it stops. Does it mean that
> SystemD has an internal counter of how many times it will restart a service
> before it doesn't do it anymore?
>