You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2018/11/18 12:39:10 UTC

[jira] [Updated] (FLINK-9228) log details about task fail/task manager is shutting down

     [ https://issues.apache.org/jira/browse/FLINK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Till Rohrmann updated FLINK-9228:
---------------------------------
    Fix Version/s: 1.8.0

> log details about task fail/task manager is shutting down
> ---------------------------------------------------------
>
>                 Key: FLINK-9228
>                 URL: https://issues.apache.org/jira/browse/FLINK-9228
>             Project: Flink
>          Issue Type: Improvement
>          Components: Logging
>    Affects Versions: 1.4.2
>            Reporter: makeyang
>            Assignee: makeyang
>            Priority: Minor
>             Fix For: 1.6.3, 1.7.0, 1.8.0
>
>
> condition:
> flink version:1.4.2
> jdk version:1.8.0.20
> linux version:3.10.0
> problem description:
> one of my task manager is out of the cluster and I checked its log found 
> something below: 
> 2018-04-19 22:34:47,441 INFO  org.apache.flink.runtime.taskmanager.Task                     
> - Attempting to fail task externally Process (115/120) 
> (19d0b0ce1ef3b8023b37bdfda643ef44). 
> 2018-04-19 22:34:47,441 INFO  org.apache.flink.runtime.taskmanager.Task                     
> - Process (115/120) (19d0b0ce1ef3b8023b37bdfda643ef44) switched from RUNNING 
> to FAILED. 
> java.lang.Exception: TaskManager is shutting down. 
>         at 
> org.apache.flink.runtime.taskmanager.TaskManager.postStop(TaskManager.scala:220) 
>         at akka.actor.Actor$class.aroundPostStop(Actor.scala:515) 
>         at 
> org.apache.flink.runtime.taskmanager.TaskManager.aroundPostStop(TaskManager.scala:121) 
>         at 
> akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210) 
>         at 
> akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) 
>         at akka.actor.ActorCell.terminate(ActorCell.scala:374) 
>         at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:467) 
>         at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483) 
>         at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282) 
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260) 
>         at akka.dispatch.Mailbox.run(Mailbox.scala:224) 
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:234) 
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
>         at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
>         at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  
> suggestion:
>  # short term suggestion:
>  ## log reasons why task tail?maybe received some event from job manager/can't connect to job manager? operator exception? the more claritify the better
>  ## log reasons why task manager is shutting down? received some event from job manager/can't connect to job manager? operator exception can't be recovery?
>  # long term suggestion:
>  ## define the state machine of flink node clearly. if nothing happens, the node should stay what it used to be, which means if it is processing events, if nothing happens, it should still processing events.or in other words, if its state changes from processing event to cancel, then event happens.
>  ## define the events which can cause node state changed clearly. like use cancel, operator exception, heart beat timeout etc
>  ## log the state change and event which cause state chaged clearly in logs
>  ## show event details(time, node, event, state changed etc) in webui



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)