You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "makeyang (JIRA)" <ji...@apache.org> on 2018/04/20 13:50:00 UTC
[jira] [Created] (FLINK-9228) log details about task fail/task manager is shutting down

makeyang created FLINK-9228:
-------------------------------

             Summary: log details about task fail/task manager is shutting down
                 Key: FLINK-9228
                 URL: https://issues.apache.org/jira/browse/FLINK-9228
             Project: Flink
          Issue Type: Improvement
          Components: Logging
    Affects Versions: 1.4.2
            Reporter: makeyang
            Assignee: makeyang
             Fix For: 1.4.3, 1.5.1


condition:

flink version:1.4.2

jdk version:1.8.0.20

linux version:3.10.0

problem description:

one of my task manager is out of the cluster and I checked its log found 
something below: 
2018-04-19 22:34:47,441 INFO  org.apache.flink.runtime.taskmanager.Task                     
- Attempting to fail task externally Process (115/120) 
(19d0b0ce1ef3b8023b37bdfda643ef44). 
2018-04-19 22:34:47,441 INFO  org.apache.flink.runtime.taskmanager.Task                     
- Process (115/120) (19d0b0ce1ef3b8023b37bdfda643ef44) switched from RUNNING 
to FAILED. 
java.lang.Exception: TaskManager is shutting down. 
        at 
org.apache.flink.runtime.taskmanager.TaskManager.postStop(TaskManager.scala:220) 
        at akka.actor.Actor$class.aroundPostStop(Actor.scala:515) 
        at 
org.apache.flink.runtime.taskmanager.TaskManager.aroundPostStop(TaskManager.scala:121) 
        at 
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210) 
        at 
akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) 
        at akka.actor.ActorCell.terminate(ActorCell.scala:374) 
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:467) 
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483) 
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282) 
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260) 
        at akka.dispatch.Mailbox.run(Mailbox.scala:224) 
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234) 
        at 
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

suggestion:
 # short term suggestion:
 ## log reasons why task tail?maybe received some event from job manager/can't connect to job manager? operator exception? the more claritify the better
 ## log reasons why task manager is shutting down? received some event from job manager/can't connect to job manager? operator exception can't be recovery?
 # long term suggestion:
 ## define the state machine of flink node clearly. if nothing happens, the node should stay what it used to be, which means if it is processing events, if nothing happens, it should still processing events.or in other words, if its state changes from processing event to cancel, then event happens.
 ## define the events which can cause node state changed clearly. like use cancel, operator exception, heart beat timeout etc
 ## log the state change and event which cause state chaged clearly in logs
 ## show event details(time, node, event, state changed etc) in webui



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)