You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yuan Mei (Jira)" <ji...@apache.org> on 2020/12/14 04:27:00 UTC

[jira] [Comment Edited] (FLINK-18983) Job doesn't changed to failed if close function has blocked

    [ https://issues.apache.org/jira/browse/FLINK-18983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248736#comment-17248736 ] 

Yuan Mei edited comment on FLINK-18983 at 12/14/20, 4:26 AM:
-------------------------------------------------------------

Hey [~liuyufei], thanks for reporting! just to confirm some more details:
 # In the case where the task blocks on the close function, if the task fails, the failed task will block until the close function finishes, and then report the failure to the JM. Then JM fails the job accordingly in stead of *"job will block at close action and never change to FAILED."*
 # It is kind of the contract that the close function is called and finished before exiting, hence I think the close function, in general, is not expected to do blocking work. May I ask why you would like to block the close function? And if you indeed want to do this, would you be able to do the timeout in the close function yourself? 
 # Cancelation depends on many different things, so it is reasonable to have a timeout bound.


was (Author: ym):
Hey [~liuyufei], thanks for reporting! just to confirm some more details:
 # In the case where the task blocks on the close function, if the task fails, the failed task will block until the close function finishes, and then report the failure to the JM. Then JM fails the job accordingly.  
 # It is kind of the contract that the close function is called and finished before exiting, hence I think the close function, in general, is not expected to do blocking work. May I ask why you would like to block the close function? And if you indeed want to do this, would you be able to do the timeout in the close function yourself? 
 # Cancelation depends on many different things, so it is reasonable to have a timeout bound.

> Job doesn't changed to failed if close function has blocked
> -----------------------------------------------------------
>
>                 Key: FLINK-18983
>                 URL: https://issues.apache.org/jira/browse/FLINK-18983
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: YufeiLiu
>            Priority: Major
>
> If a operator throw a exception, it will break process loop and dispose all operator. But state will never switch to FAILED if block in Function.close, and JobMaster can't know the final state and do restart.
> Task have {{TaskCancelerWatchDog}} to kill process if cancellation timeout, but it doesn't work for FAILED task.TAskThread will allways hang at:
> org.apache.flink.streaming.runtime.tasks.StreamTask#cleanUpInvoke
> Test case:
> {code:java}
> Configuration configuration = new Configuration();
> configuration.set(TaskManagerOptions.TASK_CANCELLATION_TIMEOUT, 10000L);
> StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(2, configuration);
> env.addSource(...)
> 	.process(new ProcessFunction<String, String>() {
> 		@Override
> 		public void processElement(String value, Context ctx, Collector<String> out) throws Exception {
> 			if (getRuntimeContext().getIndexOfThisSubtask() == 0) {
> 				throw new RuntimeException();
> 			}
> 		}
> 		@Override
> 		public void close() throws Exception {
> 			if (getRuntimeContext().getIndexOfThisSubtask() == 0) {
> 				Thread.sleep(10000000);
> 			}
> 		}
> 	}).setParallelism(2)
> 	.print();
> {code}
> In this case, job will block at close action and never change to FAILED.
> If change thread which subtaskIndex == 1 to sleep, TM will exit after TASK_CANCELLATION_TIMEOUT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)