You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "BoWang (JIRA)" <ji...@apache.org> on 2019/04/17 06:50:00 UTC
[jira] [Comment Edited] (FLINK-10644) Batch Job: Speculative execution

    [ https://issues.apache.org/jira/browse/FLINK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16819774#comment-16819774 ] 

BoWang edited comment on FLINK-10644 at 4/17/19 6:49 AM:
---------------------------------------------------------

Thanks [~greghogan] for the comments.

1) we use several rules to judge the long tail tasks, the process slow ratio is one of the concerned terms, the default value of which is 2.0 according to the production experience.

2) Currently the shuffle data indeed can not be consumed multiple times, we are working on this to fix in FLINK-12070, and all the tasks of the job could be speculatively executed.


was (Author: eaglewatcher):
Thanks [~greghogan] for the comments.

1) we use several rules to judge the long tail tasks, and the process slow ratio one of the concerned terms, the default value of which is 2.0 according to the production experience.

2) Currently the shuffle data indeed can not be consumed multiple times, we are working on this to fix in [FLINK-12070|https://issues.apache.org/jira/browse/FLINK-12070], and all the tasks of the job could be speculatively executed.

> Batch Job: Speculative execution
> --------------------------------
>
>                 Key: FLINK-10644
>                 URL: https://issues.apache.org/jira/browse/FLINK-10644
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Coordination
>            Reporter: JIN SUN
>            Assignee: BoWang
>            Priority: Major
>
> Strugglers/outlier are tasks that run slower than most of the all tasks in a Batch Job, this somehow impact job latency, as pretty much this straggler will be in the critical path of the job and become as the bottleneck.
> Tasks may be slow for various reasons, including hardware degradation, or software mis-configuration, or noise neighboring. It's hard for JM to predict the runtime.
> To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark has *_speculative execution_*. Speculative execution is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a ExecutionJobVertex than the median of all successfully completed tasks in that EJV, Such slow tasks will be re-submitted to another TM. It will not stop the slow tasks, but run a new copy in parallel. And will kill the others if one of them complete.
> This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be append later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)