You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by 刘建刚 <li...@gmail.com> on 2021/12/13 03:37:54 UTC

Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Any progress on the feature? We have the same requirement in our company.
Since the soft and hard environment can be complex, it is normal to see a
slow task which determines the execution time of the flink job.

<wa...@sina.cn> 于2021年6月20日周日 22:35写道：

> Hi everyone,
>
> I would like to kick off a discussion on speculative execution for batch
> job.
> I have created FLIP-168 [1] that clarifies our motivation to do this and
> some improvement proposals for the new design.
> It would be great to resolve the problem of long tail task in batch job.
> Please let me know your thoughts. Thanks.
>   Regards,
> wangwj
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>

Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Posted by Zhu Zhu <re...@gmail.com>.

Hi everyone,

Thank you for all the feedback on this FLIP!
I will open a vote for it since there is no more concern.

Thanks,
Zhu

Zhu Zhu <re...@gmail.com> 于2022年5月11日周三 12:29写道：
>
> Hi everyone,
>
> According to the discussion and updates of the blocklist
> mechanism[1] (FLIP-224), I have updated FLIP-168 to make
> decision on itself to block identified slow nodes. A new
> configuration is also added to control how long a slow
> node should be blocked.
>
> [1] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h
>
> Thanks,
> Zhu
>
> Zhu Zhu <re...@gmail.com> 于2022年4月29日周五 14:36写道：
> >
> > Thank you for all the feedback!
> >
> > @Guowei Ma
> > Here's my thoughts for your questions:
> > >> 1. How to judge whether the Execution Vertex belongs to a slow task.
> > If a slow task fails and gets restarted, it may not be a slow task
> > anymore. Especially given that the nodes of the slow task may have been
> > blacklisted and the new task will be deployed to a new node. I think we
> > should again go through the slow task detection process to determine
> > whether it is a slow task. I agree that it is not ideal to take another
> > 59 mins to identify a slow task. To solve this problem, one idea is to
> > introduce a slow task detection strategy which identifies slow tasks
> > according to the throughput. This approach needs more thoughts and
> > experiments so we now target it to a future time.
> >
> > >> 2. The fault tolerance strategy and the Slow task detection strategy are coupled
> > I don't think the fault tolerance and slow task detecting are coupled.
> > If a task fails while the ExecutionVertex still has a task in progress,
> > there is no need to start new executions for the vertex in the perspective
> > of fault tolerance. If the remaining task is slow, in the next slow task
> > detecting, a speculative execution will be created and deployed for it.
> > This, however, is a normal speculative execution process rather than a
> > failure recovery process. In this way, the fault tolerance and slow task
> > detecting work without knowing each other and the job can still recover
> > from failures and guarantee there are speculative executions for slow tasks.
> >
> > >> 3. Default value of `slow-task-detector.execution-time.baseline-lower-bound` is too small
> > From what I see in production and knowing from users, there are many
> > batch jobs of a relatively small scale (a few terabytes, hundreds of
> > gigabytes). Tasks of these jobs can finish in minutes, so that a
> > `1 min` lowbound is large enough. Besides that, I think the out-of-box
> > experience is more important for users running small scale jobs.
> >
> > Thanks,
> > Zhu
> >
> > Guowei Ma <gu...@gmail.com> 于2022年4月28日周四 17:55写道：
> >>
> >> Hi, zhu
> >>
> >> Many thanks to zhuzhu for initiating the FLIP discussion. Overall I think
> >> it's ok, I just have 3 small questions
> >>
> >> 1. How to judge whether the Execution Vertex belongs to a slow task.
> >> The current calculation method is: the current timestamp minus the
> >> timestamp of the execution deployment. If the execution time of this
> >> execution exceeds the baseline, then it is judged as a slow task. Normally
> >> this is no problem. But if an execution fails, the time may not be
> >> accurate. For example, the baseline is 59 minutes, and a task fails after
> >> 56 minutes of execution. In the worst case, it may take an additional 59
> >> minutes to discover that the task is a slow task.
> >>
> >> 2. Speculative Scheduler's fault tolerance strategy.
> >> The strategy in FLIP is: if the Execution Vertex can be executed, even if
> >> the execution fails, the fault tolerance strategy will not be adopted.
> >> Although currently `ExecutionTimeBasedSlowTaskDetector` can restart an
> >> execution. But isn't this dependency a bit too strong? To some extent, the
> >> fault tolerance strategy and the Slow task detection strategy are coupled
> >> together.
> >>
> >>
> >> 3. The value of the default configuration
> >> IMHO, prediction execution should only be required for relatively
> >> large-scale, very time-consuming and long-term jobs.
> >> If `slow-task-detector.execution-time.baseline-lower-bound` is too small,
> >> is it possible for the system to always start some additional tasks that
> >> have little effect? In the end, the user needs to reset this default
> >> configuration. Is it possible to consider a larger configuration. Of
> >> course, this part is best to listen to the suggestions of other community
> >> users.
> >>
> >> Best,
> >> Guowei
> >>
> >>
> >> On Thu, Apr 28, 2022 at 3:54 PM Jiangang Liu <li...@gmail.com>
> >> wrote:
> >>
> >> > +1 for the feature.
> >> >
> >> > Mang Zhang <zh...@163.com> 于2022年4月28日周四 11:36写道：
> >> >
> >> > > Hi zhu:
> >> > >
> >> > >
> >> > >     This sounds like a great job! Thanks for your great job.
> >> > >     In our company, there are already some jobs using Flink Batch,
> >> > >     but everyone knows that the offline cluster has a lot more load than
> >> > > the online cluster, and the failure rate of the machine is also much
> >> > higher.
> >> > >     If this work is done, we'd love to use it, it's simply awesome for
> >> > our
> >> > > flink users.
> >> > >     thanks again!
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > Best regards,
> >> > > Mang Zhang
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > At 2022-04-27 10:46:06, "Zhu Zhu" <zh...@apache.org> wrote:
> >> > > >Hi everyone,
> >> > > >
> >> > > >More and more users are running their batch jobs on Flink nowadays.
> >> > > >One major problem they encounter is slow tasks running on hot/bad
> >> > > >nodes, resulting in very long and uncontrollable execution time of
> >> > > >batch jobs. This problem is a pain or even unacceptable in
> >> > > >production. Many users have been asking for a solution for it.
> >> > > >
> >> > > >Therefore, I'd like to revive the discussion of speculative
> >> > > >execution to solve this problem.
> >> > > >
> >> > > >Weijun Wang, Jing Zhang, Lijie Wang and I had some offline
> >> > > >discussions to refine the design[1]. We also implemented a PoC[2]
> >> > > >and verified it using TPC-DS benchmarks and production jobs.
> >> > > >
> >> > > >Looking forward to your feedback!
> >> > > >
> >> > > >Thanks,
> >> > > >Zhu
> >> > > >
> >> > > >[1]
> >> > > >
> >> > >
> >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >> > > >[2]
> >> > > https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc
> >> > > >
> >> > > >
> >> > > >刘建刚 <li...@gmail.com> 于2021年12月13日周一 11:38写道：
> >> > > >
> >> > > >> Any progress on the feature? We have the same requirement in our
> >> > > company.
> >> > > >> Since the soft and hard environment can be complex, it is normal to
> >> > see
> >> > > a
> >> > > >> slow task which determines the execution time of the flink job.
> >> > > >>
> >> > > >> <wa...@sina.cn> 于2021年6月20日周日 22:35写道：
> >> > > >>
> >> > > >> > Hi everyone,
> >> > > >> >
> >> > > >> > I would like to kick off a discussion on speculative execution for
> >> > > batch
> >> > > >> > job.
> >> > > >> > I have created FLIP-168 [1] that clarifies our motivation to do this
> >> > > and
> >> > > >> > some improvement proposals for the new design.
> >> > > >> > It would be great to resolve the problem of long tail task in batch
> >> > > job.
> >> > > >> > Please let me know your thoughts. Thanks.
> >> > > >> >   Regards,
> >> > > >> > wangwj
> >> > > >> > [1]
> >> > > >> >
> >> > > >>
> >> > >
> >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >> > > >> >
> >> > > >>
> >> > >
> >> >

Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Posted by Zhu Zhu <re...@gmail.com>.

Hi everyone,

According to the discussion and updates of the blocklist
mechanism[1] (FLIP-224), I have updated FLIP-168 to make
decision on itself to block identified slow nodes. A new
configuration is also added to control how long a slow
node should be blocked.

[1] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h

Thanks,
Zhu

Zhu Zhu <re...@gmail.com> 于2022年4月29日周五 14:36写道：
>
> Thank you for all the feedback!
>
> @Guowei Ma
> Here's my thoughts for your questions:
> >> 1. How to judge whether the Execution Vertex belongs to a slow task.
> If a slow task fails and gets restarted, it may not be a slow task
> anymore. Especially given that the nodes of the slow task may have been
> blacklisted and the new task will be deployed to a new node. I think we
> should again go through the slow task detection process to determine
> whether it is a slow task. I agree that it is not ideal to take another
> 59 mins to identify a slow task. To solve this problem, one idea is to
> introduce a slow task detection strategy which identifies slow tasks
> according to the throughput. This approach needs more thoughts and
> experiments so we now target it to a future time.
>
> >> 2. The fault tolerance strategy and the Slow task detection strategy are coupled
> I don't think the fault tolerance and slow task detecting are coupled.
> If a task fails while the ExecutionVertex still has a task in progress,
> there is no need to start new executions for the vertex in the perspective
> of fault tolerance. If the remaining task is slow, in the next slow task
> detecting, a speculative execution will be created and deployed for it.
> This, however, is a normal speculative execution process rather than a
> failure recovery process. In this way, the fault tolerance and slow task
> detecting work without knowing each other and the job can still recover
> from failures and guarantee there are speculative executions for slow tasks.
>
> >> 3. Default value of `slow-task-detector.execution-time.baseline-lower-bound` is too small
> From what I see in production and knowing from users, there are many
> batch jobs of a relatively small scale (a few terabytes, hundreds of
> gigabytes). Tasks of these jobs can finish in minutes, so that a
> `1 min` lowbound is large enough. Besides that, I think the out-of-box
> experience is more important for users running small scale jobs.
>
> Thanks,
> Zhu
>
> Guowei Ma <gu...@gmail.com> 于2022年4月28日周四 17:55写道：
>>
>> Hi, zhu
>>
>> Many thanks to zhuzhu for initiating the FLIP discussion. Overall I think
>> it's ok, I just have 3 small questions
>>
>> 1. How to judge whether the Execution Vertex belongs to a slow task.
>> The current calculation method is: the current timestamp minus the
>> timestamp of the execution deployment. If the execution time of this
>> execution exceeds the baseline, then it is judged as a slow task. Normally
>> this is no problem. But if an execution fails, the time may not be
>> accurate. For example, the baseline is 59 minutes, and a task fails after
>> 56 minutes of execution. In the worst case, it may take an additional 59
>> minutes to discover that the task is a slow task.
>>
>> 2. Speculative Scheduler's fault tolerance strategy.
>> The strategy in FLIP is: if the Execution Vertex can be executed, even if
>> the execution fails, the fault tolerance strategy will not be adopted.
>> Although currently `ExecutionTimeBasedSlowTaskDetector` can restart an
>> execution. But isn't this dependency a bit too strong? To some extent, the
>> fault tolerance strategy and the Slow task detection strategy are coupled
>> together.
>>
>>
>> 3. The value of the default configuration
>> IMHO, prediction execution should only be required for relatively
>> large-scale, very time-consuming and long-term jobs.
>> If `slow-task-detector.execution-time.baseline-lower-bound` is too small,
>> is it possible for the system to always start some additional tasks that
>> have little effect? In the end, the user needs to reset this default
>> configuration. Is it possible to consider a larger configuration. Of
>> course, this part is best to listen to the suggestions of other community
>> users.
>>
>> Best,
>> Guowei
>>
>>
>> On Thu, Apr 28, 2022 at 3:54 PM Jiangang Liu <li...@gmail.com>
>> wrote:
>>
>> > +1 for the feature.
>> >
>> > Mang Zhang <zh...@163.com> 于2022年4月28日周四 11:36写道：
>> >
>> > > Hi zhu:
>> > >
>> > >
>> > >     This sounds like a great job! Thanks for your great job.
>> > >     In our company, there are already some jobs using Flink Batch,
>> > >     but everyone knows that the offline cluster has a lot more load than
>> > > the online cluster, and the failure rate of the machine is also much
>> > higher.
>> > >     If this work is done, we'd love to use it, it's simply awesome for
>> > our
>> > > flink users.
>> > >     thanks again!
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > > Best regards,
>> > > Mang Zhang
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > At 2022-04-27 10:46:06, "Zhu Zhu" <zh...@apache.org> wrote:
>> > > >Hi everyone,
>> > > >
>> > > >More and more users are running their batch jobs on Flink nowadays.
>> > > >One major problem they encounter is slow tasks running on hot/bad
>> > > >nodes, resulting in very long and uncontrollable execution time of
>> > > >batch jobs. This problem is a pain or even unacceptable in
>> > > >production. Many users have been asking for a solution for it.
>> > > >
>> > > >Therefore, I'd like to revive the discussion of speculative
>> > > >execution to solve this problem.
>> > > >
>> > > >Weijun Wang, Jing Zhang, Lijie Wang and I had some offline
>> > > >discussions to refine the design[1]. We also implemented a PoC[2]
>> > > >and verified it using TPC-DS benchmarks and production jobs.
>> > > >
>> > > >Looking forward to your feedback!
>> > > >
>> > > >Thanks,
>> > > >Zhu
>> > > >
>> > > >[1]
>> > > >
>> > >
>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>> > > >[2]
>> > > https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc
>> > > >
>> > > >
>> > > >刘建刚 <li...@gmail.com> 于2021年12月13日周一 11:38写道：
>> > > >
>> > > >> Any progress on the feature? We have the same requirement in our
>> > > company.
>> > > >> Since the soft and hard environment can be complex, it is normal to
>> > see
>> > > a
>> > > >> slow task which determines the execution time of the flink job.
>> > > >>
>> > > >> <wa...@sina.cn> 于2021年6月20日周日 22:35写道：
>> > > >>
>> > > >> > Hi everyone,
>> > > >> >
>> > > >> > I would like to kick off a discussion on speculative execution for
>> > > batch
>> > > >> > job.
>> > > >> > I have created FLIP-168 [1] that clarifies our motivation to do this
>> > > and
>> > > >> > some improvement proposals for the new design.
>> > > >> > It would be great to resolve the problem of long tail task in batch
>> > > job.
>> > > >> > Please let me know your thoughts. Thanks.
>> > > >> >   Regards,
>> > > >> > wangwj
>> > > >> > [1]
>> > > >> >
>> > > >>
>> > >
>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>> > > >> >
>> > > >>
>> > >
>> >

Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Posted by Zhu Zhu <re...@gmail.com>.

Thank you for all the feedback!

@Guowei Ma
Here's my thoughts for your questions:
>> 1. How to judge whether the Execution Vertex belongs to a slow task.
If a slow task fails and gets restarted, it may not be a slow task
anymore. Especially given that the nodes of the slow task may have been
blacklisted and the new task will be deployed to a new node. I think we
should again go through the slow task detection process to determine
whether it is a slow task. I agree that it is not ideal to take another
59 mins to identify a slow task. To solve this problem, one idea is to
introduce a slow task detection strategy which identifies slow tasks
according to the throughput. This approach needs more thoughts and
experiments so we now target it to a future time.

>> 2. The fault tolerance strategy and the Slow task detection strategy are
coupled
I don't think the fault tolerance and slow task detecting are coupled.
If a task fails while the ExecutionVertex still has a task in progress,
there is no need to start new executions for the vertex in the perspective
of fault tolerance. If the remaining task is slow, in the next slow task
detecting, a speculative execution will be created and deployed for it.
This, however, is a normal speculative execution process rather than a
failure recovery process. In this way, the fault tolerance and slow task
detecting work without knowing each other and the job can still recover
from failures and guarantee there are speculative executions for slow tasks.

>> 3. Default value of
`slow-task-detector.execution-time.baseline-lower-bound` is too small
From what I see in production and knowing from users, there are many
batch jobs of a relatively small scale (a few terabytes, hundreds of
gigabytes). Tasks of these jobs can finish in minutes, so that a
`1 min` lowbound is large enough. Besides that, I think the out-of-box
experience is more important for users running small scale jobs.

Thanks,
Zhu

Guowei Ma <gu...@gmail.com> 于2022年4月28日周四 17:55写道：

> Hi, zhu
>
> Many thanks to zhuzhu for initiating the FLIP discussion. Overall I think
> it's ok, I just have 3 small questions
>
> 1. How to judge whether the Execution Vertex belongs to a slow task.
> The current calculation method is: the current timestamp minus the
> timestamp of the execution deployment. If the execution time of this
> execution exceeds the baseline, then it is judged as a slow task. Normally
> this is no problem. But if an execution fails, the time may not be
> accurate. For example, the baseline is 59 minutes, and a task fails after
> 56 minutes of execution. In the worst case, it may take an additional 59
> minutes to discover that the task is a slow task.
>
> 2. Speculative Scheduler's fault tolerance strategy.
> The strategy in FLIP is: if the Execution Vertex can be executed, even if
> the execution fails, the fault tolerance strategy will not be adopted.
> Although currently `ExecutionTimeBasedSlowTaskDetector` can restart an
> execution. But isn't this dependency a bit too strong? To some extent, the
> fault tolerance strategy and the Slow task detection strategy are coupled
> together.
>
>
> 3. The value of the default configuration
> IMHO, prediction execution should only be required for relatively
> large-scale, very time-consuming and long-term jobs.
> If `slow-task-detector.execution-time.baseline-lower-bound` is too small,
> is it possible for the system to always start some additional tasks that
> have little effect? In the end, the user needs to reset this default
> configuration. Is it possible to consider a larger configuration. Of
> course, this part is best to listen to the suggestions of other community
> users.
>
> Best,
> Guowei
>
>
> On Thu, Apr 28, 2022 at 3:54 PM Jiangang Liu <li...@gmail.com>
> wrote:
>
> > +1 for the feature.
> >
> > Mang Zhang <zh...@163.com> 于2022年4月28日周四 11:36写道：
> >
> > > Hi zhu:
> > >
> > >
> > >     This sounds like a great job! Thanks for your great job.
> > >     In our company, there are already some jobs using Flink Batch,
> > >     but everyone knows that the offline cluster has a lot more load
> than
> > > the online cluster, and the failure rate of the machine is also much
> > higher.
> > >     If this work is done, we'd love to use it, it's simply awesome for
> > our
> > > flink users.
> > >     thanks again!
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > >
> > > Best regards,
> > > Mang Zhang
> > >
> > >
> > >
> > >
> > >
> > > At 2022-04-27 10:46:06, "Zhu Zhu" <zh...@apache.org> wrote:
> > > >Hi everyone,
> > > >
> > > >More and more users are running their batch jobs on Flink nowadays.
> > > >One major problem they encounter is slow tasks running on hot/bad
> > > >nodes, resulting in very long and uncontrollable execution time of
> > > >batch jobs. This problem is a pain or even unacceptable in
> > > >production. Many users have been asking for a solution for it.
> > > >
> > > >Therefore, I'd like to revive the discussion of speculative
> > > >execution to solve this problem.
> > > >
> > > >Weijun Wang, Jing Zhang, Lijie Wang and I had some offline
> > > >discussions to refine the design[1]. We also implemented a PoC[2]
> > > >and verified it using TPC-DS benchmarks and production jobs.
> > > >
> > > >Looking forward to your feedback!
> > > >
> > > >Thanks,
> > > >Zhu
> > > >
> > > >[1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > > >[2]
> > >
> https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc
> > > >
> > > >
> > > >刘建刚 <li...@gmail.com> 于2021年12月13日周一 11:38写道：
> > > >
> > > >> Any progress on the feature? We have the same requirement in our
> > > company.
> > > >> Since the soft and hard environment can be complex, it is normal to
> > see
> > > a
> > > >> slow task which determines the execution time of the flink job.
> > > >>
> > > >> <wa...@sina.cn> 于2021年6月20日周日 22:35写道：
> > > >>
> > > >> > Hi everyone,
> > > >> >
> > > >> > I would like to kick off a discussion on speculative execution for
> > > batch
> > > >> > job.
> > > >> > I have created FLIP-168 [1] that clarifies our motivation to do
> this
> > > and
> > > >> > some improvement proposals for the new design.
> > > >> > It would be great to resolve the problem of long tail task in
> batch
> > > job.
> > > >> > Please let me know your thoughts. Thanks.
> > > >> >   Regards,
> > > >> > wangwj
> > > >> > [1]
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > > >> >
> > > >>
> > >
> >
>

Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Posted by Guowei Ma <gu...@gmail.com>.

Hi, zhu

Many thanks to zhuzhu for initiating the FLIP discussion. Overall I think
it's ok, I just have 3 small questions

1. How to judge whether the Execution Vertex belongs to a slow task.
The current calculation method is: the current timestamp minus the
timestamp of the execution deployment. If the execution time of this
execution exceeds the baseline, then it is judged as a slow task. Normally
this is no problem. But if an execution fails, the time may not be
accurate. For example, the baseline is 59 minutes, and a task fails after
56 minutes of execution. In the worst case, it may take an additional 59
minutes to discover that the task is a slow task.

2. Speculative Scheduler's fault tolerance strategy.
The strategy in FLIP is: if the Execution Vertex can be executed, even if
the execution fails, the fault tolerance strategy will not be adopted.
Although currently `ExecutionTimeBasedSlowTaskDetector` can restart an
execution. But isn't this dependency a bit too strong? To some extent, the
fault tolerance strategy and the Slow task detection strategy are coupled
together.

3. The value of the default configuration
IMHO, prediction execution should only be required for relatively
large-scale, very time-consuming and long-term jobs.
If `slow-task-detector.execution-time.baseline-lower-bound` is too small,
is it possible for the system to always start some additional tasks that
have little effect? In the end, the user needs to reset this default
configuration. Is it possible to consider a larger configuration. Of
course, this part is best to listen to the suggestions of other community
users.

Best,
Guowei

On Thu, Apr 28, 2022 at 3:54 PM Jiangang Liu <li...@gmail.com>
wrote:

> +1 for the feature.
>
> Mang Zhang <zh...@163.com> 于2022年4月28日周四 11:36写道：
>
> > Hi zhu:
> >
> >
> >     This sounds like a great job! Thanks for your great job.
> >     In our company, there are already some jobs using Flink Batch,
> >     but everyone knows that the offline cluster has a lot more load than
> > the online cluster, and the failure rate of the machine is also much
> higher.
> >     If this work is done, we'd love to use it, it's simply awesome for
> our
> > flink users.
> >     thanks again!
> >
> >
> >
> >
> >
> >
> >
> > --
> >
> > Best regards,
> > Mang Zhang
> >
> >
> >
> >
> >
> > At 2022-04-27 10:46:06, "Zhu Zhu" <zh...@apache.org> wrote:
> > >Hi everyone,
> > >
> > >More and more users are running their batch jobs on Flink nowadays.
> > >One major problem they encounter is slow tasks running on hot/bad
> > >nodes, resulting in very long and uncontrollable execution time of
> > >batch jobs. This problem is a pain or even unacceptable in
> > >production. Many users have been asking for a solution for it.
> > >
> > >Therefore, I'd like to revive the discussion of speculative
> > >execution to solve this problem.
> > >
> > >Weijun Wang, Jing Zhang, Lijie Wang and I had some offline
> > >discussions to refine the design[1]. We also implemented a PoC[2]
> > >and verified it using TPC-DS benchmarks and production jobs.
> > >
> > >Looking forward to your feedback!
> > >
> > >Thanks,
> > >Zhu
> > >
> > >[1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > >[2]
> > https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc
> > >
> > >
> > >刘建刚 <li...@gmail.com> 于2021年12月13日周一 11:38写道：
> > >
> > >> Any progress on the feature? We have the same requirement in our
> > company.
> > >> Since the soft and hard environment can be complex, it is normal to
> see
> > a
> > >> slow task which determines the execution time of the flink job.
> > >>
> > >> <wa...@sina.cn> 于2021年6月20日周日 22:35写道：
> > >>
> > >> > Hi everyone,
> > >> >
> > >> > I would like to kick off a discussion on speculative execution for
> > batch
> > >> > job.
> > >> > I have created FLIP-168 [1] that clarifies our motivation to do this
> > and
> > >> > some improvement proposals for the new design.
> > >> > It would be great to resolve the problem of long tail task in batch
> > job.
> > >> > Please let me know your thoughts. Thanks.
> > >> >   Regards,
> > >> > wangwj
> > >> > [1]
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > >> >
> > >>
> >
>

Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Posted by Jiangang Liu <li...@gmail.com>.

+1 for the feature.

Mang Zhang <zh...@163.com> 于2022年4月28日周四 11:36写道：

> Hi zhu:
>
>
>     This sounds like a great job! Thanks for your great job.
>     In our company, there are already some jobs using Flink Batch,
>     but everyone knows that the offline cluster has a lot more load than
> the online cluster, and the failure rate of the machine is also much higher.
>     If this work is done, we'd love to use it, it's simply awesome for our
> flink users.
>     thanks again!
>
>
>
>
>
>
>
> --
>
> Best regards,
> Mang Zhang
>
>
>
>
>
> At 2022-04-27 10:46:06, "Zhu Zhu" <zh...@apache.org> wrote:
> >Hi everyone,
> >
> >More and more users are running their batch jobs on Flink nowadays.
> >One major problem they encounter is slow tasks running on hot/bad
> >nodes, resulting in very long and uncontrollable execution time of
> >batch jobs. This problem is a pain or even unacceptable in
> >production. Many users have been asking for a solution for it.
> >
> >Therefore, I'd like to revive the discussion of speculative
> >execution to solve this problem.
> >
> >Weijun Wang, Jing Zhang, Lijie Wang and I had some offline
> >discussions to refine the design[1]. We also implemented a PoC[2]
> >and verified it using TPC-DS benchmarks and production jobs.
> >
> >Looking forward to your feedback!
> >
> >Thanks,
> >Zhu
> >
> >[1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >[2]
> https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc
> >
> >
> >刘建刚 <li...@gmail.com> 于2021年12月13日周一 11:38写道：
> >
> >> Any progress on the feature? We have the same requirement in our
> company.
> >> Since the soft and hard environment can be complex, it is normal to see
> a
> >> slow task which determines the execution time of the flink job.
> >>
> >> <wa...@sina.cn> 于2021年6月20日周日 22:35写道：
> >>
> >> > Hi everyone,
> >> >
> >> > I would like to kick off a discussion on speculative execution for
> batch
> >> > job.
> >> > I have created FLIP-168 [1] that clarifies our motivation to do this
> and
> >> > some improvement proposals for the new design.
> >> > It would be great to resolve the problem of long tail task in batch
> job.
> >> > Please let me know your thoughts. Thanks.
> >> >   Regards,
> >> > wangwj
> >> > [1]
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >> >
> >>
>

Re:Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Posted by Mang Zhang <zh...@163.com>.

Hi zhu:


    This sounds like a great job! Thanks for your great job.
    In our company, there are already some jobs using Flink Batch, 
    but everyone knows that the offline cluster has a lot more load than the online cluster, and the failure rate of the machine is also much higher.
    If this work is done, we'd love to use it, it's simply awesome for our flink users.
    thanks again!







--

Best regards,
Mang Zhang





At 2022-04-27 10:46:06, "Zhu Zhu" <zh...@apache.org> wrote:
>Hi everyone,
>
>More and more users are running their batch jobs on Flink nowadays.
>One major problem they encounter is slow tasks running on hot/bad
>nodes, resulting in very long and uncontrollable execution time of
>batch jobs. This problem is a pain or even unacceptable in
>production. Many users have been asking for a solution for it.
>
>Therefore, I'd like to revive the discussion of speculative
>execution to solve this problem.
>
>Weijun Wang, Jing Zhang, Lijie Wang and I had some offline
>discussions to refine the design[1]. We also implemented a PoC[2]
>and verified it using TPC-DS benchmarks and production jobs.
>
>Looking forward to your feedback!
>
>Thanks,
>Zhu
>
>[1]
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>[2] https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc
>
>
>刘建刚 <li...@gmail.com> 于2021年12月13日周一 11:38写道：
>
>> Any progress on the feature? We have the same requirement in our company.
>> Since the soft and hard environment can be complex, it is normal to see a
>> slow task which determines the execution time of the flink job.
>>
>> <wa...@sina.cn> 于2021年6月20日周日 22:35写道：
>>
>> > Hi everyone,
>> >
>> > I would like to kick off a discussion on speculative execution for batch
>> > job.
>> > I have created FLIP-168 [1] that clarifies our motivation to do this and
>> > some improvement proposals for the new design.
>> > It would be great to resolve the problem of long tail task in batch job.
>> > Please let me know your thoughts. Thanks.
>> >   Regards,
>> > wangwj
>> > [1]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>> >
>>

Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Posted by Zhu Zhu <zh...@apache.org>.

Hi everyone,

More and more users are running their batch jobs on Flink nowadays.
One major problem they encounter is slow tasks running on hot/bad
nodes, resulting in very long and uncontrollable execution time of
batch jobs. This problem is a pain or even unacceptable in
production. Many users have been asking for a solution for it.

Therefore, I'd like to revive the discussion of speculative
execution to solve this problem.

Weijun Wang, Jing Zhang, Lijie Wang and I had some offline
discussions to refine the design[1]. We also implemented a PoC[2]
and verified it using TPC-DS benchmarks and production jobs.

Looking forward to your feedback!

Thanks,
Zhu

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
[2] https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc


刘建刚 <li...@gmail.com> 于2021年12月13日周一 11:38写道：

> Any progress on the feature? We have the same requirement in our company.
> Since the soft and hard environment can be complex, it is normal to see a
> slow task which determines the execution time of the flink job.
>
> <wa...@sina.cn> 于2021年6月20日周日 22:35写道：
>
> > Hi everyone,
> >
> > I would like to kick off a discussion on speculative execution for batch
> > job.
> > I have created FLIP-168 [1] that clarifies our motivation to do this and
> > some improvement proposals for the new design.
> > It would be great to resolve the problem of long tail task in batch job.
> > Please let me know your thoughts. Thanks.
> >   Regards,
> > wangwj
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >
>