You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Talat Uyarer via user <us...@flink.apache.org> on 2023/04/11 00:37:12 UTC

Task Failure Strategy for Adaptive Scheduler

Hi All,

We use Flink 1.13 with reactive mode for our streaming jobs. When we have
an issue/exception on our pipeline. Flink rescheduled all tasks. Is there
any way to reschedule only task that had exceptions ?

Thanks

Re: Task Failure Strategy for Adaptive Scheduler

Posted by Talat Uyarer via user <us...@flink.apache.org>.

Hi David,

Yes We have multiple disjoint DAGs in one job. We want better CPU
utilization.  Open Source Flink has a scheduling issue with those types of
jobs. I made a fix on 1.13 with AS.  Now we are scheduling evenly for all
DAGs. However somehow when we get an exception on a DAG we dont want to
affect others and restart only whichever gets the exception. I believe the
Region Pipelined model is good for what the Default scheduler has.

Do you have anything in your mind that addresses restarting other than
Regioned Pipelines ?

Thanks



On Tue, Apr 18, 2023 at 12:19 AM David Morávek <da...@gmail.com>
wrote:

> > Our DAG has multiple sources which are not connected to each other.
>
> To double-check, are you saying the job consists of multiple disjoint DAGs?
>
> >  Do you think somehow the adaptive scheduler supports region pipelines
> for streaming jobs ?
>
> It's doable but might not be straightforward since the AS recycles
> ExecutionGraph during restart. It has been a low priority so far because
> it's mainly valuable for batch jobs, but we might reconsider it if there
> are enough use cases.
>
> Best,
> D.
>
> On Sat, Apr 15, 2023 at 8:15 AM Talat Uyarer <tu...@paloaltonetworks.com>
> wrote:
>
>> Thanks David and others.
>>
>> Our DAG has multiple sources which are not connected to each other. If
>> one of them fails, I believe Flink can restart a single region for
>> defaultscheduler. but it is not the same case for adaptive scheduler. Do
>> you think somehow the adaptive scheduler supports region pipelines for
>> streaming jobs ? If it can, local recovery makes sense at that time I
>> believe.
>>
>> Thanks
>>
>>
>> On Wed, Apr 12, 2023 at 2:15 AM David Morávek <da...@gmail.com>
>> wrote:
>>
>>> Hi Talat,
>>>
>>> For most streaming pipelines, we have to restart the whole pipeline no
>>> matter the scheduler used because they're a single pipelined region. One
>>> limitation of AdaptiveScheduler is the lack of support for local recovery.
>>> This will be addressed in Flink 1.18 [1].
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-21450
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D21450&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=io1uqfos6mZy0_MjhX-EMn6Kv1O-J51WEdXoLFL2-JdSmUURlkVSK9Jo06K7PXbt&s=duMv5JJ1nSCwJNlA1eG3uaopaPpgwqrBfBnbegFEb7s&e=>
>>>
>>> Best,
>>> D.
>>>
>>> On Tue, Apr 11, 2023 at 4:35 AM Weihua Hu <hu...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> AFAIK, the reactive mode always restarts the whole pipeline now.
>>>>
>>>> Best,
>>>> Weihua
>>>>
>>>>
>>>> On Tue, Apr 11, 2023 at 8:38 AM Talat Uyarer via user <
>>>> user@flink.apache.org> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We use Flink 1.13 with reactive mode for our streaming jobs. When we
>>>>> have an issue/exception on our pipeline. Flink rescheduled all tasks. Is
>>>>> there any way to reschedule only task that had exceptions ?
>>>>>
>>>>> Thanks
>>>>>
>>>>

Re: Task Failure Strategy for Adaptive Scheduler

Posted by David Morávek <da...@gmail.com>.

> Our DAG has multiple sources which are not connected to each other.

To double-check, are you saying the job consists of multiple disjoint DAGs?

>  Do you think somehow the adaptive scheduler supports region pipelines
for streaming jobs ?

It's doable but might not be straightforward since the AS recycles
ExecutionGraph during restart. It has been a low priority so far because
it's mainly valuable for batch jobs, but we might reconsider it if there
are enough use cases.

Best,
D.

On Sat, Apr 15, 2023 at 8:15 AM Talat Uyarer <tu...@paloaltonetworks.com>
wrote:

> Thanks David and others.
>
> Our DAG has multiple sources which are not connected to each other. If one
> of them fails, I believe Flink can restart a single region for
> defaultscheduler. but it is not the same case for adaptive scheduler. Do
> you think somehow the adaptive scheduler supports region pipelines for
> streaming jobs ? If it can, local recovery makes sense at that time I
> believe.
>
> Thanks
>
>
> On Wed, Apr 12, 2023 at 2:15 AM David Morávek <da...@gmail.com>
> wrote:
>
>> Hi Talat,
>>
>> For most streaming pipelines, we have to restart the whole pipeline no
>> matter the scheduler used because they're a single pipelined region. One
>> limitation of AdaptiveScheduler is the lack of support for local recovery.
>> This will be addressed in Flink 1.18 [1].
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-21450
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D21450&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=io1uqfos6mZy0_MjhX-EMn6Kv1O-J51WEdXoLFL2-JdSmUURlkVSK9Jo06K7PXbt&s=duMv5JJ1nSCwJNlA1eG3uaopaPpgwqrBfBnbegFEb7s&e=>
>>
>> Best,
>> D.
>>
>> On Tue, Apr 11, 2023 at 4:35 AM Weihua Hu <hu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> AFAIK, the reactive mode always restarts the whole pipeline now.
>>>
>>> Best,
>>> Weihua
>>>
>>>
>>> On Tue, Apr 11, 2023 at 8:38 AM Talat Uyarer via user <
>>> user@flink.apache.org> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We use Flink 1.13 with reactive mode for our streaming jobs. When we
>>>> have an issue/exception on our pipeline. Flink rescheduled all tasks. Is
>>>> there any way to reschedule only task that had exceptions ?
>>>>
>>>> Thanks
>>>>
>>>

Re: Task Failure Strategy for Adaptive Scheduler

Posted by Talat Uyarer via user <us...@flink.apache.org>.

Thanks David and others.

Our DAG has multiple sources which are not connected to each other. If one
of them fails, I believe Flink can restart a single region for
defaultscheduler. but it is not the same case for adaptive scheduler. Do
you think somehow the adaptive scheduler supports region pipelines for
streaming jobs ? If it can, local recovery makes sense at that time I
believe.

Thanks


On Wed, Apr 12, 2023 at 2:15 AM David Morávek <da...@gmail.com>
wrote:

> Hi Talat,
>
> For most streaming pipelines, we have to restart the whole pipeline no
> matter the scheduler used because they're a single pipelined region. One
> limitation of AdaptiveScheduler is the lack of support for local recovery.
> This will be addressed in Flink 1.18 [1].
>
> [1] https://issues.apache.org/jira/browse/FLINK-21450
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D21450&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=io1uqfos6mZy0_MjhX-EMn6Kv1O-J51WEdXoLFL2-JdSmUURlkVSK9Jo06K7PXbt&s=duMv5JJ1nSCwJNlA1eG3uaopaPpgwqrBfBnbegFEb7s&e=>
>
> Best,
> D.
>
> On Tue, Apr 11, 2023 at 4:35 AM Weihua Hu <hu...@gmail.com> wrote:
>
>> Hi,
>>
>> AFAIK, the reactive mode always restarts the whole pipeline now.
>>
>> Best,
>> Weihua
>>
>>
>> On Tue, Apr 11, 2023 at 8:38 AM Talat Uyarer via user <
>> user@flink.apache.org> wrote:
>>
>>> Hi All,
>>>
>>> We use Flink 1.13 with reactive mode for our streaming jobs. When we
>>> have an issue/exception on our pipeline. Flink rescheduled all tasks. Is
>>> there any way to reschedule only task that had exceptions ?
>>>
>>> Thanks
>>>
>>

Re: Task Failure Strategy for Adaptive Scheduler

Posted by David Morávek <da...@gmail.com>.

Hi Talat,

For most streaming pipelines, we have to restart the whole pipeline no
matter the scheduler used because they're a single pipelined region. One
limitation of AdaptiveScheduler is the lack of support for local recovery.
This will be addressed in Flink 1.18 [1].

[1] https://issues.apache.org/jira/browse/FLINK-21450

Best,
D.

On Tue, Apr 11, 2023 at 4:35 AM Weihua Hu <hu...@gmail.com> wrote:

> Hi,
>
> AFAIK, the reactive mode always restarts the whole pipeline now.
>
> Best,
> Weihua
>
>
> On Tue, Apr 11, 2023 at 8:38 AM Talat Uyarer via user <
> user@flink.apache.org> wrote:
>
>> Hi All,
>>
>> We use Flink 1.13 with reactive mode for our streaming jobs. When we have
>> an issue/exception on our pipeline. Flink rescheduled all tasks. Is there
>> any way to reschedule only task that had exceptions ?
>>
>> Thanks
>>
>

Re: Task Failure Strategy for Adaptive Scheduler

Posted by Weihua Hu <hu...@gmail.com>.

Hi,

AFAIK, the reactive mode always restarts the whole pipeline now.

Best,
Weihua

On Tue, Apr 11, 2023 at 8:38 AM Talat Uyarer via user <us...@flink.apache.org>
wrote:

> Hi All,
>
> We use Flink 1.13 with reactive mode for our streaming jobs. When we have
> an issue/exception on our pipeline. Flink rescheduled all tasks. Is there
> any way to reschedule only task that had exceptions ?
>
> Thanks
>