You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Prabhu Joseph <pr...@gmail.com> on 2023/05/15 01:57:13 UTC

Query on RestartPipelinedRegionFailoverStrategy

 Hi, I am testing the Flink Fine-Grained Recovery
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures>
from Task Failures on Flink 1.17 and am facing some issues where I need
some advice. Have a jobgraph below with 5 operators, and all connections
between operators are pipelined and the job's parallelism.default is set to
2. Have configured RestartPipelinedRegionFailoverStrategy with Exponential
Delay Restart Strategy.

A -> B -> C -> D -> E

There are a total of 10 tasks running. The first pipeline  (a1 to e1) runs
on a TaskManager (say TM1), and the second pipeline (a2 to e2) runs on
another TaskManager (say TM2).

a1 -> b1 -> c1 -> d1 -> e1
a2 -> b2 -> c2 -> d2 -> e2

When TM1 failed, I expected only 5 tasks (a1 to e1) would fail and they
alone would be restarted, but all 10 tasks are getting restarted. There is
only one pipeline region, which consists of all 10 execution vertices, and
RestartPipelinedRegionFailoverStrategy#getTasksNeedingRestart returns all
10 tasks. Is it the right behaviour, or could there be any issue? Is it
possible to restart only the pipeline of the failed task (a1 to e1) without
restarting other parallel pipelines.

Thanks,
Prabhu Joseph

Re: Query on RestartPipelinedRegionFailoverStrategy

Posted by Shammon FY <zj...@gmail.com>.
Hi Prabhu,

Whether the tasks are in the same region depends on the DistributionPattern
between upstream and downstream Operators. For example, if the
DistributionPattern from A to B is ALL_TO_ALL, all subtasks for A and B
will be in the same range. Otherwise, if the DistributionPattern is
POINTWISE, JobManager will create independent relations between subtasks
and put the related subtasks to one region.

Best,
Shammon FY

On Tue, May 16, 2023 at 6:42 PM Prabhu Joseph <pr...@gmail.com>
wrote:

> Yes i expected the same. But all the tasks goes into one region and
> RestartPipelinedRegionFailoverStrategy restarts all of them. I see this
> strategy does not make any difference from RestartAllFailoverStrategy in
> stream execution mode. It could only help in Batch execution mode where
> Blocking result type is used which forms multiple different regions.
>
> I will come up/modify the strategy which returns two different regions and
> validate if it works without any impact.
>
>
> On Tue, 16 May, 2023, 8:09 am weijie guo, <gu...@gmail.com>
> wrote:
>
>> Hi Prabhu,
>>
>> If the edge between a -> b -> c -> d -> e all are point-wise, In theory,
>> it should form two regions.
>>
>> Best regards,
>>
>> Weijie
>>
>>
>> Prabhu Joseph <pr...@gmail.com> 于2023年5月15日周一 09:58写道:
>>
>>> Hi, I am testing the Flink Fine-Grained Recovery
>>> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures>
>>> from Task Failures on Flink 1.17 and am facing some issues where I need
>>> some advice. Have a jobgraph below with 5 operators, and all connections
>>> between operators are pipelined and the job's parallelism.default is set to
>>> 2. Have configured RestartPipelinedRegionFailoverStrategy with Exponential
>>> Delay Restart Strategy.
>>>
>>> A -> B -> C -> D -> E
>>>
>>> There are a total of 10 tasks running. The first pipeline  (a1 to e1)
>>> runs on a TaskManager (say TM1), and the second pipeline (a2 to e2) runs on
>>> another TaskManager (say TM2).
>>>
>>> a1 -> b1 -> c1 -> d1 -> e1
>>> a2 -> b2 -> c2 -> d2 -> e2
>>>
>>> When TM1 failed, I expected only 5 tasks (a1 to e1) would fail and they
>>> alone would be restarted, but all 10 tasks are getting restarted. There is
>>> only one pipeline region, which consists of all 10 execution vertices, and
>>> RestartPipelinedRegionFailoverStrategy#getTasksNeedingRestart returns all
>>> 10 tasks. Is it the right behaviour, or could there be any issue? Is it
>>> possible to restart only the pipeline of the failed task (a1 to e1) without
>>> restarting other parallel pipelines.
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Re: Query on RestartPipelinedRegionFailoverStrategy

Posted by Prabhu Joseph <pr...@gmail.com>.
Yes i expected the same. But all the tasks goes into one region and
RestartPipelinedRegionFailoverStrategy restarts all of them. I see this
strategy does not make any difference from RestartAllFailoverStrategy in
stream execution mode. It could only help in Batch execution mode where
Blocking result type is used which forms multiple different regions.

I will come up/modify the strategy which returns two different regions and
validate if it works without any impact.


On Tue, 16 May, 2023, 8:09 am weijie guo, <gu...@gmail.com> wrote:

> Hi Prabhu,
>
> If the edge between a -> b -> c -> d -> e all are point-wise, In theory,
> it should form two regions.
>
> Best regards,
>
> Weijie
>
>
> Prabhu Joseph <pr...@gmail.com> 于2023年5月15日周一 09:58写道:
>
>> Hi, I am testing the Flink Fine-Grained Recovery
>> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures>
>> from Task Failures on Flink 1.17 and am facing some issues where I need
>> some advice. Have a jobgraph below with 5 operators, and all connections
>> between operators are pipelined and the job's parallelism.default is set to
>> 2. Have configured RestartPipelinedRegionFailoverStrategy with Exponential
>> Delay Restart Strategy.
>>
>> A -> B -> C -> D -> E
>>
>> There are a total of 10 tasks running. The first pipeline  (a1 to e1)
>> runs on a TaskManager (say TM1), and the second pipeline (a2 to e2) runs on
>> another TaskManager (say TM2).
>>
>> a1 -> b1 -> c1 -> d1 -> e1
>> a2 -> b2 -> c2 -> d2 -> e2
>>
>> When TM1 failed, I expected only 5 tasks (a1 to e1) would fail and they
>> alone would be restarted, but all 10 tasks are getting restarted. There is
>> only one pipeline region, which consists of all 10 execution vertices, and
>> RestartPipelinedRegionFailoverStrategy#getTasksNeedingRestart returns all
>> 10 tasks. Is it the right behaviour, or could there be any issue? Is it
>> possible to restart only the pipeline of the failed task (a1 to e1) without
>> restarting other parallel pipelines.
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: Query on RestartPipelinedRegionFailoverStrategy

Posted by weijie guo <gu...@gmail.com>.
Hi Prabhu,

If the edge between a -> b -> c -> d -> e all are point-wise, In theory, it
should form two regions.

Best regards,

Weijie


Prabhu Joseph <pr...@gmail.com> 于2023年5月15日周一 09:58写道:

> Hi, I am testing the Flink Fine-Grained Recovery
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures>
> from Task Failures on Flink 1.17 and am facing some issues where I need
> some advice. Have a jobgraph below with 5 operators, and all connections
> between operators are pipelined and the job's parallelism.default is set to
> 2. Have configured RestartPipelinedRegionFailoverStrategy with Exponential
> Delay Restart Strategy.
>
> A -> B -> C -> D -> E
>
> There are a total of 10 tasks running. The first pipeline  (a1 to e1) runs
> on a TaskManager (say TM1), and the second pipeline (a2 to e2) runs on
> another TaskManager (say TM2).
>
> a1 -> b1 -> c1 -> d1 -> e1
> a2 -> b2 -> c2 -> d2 -> e2
>
> When TM1 failed, I expected only 5 tasks (a1 to e1) would fail and they
> alone would be restarted, but all 10 tasks are getting restarted. There is
> only one pipeline region, which consists of all 10 execution vertices, and
> RestartPipelinedRegionFailoverStrategy#getTasksNeedingRestart returns all
> 10 tasks. Is it the right behaviour, or could there be any issue? Is it
> possible to restart only the pipeline of the failed task (a1 to e1) without
> restarting other parallel pipelines.
>
> Thanks,
> Prabhu Joseph
>
>
>
>
>
>
>
>
>
>