You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by unknown unknown <un...@gmail.com> on 2022/05/26 15:05:43 UTC

Custom restart strategy

Hello Users!

    I would like to notify an external endpoint when a streaming job has a
certain number of restarts. While I can use a service to continuously *poll*
 Flink metrics and identify failing jobs, I am looking to inverse the
action and have the job notify. We have around ~50 streaming jobs and it
gets challenging querying on a continuous basis.

    Looking into [1], the intrusive way was to perform the action at [2]
(not tested though) Happy to hear suggestions and alternatives ?


[1]
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#restart-strategies


[2]
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/FixedDelayRestartBackoffTimeStrategy.java#L68


Thanks
AK.

Re: Custom restart strategy

Posted by Shengkai Fang <fs...@gmail.com>.

Hi.

Maybe the metric reporter[1] is suitabe for your case.

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/

unknown unknown <un...@gmail.com> 于2022年5月28日周六 12:49写道：

> Thanks Shengkai! Unfortunately, this would require querying status for
> each job continuously. Given very few pipelines experience failures and
> they are far in-between, I am looking for a push based model vs polling.
>
> Thanks
> AK
>
> On Thu, May 26, 2022 at 7:21 PM Shengkai Fang <fs...@gmail.com> wrote:
>
>> Hi.
>>
>> I think you can use REST OPEN API to fetch the job status from the
>> JM periodically to detect whether something happens. Currently REST OPEN
>> API also supports to fetch the exception list for the specified job[2].
>>
>> Best,
>> Shengkai
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs
>> [2]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-exceptions
>>
>> unknown unknown <un...@gmail.com> 于2022年5月26日周四 23:06写道：
>>
>>> Hello Users!
>>>
>>>     I would like to notify an external endpoint when a streaming job has
>>> a certain number of restarts. While I can use a service to continuously
>>> *poll* Flink metrics and identify failing jobs, I am looking to
>>> inverse the action and have the job notify. We have around ~50 streaming
>>> jobs and it gets challenging querying on a continuous basis.
>>>
>>>     Looking into [1], the intrusive way was to perform the action at [2]
>>> (not tested though) Happy to hear suggestions and alternatives ?
>>>
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#restart-strategies
>>>
>>>
>>> [2]
>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/FixedDelayRestartBackoffTimeStrategy.java#L68
>>>
>>>
>>> Thanks
>>> AK.
>>>
>>

Re: Custom restart strategy

Posted by unknown unknown <un...@gmail.com>.

Thanks Shengkai! Unfortunately, this would require querying status for each
job continuously. Given very few pipelines experience failures and they are
far in-between, I am looking for a push based model vs polling.

Thanks
AK

On Thu, May 26, 2022 at 7:21 PM Shengkai Fang <fs...@gmail.com> wrote:

> Hi.
>
> I think you can use REST OPEN API to fetch the job status from the
> JM periodically to detect whether something happens. Currently REST OPEN
> API also supports to fetch the exception list for the specified job[2].
>
> Best,
> Shengkai
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs
> [2]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-exceptions
>
> unknown unknown <un...@gmail.com> 于2022年5月26日周四 23:06写道：
>
>> Hello Users!
>>
>>     I would like to notify an external endpoint when a streaming job has
>> a certain number of restarts. While I can use a service to continuously
>> *poll* Flink metrics and identify failing jobs, I am looking to
>> inverse the action and have the job notify. We have around ~50 streaming
>> jobs and it gets challenging querying on a continuous basis.
>>
>>     Looking into [1], the intrusive way was to perform the action at [2]
>> (not tested though) Happy to hear suggestions and alternatives ?
>>
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#restart-strategies
>>
>>
>> [2]
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/FixedDelayRestartBackoffTimeStrategy.java#L68
>>
>>
>> Thanks
>> AK.
>>
>

Re: Custom restart strategy

Posted by Shengkai Fang <fs...@gmail.com>.

Hi.

I think you can use REST OPEN API to fetch the job status from the
JM periodically to detect whether something happens. Currently REST OPEN
API also supports to fetch the exception list for the specified job[2].

Best,
Shengkai

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-exceptions

unknown unknown <un...@gmail.com> 于2022年5月26日周四 23:06写道：

> Hello Users!
>
>     I would like to notify an external endpoint when a streaming job has a
> certain number of restarts. While I can use a service to continuously
> *poll* Flink metrics and identify failing jobs, I am looking to
> inverse the action and have the job notify. We have around ~50 streaming
> jobs and it gets challenging querying on a continuous basis.
>
>     Looking into [1], the intrusive way was to perform the action at [2]
> (not tested though) Happy to hear suggestions and alternatives ?
>
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#restart-strategies
>
>
> [2]
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/FixedDelayRestartBackoffTimeStrategy.java#L68
>
>
> Thanks
> AK.
>