You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Eleanore Jin <el...@gmail.com> on 2020/11/30 05:01:21 UTC

pause and resume flink stream job based on certain condition

Hi experts,

Here is my use case, it's a flink stateless streaming job for message
validation.
1. read from a kafka topic
2. perform validation of message, which requires query external system
       2a. the metadata from the external system will be cached in memory
for 15minutes
       2b. there is another stream that will send updates to update the
cache if metadata changed                     within 15 minutes
3. if message is valid, publish to valid topic
4. if message is invalid, publish to error topic
5. if the external system is down, the message is marked as invalid with
different error code, and published to the same error topic.

Ask:
For those messages that failed due to external system failures, it requires
manual replay of those messages.

Is there a way to pause the job if there is an external system failure, and
resume once the external system is online?

Or are there any other suggestions to allow auto retry such error?

Thanks a lot!
Eleanore

Re: pause and resume flink stream job based on certain condition

Posted by Eleanore Jin <el...@gmail.com>.
Hi Robert,

sorry for the late reply, I just did a quick test up, this seems working:
1. during the time checkpoints could expire, but once the thread is not
blocked, it will continue checkpointing
2. this guarantees the message ordering

Thanks a lot!
Eleanore

On Tue, Dec 15, 2020 at 10:42 PM Robert Metzger <rm...@apache.org> wrote:

> What you can also do is rely on Flink's backpressure mechanism: If the map
> operator that validates the messages detects that the external system is
> down, it blocks until the system is up again.
> This effectively causes the whole streaming job to pause: the Kafka source
> won't read new messages.
>
> On Tue, Dec 15, 2020 at 3:07 AM Eleanore Jin <el...@gmail.com>
> wrote:
>
>> Hi Guowei and Arvid,
>>
>> Thanks for the suggestion. I wonder if it makes sense and possible that
>> the operator will produce a side output message telling the source to
>> 'pause', and the same side output as the side input to the source, based on
>> which, the source would pause and resume?
>>
>> Thanks a lot!
>> Eleanore
>>
>> On Sun, Nov 29, 2020 at 11:33 PM Arvid Heise <ar...@ververica.com> wrote:
>>
>>> Hi Eleanore,
>>>
>>> if the external system is down, you could simply fail the job after a
>>> given timeout (for example, using asyncIO). Then the job would restart
>>> using the restarting policies.
>>>
>>> If your state is rather small (and thus recovery time okay), you would
>>> pretty much get your desired behavior. The job would stop to make progress
>>> until eventually the external system is responding again.
>>>
>>> On Mon, Nov 30, 2020 at 7:39 AM Guowei Ma <gu...@gmail.com> wrote:
>>>
>>>> Hi, Eleanore
>>>>
>>>> 1. AFAIK I think only the job could "pause" itself.  For example the
>>>> "query" external system could pause when the external system is down.
>>>> 2. Maybe you could try the "iterate" and send the failed message back
>>>> to retry if you use the DataStream api.
>>>>
>>>> Best,
>>>> Guowei
>>>>
>>>>
>>>> On Mon, Nov 30, 2020 at 1:01 PM Eleanore Jin <el...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi experts,
>>>>>
>>>>> Here is my use case, it's a flink stateless streaming job for message
>>>>> validation.
>>>>> 1. read from a kafka topic
>>>>> 2. perform validation of message, which requires query external system
>>>>>        2a. the metadata from the external system will be cached in
>>>>> memory for 15minutes
>>>>>        2b. there is another stream that will send updates to update
>>>>> the cache if metadata changed                     within 15 minutes
>>>>> 3. if message is valid, publish to valid topic
>>>>> 4. if message is invalid, publish to error topic
>>>>> 5. if the external system is down, the message is marked as invalid
>>>>> with different error code, and published to the same error topic.
>>>>>
>>>>> Ask:
>>>>> For those messages that failed due to external system failures, it
>>>>> requires manual replay of those messages.
>>>>>
>>>>> Is there a way to pause the job if there is an external system
>>>>> failure, and resume once the external system is online?
>>>>>
>>>>> Or are there any other suggestions to allow auto retry such error?
>>>>>
>>>>> Thanks a lot!
>>>>> Eleanore
>>>>>
>>>>
>>>
>>> --
>>>
>>> Arvid Heise | Senior Java Developer
>>>
>>> <https://www.ververica.com/>
>>>
>>> Follow us @VervericaData
>>>
>>> --
>>>
>>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
>>> Conference
>>>
>>> Stream Processing | Event Driven | Real Time
>>>
>>> --
>>>
>>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>>
>>> --
>>> Ververica GmbH
>>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>>> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
>>> (Toni) Cheng
>>>
>>

Re: pause and resume flink stream job based on certain condition

Posted by Robert Metzger <rm...@apache.org>.
What you can also do is rely on Flink's backpressure mechanism: If the map
operator that validates the messages detects that the external system is
down, it blocks until the system is up again.
This effectively causes the whole streaming job to pause: the Kafka source
won't read new messages.

On Tue, Dec 15, 2020 at 3:07 AM Eleanore Jin <el...@gmail.com> wrote:

> Hi Guowei and Arvid,
>
> Thanks for the suggestion. I wonder if it makes sense and possible that
> the operator will produce a side output message telling the source to
> 'pause', and the same side output as the side input to the source, based on
> which, the source would pause and resume?
>
> Thanks a lot!
> Eleanore
>
> On Sun, Nov 29, 2020 at 11:33 PM Arvid Heise <ar...@ververica.com> wrote:
>
>> Hi Eleanore,
>>
>> if the external system is down, you could simply fail the job after a
>> given timeout (for example, using asyncIO). Then the job would restart
>> using the restarting policies.
>>
>> If your state is rather small (and thus recovery time okay), you would
>> pretty much get your desired behavior. The job would stop to make progress
>> until eventually the external system is responding again.
>>
>> On Mon, Nov 30, 2020 at 7:39 AM Guowei Ma <gu...@gmail.com> wrote:
>>
>>> Hi, Eleanore
>>>
>>> 1. AFAIK I think only the job could "pause" itself.  For example the
>>> "query" external system could pause when the external system is down.
>>> 2. Maybe you could try the "iterate" and send the failed message back to
>>> retry if you use the DataStream api.
>>>
>>> Best,
>>> Guowei
>>>
>>>
>>> On Mon, Nov 30, 2020 at 1:01 PM Eleanore Jin <el...@gmail.com>
>>> wrote:
>>>
>>>> Hi experts,
>>>>
>>>> Here is my use case, it's a flink stateless streaming job for message
>>>> validation.
>>>> 1. read from a kafka topic
>>>> 2. perform validation of message, which requires query external system
>>>>        2a. the metadata from the external system will be cached in
>>>> memory for 15minutes
>>>>        2b. there is another stream that will send updates to update the
>>>> cache if metadata changed                     within 15 minutes
>>>> 3. if message is valid, publish to valid topic
>>>> 4. if message is invalid, publish to error topic
>>>> 5. if the external system is down, the message is marked as invalid
>>>> with different error code, and published to the same error topic.
>>>>
>>>> Ask:
>>>> For those messages that failed due to external system failures, it
>>>> requires manual replay of those messages.
>>>>
>>>> Is there a way to pause the job if there is an external system failure,
>>>> and resume once the external system is online?
>>>>
>>>> Or are there any other suggestions to allow auto retry such error?
>>>>
>>>> Thanks a lot!
>>>> Eleanore
>>>>
>>>
>>
>> --
>>
>> Arvid Heise | Senior Java Developer
>>
>> <https://www.ververica.com/>
>>
>> Follow us @VervericaData
>>
>> --
>>
>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
>> Conference
>>
>> Stream Processing | Event Driven | Real Time
>>
>> --
>>
>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>
>> --
>> Ververica GmbH
>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
>> (Toni) Cheng
>>
>

Re: pause and resume flink stream job based on certain condition

Posted by Eleanore Jin <el...@gmail.com>.
Hi Guowei and Arvid,

Thanks for the suggestion. I wonder if it makes sense and possible that the
operator will produce a side output message telling the source to 'pause',
and the same side output as the side input to the source, based on which,
the source would pause and resume?

Thanks a lot!
Eleanore

On Sun, Nov 29, 2020 at 11:33 PM Arvid Heise <ar...@ververica.com> wrote:

> Hi Eleanore,
>
> if the external system is down, you could simply fail the job after a
> given timeout (for example, using asyncIO). Then the job would restart
> using the restarting policies.
>
> If your state is rather small (and thus recovery time okay), you would
> pretty much get your desired behavior. The job would stop to make progress
> until eventually the external system is responding again.
>
> On Mon, Nov 30, 2020 at 7:39 AM Guowei Ma <gu...@gmail.com> wrote:
>
>> Hi, Eleanore
>>
>> 1. AFAIK I think only the job could "pause" itself.  For example the
>> "query" external system could pause when the external system is down.
>> 2. Maybe you could try the "iterate" and send the failed message back to
>> retry if you use the DataStream api.
>>
>> Best,
>> Guowei
>>
>>
>> On Mon, Nov 30, 2020 at 1:01 PM Eleanore Jin <el...@gmail.com>
>> wrote:
>>
>>> Hi experts,
>>>
>>> Here is my use case, it's a flink stateless streaming job for message
>>> validation.
>>> 1. read from a kafka topic
>>> 2. perform validation of message, which requires query external system
>>>        2a. the metadata from the external system will be cached in
>>> memory for 15minutes
>>>        2b. there is another stream that will send updates to update the
>>> cache if metadata changed                     within 15 minutes
>>> 3. if message is valid, publish to valid topic
>>> 4. if message is invalid, publish to error topic
>>> 5. if the external system is down, the message is marked as invalid with
>>> different error code, and published to the same error topic.
>>>
>>> Ask:
>>> For those messages that failed due to external system failures, it
>>> requires manual replay of those messages.
>>>
>>> Is there a way to pause the job if there is an external system failure,
>>> and resume once the external system is online?
>>>
>>> Or are there any other suggestions to allow auto retry such error?
>>>
>>> Thanks a lot!
>>> Eleanore
>>>
>>
>
> --
>
> Arvid Heise | Senior Java Developer
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng
>

Re: pause and resume flink stream job based on certain condition

Posted by Arvid Heise <ar...@ververica.com>.
Hi Eleanore,

if the external system is down, you could simply fail the job after a given
timeout (for example, using asyncIO). Then the job would restart using the
restarting policies.

If your state is rather small (and thus recovery time okay), you would
pretty much get your desired behavior. The job would stop to make progress
until eventually the external system is responding again.

On Mon, Nov 30, 2020 at 7:39 AM Guowei Ma <gu...@gmail.com> wrote:

> Hi, Eleanore
>
> 1. AFAIK I think only the job could "pause" itself.  For example the
> "query" external system could pause when the external system is down.
> 2. Maybe you could try the "iterate" and send the failed message back to
> retry if you use the DataStream api.
>
> Best,
> Guowei
>
>
> On Mon, Nov 30, 2020 at 1:01 PM Eleanore Jin <el...@gmail.com>
> wrote:
>
>> Hi experts,
>>
>> Here is my use case, it's a flink stateless streaming job for message
>> validation.
>> 1. read from a kafka topic
>> 2. perform validation of message, which requires query external system
>>        2a. the metadata from the external system will be cached in memory
>> for 15minutes
>>        2b. there is another stream that will send updates to update the
>> cache if metadata changed                     within 15 minutes
>> 3. if message is valid, publish to valid topic
>> 4. if message is invalid, publish to error topic
>> 5. if the external system is down, the message is marked as invalid with
>> different error code, and published to the same error topic.
>>
>> Ask:
>> For those messages that failed due to external system failures, it
>> requires manual replay of those messages.
>>
>> Is there a way to pause the job if there is an external system failure,
>> and resume once the external system is online?
>>
>> Or are there any other suggestions to allow auto retry such error?
>>
>> Thanks a lot!
>> Eleanore
>>
>

-- 

Arvid Heise | Senior Java Developer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng

Re: pause and resume flink stream job based on certain condition

Posted by Guowei Ma <gu...@gmail.com>.
Hi, Eleanore

1. AFAIK I think only the job could "pause" itself.  For example the
"query" external system could pause when the external system is down.
2. Maybe you could try the "iterate" and send the failed message back to
retry if you use the DataStream api.

Best,
Guowei


On Mon, Nov 30, 2020 at 1:01 PM Eleanore Jin <el...@gmail.com> wrote:

> Hi experts,
>
> Here is my use case, it's a flink stateless streaming job for message
> validation.
> 1. read from a kafka topic
> 2. perform validation of message, which requires query external system
>        2a. the metadata from the external system will be cached in memory
> for 15minutes
>        2b. there is another stream that will send updates to update the
> cache if metadata changed                     within 15 minutes
> 3. if message is valid, publish to valid topic
> 4. if message is invalid, publish to error topic
> 5. if the external system is down, the message is marked as invalid with
> different error code, and published to the same error topic.
>
> Ask:
> For those messages that failed due to external system failures, it
> requires manual replay of those messages.
>
> Is there a way to pause the job if there is an external system failure,
> and resume once the external system is online?
>
> Or are there any other suggestions to allow auto retry such error?
>
> Thanks a lot!
> Eleanore
>