You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Sanket Agrawal <sa...@infosys.com> on 2021/09/28 13:54:54 UTC

Event is taking a lot of time between the operators

Hi All,

I am new to Flink. While developing a Flink application We observed that our message is taking around 10 seconds between the two Async operators. Below are the details.


  *   Flink Flow: Kinesis Source -> Process -> Async1 -> Async2 -> Process -> Kinesis Sink
  *   Environment: Amazon KDA. 1 Kinesis Processing Unit (1vCore & 4GB ram), and 1 parallelism.
  *   Flink Version: 1.8.0
  *   Backpressure: Flink dashboard shows that backpressure is OK.
  *   Input rate: 60 messages per second.

Any kind of pointers/help will be very useful.

Thanks,
Sanket Agrawal


RE: Event is taking a lot of time between the operators

Posted by Sanket Agrawal <sa...@infosys.com>.
Thank you @Piotr Nowojski<ma...@apache.org> for helping me.

From: Piotr Nowojski <pn...@apache.org>
Sent: Wednesday, September 29, 2021 12:53 PM
To: Sanket Agrawal <sa...@infosys.com>
Cc: Ragini Manjaiah <ra...@gmail.com>; user@flink.apache.org
Subject: Re: Event is taking a lot of time between the operators


[**EXTERNAL EMAIL**]
Hi Sanket,

As I mentioned in the previous email, it's most likely still an issue of backpressure and you can check it as I described in that message. Either your records are stuck in the network buffers between (I) to async operations (if there is a network exchange), and/or inside the `AsyncWaitOperator`'s internal queue (II). If it's causing you problems

I. For the former problem (network buffers) you can:
a) get rid of the network exchange, via removing keyBy/shuffle/rebalance (might not be feasible, depending on your business logic)
b) reduce the amount of the in-flight data. In Flink 1.14 we are adding automatic buffer debloating mechanism, in Flink 1.8 you can not use, but you could manually tweak both amount and the size of the buffers. You can read about it here [1], just ignore the automatic buffer debloating mechanism.
II. You can change the size of the internal queue by adjusting the `capacity` parameter [2]

The more buffered in-flight data you have between operators, the longer the delay between processing the same record by two different operators.

Best,
Piotrek

[1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/network_mem_tuning/<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnightlies.apache.org%2Fflink%2Fflink-docs-master%2Fdocs%2Fdeployment%2Fmemory%2Fnetwork_mem_tuning%2F&data=04%7C01%7Csanket.agrawal%40infosys.com%7C7c0683dcdf29475809d908d9831a36f1%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684970891526908%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UCwjYQ7ivszHcaINrZ96sop%2BTvTLwpo9epkc6v77drg%3D&reserved=0>
[2] https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#async-io-api<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnightlies.apache.org%2Fflink%2Fflink-docs-master%2Fdocs%2Fdev%2Fdatastream%2Foperators%2Fasyncio%2F%23async-io-api&data=04%7C01%7Csanket.agrawal%40infosys.com%7C7c0683dcdf29475809d908d9831a36f1%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684970891536859%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LZCgZc3bs7u%2ByPtnm8XY%2FTK3S1Y7GPh9kN5Pr7TyOjY%3D&reserved=0>



śr., 29 wrz 2021 o 08:20 Sanket Agrawal <sa...@infosys.com>> napisał(a):
Hi Ragini,

For measuring time in an async we have put a logger as the first and the last statement in asyncInvoke and for measuring time between the asyncs we are simply subtracting the message2's start time and message1's end time. Also, we are using 1 as the parallelism.

Please let me know if you need any other information or if you have any recommendations on improving the approach.

Thanks,
Sanket Agrawal

From: Ragini Manjaiah <ra...@gmail.com>>
Sent: Wednesday, September 29, 2021 11:17 AM
To: Sanket Agrawal <sa...@infosys.com>>
Cc: Piotr Nowojski <pn...@apache.org>>; user@flink.apache.org<ma...@flink.apache.org>
Subject: Re: Event is taking a lot of time between the operators


[**EXTERNAL EMAIL**]
Hi Sanket,
 I have a similar use case. how are you measuring the time for Async1` function to return the result and external api call

On Wed, Sep 29, 2021 at 10:47 AM Sanket Agrawal <sa...@infosys.com>> wrote:
Hi @Piotr Nowojski<ma...@apache.org>,

Thank you for replying back. Yes, first async is taking between 1300-1500 milliseconds but that is called on a CompletableFuture.supplyAsync and the Async Capacity is set to 1000.

Async Code Structure: Inside asyncInvoke we are calling CompletableFuture.supplyAsync and inside supplyAsync we are calling an external API which is taking around 1005ms to 1040ms. Rest of the code for request creation/response validation is also inside the supplyAsync and is taking around 250ms.

This way we tried that the main Async thread(as the async does not uses multiple threads directly) is available for the next message as soon as it calls CompletableFuture.supplyAsync on the current message.

Thanks,
Sanket Agrawal

From: Piotr Nowojski <pn...@apache.org>>
Sent: Tuesday, September 28, 2021 8:02 PM
To: Sanket Agrawal <sa...@infosys.com>>
Cc: user@flink.apache.org<ma...@flink.apache.org>
Subject: Re: Event is taking a lot of time between the operators


[**EXTERNAL EMAIL**]
Hi,

With Flink 1.8.0 I'm not sure how reliable the backpressure status is in the WebUI when it comes to the Async operators. If I remember correctly until around Flink 1.10 (+/- 2 version) backpressure monitoring was checking for thread dumps stuck in requesting Flink's network memory buffers. If in your job AsyncFunction is the source of a backpressure, it would be skipped and not reported. For analysing backpressure I would highly recommend upgrading to Flink 1.13.x as it has greatly improved tooling for that [1]. And in that version AsynFunctions are definitely handled correctly. Since Flink 1.10 I believe you can use the `isBackPressured` metric. In previous versions you would have to rely on buffer usage metrics as described here [2].


[1] https://flink.apache.org/2021/07/07/backpressure.html<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2F2021%2F07%2F07%2Fbackpressure.html&data=04%7C01%7Csanket.agrawal%40infosys.com%7C7c0683dcdf29475809d908d9831a36f1%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684970891546816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mLbLr99gt4v5unD6BxoXG%2F%2B4N98uVo%2FVwA49e%2F94qU4%3D&reserved=0>
[2] https://flink.apache.org/2019/07/23/flink-network-stack-2.html#network-metrics<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2F2019%2F07%2F23%2Fflink-network-stack-2.html%23network-metrics&data=04%7C01%7Csanket.agrawal%40infosys.com%7C7c0683dcdf29475809d908d9831a36f1%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684970891556765%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4LKUKiT06xIoAPkZEFdFZQny%2FbnzUK%2BRdi7ik5TfFGg%3D&reserved=0>

Apart of the back pressure, part of the problem might be simply how long does it take for `Async1` function to return the result. Have you checked that? Isn't it taking a couple of seconds?

Best,
Piotrek

wt., 28 wrz 2021 o 15:55 Sanket Agrawal <sa...@infosys.com>> napisał(a):
Hi All,

I am new to Flink. While developing a Flink application We observed that our message is taking around 10 seconds between the two Async operators. Below are the details.


  *   Flink Flow: Kinesis Source -> Process -> Async1 -> Async2 -> Process -> Kinesis Sink
  *   Environment: Amazon KDA. 1 Kinesis Processing Unit (1vCore & 4GB ram), and 1 parallelism.
  *   Flink Version: 1.8.0
  *   Backpressure: Flink dashboard shows that backpressure is OK.
  *   Input rate: 60 messages per second.

Any kind of pointers/help will be very useful.

Thanks,
Sanket Agrawal


Re: Event is taking a lot of time between the operators

Posted by Piotr Nowojski <pn...@apache.org>.
Hi Sanket,

As I mentioned in the previous email, it's most likely still an issue of
backpressure and you can check it as I described in that message. Either
your records are stuck in the network buffers between (I) to async
operations (if there is a network exchange), and/or inside the
`AsyncWaitOperator`'s internal queue (II). If it's causing you problems

I. For the former problem (network buffers) you can:
a) get rid of the network exchange, via removing keyBy/shuffle/rebalance
(might not be feasible, depending on your business logic)
b) reduce the amount of the in-flight data. In Flink 1.14 we are adding
automatic buffer debloating mechanism, in Flink 1.8 you can not use, but
you could manually tweak both amount and the size of the buffers. You can
read about it here [1], just ignore the automatic buffer debloating
mechanism.
II. You can change the size of the internal queue by adjusting the
`capacity` parameter [2]

The more buffered in-flight data you have between operators, the longer the
delay between processing the same record by two different operators.

Best,
Piotrek

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/network_mem_tuning/
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#async-io-api



śr., 29 wrz 2021 o 08:20 Sanket Agrawal <sa...@infosys.com>
napisał(a):

> Hi Ragini,
>
>
>
> For measuring time in an async we have put a logger as the first and the
> last statement in asyncInvoke and for measuring time between the asyncs
> we are simply subtracting the message2’s start time and message1’s end
> time. Also, we are using 1 as the parallelism.
>
>
>
> Please let me know if you need any other information or if you have any
> recommendations on improving the approach.
>
>
>
> Thanks,
>
> Sanket Agrawal
>
>
>
> *From:* Ragini Manjaiah <ra...@gmail.com>
> *Sent:* Wednesday, September 29, 2021 11:17 AM
> *To:* Sanket Agrawal <sa...@infosys.com>
> *Cc:* Piotr Nowojski <pn...@apache.org>; user@flink.apache.org
> *Subject:* Re: Event is taking a lot of time between the operators
>
>
>
> [**EXTERNAL EMAIL**]
>
> Hi Sanket,
>
>  I have a similar use case. how are you measuring the time for Async1`
> function to return the result and external api call
>
>
>
> On Wed, Sep 29, 2021 at 10:47 AM Sanket Agrawal <
> sanket.agrawal@infosys.com> wrote:
>
> Hi @Piotr Nowojski <pn...@apache.org>,
>
>
>
> Thank you for replying back. Yes, first async is taking between 1300-1500
> milliseconds but that is called on a CompletableFuture.*supplyAsync *and
> the Async Capacity is set to 1000.
>
>
>
> *Async Code Structure*: Inside asyncInvoke we are calling
> CompletableFuture.*supplyAsync *and inside* supplyAsync *we are calling
> an external API which is taking around 1005ms to 1040ms. Rest of the code
> for request creation/response validation is also inside the* supplyAsync *and
> is taking around 250ms.
>
>
>
> This way we tried that the main Async thread(as the async does not uses
> multiple threads directly) is available for the next message as soon as it
> calls CompletableFuture.supplyAsync on the current message.
>
>
>
> Thanks,
>
> Sanket Agrawal
>
>
>
> *From:* Piotr Nowojski <pn...@apache.org>
> *Sent:* Tuesday, September 28, 2021 8:02 PM
> *To:* Sanket Agrawal <sa...@infosys.com>
> *Cc:* user@flink.apache.org
> *Subject:* Re: Event is taking a lot of time between the operators
>
>
>
> [**EXTERNAL EMAIL**]
>
> Hi,
>
>
>
> With Flink 1.8.0 I'm not sure how reliable the backpressure status is in
> the WebUI when it comes to the Async operators. If I remember correctly
> until around Flink 1.10 (+/- 2 version) backpressure monitoring was
> checking for thread dumps stuck in requesting Flink's network memory
> buffers. If in your job AsyncFunction is the source of a backpressure, it
> would be skipped and not reported. For analysing backpressure I would
> highly recommend upgrading to Flink 1.13.x as it has greatly improved
> tooling for that [1]. And in that version AsynFunctions are definitely
> handled correctly. Since Flink 1.10 I believe you can use the
> `isBackPressured` metric. In previous versions you would have to rely on
> buffer usage metrics as described here [2].
>
>
>
>
>
> [1] https://flink.apache.org/2021/07/07/backpressure.html
> <https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2F2021%2F07%2F07%2Fbackpressure.html&data=04%7C01%7Csanket.agrawal%40infosys.com%7C3c523d23c444478d85b908d9830caf3e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684912775947321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BzEolZuvsAgudziPqGzqFuDGRZbHR2hu9D%2F9rERLwk8%3D&reserved=0>
>
> [2]
> https://flink.apache.org/2019/07/23/flink-network-stack-2.html#network-metrics
> <https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2F2019%2F07%2F23%2Fflink-network-stack-2.html%23network-metrics&data=04%7C01%7Csanket.agrawal%40infosys.com%7C3c523d23c444478d85b908d9830caf3e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684912775957276%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Xi2g%2BQ6V0AtftHMxYrHgELNn7c%2BKL0iXdacAamI%2F3A0%3D&reserved=0>
>
>
>
> Apart of the back pressure, part of the problem might be simply how long
> does it take for `Async1` function to return the result. Have you checked
> that? Isn't it taking a couple of seconds?
>
>
>
> Best,
>
> Piotrek
>
>
>
> wt., 28 wrz 2021 o 15:55 Sanket Agrawal <sa...@infosys.com>
> napisał(a):
>
> Hi All,
>
>
>
> I am new to Flink. While developing a Flink application We observed that
> our message is taking around 10 seconds between the two Async operators.
> Below are the details.
>
>
>
>    - *Flink Flow*: Kinesis Source -> Process -> Async1 -> Async2 ->
>    Process -> Kinesis Sink
>    - *Environment*: Amazon KDA. 1 Kinesis Processing Unit (1vCore & 4GB
>    ram), and 1 parallelism.
>    - *Flink Version*: 1.8.0
>    - *Backpressure*: Flink dashboard shows that backpressure is *OK.*
>    - *Input rate: *60 messages per second.
>
>
>
> Any kind of pointers/help will be very useful.
>
>
>
> Thanks,
>
> Sanket Agrawal
>
>
>
>

RE: Event is taking a lot of time between the operators

Posted by Sanket Agrawal <sa...@infosys.com>.
Hi Ragini,

For measuring time in an async we have put a logger as the first and the last statement in asyncInvoke and for measuring time between the asyncs we are simply subtracting the message2's start time and message1's end time. Also, we are using 1 as the parallelism.

Please let me know if you need any other information or if you have any recommendations on improving the approach.

Thanks,
Sanket Agrawal

From: Ragini Manjaiah <ra...@gmail.com>
Sent: Wednesday, September 29, 2021 11:17 AM
To: Sanket Agrawal <sa...@infosys.com>
Cc: Piotr Nowojski <pn...@apache.org>; user@flink.apache.org
Subject: Re: Event is taking a lot of time between the operators


[**EXTERNAL EMAIL**]
Hi Sanket,
 I have a similar use case. how are you measuring the time for Async1` function to return the result and external api call

On Wed, Sep 29, 2021 at 10:47 AM Sanket Agrawal <sa...@infosys.com>> wrote:
Hi @Piotr Nowojski<ma...@apache.org>,

Thank you for replying back. Yes, first async is taking between 1300-1500 milliseconds but that is called on a CompletableFuture.supplyAsync and the Async Capacity is set to 1000.

Async Code Structure: Inside asyncInvoke we are calling CompletableFuture.supplyAsync and inside supplyAsync we are calling an external API which is taking around 1005ms to 1040ms. Rest of the code for request creation/response validation is also inside the supplyAsync and is taking around 250ms.

This way we tried that the main Async thread(as the async does not uses multiple threads directly) is available for the next message as soon as it calls CompletableFuture.supplyAsync on the current message.

Thanks,
Sanket Agrawal

From: Piotr Nowojski <pn...@apache.org>>
Sent: Tuesday, September 28, 2021 8:02 PM
To: Sanket Agrawal <sa...@infosys.com>>
Cc: user@flink.apache.org<ma...@flink.apache.org>
Subject: Re: Event is taking a lot of time between the operators


[**EXTERNAL EMAIL**]
Hi,

With Flink 1.8.0 I'm not sure how reliable the backpressure status is in the WebUI when it comes to the Async operators. If I remember correctly until around Flink 1.10 (+/- 2 version) backpressure monitoring was checking for thread dumps stuck in requesting Flink's network memory buffers. If in your job AsyncFunction is the source of a backpressure, it would be skipped and not reported. For analysing backpressure I would highly recommend upgrading to Flink 1.13.x as it has greatly improved tooling for that [1]. And in that version AsynFunctions are definitely handled correctly. Since Flink 1.10 I believe you can use the `isBackPressured` metric. In previous versions you would have to rely on buffer usage metrics as described here [2].


[1] https://flink.apache.org/2021/07/07/backpressure.html<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2F2021%2F07%2F07%2Fbackpressure.html&data=04%7C01%7Csanket.agrawal%40infosys.com%7C3c523d23c444478d85b908d9830caf3e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684912775947321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BzEolZuvsAgudziPqGzqFuDGRZbHR2hu9D%2F9rERLwk8%3D&reserved=0>
[2] https://flink.apache.org/2019/07/23/flink-network-stack-2.html#network-metrics<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2F2019%2F07%2F23%2Fflink-network-stack-2.html%23network-metrics&data=04%7C01%7Csanket.agrawal%40infosys.com%7C3c523d23c444478d85b908d9830caf3e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684912775957276%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Xi2g%2BQ6V0AtftHMxYrHgELNn7c%2BKL0iXdacAamI%2F3A0%3D&reserved=0>

Apart of the back pressure, part of the problem might be simply how long does it take for `Async1` function to return the result. Have you checked that? Isn't it taking a couple of seconds?

Best,
Piotrek

wt., 28 wrz 2021 o 15:55 Sanket Agrawal <sa...@infosys.com>> napisał(a):
Hi All,

I am new to Flink. While developing a Flink application We observed that our message is taking around 10 seconds between the two Async operators. Below are the details.


  *   Flink Flow: Kinesis Source -> Process -> Async1 -> Async2 -> Process -> Kinesis Sink
  *   Environment: Amazon KDA. 1 Kinesis Processing Unit (1vCore & 4GB ram), and 1 parallelism.
  *   Flink Version: 1.8.0
  *   Backpressure: Flink dashboard shows that backpressure is OK.
  *   Input rate: 60 messages per second.

Any kind of pointers/help will be very useful.

Thanks,
Sanket Agrawal


Re: Event is taking a lot of time between the operators

Posted by Ragini Manjaiah <ra...@gmail.com>.
Hi Sanket,
 I have a similar use case. how are you measuring the time for Async1`
function to return the result and external api call

On Wed, Sep 29, 2021 at 10:47 AM Sanket Agrawal <sa...@infosys.com>
wrote:

> Hi @Piotr Nowojski <pn...@apache.org>,
>
>
>
> Thank you for replying back. Yes, first async is taking between 1300-1500
> milliseconds but that is called on a CompletableFuture.*supplyAsync *and
> the Async Capacity is set to 1000.
>
>
>
> *Async Code Structure*: Inside asyncInvoke we are calling
> CompletableFuture.*supplyAsync *and inside* supplyAsync *we are calling
> an external API which is taking around 1005ms to 1040ms. Rest of the code
> for request creation/response validation is also inside the* supplyAsync *and
> is taking around 250ms.
>
>
>
> This way we tried that the main Async thread(as the async does not uses
> multiple threads directly) is available for the next message as soon as it
> calls CompletableFuture.supplyAsync on the current message.
>
>
>
> Thanks,
>
> Sanket Agrawal
>
>
>
> *From:* Piotr Nowojski <pn...@apache.org>
> *Sent:* Tuesday, September 28, 2021 8:02 PM
> *To:* Sanket Agrawal <sa...@infosys.com>
> *Cc:* user@flink.apache.org
> *Subject:* Re: Event is taking a lot of time between the operators
>
>
>
> [**EXTERNAL EMAIL**]
>
> Hi,
>
>
>
> With Flink 1.8.0 I'm not sure how reliable the backpressure status is in
> the WebUI when it comes to the Async operators. If I remember correctly
> until around Flink 1.10 (+/- 2 version) backpressure monitoring was
> checking for thread dumps stuck in requesting Flink's network memory
> buffers. If in your job AsyncFunction is the source of a backpressure, it
> would be skipped and not reported. For analysing backpressure I would
> highly recommend upgrading to Flink 1.13.x as it has greatly improved
> tooling for that [1]. And in that version AsynFunctions are definitely
> handled correctly. Since Flink 1.10 I believe you can use the
> `isBackPressured` metric. In previous versions you would have to rely on
> buffer usage metrics as described here [2].
>
>
>
>
>
> [1] https://flink.apache.org/2021/07/07/backpressure.html
> <https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2F2021%2F07%2F07%2Fbackpressure.html&data=04%7C01%7Csanket.agrawal%40infosys.com%7C23f3adeda77d49df701e08d9828cf9be%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684364264365935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jHbzq5R3ObWWE9XjNpyVFi9bZl9QMIJQp13ZPd%2BMb00%3D&reserved=0>
>
> [2]
> https://flink.apache.org/2019/07/23/flink-network-stack-2.html#network-metrics
> <https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2F2019%2F07%2F23%2Fflink-network-stack-2.html%23network-metrics&data=04%7C01%7Csanket.agrawal%40infosys.com%7C23f3adeda77d49df701e08d9828cf9be%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684364264365935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=KIcL8H1ztS67z5U%2FPZnhnYvlwYM8jNhl9K1sD1GYQHg%3D&reserved=0>
>
>
>
> Apart of the back pressure, part of the problem might be simply how long
> does it take for `Async1` function to return the result. Have you checked
> that? Isn't it taking a couple of seconds?
>
>
>
> Best,
>
> Piotrek
>
>
>
> wt., 28 wrz 2021 o 15:55 Sanket Agrawal <sa...@infosys.com>
> napisał(a):
>
> Hi All,
>
>
>
> I am new to Flink. While developing a Flink application We observed that
> our message is taking around 10 seconds between the two Async operators.
> Below are the details.
>
>
>
>    - *Flink Flow*: Kinesis Source -> Process -> Async1 -> Async2 ->
>    Process -> Kinesis Sink
>    - *Environment*: Amazon KDA. 1 Kinesis Processing Unit (1vCore & 4GB
>    ram), and 1 parallelism.
>    - *Flink Version*: 1.8.0
>    - *Backpressure*: Flink dashboard shows that backpressure is *OK.*
>    - *Input rate: *60 messages per second.
>
>
>
> Any kind of pointers/help will be very useful.
>
>
>
> Thanks,
>
> Sanket Agrawal
>
>
>
>

RE: Event is taking a lot of time between the operators

Posted by Sanket Agrawal <sa...@infosys.com>.
Hi @Piotr Nowojski<ma...@apache.org>,

Thank you for replying back. Yes, first async is taking between 1300-1500 milliseconds but that is called on a CompletableFuture.supplyAsync and the Async Capacity is set to 1000.

Async Code Structure: Inside asyncInvoke we are calling CompletableFuture.supplyAsync and inside supplyAsync we are calling an external API which is taking around 1005ms to 1040ms. Rest of the code for request creation/response validation is also inside the supplyAsync and is taking around 250ms.

This way we tried that the main Async thread(as the async does not uses multiple threads directly) is available for the next message as soon as it calls CompletableFuture.supplyAsync on the current message.

Thanks,
Sanket Agrawal

From: Piotr Nowojski <pn...@apache.org>
Sent: Tuesday, September 28, 2021 8:02 PM
To: Sanket Agrawal <sa...@infosys.com>
Cc: user@flink.apache.org
Subject: Re: Event is taking a lot of time between the operators


[**EXTERNAL EMAIL**]
Hi,

With Flink 1.8.0 I'm not sure how reliable the backpressure status is in the WebUI when it comes to the Async operators. If I remember correctly until around Flink 1.10 (+/- 2 version) backpressure monitoring was checking for thread dumps stuck in requesting Flink's network memory buffers. If in your job AsyncFunction is the source of a backpressure, it would be skipped and not reported. For analysing backpressure I would highly recommend upgrading to Flink 1.13.x as it has greatly improved tooling for that [1]. And in that version AsynFunctions are definitely handled correctly. Since Flink 1.10 I believe you can use the `isBackPressured` metric. In previous versions you would have to rely on buffer usage metrics as described here [2].


[1] https://flink.apache.org/2021/07/07/backpressure.html<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2F2021%2F07%2F07%2Fbackpressure.html&data=04%7C01%7Csanket.agrawal%40infosys.com%7C23f3adeda77d49df701e08d9828cf9be%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684364264365935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jHbzq5R3ObWWE9XjNpyVFi9bZl9QMIJQp13ZPd%2BMb00%3D&reserved=0>
[2] https://flink.apache.org/2019/07/23/flink-network-stack-2.html#network-metrics<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2F2019%2F07%2F23%2Fflink-network-stack-2.html%23network-metrics&data=04%7C01%7Csanket.agrawal%40infosys.com%7C23f3adeda77d49df701e08d9828cf9be%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637684364264365935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=KIcL8H1ztS67z5U%2FPZnhnYvlwYM8jNhl9K1sD1GYQHg%3D&reserved=0>

Apart of the back pressure, part of the problem might be simply how long does it take for `Async1` function to return the result. Have you checked that? Isn't it taking a couple of seconds?

Best,
Piotrek

wt., 28 wrz 2021 o 15:55 Sanket Agrawal <sa...@infosys.com>> napisał(a):
Hi All,

I am new to Flink. While developing a Flink application We observed that our message is taking around 10 seconds between the two Async operators. Below are the details.


  *   Flink Flow: Kinesis Source -> Process -> Async1 -> Async2 -> Process -> Kinesis Sink
  *   Environment: Amazon KDA. 1 Kinesis Processing Unit (1vCore & 4GB ram), and 1 parallelism.
  *   Flink Version: 1.8.0
  *   Backpressure: Flink dashboard shows that backpressure is OK.
  *   Input rate: 60 messages per second.

Any kind of pointers/help will be very useful.

Thanks,
Sanket Agrawal


Re: Event is taking a lot of time between the operators

Posted by Piotr Nowojski <pn...@apache.org>.
Hi,

With Flink 1.8.0 I'm not sure how reliable the backpressure status is in
the WebUI when it comes to the Async operators. If I remember correctly
until around Flink 1.10 (+/- 2 version) backpressure monitoring was
checking for thread dumps stuck in requesting Flink's network memory
buffers. If in your job AsyncFunction is the source of a backpressure, it
would be skipped and not reported. For analysing backpressure I would
highly recommend upgrading to Flink 1.13.x as it has greatly improved
tooling for that [1]. And in that version AsynFunctions are definitely
handled correctly. Since Flink 1.10 I believe you can use the
`isBackPressured` metric. In previous versions you would have to rely on
buffer usage metrics as described here [2].


[1] https://flink.apache.org/2021/07/07/backpressure.html
[2]
https://flink.apache.org/2019/07/23/flink-network-stack-2.html#network-metrics

Apart of the back pressure, part of the problem might be simply how long
does it take for `Async1` function to return the result. Have you checked
that? Isn't it taking a couple of seconds?

Best,
Piotrek

wt., 28 wrz 2021 o 15:55 Sanket Agrawal <sa...@infosys.com>
napisał(a):

> Hi All,
>
>
>
> I am new to Flink. While developing a Flink application We observed that
> our message is taking around 10 seconds between the two Async operators.
> Below are the details.
>
>
>
>    - *Flink Flow*: Kinesis Source -> Process -> Async1 -> Async2 ->
>    Process -> Kinesis Sink
>    - *Environment*: Amazon KDA. 1 Kinesis Processing Unit (1vCore & 4GB
>    ram), and 1 parallelism.
>    - *Flink Version*: 1.8.0
>    - *Backpressure*: Flink dashboard shows that backpressure is *OK.*
>    - *Input rate: *60 messages per second.
>
>
>
> Any kind of pointers/help will be very useful.
>
>
>
> Thanks,
>
> Sanket Agrawal
>
>
>