You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Morgan Geldenhuys <mo...@tu-berlin.de> on 2020/02/25 11:31:00 UTC

How to determine average utilization before backpressure kicks in?

Hello community,

I am fairly new to Flink and have a question concerning utilization. I 
was hoping someone could help.

Knowing that backpressure is essentially the point at which utilization 
has reached 100% for any particular streaming pipeline and means that 
the application cannot "keep up" with the messages coming into the system.

I was wondering, assuming a fairly stable input throughput, is there a 
way of determining the average utilization as a percentage? Can we 
determine how much more capacity each operator has before backpressure 
kicks in from metrics alone, i.e. 60% of capacity for example? Knowing 
that the maximum throughput of the DSP application is dictated by the 
slowest part of the pipeline, we would need to identify the slowest 
operator and then average horizontally.

The only method that I can see of determining the point at which the 
system cannot keep up any longer is by scaling the input throughput 
slowly until the backpressure HIGH alarm is shown and hence the number 
of messages/sec is known.

Yes I know this is a gross oversimplification and there are many many 
factors that need to be taken into account when dealing with 
backpressure, but it would be nice to have a general indicator, a rough 
estimate is fine.

Thank you in advance.

Regards,
M.

Re: How to determine average utilization before backpressure kicks in?

Posted by Khachatryan Roman <kh...@gmail.com>.

Hi Morgan,

Thanks for your reply.

I think the only possible way to determine this limit is load testing. In
the end, this is all load testing is about.
I can only suggest testing parts of the system separately to know their
individual limits (e.g. IO, CPU). Ideally, this should be done on a regular
basis.

Hope this helps.

Regards,
Roman


On Tue, Feb 25, 2020 at 2:47 PM Morgan Geldenhuys <
morgan.geldenhuys@tu-berlin.de> wrote:

> Hi Roman,
>
> Thank you for the reply.
>
> Yes, I am aware that backpressure can be the result of many factors and
> yes this is an oversimplification of something very complex, please bare
> with me. Lets assume that this has been taken into account and has lowered
> the threshold for when this status permanently comes into effect, i.e. HIGH.
>
> Example: The system is running along perfectly fine under normal
> conditions, accessing external sources, and processing at an average of
> 100,000 messages/sec. Lets assume the maximum capacity is around 130,000
> message/sec before back pressure starts propagating messages back up the
> stream. Therefore, utilization is at 0.76 (100K/130K). Great, but at
> present we dont know that 130,000 is the limit.
>
> For this example or for any job, is there a way of finding this maximum
> capacity (and hence the utilization) without pushing the system to its
> limit based on the current throughput? Possibly by measuring (as you say)
> the saturation of certain buffers (looking into this now, however, i am not
> too familiar with flink internals)? It doesn't have to be extremely
> precise. Any hints would be greatly appreciated.
>
> Regards,
> M.
>
> On 25.02.20 13:34, Khachatryan Roman wrote:
>
> Hi Morgan,
>
> Regarding backpressure, it can be caused by a number of factors, e.g.
> writing to an external system or slow input partitions.
>
> However, if you know that a particular resource is a bottleneck then it
> makes sense to monitor its saturation.
> It can be done by using Flink metrics. Please see the documentation for
> more details:
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html
>
> Regards,
> Roman
>
>
> On Tue, Feb 25, 2020 at 12:33 PM Morgan Geldenhuys <
> morgan.geldenhuys@tu-berlin.de> wrote:
>
>> Hello community,
>>
>> I am fairly new to Flink and have a question concerning utilization. I
>> was hoping someone could help.
>>
>> Knowing that backpressure is essentially the point at which utilization
>> has reached 100% for any particular streaming pipeline and means that
>> the application cannot "keep up" with the messages coming into the system.
>>
>> I was wondering, assuming a fairly stable input throughput, is there a
>> way of determining the average utilization as a percentage? Can we
>> determine how much more capacity each operator has before backpressure
>> kicks in from metrics alone, i.e. 60% of capacity for example? Knowing
>> that the maximum throughput of the DSP application is dictated by the
>> slowest part of the pipeline, we would need to identify the slowest
>> operator and then average horizontally.
>>
>> The only method that I can see of determining the point at which the
>> system cannot keep up any longer is by scaling the input throughput
>> slowly until the backpressure HIGH alarm is shown and hence the number
>> of messages/sec is known.
>>
>> Yes I know this is a gross oversimplification and there are many many
>> factors that need to be taken into account when dealing with
>> backpressure, but it would be nice to have a general indicator, a rough
>> estimate is fine.
>>
>> Thank you in advance.
>>
>> Regards,
>> M.
>>
>>
>>
>>
>

Re: How to determine average utilization before backpressure kicks in?

Posted by Morgan Geldenhuys <mo...@tu-berlin.de>.

Hi Roman,

Thank you for the reply.

Yes, I am aware that backpressure can be the result of many factors and 
yes this is an oversimplification of something very complex, please bare 
with me. Lets assume that this has been taken into account and has 
lowered the threshold for when this status permanently comes into 
effect, i.e. HIGH.

Example: The system is running along perfectly fine under normal 
conditions, accessing external sources, and processing at an average of 
100,000 messages/sec. Lets assume the maximum capacity is around 130,000 
message/sec before back pressure starts propagating messages back up the 
stream. Therefore, utilization is at 0.76 (100K/130K). Great, but at 
present we dont know that 130,000 is the limit.

For this example or for any job, is there a way of finding this maximum 
capacity (and hence the utilization) without pushing the system to its 
limit based on the current throughput? Possibly by measuring (as you 
say) the saturation of certain buffers (looking into this now, however, 
i am not too familiar with flink internals)? It doesn't have to be 
extremely precise. Any hints would be greatly appreciated.

Regards,
M.

On 25.02.20 13:34, Khachatryan Roman wrote:
> Hi Morgan,
>
> Regarding backpressure, it can be caused by a number of factors, e.g. 
> writing to an external system or slow input partitions.
>
> However, if you know that a particular resource is a bottleneck then 
> it makes sense to monitor its saturation.
> It can be done by using Flink metrics. Please see the documentation 
> for more details:
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html
>
> Regards,
> Roman
>
>
> On Tue, Feb 25, 2020 at 12:33 PM Morgan Geldenhuys 
> <morgan.geldenhuys@tu-berlin.de 
> <ma...@tu-berlin.de>> wrote:
>
>     Hello community,
>
>     I am fairly new to Flink and have a question concerning
>     utilization. I
>     was hoping someone could help.
>
>     Knowing that backpressure is essentially the point at which
>     utilization
>     has reached 100% for any particular streaming pipeline and means that
>     the application cannot "keep up" with the messages coming into the
>     system.
>
>     I was wondering, assuming a fairly stable input throughput, is
>     there a
>     way of determining the average utilization as a percentage? Can we
>     determine how much more capacity each operator has before
>     backpressure
>     kicks in from metrics alone, i.e. 60% of capacity for example?
>     Knowing
>     that the maximum throughput of the DSP application is dictated by the
>     slowest part of the pipeline, we would need to identify the slowest
>     operator and then average horizontally.
>
>     The only method that I can see of determining the point at which the
>     system cannot keep up any longer is by scaling the input throughput
>     slowly until the backpressure HIGH alarm is shown and hence the
>     number
>     of messages/sec is known.
>
>     Yes I know this is a gross oversimplification and there are many many
>     factors that need to be taken into account when dealing with
>     backpressure, but it would be nice to have a general indicator, a
>     rough
>     estimate is fine.
>
>     Thank you in advance.
>
>     Regards,
>     M.
>
>
>

Re: How to determine average utilization before backpressure kicks in?

Posted by Khachatryan Roman <kh...@gmail.com>.

Hi Morgan,

Regarding backpressure, it can be caused by a number of factors, e.g.
writing to an external system or slow input partitions.

However, if you know that a particular resource is a bottleneck then it
makes sense to monitor its saturation.
It can be done by using Flink metrics. Please see the documentation for
more details:
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html

Regards,
Roman


On Tue, Feb 25, 2020 at 12:33 PM Morgan Geldenhuys <
morgan.geldenhuys@tu-berlin.de> wrote:

> Hello community,
>
> I am fairly new to Flink and have a question concerning utilization. I
> was hoping someone could help.
>
> Knowing that backpressure is essentially the point at which utilization
> has reached 100% for any particular streaming pipeline and means that
> the application cannot "keep up" with the messages coming into the system.
>
> I was wondering, assuming a fairly stable input throughput, is there a
> way of determining the average utilization as a percentage? Can we
> determine how much more capacity each operator has before backpressure
> kicks in from metrics alone, i.e. 60% of capacity for example? Knowing
> that the maximum throughput of the DSP application is dictated by the
> slowest part of the pipeline, we would need to identify the slowest
> operator and then average horizontally.
>
> The only method that I can see of determining the point at which the
> system cannot keep up any longer is by scaling the input throughput
> slowly until the backpressure HIGH alarm is shown and hence the number
> of messages/sec is known.
>
> Yes I know this is a gross oversimplification and there are many many
> factors that need to be taken into account when dealing with
> backpressure, but it would be nice to have a general indicator, a rough
> estimate is fine.
>
> Thank you in advance.
>
> Regards,
> M.
>
>
>
>