You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Lu Niu <qq...@gmail.com> on 2021/04/05 21:19:43 UTC

Automatic backpressure detection

Hi, Flink dev

Lately, we want to develop some tools to:
1. show backpressure operator without manual operation
2. Provide suggestions to mitigate back pressure after checking data skew,
external service RPC etc.
3. Show back pressure history

Could anyone share their experience with such tooling?
Also, I notice backpressure monitoring and detection is mentioned across
multiple places. Could someone help to explain how these connect to each
other? Maybe some of them are outdated? Thanks!

1. The official doc introduces monitoring back pressure through web UI.
https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
2. In https://flink.apache.org/2019/07/23/flink-network-stack-2.html, it
says outPoolUsage, inPoolUsage metrics can be used to determine back
pressure.
3. Latest flink version introduces metrics called “isBackPressured" But I
didn't find related documentation on usage.

Best
Lu

Re: Automatic backpressure detection

Posted by Lu Niu <qq...@gmail.com>.

Cool. Thanks!

Best
Lu

On Mon, Apr 12, 2021 at 11:27 PM Piotr Nowojski <pn...@apache.org>
wrote:

> Hi,
>
> Yes. Back-pressure from AsyncOperator should be correctly reported via
> isBackPressured, backPressuredMsPerSecond metrics and by extension in the
> WebUI from 1.13.
>
> Piotre
>
> pon., 12 kwi 2021 o 23:17 Lu Niu <qq...@gmail.com> napisał(a):
>
> > Hi, Piotr
> >
> > Thanks for your detailed reply! It is mentioned here we cannot observe
> > backpressure generated from  AsyncOperator in Flink UI in 1.9.1. Is it
> > fixed in the latest version? Thank you!
> >
> >
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Async-Function-Not-Generating-Backpressure-td26766.html
> >
> > Best
> > Lu
> >
> > On Tue, Apr 6, 2021 at 11:14 PM Piotr Nowojski <pn...@apache.org>
> > wrote:
> >
> > > Hi,
> > >
> > > Yes, you can use `isBackPressured` to monitor a task's back-pressure.
> > > However keep in mind:
> > > a) You are going to miss some nice way to visualize this information,
> > which
> > > is present in 1.13's WebUI.
> > > b) `isBackPressured` is a sampling based metric. If your job has
> varying
> > > load, for example all windows firing at the same processing time, every
> > > couple of seconds, causing intermittent back-pressure, this metric will
> > > show it randomly as `true` or `false`.
> > > c) `isBackPressured` is slightly less accurate compared to
> > > `backPressuredTimeMsPerSecond`. There are some corner cases when for a
> > > brief amount of time it can return `true`, while a task is still
> running,
> > > while the time based metrics work in a different much more accurate
> way.
> > >
> > > About back porting the patches, if you want to create a custom Flink
> > build
> > > it should be do-able. There will be some conflicts for sure, so you
> will
> > > need to understand Flink's code.
> > >
> > > Best,
> > > Piotrek
> > >
> > > śr., 7 kwi 2021 o 02:32 Lu Niu <qq...@gmail.com> napisał(a):
> > >
> > > > Hi, Piotr
> > > >
> > > > Thanks for replying!
> > > >
> > > > We don't have a plan to upgrade to 1.13 in short term. We are using
> > flink
> > > > 1.11 and I notice there is a metric called isBackpressured. Is that
> > > enough
> > > > to solve 1? If not, would backporting patches regarding
> > > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> > idleTimeMsPerSecond
> > > > work? And do you have an estimate of how difficult it is?
> > > >
> > > >
> > > > Best
> > > > Lu
> > > >
> > > >
> > > >
> > > > On Tue, Apr 6, 2021 at 12:18 AM Piotr Nowojski <pnowojski@apache.org
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Lately we overhauled the backpressure detection [1] and a
> screenshot
> > > > > preview of those efforts is attached here [2]. I encourage you to
> > check
> > > > the
> > > > > 1.13 RC0 build and how the current mechanism works for you [3]. To
> > > > support
> > > > > those WebUI changes we have added a couple of new metrics:
> > > > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> > > > idleTimeMsPerSecond.
> > > > >
> > > > > 1. I believe that solves 1.
> > > > > 2. This still requires a bit of manual investigation. Once you
> locate
> > > > > backpressuring task, you can check the detail subtask stats to
> check
> > if
> > > > all
> > > > > parallel instances are uniformly backpressured/busy or not. If you
> > > would
> > > > > like to add a hint "it looks like you have a data skew in Task XYZ
> ",
> > > > that
> > > > > I believe could be added to the WebUI.
> > > > > 3. The tricky part is how to display this kind of information.
> > > Currently
> > > > I
> > > > > would recommend just export/report
> > > > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> > > idleTimeMsPerSecond
> > > > > metrics for every task to an external system and  display them for
> > > > example
> > > > > in Graphana.
> > > > >
> > > > > The blog post you are referencing is quite outdated, especially
> with
> > > > those
> > > > > new changes from 1.13. I'm hoping to write a new one pretty soon.
> > > > >
> > > > > Piotrek
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/FLINK-14712
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/FLINK-14814?focusedCommentId=17256926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256926
> > > > > [3]
> > > > >
> > > > >
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/flink-user/202104.mbox/%3C1d2412ce-d4d0-ed50-6181-1b610e16d289@apache.org%3E
> > > > >
> > > > > pon., 5 kwi 2021 o 23:20 Lu Niu <qq...@gmail.com> napisał(a):
> > > > >
> > > > > > Hi, Flink dev
> > > > > >
> > > > > > Lately, we want to develop some tools to:
> > > > > > 1. show backpressure operator without manual operation
> > > > > > 2. Provide suggestions to mitigate back pressure after checking
> > data
> > > > > skew,
> > > > > > external service RPC etc.
> > > > > > 3. Show back pressure history
> > > > > >
> > > > > > Could anyone share their experience with such tooling?
> > > > > > Also, I notice backpressure monitoring and detection is mentioned
> > > > across
> > > > > > multiple places. Could someone help to explain how these connect
> to
> > > > each
> > > > > > other? Maybe some of them are outdated? Thanks!
> > > > > >
> > > > > > 1. The official doc introduces monitoring back pressure through
> web
> > > UI.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
> > > > > > 2. In
> > https://flink.apache.org/2019/07/23/flink-network-stack-2.html
> > > ,
> > > > it
> > > > > > says outPoolUsage, inPoolUsage metrics can be used to determine
> > back
> > > > > > pressure.
> > > > > > 3. Latest flink version introduces metrics called
> “isBackPressured"
> > > > But I
> > > > > > didn't find related documentation on usage.
> > > > > >
> > > > > > Best
> > > > > > Lu
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Automatic backpressure detection

Posted by Piotr Nowojski <pn...@apache.org>.

Hi,

Yes. Back-pressure from AsyncOperator should be correctly reported via
isBackPressured, backPressuredMsPerSecond metrics and by extension in the
WebUI from 1.13.

Piotre

pon., 12 kwi 2021 o 23:17 Lu Niu <qq...@gmail.com> napisał(a):

> Hi, Piotr
>
> Thanks for your detailed reply! It is mentioned here we cannot observe
> backpressure generated from  AsyncOperator in Flink UI in 1.9.1. Is it
> fixed in the latest version? Thank you!
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Async-Function-Not-Generating-Backpressure-td26766.html
>
> Best
> Lu
>
> On Tue, Apr 6, 2021 at 11:14 PM Piotr Nowojski <pn...@apache.org>
> wrote:
>
> > Hi,
> >
> > Yes, you can use `isBackPressured` to monitor a task's back-pressure.
> > However keep in mind:
> > a) You are going to miss some nice way to visualize this information,
> which
> > is present in 1.13's WebUI.
> > b) `isBackPressured` is a sampling based metric. If your job has varying
> > load, for example all windows firing at the same processing time, every
> > couple of seconds, causing intermittent back-pressure, this metric will
> > show it randomly as `true` or `false`.
> > c) `isBackPressured` is slightly less accurate compared to
> > `backPressuredTimeMsPerSecond`. There are some corner cases when for a
> > brief amount of time it can return `true`, while a task is still running,
> > while the time based metrics work in a different much more accurate way.
> >
> > About back porting the patches, if you want to create a custom Flink
> build
> > it should be do-able. There will be some conflicts for sure, so you will
> > need to understand Flink's code.
> >
> > Best,
> > Piotrek
> >
> > śr., 7 kwi 2021 o 02:32 Lu Niu <qq...@gmail.com> napisał(a):
> >
> > > Hi, Piotr
> > >
> > > Thanks for replying!
> > >
> > > We don't have a plan to upgrade to 1.13 in short term. We are using
> flink
> > > 1.11 and I notice there is a metric called isBackpressured. Is that
> > enough
> > > to solve 1? If not, would backporting patches regarding
> > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> idleTimeMsPerSecond
> > > work? And do you have an estimate of how difficult it is?
> > >
> > >
> > > Best
> > > Lu
> > >
> > >
> > >
> > > On Tue, Apr 6, 2021 at 12:18 AM Piotr Nowojski <pn...@apache.org>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Lately we overhauled the backpressure detection [1] and a screenshot
> > > > preview of those efforts is attached here [2]. I encourage you to
> check
> > > the
> > > > 1.13 RC0 build and how the current mechanism works for you [3]. To
> > > support
> > > > those WebUI changes we have added a couple of new metrics:
> > > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> > > idleTimeMsPerSecond.
> > > >
> > > > 1. I believe that solves 1.
> > > > 2. This still requires a bit of manual investigation. Once you locate
> > > > backpressuring task, you can check the detail subtask stats to check
> if
> > > all
> > > > parallel instances are uniformly backpressured/busy or not. If you
> > would
> > > > like to add a hint "it looks like you have a data skew in Task XYZ ",
> > > that
> > > > I believe could be added to the WebUI.
> > > > 3. The tricky part is how to display this kind of information.
> > Currently
> > > I
> > > > would recommend just export/report
> > > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> > idleTimeMsPerSecond
> > > > metrics for every task to an external system and  display them for
> > > example
> > > > in Graphana.
> > > >
> > > > The blog post you are referencing is quite outdated, especially with
> > > those
> > > > new changes from 1.13. I'm hoping to write a new one pretty soon.
> > > >
> > > > Piotrek
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-14712
> > > > [2]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/FLINK-14814?focusedCommentId=17256926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256926
> > > > [3]
> > > >
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/flink-user/202104.mbox/%3C1d2412ce-d4d0-ed50-6181-1b610e16d289@apache.org%3E
> > > >
> > > > pon., 5 kwi 2021 o 23:20 Lu Niu <qq...@gmail.com> napisał(a):
> > > >
> > > > > Hi, Flink dev
> > > > >
> > > > > Lately, we want to develop some tools to:
> > > > > 1. show backpressure operator without manual operation
> > > > > 2. Provide suggestions to mitigate back pressure after checking
> data
> > > > skew,
> > > > > external service RPC etc.
> > > > > 3. Show back pressure history
> > > > >
> > > > > Could anyone share their experience with such tooling?
> > > > > Also, I notice backpressure monitoring and detection is mentioned
> > > across
> > > > > multiple places. Could someone help to explain how these connect to
> > > each
> > > > > other? Maybe some of them are outdated? Thanks!
> > > > >
> > > > > 1. The official doc introduces monitoring back pressure through web
> > UI.
> > > > >
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
> > > > > 2. In
> https://flink.apache.org/2019/07/23/flink-network-stack-2.html
> > ,
> > > it
> > > > > says outPoolUsage, inPoolUsage metrics can be used to determine
> back
> > > > > pressure.
> > > > > 3. Latest flink version introduces metrics called “isBackPressured"
> > > But I
> > > > > didn't find related documentation on usage.
> > > > >
> > > > > Best
> > > > > Lu
> > > > >
> > > >
> > >
> >
>

Re: Automatic backpressure detection

Posted by Lu Niu <qq...@gmail.com>.

Hi, Piotr

Thanks for your detailed reply! It is mentioned here we cannot observe
backpressure generated from  AsyncOperator in Flink UI in 1.9.1. Is it
fixed in the latest version? Thank you!
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Async-Function-Not-Generating-Backpressure-td26766.html

Best
Lu

On Tue, Apr 6, 2021 at 11:14 PM Piotr Nowojski <pn...@apache.org> wrote:

> Hi,
>
> Yes, you can use `isBackPressured` to monitor a task's back-pressure.
> However keep in mind:
> a) You are going to miss some nice way to visualize this information, which
> is present in 1.13's WebUI.
> b) `isBackPressured` is a sampling based metric. If your job has varying
> load, for example all windows firing at the same processing time, every
> couple of seconds, causing intermittent back-pressure, this metric will
> show it randomly as `true` or `false`.
> c) `isBackPressured` is slightly less accurate compared to
> `backPressuredTimeMsPerSecond`. There are some corner cases when for a
> brief amount of time it can return `true`, while a task is still running,
> while the time based metrics work in a different much more accurate way.
>
> About back porting the patches, if you want to create a custom Flink build
> it should be do-able. There will be some conflicts for sure, so you will
> need to understand Flink's code.
>
> Best,
> Piotrek
>
> śr., 7 kwi 2021 o 02:32 Lu Niu <qq...@gmail.com> napisał(a):
>
> > Hi, Piotr
> >
> > Thanks for replying!
> >
> > We don't have a plan to upgrade to 1.13 in short term. We are using flink
> > 1.11 and I notice there is a metric called isBackpressured. Is that
> enough
> > to solve 1? If not, would backporting patches regarding
> > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond
> > work? And do you have an estimate of how difficult it is?
> >
> >
> > Best
> > Lu
> >
> >
> >
> > On Tue, Apr 6, 2021 at 12:18 AM Piotr Nowojski <pn...@apache.org>
> > wrote:
> >
> > > Hi,
> > >
> > > Lately we overhauled the backpressure detection [1] and a screenshot
> > > preview of those efforts is attached here [2]. I encourage you to check
> > the
> > > 1.13 RC0 build and how the current mechanism works for you [3]. To
> > support
> > > those WebUI changes we have added a couple of new metrics:
> > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> > idleTimeMsPerSecond.
> > >
> > > 1. I believe that solves 1.
> > > 2. This still requires a bit of manual investigation. Once you locate
> > > backpressuring task, you can check the detail subtask stats to check if
> > all
> > > parallel instances are uniformly backpressured/busy or not. If you
> would
> > > like to add a hint "it looks like you have a data skew in Task XYZ ",
> > that
> > > I believe could be added to the WebUI.
> > > 3. The tricky part is how to display this kind of information.
> Currently
> > I
> > > would recommend just export/report
> > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> idleTimeMsPerSecond
> > > metrics for every task to an external system and  display them for
> > example
> > > in Graphana.
> > >
> > > The blog post you are referencing is quite outdated, especially with
> > those
> > > new changes from 1.13. I'm hoping to write a new one pretty soon.
> > >
> > > Piotrek
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-14712
> > > [2]
> > >
> > >
> >
> https://issues.apache.org/jira/browse/FLINK-14814?focusedCommentId=17256926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256926
> > > [3]
> > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/flink-user/202104.mbox/%3C1d2412ce-d4d0-ed50-6181-1b610e16d289@apache.org%3E
> > >
> > > pon., 5 kwi 2021 o 23:20 Lu Niu <qq...@gmail.com> napisał(a):
> > >
> > > > Hi, Flink dev
> > > >
> > > > Lately, we want to develop some tools to:
> > > > 1. show backpressure operator without manual operation
> > > > 2. Provide suggestions to mitigate back pressure after checking data
> > > skew,
> > > > external service RPC etc.
> > > > 3. Show back pressure history
> > > >
> > > > Could anyone share their experience with such tooling?
> > > > Also, I notice backpressure monitoring and detection is mentioned
> > across
> > > > multiple places. Could someone help to explain how these connect to
> > each
> > > > other? Maybe some of them are outdated? Thanks!
> > > >
> > > > 1. The official doc introduces monitoring back pressure through web
> UI.
> > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
> > > > 2. In https://flink.apache.org/2019/07/23/flink-network-stack-2.html
> ,
> > it
> > > > says outPoolUsage, inPoolUsage metrics can be used to determine back
> > > > pressure.
> > > > 3. Latest flink version introduces metrics called “isBackPressured"
> > But I
> > > > didn't find related documentation on usage.
> > > >
> > > > Best
> > > > Lu
> > > >
> > >
> >
>

Re: Automatic backpressure detection

Posted by Piotr Nowojski <pn...@apache.org>.

Hi,

Yes, you can use `isBackPressured` to monitor a task's back-pressure.
However keep in mind:
a) You are going to miss some nice way to visualize this information, which
is present in 1.13's WebUI.
b) `isBackPressured` is a sampling based metric. If your job has varying
load, for example all windows firing at the same processing time, every
couple of seconds, causing intermittent back-pressure, this metric will
show it randomly as `true` or `false`.
c) `isBackPressured` is slightly less accurate compared to
`backPressuredTimeMsPerSecond`. There are some corner cases when for a
brief amount of time it can return `true`, while a task is still running,
while the time based metrics work in a different much more accurate way.

About back porting the patches, if you want to create a custom Flink build
it should be do-able. There will be some conflicts for sure, so you will
need to understand Flink's code.

Best,
Piotrek

śr., 7 kwi 2021 o 02:32 Lu Niu <qq...@gmail.com> napisał(a):

> Hi, Piotr
>
> Thanks for replying!
>
> We don't have a plan to upgrade to 1.13 in short term. We are using flink
> 1.11 and I notice there is a metric called isBackpressured. Is that enough
> to solve 1? If not, would backporting patches regarding
> backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond
> work? And do you have an estimate of how difficult it is?
>
>
> Best
> Lu
>
>
>
> On Tue, Apr 6, 2021 at 12:18 AM Piotr Nowojski <pn...@apache.org>
> wrote:
>
> > Hi,
> >
> > Lately we overhauled the backpressure detection [1] and a screenshot
> > preview of those efforts is attached here [2]. I encourage you to check
> the
> > 1.13 RC0 build and how the current mechanism works for you [3]. To
> support
> > those WebUI changes we have added a couple of new metrics:
> > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> idleTimeMsPerSecond.
> >
> > 1. I believe that solves 1.
> > 2. This still requires a bit of manual investigation. Once you locate
> > backpressuring task, you can check the detail subtask stats to check if
> all
> > parallel instances are uniformly backpressured/busy or not. If you would
> > like to add a hint "it looks like you have a data skew in Task XYZ ",
> that
> > I believe could be added to the WebUI.
> > 3. The tricky part is how to display this kind of information. Currently
> I
> > would recommend just export/report
> > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond
> > metrics for every task to an external system and  display them for
> example
> > in Graphana.
> >
> > The blog post you are referencing is quite outdated, especially with
> those
> > new changes from 1.13. I'm hoping to write a new one pretty soon.
> >
> > Piotrek
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-14712
> > [2]
> >
> >
> https://issues.apache.org/jira/browse/FLINK-14814?focusedCommentId=17256926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256926
> > [3]
> >
> >
> http://mail-archives.apache.org/mod_mbox/flink-user/202104.mbox/%3C1d2412ce-d4d0-ed50-6181-1b610e16d289@apache.org%3E
> >
> > pon., 5 kwi 2021 o 23:20 Lu Niu <qq...@gmail.com> napisał(a):
> >
> > > Hi, Flink dev
> > >
> > > Lately, we want to develop some tools to:
> > > 1. show backpressure operator without manual operation
> > > 2. Provide suggestions to mitigate back pressure after checking data
> > skew,
> > > external service RPC etc.
> > > 3. Show back pressure history
> > >
> > > Could anyone share their experience with such tooling?
> > > Also, I notice backpressure monitoring and detection is mentioned
> across
> > > multiple places. Could someone help to explain how these connect to
> each
> > > other? Maybe some of them are outdated? Thanks!
> > >
> > > 1. The official doc introduces monitoring back pressure through web UI.
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
> > > 2. In https://flink.apache.org/2019/07/23/flink-network-stack-2.html,
> it
> > > says outPoolUsage, inPoolUsage metrics can be used to determine back
> > > pressure.
> > > 3. Latest flink version introduces metrics called “isBackPressured"
> But I
> > > didn't find related documentation on usage.
> > >
> > > Best
> > > Lu
> > >
> >
>

Re: Automatic backpressure detection

Posted by Lu Niu <qq...@gmail.com>.

Hi, Piotr

Thanks for replying!

We don't have a plan to upgrade to 1.13 in short term. We are using flink
1.11 and I notice there is a metric called isBackpressured. Is that enough
to solve 1? If not, would backporting patches regarding
backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond
work? And do you have an estimate of how difficult it is?


Best
Lu



On Tue, Apr 6, 2021 at 12:18 AM Piotr Nowojski <pn...@apache.org> wrote:

> Hi,
>
> Lately we overhauled the backpressure detection [1] and a screenshot
> preview of those efforts is attached here [2]. I encourage you to check the
> 1.13 RC0 build and how the current mechanism works for you [3]. To support
> those WebUI changes we have added a couple of new metrics:
> backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond.
>
> 1. I believe that solves 1.
> 2. This still requires a bit of manual investigation. Once you locate
> backpressuring task, you can check the detail subtask stats to check if all
> parallel instances are uniformly backpressured/busy or not. If you would
> like to add a hint "it looks like you have a data skew in Task XYZ ", that
> I believe could be added to the WebUI.
> 3. The tricky part is how to display this kind of information. Currently I
> would recommend just export/report
> backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond
> metrics for every task to an external system and  display them for example
> in Graphana.
>
> The blog post you are referencing is quite outdated, especially with those
> new changes from 1.13. I'm hoping to write a new one pretty soon.
>
> Piotrek
>
> [1] https://issues.apache.org/jira/browse/FLINK-14712
> [2]
>
> https://issues.apache.org/jira/browse/FLINK-14814?focusedCommentId=17256926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256926
> [3]
>
> http://mail-archives.apache.org/mod_mbox/flink-user/202104.mbox/%3C1d2412ce-d4d0-ed50-6181-1b610e16d289@apache.org%3E
>
> pon., 5 kwi 2021 o 23:20 Lu Niu <qq...@gmail.com> napisał(a):
>
> > Hi, Flink dev
> >
> > Lately, we want to develop some tools to:
> > 1. show backpressure operator without manual operation
> > 2. Provide suggestions to mitigate back pressure after checking data
> skew,
> > external service RPC etc.
> > 3. Show back pressure history
> >
> > Could anyone share their experience with such tooling?
> > Also, I notice backpressure monitoring and detection is mentioned across
> > multiple places. Could someone help to explain how these connect to each
> > other? Maybe some of them are outdated? Thanks!
> >
> > 1. The official doc introduces monitoring back pressure through web UI.
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
> > 2. In https://flink.apache.org/2019/07/23/flink-network-stack-2.html, it
> > says outPoolUsage, inPoolUsage metrics can be used to determine back
> > pressure.
> > 3. Latest flink version introduces metrics called “isBackPressured" But I
> > didn't find related documentation on usage.
> >
> > Best
> > Lu
> >
>

Re: Automatic backpressure detection

Posted by Piotr Nowojski <pn...@apache.org>.

Hi,

Lately we overhauled the backpressure detection [1] and a screenshot
preview of those efforts is attached here [2]. I encourage you to check the
1.13 RC0 build and how the current mechanism works for you [3]. To support
those WebUI changes we have added a couple of new metrics:
backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond.

1. I believe that solves 1.
2. This still requires a bit of manual investigation. Once you locate
backpressuring task, you can check the detail subtask stats to check if all
parallel instances are uniformly backpressured/busy or not. If you would
like to add a hint "it looks like you have a data skew in Task XYZ ", that
I believe could be added to the WebUI.
3. The tricky part is how to display this kind of information. Currently I
would recommend just export/report
backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond
metrics for every task to an external system and  display them for example
in Graphana.

The blog post you are referencing is quite outdated, especially with those
new changes from 1.13. I'm hoping to write a new one pretty soon.

Piotrek

[1] https://issues.apache.org/jira/browse/FLINK-14712
[2]
https://issues.apache.org/jira/browse/FLINK-14814?focusedCommentId=17256926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256926
[3]
http://mail-archives.apache.org/mod_mbox/flink-user/202104.mbox/%3C1d2412ce-d4d0-ed50-6181-1b610e16d289@apache.org%3E

pon., 5 kwi 2021 o 23:20 Lu Niu <qq...@gmail.com> napisał(a):

> Hi, Flink dev
>
> Lately, we want to develop some tools to:
> 1. show backpressure operator without manual operation
> 2. Provide suggestions to mitigate back pressure after checking data skew,
> external service RPC etc.
> 3. Show back pressure history
>
> Could anyone share their experience with such tooling?
> Also, I notice backpressure monitoring and detection is mentioned across
> multiple places. Could someone help to explain how these connect to each
> other? Maybe some of them are outdated? Thanks!
>
> 1. The official doc introduces monitoring back pressure through web UI.
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
> 2. In https://flink.apache.org/2019/07/23/flink-network-stack-2.html, it
> says outPoolUsage, inPoolUsage metrics can be used to determine back
> pressure.
> 3. Latest flink version introduces metrics called “isBackPressured" But I
> didn't find related documentation on usage.
>
> Best
> Lu
>