You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Zach Cox <zc...@gmail.com> on 2016/04/05 22:47:36 UTC

Back Pressure details

Hi - I'm trying to identify bottlenecks in my Flink streaming job, and am
curious about the Back Pressure view in the job manager web UI. If there
are already docs for Back Pressure please feel free to just point me to
those. :)

When "Sampling in progress..." is displayed, what exactly is happening?

What do the values in the Ratio column for each Subtask mean exactly?

What does Status such as OK, High, etc mean? Are these determined from the
Ratio values?

If my job graph looks like Source => A => B => Sink, with Back Pressure OK
for Source and Sink, but High for A and B, what does that suggest?

Thanks,
Zach

Re: Back Pressure details

Posted by Zach Cox <zc...@gmail.com>.

Yeah I don't think that's the case for my setup either :)  I wrote a simple
Flink job that just consumes from Kafka and sinks events/sec rate to
Graphite. That consumes from Kafka several orders of magnitude higher than
the job that also sinks to Elasticsearch. As you said, the downstream back
pressure must be also slowing down consumption from Kafka, even though job
manager UI doesn't show HIGH back pressure on the Kafka source.

Thanks for the details!

-Zach


On Wed, Apr 6, 2016 at 2:37 PM Ufuk Celebi <uc...@apache.org> wrote:

> Ah sorry, I forgot to mention that in the docs.
>
> The way that data is pulled from Kafka is bypassing Flink's task
> Thread. The topic is consumed in a separate Thread and the task Thread
> is just waiting. That's why you don't see any back pressure for Kafka
> sources. I would expect your Kafka source to be back pressured as well
> then.
>
> In theory it is possible that the speed at which data is consumed in
> the source "matches" the speed of the back pressured operator down
> stream. That would result in a non back pressured source with a down
> stream back pressured task. But I don't think that's the case for your
> setup. ;-)
>
> On Wed, Apr 6, 2016 at 9:27 PM, Zach Cox <zc...@gmail.com> wrote:
> > The new back pressure docs are great, thanks Ufuk! I'm sure those will
> help
> > others as well.
> >
> > In the Source => A => B => Sink example, if A and B show HIGH back
> pressure,
> > should Source also show HIGH? In my case it's a Kafka source and
> > Elasticsearch sink. I know currently our Kafka can provide data at a much
> > higher rate than our Elasticsearch can ingest (I'm working on scaling up
> > Elasticsearch), just curious why the Kafka source wouldn't also show HIGH
> > back pressure.
> >
> > Thanks,
> > Zach
> >
> >
> > On Wed, Apr 6, 2016 at 5:36 AM Ufuk Celebi <uc...@apache.org> wrote:
> >>
> >> Hey Zach,
> >>
> >> just added some documentation, which will be available in ~ 30 mins
> >> here:
> >>
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/internals/back_pressure_monitoring.html
> >>
> >> If you think that something is missing there, I would appreciate some
> >> feedback. :-)
> >>
> >> Back pressure is determined by repeatedly calling getStackTrace() on
> >> the task Threads executing the job. By default, 100 times with 50ms
> >> delay between calls. If the task thread is stuck in an internal method
> >> call requesting buffers from the network stack, this indicates back
> >> pressure.
> >>
> >> The ratio you see tells you how many of the stack traces were stuck in
> >> that method (e.g. 1 out of 100) and the status codes just group those
> >> in a (hopefully) reasonable way (<= 0.10 is OK, <= 0.5 is LOW, > 0.5
> >> is HIGH).
> >>
> >> If you have a task with back pressure this means that it is producing
> >> data faster than the network can consume, for example because the
> >> downstream operator is slow or the network can't handle it. Your
> >> Source => A => B => Sink example suggests that the sink is slowing
> >> down/back pressuring B, which is in turn slowing down/back pressuring
> >> A.
> >>
> >> Does this help?
> >>
> >> Keep in mind though that it is not a rock solid approach and there is
> >> a chance that we miss the back pressure indicators or always sample
> >> when we the task is requesting buffers (which is happening all the
> >> time). It often works better at the extremes, e.g. when there is no
> >> back pressure at all or very high back pressure.
> >>
> >> – Ufuk
> >>
> >>
> >> On Tue, Apr 5, 2016 at 10:47 PM, Zach Cox <zc...@gmail.com> wrote:
> >> > Hi - I'm trying to identify bottlenecks in my Flink streaming job, and
> >> > am
> >> > curious about the Back Pressure view in the job manager web UI. If
> there
> >> > are
> >> > already docs for Back Pressure please feel free to just point me to
> >> > those.
> >> > :)
> >> >
> >> > When "Sampling in progress..." is displayed, what exactly is
> happening?
> >> >
> >> > What do the values in the Ratio column for each Subtask mean exactly?
> >> >
> >> > What does Status such as OK, High, etc mean? Are these determined from
> >> > the
> >> > Ratio values?
> >> >
> >> > If my job graph looks like Source => A => B => Sink, with Back
> Pressure
> >> > OK
> >> > for Source and Sink, but High for A and B, what does that suggest?
> >> >
> >> > Thanks,
> >> > Zach
> >> >
>

Re: Back Pressure details

Posted by Ufuk Celebi <uc...@apache.org>.

Ah sorry, I forgot to mention that in the docs.

The way that data is pulled from Kafka is bypassing Flink's task
Thread. The topic is consumed in a separate Thread and the task Thread
is just waiting. That's why you don't see any back pressure for Kafka
sources. I would expect your Kafka source to be back pressured as well
then.

In theory it is possible that the speed at which data is consumed in
the source "matches" the speed of the back pressured operator down
stream. That would result in a non back pressured source with a down
stream back pressured task. But I don't think that's the case for your
setup. ;-)

On Wed, Apr 6, 2016 at 9:27 PM, Zach Cox <zc...@gmail.com> wrote:
> The new back pressure docs are great, thanks Ufuk! I'm sure those will help
> others as well.
>
> In the Source => A => B => Sink example, if A and B show HIGH back pressure,
> should Source also show HIGH? In my case it's a Kafka source and
> Elasticsearch sink. I know currently our Kafka can provide data at a much
> higher rate than our Elasticsearch can ingest (I'm working on scaling up
> Elasticsearch), just curious why the Kafka source wouldn't also show HIGH
> back pressure.
>
> Thanks,
> Zach
>
>
> On Wed, Apr 6, 2016 at 5:36 AM Ufuk Celebi <uc...@apache.org> wrote:
>>
>> Hey Zach,
>>
>> just added some documentation, which will be available in ~ 30 mins
>> here:
>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/internals/back_pressure_monitoring.html
>>
>> If you think that something is missing there, I would appreciate some
>> feedback. :-)
>>
>> Back pressure is determined by repeatedly calling getStackTrace() on
>> the task Threads executing the job. By default, 100 times with 50ms
>> delay between calls. If the task thread is stuck in an internal method
>> call requesting buffers from the network stack, this indicates back
>> pressure.
>>
>> The ratio you see tells you how many of the stack traces were stuck in
>> that method (e.g. 1 out of 100) and the status codes just group those
>> in a (hopefully) reasonable way (<= 0.10 is OK, <= 0.5 is LOW, > 0.5
>> is HIGH).
>>
>> If you have a task with back pressure this means that it is producing
>> data faster than the network can consume, for example because the
>> downstream operator is slow or the network can't handle it. Your
>> Source => A => B => Sink example suggests that the sink is slowing
>> down/back pressuring B, which is in turn slowing down/back pressuring
>> A.
>>
>> Does this help?
>>
>> Keep in mind though that it is not a rock solid approach and there is
>> a chance that we miss the back pressure indicators or always sample
>> when we the task is requesting buffers (which is happening all the
>> time). It often works better at the extremes, e.g. when there is no
>> back pressure at all or very high back pressure.
>>
>> – Ufuk
>>
>>
>> On Tue, Apr 5, 2016 at 10:47 PM, Zach Cox <zc...@gmail.com> wrote:
>> > Hi - I'm trying to identify bottlenecks in my Flink streaming job, and
>> > am
>> > curious about the Back Pressure view in the job manager web UI. If there
>> > are
>> > already docs for Back Pressure please feel free to just point me to
>> > those.
>> > :)
>> >
>> > When "Sampling in progress..." is displayed, what exactly is happening?
>> >
>> > What do the values in the Ratio column for each Subtask mean exactly?
>> >
>> > What does Status such as OK, High, etc mean? Are these determined from
>> > the
>> > Ratio values?
>> >
>> > If my job graph looks like Source => A => B => Sink, with Back Pressure
>> > OK
>> > for Source and Sink, but High for A and B, what does that suggest?
>> >
>> > Thanks,
>> > Zach
>> >

Re: Back Pressure details

Posted by Zach Cox <zc...@gmail.com>.

The new back pressure docs are great, thanks Ufuk! I'm sure those will help
others as well.

In the Source => A => B => Sink example, if A and B show HIGH back
pressure, should Source also show HIGH? In my case it's a Kafka source and
Elasticsearch sink. I know currently our Kafka can provide data at a much
higher rate than our Elasticsearch can ingest (I'm working on scaling up
Elasticsearch), just curious why the Kafka source wouldn't also show HIGH
back pressure.

Thanks,
Zach


On Wed, Apr 6, 2016 at 5:36 AM Ufuk Celebi <uc...@apache.org> wrote:

> Hey Zach,
>
> just added some documentation, which will be available in ~ 30 mins
> here:
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/internals/back_pressure_monitoring.html
>
> If you think that something is missing there, I would appreciate some
> feedback. :-)
>
> Back pressure is determined by repeatedly calling getStackTrace() on
> the task Threads executing the job. By default, 100 times with 50ms
> delay between calls. If the task thread is stuck in an internal method
> call requesting buffers from the network stack, this indicates back
> pressure.
>
> The ratio you see tells you how many of the stack traces were stuck in
> that method (e.g. 1 out of 100) and the status codes just group those
> in a (hopefully) reasonable way (<= 0.10 is OK, <= 0.5 is LOW, > 0.5
> is HIGH).
>
> If you have a task with back pressure this means that it is producing
> data faster than the network can consume, for example because the
> downstream operator is slow or the network can't handle it. Your
> Source => A => B => Sink example suggests that the sink is slowing
> down/back pressuring B, which is in turn slowing down/back pressuring
> A.
>
> Does this help?
>
> Keep in mind though that it is not a rock solid approach and there is
> a chance that we miss the back pressure indicators or always sample
> when we the task is requesting buffers (which is happening all the
> time). It often works better at the extremes, e.g. when there is no
> back pressure at all or very high back pressure.
>
> – Ufuk
>
>
> On Tue, Apr 5, 2016 at 10:47 PM, Zach Cox <zc...@gmail.com> wrote:
> > Hi - I'm trying to identify bottlenecks in my Flink streaming job, and am
> > curious about the Back Pressure view in the job manager web UI. If there
> are
> > already docs for Back Pressure please feel free to just point me to
> those.
> > :)
> >
> > When "Sampling in progress..." is displayed, what exactly is happening?
> >
> > What do the values in the Ratio column for each Subtask mean exactly?
> >
> > What does Status such as OK, High, etc mean? Are these determined from
> the
> > Ratio values?
> >
> > If my job graph looks like Source => A => B => Sink, with Back Pressure
> OK
> > for Source and Sink, but High for A and B, what does that suggest?
> >
> > Thanks,
> > Zach
> >
>

Re: Back Pressure details

Posted by Ufuk Celebi <uc...@apache.org>.

Hey Zach,

just added some documentation, which will be available in ~ 30 mins
here: https://ci.apache.org/projects/flink/flink-docs-release-1.0/internals/back_pressure_monitoring.html

If you think that something is missing there, I would appreciate some
feedback. :-)

Back pressure is determined by repeatedly calling getStackTrace() on
the task Threads executing the job. By default, 100 times with 50ms
delay between calls. If the task thread is stuck in an internal method
call requesting buffers from the network stack, this indicates back
pressure.

The ratio you see tells you how many of the stack traces were stuck in
that method (e.g. 1 out of 100) and the status codes just group those
in a (hopefully) reasonable way (<= 0.10 is OK, <= 0.5 is LOW, > 0.5
is HIGH).

If you have a task with back pressure this means that it is producing
data faster than the network can consume, for example because the
downstream operator is slow or the network can't handle it. Your
Source => A => B => Sink example suggests that the sink is slowing
down/back pressuring B, which is in turn slowing down/back pressuring
A.

Does this help?

Keep in mind though that it is not a rock solid approach and there is
a chance that we miss the back pressure indicators or always sample
when we the task is requesting buffers (which is happening all the
time). It often works better at the extremes, e.g. when there is no
back pressure at all or very high back pressure.

– Ufuk

On Tue, Apr 5, 2016 at 10:47 PM, Zach Cox <zc...@gmail.com> wrote:
> Hi - I'm trying to identify bottlenecks in my Flink streaming job, and am
> curious about the Back Pressure view in the job manager web UI. If there are
> already docs for Back Pressure please feel free to just point me to those.
> :)
>
> When "Sampling in progress..." is displayed, what exactly is happening?
>
> What do the values in the Ratio column for each Subtask mean exactly?
>
> What does Status such as OK, High, etc mean? Are these determined from the
> Ratio values?
>
> If my job graph looks like Source => A => B => Sink, with Back Pressure OK
> for Source and Sink, but High for A and B, what does that suggest?
>
> Thanks,
> Zach
>