You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Till Rohrmann <tr...@apache.org> on 2019/08/01 09:39:48 UTC

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

Hi David,

thanks for starting this discussion. I like the idea of improving insights
into Flink's execution and I believe that a flame graph could be helpful.

I quickly glanced over your changes and I think they go in a good
direction. One idea could be to share the `StackTraceSample` produced by
the `StackTraceSampleCoordinator` between the different
`StackTraceOperatorTracker` so that we don't send multiple requests for the
same operators. That way we would decrease a bit the RPC load.

Apart from that, I think the next steps would be to find a committer who
could shepherd this effort and help you with merging it.

Cheers,
Till

On Wed, Jul 31, 2019 at 7:05 PM David Morávek <dm...@apache.org> wrote:

> Hello,
>
> While looking into Flink internals, I've noticed that there is already a
> mechanism for stack-trace sampling of a particular job vertex.
>
> I think it may be really useful to allow user to easily render a cpu
> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI for
> a
> selected vertex (new tab next to back pressure) of a running job. Back
> pressure tab already provides a good idea of which vertex causes trouble,
> but it's hard to say what's actually going on.
>
> I've tried to implement a basic REST endpoint
> <
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> >,
> that prepares data for the flame graph rendering and it seems to be
> providing good insight.
>
> It should be straightforward to render data from the endpoint in new UI
> using existing <https://github.com/spiermar/d3-flame-graph> javascript
> libraries.
>
> WDYT? Is this worth pushing forward?
>
> D.
>

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

Posted by boshu Zheng <ki...@163.com>.
Big +1 for this helpful feature :)


On 08/02/2019 13:54, Jark Wu wrote:
Hi David,

The demo looks charming! I think it will definitely help a lot when
performance tuning.
A big +1 for this.

I cc-ed Yadong who's one of the main contributors of the new Web UI.
Maybe he can give some help on the front end.

Regards,
Jark

On Fri, 2 Aug 2019 at 04:26, David Morávek <da...@gmail.com> wrote:

> Hi Till, thanks for the feedback! These endpoints are only called when the
> vertex is selected in the UI, so there should be any heavy RPC load. For
> back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
> flame-graph, we want to sample the whole stack trace and we need different
> sampling rate (longer period, more samples). Those are the main reasons to
> split these in two "trackers", but I may be missing something.
>
> I've prepared a little demo, so others can have a better idea of what I
> have in mind.
>
> https://youtu.be/GUNDehj9z9o
>
> Please note that this is a proof of concept and I'm not frontend person, so
> it may look little clumsy :)
>
> D.
>
> On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
> > Hi David,
> >
> > thanks for starting this discussion. I like the idea of improving
> insights
> > into Flink's execution and I believe that a flame graph could be helpful.
> >
> > I quickly glanced over your changes and I think they go in a good
> > direction. One idea could be to share the `StackTraceSample` produced by
> > the `StackTraceSampleCoordinator` between the different
> > `StackTraceOperatorTracker` so that we don't send multiple requests for
> the
> > same operators. That way we would decrease a bit the RPC load.
> >
> > Apart from that, I think the next steps would be to find a committer who
> > could shepherd this effort and help you with merging it.
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 31, 2019 at 7:05 PM David Morávek <dm...@apache.org> wrote:
> >
> > > Hello,
> > >
> > > While looking into Flink internals, I've noticed that there is already
> a
> > > mechanism for stack-trace sampling of a particular job vertex.
> > >
> > > I think it may be really useful to allow user to easily render a cpu
> > > flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
> > for
> > > a
> > > selected vertex (new tab next to back pressure) of a running job. Back
> > > pressure tab already provides a good idea of which vertex causes
> trouble,
> > > but it's hard to say what's actually going on.
> > >
> > > I've tried to implement a basic REST endpoint
> > > <
> > >
> >
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> > > >,
> > > that prepares data for the flame graph rendering and it seems to be
> > > providing good insight.
> > >
> > > It should be straightforward to render data from the endpoint in new UI
> > > using existing <https://github.com/spiermar/d3-flame-graph> javascript
> > > libraries.
> > >
> > > WDYT? Is this worth pushing forward?
> > >
> > > D.
> > >
> >
>

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

Posted by Jark Wu <im...@gmail.com>.
Hi David,

The demo looks charming! I think it will definitely help a lot when
performance tuning.
A big +1 for this.

I cc-ed Yadong who's one of the main contributors of the new Web UI.
Maybe he can give some help on the front end.

Regards,
Jark

On Fri, 2 Aug 2019 at 04:26, David Morávek <da...@gmail.com> wrote:

> Hi Till, thanks for the feedback! These endpoints are only called when the
> vertex is selected in the UI, so there should be any heavy RPC load. For
> back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
> flame-graph, we want to sample the whole stack trace and we need different
> sampling rate (longer period, more samples). Those are the main reasons to
> split these in two "trackers", but I may be missing something.
>
> I've prepared a little demo, so others can have a better idea of what I
> have in mind.
>
> https://youtu.be/GUNDehj9z9o
>
> Please note that this is a proof of concept and I'm not frontend person, so
> it may look little clumsy :)
>
> D.
>
> On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
> > Hi David,
> >
> > thanks for starting this discussion. I like the idea of improving
> insights
> > into Flink's execution and I believe that a flame graph could be helpful.
> >
> > I quickly glanced over your changes and I think they go in a good
> > direction. One idea could be to share the `StackTraceSample` produced by
> > the `StackTraceSampleCoordinator` between the different
> > `StackTraceOperatorTracker` so that we don't send multiple requests for
> the
> > same operators. That way we would decrease a bit the RPC load.
> >
> > Apart from that, I think the next steps would be to find a committer who
> > could shepherd this effort and help you with merging it.
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 31, 2019 at 7:05 PM David Morávek <dm...@apache.org> wrote:
> >
> > > Hello,
> > >
> > > While looking into Flink internals, I've noticed that there is already
> a
> > > mechanism for stack-trace sampling of a particular job vertex.
> > >
> > > I think it may be really useful to allow user to easily render a cpu
> > > flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
> > for
> > > a
> > > selected vertex (new tab next to back pressure) of a running job. Back
> > > pressure tab already provides a good idea of which vertex causes
> trouble,
> > > but it's hard to say what's actually going on.
> > >
> > > I've tried to implement a basic REST endpoint
> > > <
> > >
> >
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> > > >,
> > > that prepares data for the flame graph rendering and it seems to be
> > > providing good insight.
> > >
> > > It should be straightforward to render data from the endpoint in new UI
> > > using existing <https://github.com/spiermar/d3-flame-graph> javascript
> > > libraries.
> > >
> > > WDYT? Is this worth pushing forward?
> > >
> > > D.
> > >
> >
>

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

Posted by David Morávek <dm...@apache.org>.
I've created FLINK-13550 <https://issues.apache.org/jira/browse/FLINK-13550>
to track the issue.

Is there any committer who'd be willing to "shepherd this effort"? :)

Thanks,
D.

On Fri, Aug 2, 2019 at 10:22 AM David Morávek <dm...@apache.org> wrote:

> Hi Paul, for now I only plan to add the one based on java stack traces.
>
> On Fri, Aug 2, 2019 at 9:34 AM Paul Lam <pa...@gmail.com> wrote:
>
>> Hi David,
>>
>> Thanks for the new feature! I think the flame graph would be a useful
>> tool to understand the state of job executions, and it looks good too. +1
>> for this.
>>
>> And a minor question: do we plan to support multiple kinds of flame
>> graphs? It would be great if we have both on-cpu and off-cpu flame graphs.
>>
>> Best,
>> Paul Lam
>>
>> > 在 2019年8月2日,04:24,David Morávek <da...@gmail.com> 写道:
>> >
>> > Hi Till, thanks for the feedback! These endpoints are only called when
>> the
>> > vertex is selected in the UI, so there should be any heavy RPC load. For
>> > back-pressure, we only sample top 3 calls of the stack (depth = 3). For
>> the
>> > flame-graph, we want to sample the whole stack trace and we need
>> different
>> > sampling rate (longer period, more samples). Those are the main reasons
>> to
>> > split these in two "trackers", but I may be missing something.
>> >
>> > I've prepared a little demo, so others can have a better idea of what I
>> > have in mind.
>> >
>> > https://youtu.be/GUNDehj9z9o
>> >
>> > Please note that this is a proof of concept and I'm not frontend
>> person, so
>> > it may look little clumsy :)
>> >
>> > D.
>> >
>> > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <tr...@apache.org>
>> wrote:
>> >
>> >> Hi David,
>> >>
>> >> thanks for starting this discussion. I like the idea of improving
>> insights
>> >> into Flink's execution and I believe that a flame graph could be
>> helpful.
>> >>
>> >> I quickly glanced over your changes and I think they go in a good
>> >> direction. One idea could be to share the `StackTraceSample` produced
>> by
>> >> the `StackTraceSampleCoordinator` between the different
>> >> `StackTraceOperatorTracker` so that we don't send multiple requests
>> for the
>> >> same operators. That way we would decrease a bit the RPC load.
>> >>
>> >> Apart from that, I think the next steps would be to find a committer
>> who
>> >> could shepherd this effort and help you with merging it.
>> >>
>> >> Cheers,
>> >> Till
>> >>
>> >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <dm...@apache.org> wrote:
>> >>
>> >>> Hello,
>> >>>
>> >>> While looking into Flink internals, I've noticed that there is
>> already a
>> >>> mechanism for stack-trace sampling of a particular job vertex.
>> >>>
>> >>> I think it may be really useful to allow user to easily render a cpu
>> >>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
>> >> for
>> >>> a
>> >>> selected vertex (new tab next to back pressure) of a running job. Back
>> >>> pressure tab already provides a good idea of which vertex causes
>> trouble,
>> >>> but it's hard to say what's actually going on.
>> >>>
>> >>> I've tried to implement a basic REST endpoint
>> >>> <
>> >>>
>> >>
>> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
>> >>>> ,
>> >>> that prepares data for the flame graph rendering and it seems to be
>> >>> providing good insight.
>> >>>
>> >>> It should be straightforward to render data from the endpoint in new
>> UI
>> >>> using existing <https://github.com/spiermar/d3-flame-graph>
>> javascript
>> >>> libraries.
>> >>>
>> >>> WDYT? Is this worth pushing forward?
>> >>>
>> >>> D.
>> >>>
>> >>
>>
>>

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

Posted by David Morávek <dm...@apache.org>.
Hi Paul, for now I only plan to add the one based on java stack traces.

On Fri, Aug 2, 2019 at 9:34 AM Paul Lam <pa...@gmail.com> wrote:

> Hi David,
>
> Thanks for the new feature! I think the flame graph would be a useful tool
> to understand the state of job executions, and it looks good too. +1 for
> this.
>
> And a minor question: do we plan to support multiple kinds of flame
> graphs? It would be great if we have both on-cpu and off-cpu flame graphs.
>
> Best,
> Paul Lam
>
> > 在 2019年8月2日,04:24,David Morávek <da...@gmail.com> 写道:
> >
> > Hi Till, thanks for the feedback! These endpoints are only called when
> the
> > vertex is selected in the UI, so there should be any heavy RPC load. For
> > back-pressure, we only sample top 3 calls of the stack (depth = 3). For
> the
> > flame-graph, we want to sample the whole stack trace and we need
> different
> > sampling rate (longer period, more samples). Those are the main reasons
> to
> > split these in two "trackers", but I may be missing something.
> >
> > I've prepared a little demo, so others can have a better idea of what I
> > have in mind.
> >
> > https://youtu.be/GUNDehj9z9o
> >
> > Please note that this is a proof of concept and I'm not frontend person,
> so
> > it may look little clumsy :)
> >
> > D.
> >
> > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <tr...@apache.org>
> wrote:
> >
> >> Hi David,
> >>
> >> thanks for starting this discussion. I like the idea of improving
> insights
> >> into Flink's execution and I believe that a flame graph could be
> helpful.
> >>
> >> I quickly glanced over your changes and I think they go in a good
> >> direction. One idea could be to share the `StackTraceSample` produced by
> >> the `StackTraceSampleCoordinator` between the different
> >> `StackTraceOperatorTracker` so that we don't send multiple requests for
> the
> >> same operators. That way we would decrease a bit the RPC load.
> >>
> >> Apart from that, I think the next steps would be to find a committer who
> >> could shepherd this effort and help you with merging it.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <dm...@apache.org> wrote:
> >>
> >>> Hello,
> >>>
> >>> While looking into Flink internals, I've noticed that there is already
> a
> >>> mechanism for stack-trace sampling of a particular job vertex.
> >>>
> >>> I think it may be really useful to allow user to easily render a cpu
> >>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
> >> for
> >>> a
> >>> selected vertex (new tab next to back pressure) of a running job. Back
> >>> pressure tab already provides a good idea of which vertex causes
> trouble,
> >>> but it's hard to say what's actually going on.
> >>>
> >>> I've tried to implement a basic REST endpoint
> >>> <
> >>>
> >>
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> >>>> ,
> >>> that prepares data for the flame graph rendering and it seems to be
> >>> providing good insight.
> >>>
> >>> It should be straightforward to render data from the endpoint in new UI
> >>> using existing <https://github.com/spiermar/d3-flame-graph> javascript
> >>> libraries.
> >>>
> >>> WDYT? Is this worth pushing forward?
> >>>
> >>> D.
> >>>
> >>
>
>

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

Posted by Paul Lam <pa...@gmail.com>.
Hi David,

Thanks for the new feature! I think the flame graph would be a useful tool to understand the state of job executions, and it looks good too. +1 for this.

And a minor question: do we plan to support multiple kinds of flame graphs? It would be great if we have both on-cpu and off-cpu flame graphs.

Best,
Paul Lam

> 在 2019年8月2日,04:24,David Morávek <da...@gmail.com> 写道:
> 
> Hi Till, thanks for the feedback! These endpoints are only called when the
> vertex is selected in the UI, so there should be any heavy RPC load. For
> back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
> flame-graph, we want to sample the whole stack trace and we need different
> sampling rate (longer period, more samples). Those are the main reasons to
> split these in two "trackers", but I may be missing something.
> 
> I've prepared a little demo, so others can have a better idea of what I
> have in mind.
> 
> https://youtu.be/GUNDehj9z9o
> 
> Please note that this is a proof of concept and I'm not frontend person, so
> it may look little clumsy :)
> 
> D.
> 
> On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <tr...@apache.org> wrote:
> 
>> Hi David,
>> 
>> thanks for starting this discussion. I like the idea of improving insights
>> into Flink's execution and I believe that a flame graph could be helpful.
>> 
>> I quickly glanced over your changes and I think they go in a good
>> direction. One idea could be to share the `StackTraceSample` produced by
>> the `StackTraceSampleCoordinator` between the different
>> `StackTraceOperatorTracker` so that we don't send multiple requests for the
>> same operators. That way we would decrease a bit the RPC load.
>> 
>> Apart from that, I think the next steps would be to find a committer who
>> could shepherd this effort and help you with merging it.
>> 
>> Cheers,
>> Till
>> 
>> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <dm...@apache.org> wrote:
>> 
>>> Hello,
>>> 
>>> While looking into Flink internals, I've noticed that there is already a
>>> mechanism for stack-trace sampling of a particular job vertex.
>>> 
>>> I think it may be really useful to allow user to easily render a cpu
>>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
>> for
>>> a
>>> selected vertex (new tab next to back pressure) of a running job. Back
>>> pressure tab already provides a good idea of which vertex causes trouble,
>>> but it's hard to say what's actually going on.
>>> 
>>> I've tried to implement a basic REST endpoint
>>> <
>>> 
>> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
>>>> ,
>>> that prepares data for the flame graph rendering and it seems to be
>>> providing good insight.
>>> 
>>> It should be straightforward to render data from the endpoint in new UI
>>> using existing <https://github.com/spiermar/d3-flame-graph> javascript
>>> libraries.
>>> 
>>> WDYT? Is this worth pushing forward?
>>> 
>>> D.
>>> 
>> 


Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

Posted by David Morávek <da...@gmail.com>.
Hi Till, thanks for the feedback! These endpoints are only called when the
vertex is selected in the UI, so there should be any heavy RPC load. For
back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
flame-graph, we want to sample the whole stack trace and we need different
sampling rate (longer period, more samples). Those are the main reasons to
split these in two "trackers", but I may be missing something.

I've prepared a little demo, so others can have a better idea of what I
have in mind.

https://youtu.be/GUNDehj9z9o

Please note that this is a proof of concept and I'm not frontend person, so
it may look little clumsy :)

D.

On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <tr...@apache.org> wrote:

> Hi David,
>
> thanks for starting this discussion. I like the idea of improving insights
> into Flink's execution and I believe that a flame graph could be helpful.
>
> I quickly glanced over your changes and I think they go in a good
> direction. One idea could be to share the `StackTraceSample` produced by
> the `StackTraceSampleCoordinator` between the different
> `StackTraceOperatorTracker` so that we don't send multiple requests for the
> same operators. That way we would decrease a bit the RPC load.
>
> Apart from that, I think the next steps would be to find a committer who
> could shepherd this effort and help you with merging it.
>
> Cheers,
> Till
>
> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <dm...@apache.org> wrote:
>
> > Hello,
> >
> > While looking into Flink internals, I've noticed that there is already a
> > mechanism for stack-trace sampling of a particular job vertex.
> >
> > I think it may be really useful to allow user to easily render a cpu
> > flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
> for
> > a
> > selected vertex (new tab next to back pressure) of a running job. Back
> > pressure tab already provides a good idea of which vertex causes trouble,
> > but it's hard to say what's actually going on.
> >
> > I've tried to implement a basic REST endpoint
> > <
> >
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> > >,
> > that prepares data for the flame graph rendering and it seems to be
> > providing good insight.
> >
> > It should be straightforward to render data from the endpoint in new UI
> > using existing <https://github.com/spiermar/d3-flame-graph> javascript
> > libraries.
> >
> > WDYT? Is this worth pushing forward?
> >
> > D.
> >
>