You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by aalexandrov <al...@gmail.com> on 2014/12/02 14:53:57 UTC

Re: Enhance Flink's monitoring capabilities

Hello Nils,

I am going to work on a similar issue related to tracking some basics
statistics of the intermediate results produced by dataflows during
execution.

I just create a Jira issue here:

https://issues.apache.org/jira/browse/FLINK-1297

If you already have some work done on extending the monitoring capabilities
in a branch, it might be good to sync-up the development in order to avoid
duplicated work (e.g. using the same communication channel used to send the
data from the task managers to the job manager).



--
View this message in context: http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
Sent from the Apache Flink (Incubator) Mailing List archive. mailing list archive at Nabble.com.

Re: Enhance Flink's monitoring capabilities

Posted by Henry Saputra <he...@gmail.com>.

+1

It's extensibility is one of the reasons it has been used in other projects.

On Sunday, December 7, 2014, Stephan Ewen <se...@apache.org> wrote:

> That actually sounds like a great idea. I discussed a bit with Robert
> offline on Friday, and it seems that Metrics has most of what we talked
> about.
>
> I also like the way they make it extensible, so people can capture their
> own metrics.
>
> On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <henry.saputra@gmail.com
> <javascript:;>>
> wrote:
>
> > Hi Robert,
> >
> > From I have seen it so far, it is probably better and easier for Flink
> > to leverage metrics library [1] for the metrics collection rather than
> > building organically.
> >
> > Several ASF projects like Spark [2] and Tajo have used it with great
> > success.
> >
> > One of the main reasons is maintainability and the breath of types of
> > metric could and should be collected.
> >
> > - Henry
> >
> > [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
> > [2] https://spark.apache.org/docs/1.0.1/monitoring.html
> > [3] https://issues.apache.org/jira/browse/TAJO-333
> >
> > On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <rmetzger@apache.org
> <javascript:;>>
> > wrote:
> > > Hey Nils,
> > >
> > > I have played around a bit with a little prototype. You can find the
> code
> > > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
> > > another branch in my repo).
> > > You can see the changes that I applied on top of Till's Akka branch
> here:
> > >
> >
> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
> > >
> > > What the code does is collecting statistics about each TaskManager in
> the
> > > system. These stats are assembled into a "MetricsReport" which is send
> > with
> > > the periodical heartbeat to the JobManager. The JobManager stores the
> > > latest MetricsReport for each TaskManager (in the Instance object for
> > each
> > > TM).
> > > When the user accesses the TaskManager overview, the latest
> MetricsReport
> > > is send as a JSONObject to the browser.
> > >
> > > to test my changes, check out the code, build it
> > >  mvn clean package -DskipTests -Dcheckstyle.skip=true
> > > go into
> > > cd
> > >
> >
> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
> > > and start the web interface
> > > /bin/start-local.sh
> > >
> > > Go to localhost:8081, in the "TaskManager" view, you can see some
> > metrics.
> > > Here is a screenshot: http://img42.com/eNPve
> > >
> > > I named my branch after this issue, as it is probably describing best
> > what
> > > we're working on here: FLINK-456
> > > <https://issues.apache.org/jira/browse/FLINK-456>
> > >
> > > As I said in the beginning, its really just a prototype. Let me know if
> > you
> > > have any further questions.
> > > For the "per TaskManager" reports, we should probably integrate some
> more
> > > statistics. Also, the presentation of the numbers is very very basic
> > right
> > > now. I think there are many good libraries for visualizing these kinds
> of
> > > stats.
> > > Also, the numbers currently represent only a "snapshot", however, some
> of
> > > the numbers can be accumulated (read/write bytes of the io manager).
> > > Another missing feature is storing a little history of numbers to
> > visualize
> > > metrics over time.
> > >
> > > I'm trying to find time to look into "per job" metrics as well. They
> will
> > > require a bit more infrastructure to distinguish them on the JobManager
> > > side and to get them on the TaskManagers.
> > >
> > >
> > > Best,
> > > Robert
> > >
> > >
> > >
> > > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
> > > alexander.s.alexandrov@gmail.com <javascript:;>> wrote:
> > >
> > >> Hello Nils,
> > >>
> > >> I am going to work on a similar issue related to tracking some basics
> > >> statistics of the intermediate results produced by dataflows during
> > >> execution.
> > >>
> > >> I just create a Jira issue here:
> > >>
> > >> https://issues.apache.org/jira/browse/FLINK-1297
> > >>
> > >> If you already have some work done on extending the monitoring
> > capabilities
> > >> in a branch, it might be good to sync-up the development in order to
> > avoid
> > >> duplicated work (e.g. using the same communication channel used to
> send
> > the
> > >> data from the task managers to the job manager).
> > >>
> > >>
> > >>
> > >> --
> > >> View this message in context:
> > >>
> >
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
> > >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing
> > list
> > >> archive at Nabble.com.
> > >>
> >
>

Fwd: Enhance Flink's monitoring capabilities

Posted by Alexander Alexandrov <al...@gmail.com>.

I have created an issue for the related dataflow statistics tracking
feature here:

https://issues.apache.org/jira/browse/FLINK-1297

FLINK-456 seems to have some overlap with what I described. I suggest to
either have three separate issues or at least work on resolving FLINK-1297
and FLINK-456 in three stages:

1. agree upon a design and implement the basic service architecture and the
model;
2. implement dataflow statistics tracking on top of (1): min, max, count,
count distinct;
3. implement runtime statistics tracking on top of (1): CPU, I/O load;

It makes sense to have a design document (probably Markdown) with some
figures to agree on the scope and implementation aspects on (1) as Henry
Proposed in the "Statistics collection for optimization" thread before we
start with the actual implementation.

Robert's prototype branch (
https://github.com/rmetzger/incubator-flink/tree/flink456) on top of the
latest version of Till's Akka rework seems to be a good starting point to
fork for the actual work on (1). I suggest that after that we somehow
divide and conquer (2) and (3).

Regards,
Alexander

---------- Forwarded message ----------
From: Henry Saputra <he...@gmail.com>
Date: 2014-12-12 6:18 GMT+01:00
Subject: Re: Enhance Flink's monitoring capabilities
To: "dev@flink.incubator.apache.org" <de...@flink.incubator.apache.org>

Thanks Robert, looks like we could use this JIRA to do the work

- Henry

On Thu, Dec 11, 2014 at 9:25 AM, Robert Metzger <rm...@apache.org> wrote:
> I think this (very old) issue is somewhat closely describing the feature:
> https://issues.apache.org/jira/browse/FLINK-456
>
>
>
> On Thu, Dec 11, 2014 at 8:32 AM, Henry Saputra <he...@gmail.com>
> wrote:
>
>> Just curious, is there any JIRA filed for this or was it just in
>> preliminary proposal talk?
>>
>> - Henry
>>
>> On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <se...@apache.org> wrote:
>> > That actually sounds like a great idea. I discussed a bit with Robert
>> > offline on Friday, and it seems that Metrics has most of what we talked
>> > about.
>> >
>> > I also like the way they make it extensible, so people can capture
their
>> > own metrics.
>> >
>> > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <he...@gmail.com>
>> > wrote:
>> >
>> >> Hi Robert,
>> >>
>> >> From I have seen it so far, it is probably better and easier for Flink
>> >> to leverage metrics library [1] for the metrics collection rather than
>> >> building organically.
>> >>
>> >> Several ASF projects like Spark [2] and Tajo have used it with great
>> >> success.
>> >>
>> >> One of the main reasons is maintainability and the breath of types of
>> >> metric could and should be collected.
>> >>
>> >> - Henry
>> >>
>> >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
>> >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html
>> >> [3] https://issues.apache.org/jira/browse/TAJO-333
>> >>
>> >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <rm...@apache.org>
>> >> wrote:
>> >> > Hey Nils,
>> >> >
>> >> > I have played around a bit with a little prototype. You can find the
>> code
>> >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
>> >> > another branch in my repo).
>> >> > You can see the changes that I applied on top of Till's Akka branch
>> here:
>> >> >
>> >>
>>
https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
>> >> >
>> >> > What the code does is collecting statistics about each TaskManager
in
>> the
>> >> > system. These stats are assembled into a "MetricsReport" which is
send
>> >> with
>> >> > the periodical heartbeat to the JobManager. The JobManager stores
the
>> >> > latest MetricsReport for each TaskManager (in the Instance object
for
>> >> each
>> >> > TM).
>> >> > When the user accesses the TaskManager overview, the latest
>> MetricsReport
>> >> > is send as a JSONObject to the browser.
>> >> >
>> >> > to test my changes, check out the code, build it
>> >> >  mvn clean package -DskipTests -Dcheckstyle.skip=true
>> >> > go into
>> >> > cd
>> >> >
>> >>
>>
flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
>> >> > and start the web interface
>> >> > /bin/start-local.sh
>> >> >
>> >> > Go to localhost:8081, in the "TaskManager" view, you can see some
>> >> metrics.
>> >> > Here is a screenshot: http://img42.com/eNPve
>> >> >
>> >> > I named my branch after this issue, as it is probably describing
best
>> >> what
>> >> > we're working on here: FLINK-456
>> >> > <https://issues.apache.org/jira/browse/FLINK-456>
>> >> >
>> >> > As I said in the beginning, its really just a prototype. Let me know
>> if
>> >> you
>> >> > have any further questions.
>> >> > For the "per TaskManager" reports, we should probably integrate some
>> more
>> >> > statistics. Also, the presentation of the numbers is very very basic
>> >> right
>> >> > now. I think there are many good libraries for visualizing these
>> kinds of
>> >> > stats.
>> >> > Also, the numbers currently represent only a "snapshot", however,
>> some of
>> >> > the numbers can be accumulated (read/write bytes of the io manager).
>> >> > Another missing feature is storing a little history of numbers to
>> >> visualize
>> >> > metrics over time.
>> >> >
>> >> > I'm trying to find time to look into "per job" metrics as well. They
>> will
>> >> > require a bit more infrastructure to distinguish them on the
>> JobManager
>> >> > side and to get them on the TaskManagers.
>> >> >
>> >> >
>> >> > Best,
>> >> > Robert
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
>> >> > alexander.s.alexandrov@gmail.com> wrote:
>> >> >
>> >> >> Hello Nils,
>> >> >>
>> >> >> I am going to work on a similar issue related to tracking some
basics
>> >> >> statistics of the intermediate results produced by dataflows during
>> >> >> execution.
>> >> >>
>> >> >> I just create a Jira issue here:
>> >> >>
>> >> >> https://issues.apache.org/jira/browse/FLINK-1297
>> >> >>
>> >> >> If you already have some work done on extending the monitoring
>> >> capabilities
>> >> >> in a branch, it might be good to sync-up the development in order
to
>> >> avoid
>> >> >> duplicated work (e.g. using the same communication channel used to
>> send
>> >> the
>> >> >> data from the task managers to the job manager).
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>>
http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
>> >> >> Sent from the Apache Flink (Incubator) Mailing List archive.
mailing
>> >> list
>> >> >> archive at Nabble.com.
>> >> >>
>> >>
>>

Re: Enhance Flink's monitoring capabilities

Posted by Henry Saputra <he...@gmail.com>.

Thanks Robert, looks like we could use this JIRA to do the work

- Henry

On Thu, Dec 11, 2014 at 9:25 AM, Robert Metzger <rm...@apache.org> wrote:
> I think this (very old) issue is somewhat closely describing the feature:
> https://issues.apache.org/jira/browse/FLINK-456
>
>
>
> On Thu, Dec 11, 2014 at 8:32 AM, Henry Saputra <he...@gmail.com>
> wrote:
>
>> Just curious, is there any JIRA filed for this or was it just in
>> preliminary proposal talk?
>>
>> - Henry
>>
>> On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <se...@apache.org> wrote:
>> > That actually sounds like a great idea. I discussed a bit with Robert
>> > offline on Friday, and it seems that Metrics has most of what we talked
>> > about.
>> >
>> > I also like the way they make it extensible, so people can capture their
>> > own metrics.
>> >
>> > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <he...@gmail.com>
>> > wrote:
>> >
>> >> Hi Robert,
>> >>
>> >> From I have seen it so far, it is probably better and easier for Flink
>> >> to leverage metrics library [1] for the metrics collection rather than
>> >> building organically.
>> >>
>> >> Several ASF projects like Spark [2] and Tajo have used it with great
>> >> success.
>> >>
>> >> One of the main reasons is maintainability and the breath of types of
>> >> metric could and should be collected.
>> >>
>> >> - Henry
>> >>
>> >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
>> >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html
>> >> [3] https://issues.apache.org/jira/browse/TAJO-333
>> >>
>> >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <rm...@apache.org>
>> >> wrote:
>> >> > Hey Nils,
>> >> >
>> >> > I have played around a bit with a little prototype. You can find the
>> code
>> >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
>> >> > another branch in my repo).
>> >> > You can see the changes that I applied on top of Till's Akka branch
>> here:
>> >> >
>> >>
>> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
>> >> >
>> >> > What the code does is collecting statistics about each TaskManager in
>> the
>> >> > system. These stats are assembled into a "MetricsReport" which is send
>> >> with
>> >> > the periodical heartbeat to the JobManager. The JobManager stores the
>> >> > latest MetricsReport for each TaskManager (in the Instance object for
>> >> each
>> >> > TM).
>> >> > When the user accesses the TaskManager overview, the latest
>> MetricsReport
>> >> > is send as a JSONObject to the browser.
>> >> >
>> >> > to test my changes, check out the code, build it
>> >> >  mvn clean package -DskipTests -Dcheckstyle.skip=true
>> >> > go into
>> >> > cd
>> >> >
>> >>
>> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
>> >> > and start the web interface
>> >> > /bin/start-local.sh
>> >> >
>> >> > Go to localhost:8081, in the "TaskManager" view, you can see some
>> >> metrics.
>> >> > Here is a screenshot: http://img42.com/eNPve
>> >> >
>> >> > I named my branch after this issue, as it is probably describing best
>> >> what
>> >> > we're working on here: FLINK-456
>> >> > <https://issues.apache.org/jira/browse/FLINK-456>
>> >> >
>> >> > As I said in the beginning, its really just a prototype. Let me know
>> if
>> >> you
>> >> > have any further questions.
>> >> > For the "per TaskManager" reports, we should probably integrate some
>> more
>> >> > statistics. Also, the presentation of the numbers is very very basic
>> >> right
>> >> > now. I think there are many good libraries for visualizing these
>> kinds of
>> >> > stats.
>> >> > Also, the numbers currently represent only a "snapshot", however,
>> some of
>> >> > the numbers can be accumulated (read/write bytes of the io manager).
>> >> > Another missing feature is storing a little history of numbers to
>> >> visualize
>> >> > metrics over time.
>> >> >
>> >> > I'm trying to find time to look into "per job" metrics as well. They
>> will
>> >> > require a bit more infrastructure to distinguish them on the
>> JobManager
>> >> > side and to get them on the TaskManagers.
>> >> >
>> >> >
>> >> > Best,
>> >> > Robert
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
>> >> > alexander.s.alexandrov@gmail.com> wrote:
>> >> >
>> >> >> Hello Nils,
>> >> >>
>> >> >> I am going to work on a similar issue related to tracking some basics
>> >> >> statistics of the intermediate results produced by dataflows during
>> >> >> execution.
>> >> >>
>> >> >> I just create a Jira issue here:
>> >> >>
>> >> >> https://issues.apache.org/jira/browse/FLINK-1297
>> >> >>
>> >> >> If you already have some work done on extending the monitoring
>> >> capabilities
>> >> >> in a branch, it might be good to sync-up the development in order to
>> >> avoid
>> >> >> duplicated work (e.g. using the same communication channel used to
>> send
>> >> the
>> >> >> data from the task managers to the job manager).
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
>> >> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing
>> >> list
>> >> >> archive at Nabble.com.
>> >> >>
>> >>
>>

Re: Enhance Flink's monitoring capabilities

Posted by Robert Metzger <rm...@apache.org>.

I think this (very old) issue is somewhat closely describing the feature:
https://issues.apache.org/jira/browse/FLINK-456



On Thu, Dec 11, 2014 at 8:32 AM, Henry Saputra <he...@gmail.com>
wrote:

> Just curious, is there any JIRA filed for this or was it just in
> preliminary proposal talk?
>
> - Henry
>
> On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <se...@apache.org> wrote:
> > That actually sounds like a great idea. I discussed a bit with Robert
> > offline on Friday, and it seems that Metrics has most of what we talked
> > about.
> >
> > I also like the way they make it extensible, so people can capture their
> > own metrics.
> >
> > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <he...@gmail.com>
> > wrote:
> >
> >> Hi Robert,
> >>
> >> From I have seen it so far, it is probably better and easier for Flink
> >> to leverage metrics library [1] for the metrics collection rather than
> >> building organically.
> >>
> >> Several ASF projects like Spark [2] and Tajo have used it with great
> >> success.
> >>
> >> One of the main reasons is maintainability and the breath of types of
> >> metric could and should be collected.
> >>
> >> - Henry
> >>
> >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
> >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html
> >> [3] https://issues.apache.org/jira/browse/TAJO-333
> >>
> >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <rm...@apache.org>
> >> wrote:
> >> > Hey Nils,
> >> >
> >> > I have played around a bit with a little prototype. You can find the
> code
> >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
> >> > another branch in my repo).
> >> > You can see the changes that I applied on top of Till's Akka branch
> here:
> >> >
> >>
> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
> >> >
> >> > What the code does is collecting statistics about each TaskManager in
> the
> >> > system. These stats are assembled into a "MetricsReport" which is send
> >> with
> >> > the periodical heartbeat to the JobManager. The JobManager stores the
> >> > latest MetricsReport for each TaskManager (in the Instance object for
> >> each
> >> > TM).
> >> > When the user accesses the TaskManager overview, the latest
> MetricsReport
> >> > is send as a JSONObject to the browser.
> >> >
> >> > to test my changes, check out the code, build it
> >> >  mvn clean package -DskipTests -Dcheckstyle.skip=true
> >> > go into
> >> > cd
> >> >
> >>
> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
> >> > and start the web interface
> >> > /bin/start-local.sh
> >> >
> >> > Go to localhost:8081, in the "TaskManager" view, you can see some
> >> metrics.
> >> > Here is a screenshot: http://img42.com/eNPve
> >> >
> >> > I named my branch after this issue, as it is probably describing best
> >> what
> >> > we're working on here: FLINK-456
> >> > <https://issues.apache.org/jira/browse/FLINK-456>
> >> >
> >> > As I said in the beginning, its really just a prototype. Let me know
> if
> >> you
> >> > have any further questions.
> >> > For the "per TaskManager" reports, we should probably integrate some
> more
> >> > statistics. Also, the presentation of the numbers is very very basic
> >> right
> >> > now. I think there are many good libraries for visualizing these
> kinds of
> >> > stats.
> >> > Also, the numbers currently represent only a "snapshot", however,
> some of
> >> > the numbers can be accumulated (read/write bytes of the io manager).
> >> > Another missing feature is storing a little history of numbers to
> >> visualize
> >> > metrics over time.
> >> >
> >> > I'm trying to find time to look into "per job" metrics as well. They
> will
> >> > require a bit more infrastructure to distinguish them on the
> JobManager
> >> > side and to get them on the TaskManagers.
> >> >
> >> >
> >> > Best,
> >> > Robert
> >> >
> >> >
> >> >
> >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
> >> > alexander.s.alexandrov@gmail.com> wrote:
> >> >
> >> >> Hello Nils,
> >> >>
> >> >> I am going to work on a similar issue related to tracking some basics
> >> >> statistics of the intermediate results produced by dataflows during
> >> >> execution.
> >> >>
> >> >> I just create a Jira issue here:
> >> >>
> >> >> https://issues.apache.org/jira/browse/FLINK-1297
> >> >>
> >> >> If you already have some work done on extending the monitoring
> >> capabilities
> >> >> in a branch, it might be good to sync-up the development in order to
> >> avoid
> >> >> duplicated work (e.g. using the same communication channel used to
> send
> >> the
> >> >> data from the task managers to the job manager).
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> >>
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
> >> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing
> >> list
> >> >> archive at Nabble.com.
> >> >>
> >>
>

Re: Enhance Flink's monitoring capabilities

Posted by Henry Saputra <he...@gmail.com>.

Just curious, is there any JIRA filed for this or was it just in
preliminary proposal talk?

- Henry

On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <se...@apache.org> wrote:
> That actually sounds like a great idea. I discussed a bit with Robert
> offline on Friday, and it seems that Metrics has most of what we talked
> about.
>
> I also like the way they make it extensible, so people can capture their
> own metrics.
>
> On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <he...@gmail.com>
> wrote:
>
>> Hi Robert,
>>
>> From I have seen it so far, it is probably better and easier for Flink
>> to leverage metrics library [1] for the metrics collection rather than
>> building organically.
>>
>> Several ASF projects like Spark [2] and Tajo have used it with great
>> success.
>>
>> One of the main reasons is maintainability and the breath of types of
>> metric could and should be collected.
>>
>> - Henry
>>
>> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
>> [2] https://spark.apache.org/docs/1.0.1/monitoring.html
>> [3] https://issues.apache.org/jira/browse/TAJO-333
>>
>> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <rm...@apache.org>
>> wrote:
>> > Hey Nils,
>> >
>> > I have played around a bit with a little prototype. You can find the code
>> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
>> > another branch in my repo).
>> > You can see the changes that I applied on top of Till's Akka branch here:
>> >
>> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
>> >
>> > What the code does is collecting statistics about each TaskManager in the
>> > system. These stats are assembled into a "MetricsReport" which is send
>> with
>> > the periodical heartbeat to the JobManager. The JobManager stores the
>> > latest MetricsReport for each TaskManager (in the Instance object for
>> each
>> > TM).
>> > When the user accesses the TaskManager overview, the latest MetricsReport
>> > is send as a JSONObject to the browser.
>> >
>> > to test my changes, check out the code, build it
>> >  mvn clean package -DskipTests -Dcheckstyle.skip=true
>> > go into
>> > cd
>> >
>> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
>> > and start the web interface
>> > /bin/start-local.sh
>> >
>> > Go to localhost:8081, in the "TaskManager" view, you can see some
>> metrics.
>> > Here is a screenshot: http://img42.com/eNPve
>> >
>> > I named my branch after this issue, as it is probably describing best
>> what
>> > we're working on here: FLINK-456
>> > <https://issues.apache.org/jira/browse/FLINK-456>
>> >
>> > As I said in the beginning, its really just a prototype. Let me know if
>> you
>> > have any further questions.
>> > For the "per TaskManager" reports, we should probably integrate some more
>> > statistics. Also, the presentation of the numbers is very very basic
>> right
>> > now. I think there are many good libraries for visualizing these kinds of
>> > stats.
>> > Also, the numbers currently represent only a "snapshot", however, some of
>> > the numbers can be accumulated (read/write bytes of the io manager).
>> > Another missing feature is storing a little history of numbers to
>> visualize
>> > metrics over time.
>> >
>> > I'm trying to find time to look into "per job" metrics as well. They will
>> > require a bit more infrastructure to distinguish them on the JobManager
>> > side and to get them on the TaskManagers.
>> >
>> >
>> > Best,
>> > Robert
>> >
>> >
>> >
>> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
>> > alexander.s.alexandrov@gmail.com> wrote:
>> >
>> >> Hello Nils,
>> >>
>> >> I am going to work on a similar issue related to tracking some basics
>> >> statistics of the intermediate results produced by dataflows during
>> >> execution.
>> >>
>> >> I just create a Jira issue here:
>> >>
>> >> https://issues.apache.org/jira/browse/FLINK-1297
>> >>
>> >> If you already have some work done on extending the monitoring
>> capabilities
>> >> in a branch, it might be good to sync-up the development in order to
>> avoid
>> >> duplicated work (e.g. using the same communication channel used to send
>> the
>> >> data from the task managers to the job manager).
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
>> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing
>> list
>> >> archive at Nabble.com.
>> >>
>>

Re: Enhance Flink's monitoring capabilities

Posted by Stephan Ewen <se...@apache.org>.

That actually sounds like a great idea. I discussed a bit with Robert
offline on Friday, and it seems that Metrics has most of what we talked
about.

I also like the way they make it extensible, so people can capture their
own metrics.

On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <he...@gmail.com>
wrote:

> Hi Robert,
>
> From I have seen it so far, it is probably better and easier for Flink
> to leverage metrics library [1] for the metrics collection rather than
> building organically.
>
> Several ASF projects like Spark [2] and Tajo have used it with great
> success.
>
> One of the main reasons is maintainability and the breath of types of
> metric could and should be collected.
>
> - Henry
>
> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
> [2] https://spark.apache.org/docs/1.0.1/monitoring.html
> [3] https://issues.apache.org/jira/browse/TAJO-333
>
> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <rm...@apache.org>
> wrote:
> > Hey Nils,
> >
> > I have played around a bit with a little prototype. You can find the code
> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
> > another branch in my repo).
> > You can see the changes that I applied on top of Till's Akka branch here:
> >
> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
> >
> > What the code does is collecting statistics about each TaskManager in the
> > system. These stats are assembled into a "MetricsReport" which is send
> with
> > the periodical heartbeat to the JobManager. The JobManager stores the
> > latest MetricsReport for each TaskManager (in the Instance object for
> each
> > TM).
> > When the user accesses the TaskManager overview, the latest MetricsReport
> > is send as a JSONObject to the browser.
> >
> > to test my changes, check out the code, build it
> >  mvn clean package -DskipTests -Dcheckstyle.skip=true
> > go into
> > cd
> >
> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
> > and start the web interface
> > /bin/start-local.sh
> >
> > Go to localhost:8081, in the "TaskManager" view, you can see some
> metrics.
> > Here is a screenshot: http://img42.com/eNPve
> >
> > I named my branch after this issue, as it is probably describing best
> what
> > we're working on here: FLINK-456
> > <https://issues.apache.org/jira/browse/FLINK-456>
> >
> > As I said in the beginning, its really just a prototype. Let me know if
> you
> > have any further questions.
> > For the "per TaskManager" reports, we should probably integrate some more
> > statistics. Also, the presentation of the numbers is very very basic
> right
> > now. I think there are many good libraries for visualizing these kinds of
> > stats.
> > Also, the numbers currently represent only a "snapshot", however, some of
> > the numbers can be accumulated (read/write bytes of the io manager).
> > Another missing feature is storing a little history of numbers to
> visualize
> > metrics over time.
> >
> > I'm trying to find time to look into "per job" metrics as well. They will
> > require a bit more infrastructure to distinguish them on the JobManager
> > side and to get them on the TaskManagers.
> >
> >
> > Best,
> > Robert
> >
> >
> >
> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
> > alexander.s.alexandrov@gmail.com> wrote:
> >
> >> Hello Nils,
> >>
> >> I am going to work on a similar issue related to tracking some basics
> >> statistics of the intermediate results produced by dataflows during
> >> execution.
> >>
> >> I just create a Jira issue here:
> >>
> >> https://issues.apache.org/jira/browse/FLINK-1297
> >>
> >> If you already have some work done on extending the monitoring
> capabilities
> >> in a branch, it might be good to sync-up the development in order to
> avoid
> >> duplicated work (e.g. using the same communication channel used to send
> the
> >> data from the task managers to the job manager).
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing
> list
> >> archive at Nabble.com.
> >>
>

Re: Enhance Flink's monitoring capabilities

Posted by Henry Saputra <he...@gmail.com>.

Hi Robert,

>From I have seen it so far, it is probably better and easier for Flink
to leverage metrics library [1] for the metrics collection rather than
building organically.

Several ASF projects like Spark [2] and Tajo have used it with great success.

One of the main reasons is maintainability and the breath of types of
metric could and should be collected.

- Henry

[1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
[2] https://spark.apache.org/docs/1.0.1/monitoring.html
[3] https://issues.apache.org/jira/browse/TAJO-333

On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <rm...@apache.org> wrote:
> Hey Nils,
>
> I have played around a bit with a little prototype. You can find the code
> here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
> another branch in my repo).
> You can see the changes that I applied on top of Till's Akka branch here:
> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
>
> What the code does is collecting statistics about each TaskManager in the
> system. These stats are assembled into a "MetricsReport" which is send with
> the periodical heartbeat to the JobManager. The JobManager stores the
> latest MetricsReport for each TaskManager (in the Instance object for each
> TM).
> When the user accesses the TaskManager overview, the latest MetricsReport
> is send as a JSONObject to the browser.
>
> to test my changes, check out the code, build it
>  mvn clean package -DskipTests -Dcheckstyle.skip=true
> go into
> cd
> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
> and start the web interface
> /bin/start-local.sh
>
> Go to localhost:8081, in the "TaskManager" view, you can see some metrics.
> Here is a screenshot: http://img42.com/eNPve
>
> I named my branch after this issue, as it is probably describing best what
> we're working on here: FLINK-456
> <https://issues.apache.org/jira/browse/FLINK-456>
>
> As I said in the beginning, its really just a prototype. Let me know if you
> have any further questions.
> For the "per TaskManager" reports, we should probably integrate some more
> statistics. Also, the presentation of the numbers is very very basic right
> now. I think there are many good libraries for visualizing these kinds of
> stats.
> Also, the numbers currently represent only a "snapshot", however, some of
> the numbers can be accumulated (read/write bytes of the io manager).
> Another missing feature is storing a little history of numbers to visualize
> metrics over time.
>
> I'm trying to find time to look into "per job" metrics as well. They will
> require a bit more infrastructure to distinguish them on the JobManager
> side and to get them on the TaskManagers.
>
>
> Best,
> Robert
>
>
>
> On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
> alexander.s.alexandrov@gmail.com> wrote:
>
>> Hello Nils,
>>
>> I am going to work on a similar issue related to tracking some basics
>> statistics of the intermediate results produced by dataflows during
>> execution.
>>
>> I just create a Jira issue here:
>>
>> https://issues.apache.org/jira/browse/FLINK-1297
>>
>> If you already have some work done on extending the monitoring capabilities
>> in a branch, it might be good to sync-up the development in order to avoid
>> duplicated work (e.g. using the same communication channel used to send the
>> data from the task managers to the job manager).
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
>> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
>> archive at Nabble.com.
>>

Re: Enhance Flink's monitoring capabilities

Posted by Robert Metzger <rm...@apache.org>.

Hey Nils,

I have played around a bit with a little prototype. You can find the code
here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
another branch in my repo).
You can see the changes that I applied on top of Till's Akka branch here:
https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1

What the code does is collecting statistics about each TaskManager in the
system. These stats are assembled into a "MetricsReport" which is send with
the periodical heartbeat to the JobManager. The JobManager stores the
latest MetricsReport for each TaskManager (in the Instance object for each
TM).
When the user accesses the TaskManager overview, the latest MetricsReport
is send as a JSONObject to the browser.

to test my changes, check out the code, build it
 mvn clean package -DskipTests -Dcheckstyle.skip=true
go into
cd
flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
and start the web interface
/bin/start-local.sh

Go to localhost:8081, in the "TaskManager" view, you can see some metrics.
Here is a screenshot: http://img42.com/eNPve

I named my branch after this issue, as it is probably describing best what
we're working on here: FLINK-456
<https://issues.apache.org/jira/browse/FLINK-456>

As I said in the beginning, its really just a prototype. Let me know if you
have any further questions.
For the "per TaskManager" reports, we should probably integrate some more
statistics. Also, the presentation of the numbers is very very basic right
now. I think there are many good libraries for visualizing these kinds of
stats.
Also, the numbers currently represent only a "snapshot", however, some of
the numbers can be accumulated (read/write bytes of the io manager).
Another missing feature is storing a little history of numbers to visualize
metrics over time.

I'm trying to find time to look into "per job" metrics as well. They will
require a bit more infrastructure to distinguish them on the JobManager
side and to get them on the TaskManagers.

Best,
Robert

On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
alexander.s.alexandrov@gmail.com> wrote:

> Hello Nils,
>
> I am going to work on a similar issue related to tracking some basics
> statistics of the intermediate results produced by dataflows during
> execution.
>
> I just create a Jira issue here:
>
> https://issues.apache.org/jira/browse/FLINK-1297
>
> If you already have some work done on extending the monitoring capabilities
> in a branch, it might be good to sync-up the development in order to avoid
> duplicated work (e.g. using the same communication channel used to send the
> data from the task managers to the job manager).
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>