You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Etienne Chauchot <ec...@apache.org> on 2018/03/26 15:29:21 UTC

Re: [DISCUSSION] Runner agnostic metrics extractor?

Hi guys,
As part of the work bellow I need the help of Google Dataflow engine maintainers:
AFAIK Dataflow being a cloud hosted engine, the related runner is very different from the others. It just submits a job
to the cloud hosted engine. So, no access to metrics container etc... from the runner. So I think that the MetricsPusher
(component responsible for merging metrics and pushing them to a sink backend) must not be instanciated in
DataflowRunner otherwise it would be more a client (driver) piece of code and we will lose all the interest of being
close to the execution engine (among other things instrumentation of the execution of the pipelines). 
I think that the MetricsPusher needs to be instanciated in the actual Dataflow engine. 
As a side note a java implementation of the MetricsPusher is available in the PR in runner-core.
We need a serialized (langage agnostic) version of the MetricsPusher that SDKs will generate out of user pipelineOptions
and that will be sent to runners for instanciation
Are you guys willing to instanciate it in dataflow engine?  I think it is important that dataflow is wired up in
addition to flink and spark for this new feature to ramp up (adding it to other runners) in Beam. 
If you agree, can someone from Google help on that? 
Thanks 
Etienne
Le mercredi 31 janvier 2018 à 14:01 +0100, Etienne Chauchot a écrit :
> Hi all,
> Just to let you know that I have just submitted the PR [1]:
> This PR adds a MetricsPusher discussed in this [2] document in scenario 3.b.
> It merges and pushes beam metrics at a configurable (via pipelineOptions) frequency to a configurable sink. By default
> the sink is a DummySink also useful for tests. There is also a HttpMetricsSink available that pushes to a http backend
> the json-serialized metrics results. We could imagine in the future many Beam sinks to write to Ganglia, Graphite, ...
> Please note that these are not IOs because it needed to be transparent to the user.
> The pusher only supports attempted metrics for now.
> 
> This feature is hosted in the runner-core module (as discussed in the doc) to be used by all the runners. I have wired
> it up with Spark and Flink runners (this part was quite painful :) )
> If the PR is merged, I hope that runner experts would wire this feature up in the other runners. Your help is needed
> guys :)
> Besides, there is nothing related to portability for now in this PR.
> Best!
> Etienne
> [1] https://github.com/apache/beam/pull/4548
> [2] https://s.apache.org/runner_independent_metrics_extraction
> 
> Le 11/12/2017 à 17:33, Etienne Chauchot a écrit :
> > Hi all, 
> > I sketched a little doc [1] about this subject. It tries to sum up the differences between the runners towards
> > metrics extraction and propose some possible designs to have a runner agnostic extraction of the metrics.
> > It is a 2 pages long doc, can you please comment it, and correct it if needed?
> > Thanks
> > Etienne
> > [1]: https://s.apache.org/runner_independent_metrics_extraction
> > 
> > Le 27/11/2017 à 18:17, Ben Chambers a écrit :
> > > I think discussing a runner agnostic way of configuring how metrics are
> > > extracted is a great idea -- thanks for bringing it up Etienne!
> > > 
> > > Using a thread that polls the pipeline result relies on the program that
> > > created and submitted the pipeline continuing to run (eg., no machine
> > > faults, network problems, etc.). For many applications, this isn't a good
> > > model (a Streaming pipeline may run for weeks, a Batch pipeline may be
> > > automatically run every hour, etc.).
> > > 
> > > Etienne's proposal of having something the runner pushes metrics too has
> > > the benefit of running in the same cluster as the pipeline, thus having the
> > > same reliability benefits.
> > > 
> > > As noted, it would require runners to ensure that metrics were pushed into
> > > the extractor but from there it would allow a general configuration of how
> > > metrics are extracted from the pipeline and exposed to some external
> > > services.
> > > 
> > > Providing something that the runners could push metrics into and have them
> > > automatically exported seems like it would have several benefits:
> > >   1. It would provide a single way to configure how metrics are actually
> > > exported.
> > >   2. It would allow the runners to ensure it was reliably executed.
> > >   3. It would allow the runner to report system metrics directly (eg., if a
> > > runner wanted to report the watermark, it could push that in directly).
> > > 
> > > -- Ben
> > > 
> > > On Mon, Nov 27, 2017 at 9:06 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> > > wrote:
> > > 
> > > > Hi all,
> > > > 
> > > > Etienne forgot to mention that we started a PoC about that.
> > > > 
> > > > What I started is to wrap the Pipeline creation to include a thread that
> > > > polls
> > > > periodically the metrics in the pipeline result (it's what I proposed when
> > > > I
> > > > compared with Karaf Decanter some time ago).
> > > > Then, this thread marshalls the collected metrics and send to a sink. At
> > > > the
> > > > end, it means that the harvested metrics data will be store in a backend
> > > > (for
> > > > instance elasticsearch).
> > > > 
> > > > The pro of this approach is that it doesn't require any change in the
> > > > core, it's
> > > > up to the user to use the PipelineWithMetric wrapper.
> > > > 
> > > > The cons is that the user needs to explicitly use the PipelineWithMetric
> > > > wrapper.
> > > > 
> > > > IMHO, it's good enough as user can decide to poll metrics for some
> > > > pipelines and
> > > > not for others.
> > > > 
> > > > Regards
> > > > JB
> > > > 
> > > > On 11/27/2017 04:56 PM, Etienne Chauchot wrote:
> > > > > Hi all,
> > > > > 
> > > > > I came by this ticket https://issues.apache.org/jira/browse/BEAM-2456.
> > > > I know
> > > > > that the metrics subject has already been discussed a lot, but I would
> > > > like to
> > > > > revive the discussion.
> > > > > 
> > > > > The aim in this ticket is to avoid relying on the runner to provide the
> > > > metrics
> > > > > because they don't have all the same capabilities towards metrics. The
> > > > idea in
> > > > > the ticket is to still use beam metrics API (and not others like
> > > > codahale as it
> > > > > has been discussed some time ago) and provide a way to extract the
> > > > metrics with
> > > > > a polling thread that would be forked by a PipelineWithMetrics (so,
> > > > almost
> > > > > invisible to the end user) and then to push to a sink (such as a Http
> > > > rest sink
> > > > > for example or Graphite sink or anything else...). Nevertheless, a
> > > > polling
> > > > > thread might not work for all the runners because some might not make the
> > > > > metrics available before the end of the pipeline. Also, forking a thread
> > > > would
> > > > > be a bit unconventional, so it could be provided as a beam sdk extension.
> > > > > 
> > > > > Another way, to avoid polling, would be to push metrics values to a sink
> > > > when
> > > > > they are updated but I don't know if it is feasible in a runner
> > > > independent way.
> > > > > WDYT about the ideas in this ticket?
> > > > > 
> > > > > Best,
> > > > > Etienne
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbonofre@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > > 
> >  
>  

Re: [DISCUSSION] Runner agnostic metrics extractor?

Posted by Etienne Chauchot <ec...@apache.org>.
Hi JB,
I guess you mean add it on the engine, not on the runner, as dataflow runner is more a client
Le lundi 26 mars 2018 à 17:36 +0200, Jean-Baptiste Onofré a écrit :
> Hi Etienne,
> 
> as we might want to keep the runners consistent on such feature, I think it
> makes sense to have this in the dataflow runner.
> 
> Especially, if it's not used by end-users, there's no impact in the runner.
> 
> So, +1 to add MetricsPusher in dataflow runner.
> 
> My $0.01
> 
> Regards
> JB
> 
> On 03/26/2018 05:29 PM, Etienne Chauchot wrote:
> > 
> > Hi guys,
> > As part of the work bellow I need the help of Google Dataflow engine maintainers:
> > 
> > AFAIK Dataflow being a cloud hosted engine, the related runner is very different
> > from the others. It just submits a job to the cloud hosted engine. So, no access
> > to metrics container etc... from the runner. So I think that the MetricsPusher
> > (component responsible for merging metrics and pushing them to a sink backend)
> > must not be instanciated in DataflowRunner otherwise it would be more a client
> > (driver) piece of code and we will lose all the interest of being close to the
> > execution engine (among other things instrumentation of the execution of the
> > pipelines). 
> > *I think that the MetricsPusher needs to be instanciated in the actual Dataflow
> > engine. *
> > 
> > As a side note a java implementation of the MetricsPusher is available in the PR
> > in runner-core.
> > 
> > We need a serialized (langage agnostic) version of the MetricsPusher that SDKs
> > will generate out of user pipelineOptions and that will be sent to runners for
> > instanciation
> > 
> > *Are you guys willing to instanciate it in dataflow engine?  I think it is
> > important that dataflow is wired up in addition to flink and spark for this new
> > feature to ramp up (adding it to other runners) in Beam. *
> > *If you agree, can someone from Google help on that? *
> > 
> > Thanks 
> > Etienne
> > 
> > 
> > Le mercredi 31 janvier 2018 à 14:01 +0100, Etienne Chauchot a écrit :
> > > 
> > > 
> > > Hi all,
> > > 
> > > Just to let you know that I have just submitted the PR [1]:
> > > 
> > > This PR adds a MetricsPusher discussed in this [2] document in scenario 3.b.
> > > It merges and pushes beam metrics at a configurable (via pipelineOptions)
> > > frequency to a configurable sink. By default the sink is a DummySink also
> > > useful for tests. There is also a HttpMetricsSink available that pushes to a
> > > http backend the json-serialized metrics results. We could imagine in the
> > > future many Beam sinks to write to Ganglia, Graphite, ... Please note that
> > > these are not IOs because it needed to be transparent to the user.
> > > 
> > > The pusher only supports attempted metrics for now.
> > > 
> > > This feature is hosted in the runner-core module (as discussed in the doc) to
> > > be used by all the runners. I have wired it up with Spark and Flink runners
> > > (this part was quite painful :) )
> > > 
> > > If the PR is merged, I hope that runner experts would wire this feature up in
> > > the other runners. Your help is needed guys :)
> > > 
> > > Besides, there is nothing related to portability for now in this PR.
> > > 
> > > Best!
> > > 
> > > Etienne
> > > 
> > > [1] https://github.com/apache/beam/pull/4548
> > > 
> > > [2] https://s.apache.org/runner_independent_metrics_extraction
> > > 
> > > 
> > > Le 11/12/2017 à 17:33, Etienne Chauchot a écrit :
> > > > 
> > > > 
> > > > Hi all,
> > > > 
> > > > I sketched a little doc [1] about this subject. It tries to sum up the
> > > > differences between the runners towards metrics extraction and propose some
> > > > possible designs to have a runner agnostic extraction of the metrics.
> > > > 
> > > > It is a 2 pages long doc, can you please comment it, and correct it if needed?
> > > > 
> > > > Thanks
> > > > 
> > > > Etienne
> > > > 
> > > > [1]: *https://s.apache.org/runner_independent_metrics_extraction*
> > > > 
> > > > 
> > > > Le 27/11/2017 à 18:17, Ben Chambers a écrit :
> > > > > 
> > > > > I think discussing a runner agnostic way of configuring how metrics are
> > > > > extracted is a great idea -- thanks for bringing it up Etienne!
> > > > > 
> > > > > Using a thread that polls the pipeline result relies on the program that
> > > > > created and submitted the pipeline continuing to run (eg., no machine
> > > > > faults, network problems, etc.). For many applications, this isn't a good
> > > > > model (a Streaming pipeline may run for weeks, a Batch pipeline may be
> > > > > automatically run every hour, etc.).
> > > > > 
> > > > > Etienne's proposal of having something the runner pushes metrics too has
> > > > > the benefit of running in the same cluster as the pipeline, thus having the
> > > > > same reliability benefits.
> > > > > 
> > > > > As noted, it would require runners to ensure that metrics were pushed into
> > > > > the extractor but from there it would allow a general configuration of how
> > > > > metrics are extracted from the pipeline and exposed to some external
> > > > > services.
> > > > > 
> > > > > Providing something that the runners could push metrics into and have them
> > > > > automatically exported seems like it would have several benefits:
> > > > >   1. It would provide a single way to configure how metrics are actually
> > > > > exported.
> > > > >   2. It would allow the runners to ensure it was reliably executed.
> > > > >   3. It would allow the runner to report system metrics directly (eg., if a
> > > > > runner wanted to report the watermark, it could push that in directly).
> > > > > 
> > > > > -- Ben
> > > > > 
> > > > > On Mon, Nov 27, 2017 at 9:06 AM Jean-Baptiste Onofré <jb...@nanthrax.net> jb@nanthrax.net>
> > > > > wrote:
> > > > > 
> > > > > > 
> > > > > > Hi all,
> > > > > > 
> > > > > > Etienne forgot to mention that we started a PoC about that.
> > > > > > 
> > > > > > What I started is to wrap the Pipeline creation to include a thread that
> > > > > > polls
> > > > > > periodically the metrics in the pipeline result (it's what I proposed when
> > > > > > I
> > > > > > compared with Karaf Decanter some time ago).
> > > > > > Then, this thread marshalls the collected metrics and send to a sink. At
> > > > > > the
> > > > > > end, it means that the harvested metrics data will be store in a backend
> > > > > > (for
> > > > > > instance elasticsearch).
> > > > > > 
> > > > > > The pro of this approach is that it doesn't require any change in the
> > > > > > core, it's
> > > > > > up to the user to use the PipelineWithMetric wrapper.
> > > > > > 
> > > > > > The cons is that the user needs to explicitly use the PipelineWithMetric
> > > > > > wrapper.
> > > > > > 
> > > > > > IMHO, it's good enough as user can decide to poll metrics for some
> > > > > > pipelines and
> > > > > > not for others.
> > > > > > 
> > > > > > Regards
> > > > > > JB
> > > > > > 
> > > > > > On 11/27/2017 04:56 PM, Etienne Chauchot wrote:
> > > > > > > 
> > > > > > > Hi all,
> > > > > > > 
> > > > > > > I came by this ticket https://issues.apache.org/jira/browse/BEAM-2456.
> > > > > > I know
> > > > > > > 
> > > > > > > that the metrics subject has already been discussed a lot, but I would
> > > > > > like to
> > > > > > > 
> > > > > > > revive the discussion.
> > > > > > > 
> > > > > > > The aim in this ticket is to avoid relying on the runner to provide the
> > > > > > metrics
> > > > > > > 
> > > > > > > because they don't have all the same capabilities towards metrics. The
> > > > > > idea in
> > > > > > > 
> > > > > > > the ticket is to still use beam metrics API (and not others like
> > > > > > codahale as it
> > > > > > > 
> > > > > > > has been discussed some time ago) and provide a way to extract the
> > > > > > metrics with
> > > > > > > 
> > > > > > > a polling thread that would be forked by a PipelineWithMetrics (so,
> > > > > > almost
> > > > > > > 
> > > > > > > invisible to the end user) and then to push to a sink (such as a Http
> > > > > > rest sink
> > > > > > > 
> > > > > > > for example or Graphite sink or anything else...). Nevertheless, a
> > > > > > polling
> > > > > > > 
> > > > > > > thread might not work for all the runners because some might not make the
> > > > > > > metrics available before the end of the pipeline. Also, forking a thread
> > > > > > would
> > > > > > > 
> > > > > > > be a bit unconventional, so it could be provided as a beam sdk extension.
> > > > > > > 
> > > > > > > Another way, to avoid polling, would be to push metrics values to a sink
> > > > > > when
> > > > > > > 
> > > > > > > they are updated but I don't know if it is feasible in a runner
> > > > > > independent way.
> > > > > > > 
> > > > > > > WDYT about the ideas in this ticket?
> > > > > > > 
> > > > > > > Best,
> > > > > > > Etienne
> > > > > > --
> > > > > > Jean-Baptiste Onofré
> > > > > > jbonofre@apache.org <ma...@apache.org>
> > > > > > http://blog.nanthrax.net
> > > > > > Talend - http://www.talend.com
> > > > > > 

Re: [DISCUSSION] Runner agnostic metrics extractor?

Posted by Etienne Chauchot <ec...@apache.org>.
Hi Scott,
Thanks for the help and thanks for the ticket creation. I'll add the equivalent tickets for flink and spark but the
integration is already done in the PR you reviewed (https://github.com/apache/beam/pull/4548)
Maybe I'll split the PR into different runner coverage (I had already grouped the commits to do so)
Of course there is also a validates runner test available in the same PR (see MetricsPusherTest for all the runners
except spark and see SparkMetricsPusherTest for spark (as it uses Spark specific CreateStream) ). 
To ease the acceptance of the PR I had added everything: metrics pusher, sinks, validates runner tests and integration
for spark and flink. 
Etienne
Le lundi 26 mars 2018 à 16:59 +0000, Scott Wegner a écrit :
> Thanks for keeping this discussion going, Etienne. I can help investigate what it would take to add support for
> Dataflow runner. I've filed BEAM-3926 to track [1].
> 
> Is there a @ValidatesRunner integration test [2] that can be used to verify when the functionality has been correctly
> implemented for a new runner?
> 
> [1] https://issues.apache.org/jira/browse/BEAM-3926 
> [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunne
> r.java 
> 
> On Mon, Mar 26, 2018 at 8:54 AM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> > Hi Etienne,
> > 
> > as we might want to keep the runners consistent on such feature, I think it
> > makes sense to have this in the dataflow runner.
> > 
> > Especially, if it's not used by end-users, there's no impact in the runner.
> > 
> > So, +1 to add MetricsPusher in dataflow runner.
> > 
> > My $0.01
> > 
> > Regards
> > JB
> > 
> > On 03/26/2018 05:29 PM, Etienne Chauchot wrote:
> > > Hi guys,
> > > As part of the work bellow I need the help of Google Dataflow engine maintainers:
> > >
> > > AFAIK Dataflow being a cloud hosted engine, the related runner is very different
> > > from the others. It just submits a job to the cloud hosted engine. So, no access
> > > to metrics container etc... from the runner. So I think that the MetricsPusher
> > > (component responsible for merging metrics and pushing them to a sink backend)
> > > must not be instanciated in DataflowRunner otherwise it would be more a client
> > > (driver) piece of code and we will lose all the interest of being close to the
> > > execution engine (among other things instrumentation of the execution of the
> > > pipelines). 
> > > *I think that the MetricsPusher needs to be instanciated in the actual Dataflow
> > > engine. *
> > >
> > > As a side note a java implementation of the MetricsPusher is available in the PR
> > > in runner-core.
> > >
> > > We need a serialized (langage agnostic) version of the MetricsPusher that SDKs
> > > will generate out of user pipelineOptions and that will be sent to runners for
> > > instanciation
> > >
> > > *Are you guys willing to instanciate it in dataflow engine?  I think it is
> > > important that dataflow is wired up in addition to flink and spark for this new
> > > feature to ramp up (adding it to other runners) in Beam. *
> > > *If you agree, can someone from Google help on that? *
> > >
> > > Thanks 
> > > Etienne
> > >
> > >
> > > Le mercredi 31 janvier 2018 à 14:01 +0100, Etienne Chauchot a écrit :
> > >>
> > >> Hi all,
> > >>
> > >> Just to let you know that I have just submitted the PR [1]:
> > >>
> > >> This PR adds a MetricsPusher discussed in this [2] document in scenario 3.b.
> > >> It merges and pushes beam metrics at a configurable (via pipelineOptions)
> > >> frequency to a configurable sink. By default the sink is a DummySink also
> > >> useful for tests. There is also a HttpMetricsSink available that pushes to a
> > >> http backend the json-serialized metrics results. We could imagine in the
> > >> future many Beam sinks to write to Ganglia, Graphite, ... Please note that
> > >> these are not IOs because it needed to be transparent to the user.
> > >>
> > >> The pusher only supports attempted metrics for now.
> > >>
> > >> This feature is hosted in the runner-core module (as discussed in the doc) to
> > >> be used by all the runners. I have wired it up with Spark and Flink runners
> > >> (this part was quite painful :) )
> > >>
> > >> If the PR is merged, I hope that runner experts would wire this feature up in
> > >> the other runners. Your help is needed guys :)
> > >>
> > >> Besides, there is nothing related to portability for now in this PR.
> > >>
> > >> Best!
> > >>
> > >> Etienne
> > >>
> > >> [1] https://github.com/apache/beam/pull/4548
> > >>
> > >> [2] https://s.apache.org/runner_independent_metrics_extraction
> > >>
> > >>
> > >> Le 11/12/2017 à 17:33, Etienne Chauchot a écrit :
> > >>>
> > >>> Hi all,
> > >>>
> > >>> I sketched a little doc [1] about this subject. It tries to sum up the
> > >>> differences between the runners towards metrics extraction and propose some
> > >>> possible designs to have a runner agnostic extraction of the metrics.
> > >>>
> > >>> It is a 2 pages long doc, can you please comment it, and correct it if needed?
> > >>>
> > >>> Thanks
> > >>>
> > >>> Etienne
> > >>>
> > >>> [1]: *https://s.apache.org/runner_independent_metrics_extraction*
> > >>>
> > >>>
> > >>> Le 27/11/2017 à 18:17, Ben Chambers a écrit :
> > >>>> I think discussing a runner agnostic way of configuring how metrics are
> > >>>> extracted is a great idea -- thanks for bringing it up Etienne!
> > >>>>
> > >>>> Using a thread that polls the pipeline result relies on the program that
> > >>>> created and submitted the pipeline continuing to run (eg., no machine
> > >>>> faults, network problems, etc.). For many applications, this isn't a good
> > >>>> model (a Streaming pipeline may run for weeks, a Batch pipeline may be
> > >>>> automatically run every hour, etc.).
> > >>>>
> > >>>> Etienne's proposal of having something the runner pushes metrics too has
> > >>>> the benefit of running in the same cluster as the pipeline, thus having the
> > >>>> same reliability benefits.
> > >>>>
> > >>>> As noted, it would require runners to ensure that metrics were pushed into
> > >>>> the extractor but from there it would allow a general configuration of how
> > >>>> metrics are extracted from the pipeline and exposed to some external
> > >>>> services.
> > >>>>
> > >>>> Providing something that the runners could push metrics into and have them
> > >>>> automatically exported seems like it would have several benefits:
> > >>>>   1. It would provide a single way to configure how metrics are actually
> > >>>> exported.
> > >>>>   2. It would allow the runners to ensure it was reliably executed.
> > >>>>   3. It would allow the runner to report system metrics directly (eg., if a
> > >>>> runner wanted to report the watermark, it could push that in directly).
> > >>>>
> > >>>> -- Ben
> > >>>>
> > >>>> On Mon, Nov 27, 2017 at 9:06 AM Jean-Baptiste Onofré <jb...@nanthrax.net> jb@nanthrax.net>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> Etienne forgot to mention that we started a PoC about that.
> > >>>>>
> > >>>>> What I started is to wrap the Pipeline creation to include a thread that
> > >>>>> polls
> > >>>>> periodically the metrics in the pipeline result (it's what I proposed when
> > >>>>> I
> > >>>>> compared with Karaf Decanter some time ago).
> > >>>>> Then, this thread marshalls the collected metrics and send to a sink. At
> > >>>>> the
> > >>>>> end, it means that the harvested metrics data will be store in a backend
> > >>>>> (for
> > >>>>> instance elasticsearch).
> > >>>>>
> > >>>>> The pro of this approach is that it doesn't require any change in the
> > >>>>> core, it's
> > >>>>> up to the user to use the PipelineWithMetric wrapper.
> > >>>>>
> > >>>>> The cons is that the user needs to explicitly use the PipelineWithMetric
> > >>>>> wrapper.
> > >>>>>
> > >>>>> IMHO, it's good enough as user can decide to poll metrics for some
> > >>>>> pipelines and
> > >>>>> not for others.
> > >>>>>
> > >>>>> Regards
> > >>>>> JB
> > >>>>>
> > >>>>> On 11/27/2017 04:56 PM, Etienne Chauchot wrote:
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> I came by this ticket https://issues.apache.org/jira/browse/BEAM-2456.
> > >>>>> I know
> > >>>>>> that the metrics subject has already been discussed a lot, but I would
> > >>>>> like to
> > >>>>>> revive the discussion.
> > >>>>>>
> > >>>>>> The aim in this ticket is to avoid relying on the runner to provide the
> > >>>>> metrics
> > >>>>>> because they don't have all the same capabilities towards metrics. The
> > >>>>> idea in
> > >>>>>> the ticket is to still use beam metrics API (and not others like
> > >>>>> codahale as it
> > >>>>>> has been discussed some time ago) and provide a way to extract the
> > >>>>> metrics with
> > >>>>>> a polling thread that would be forked by a PipelineWithMetrics (so,
> > >>>>> almost
> > >>>>>> invisible to the end user) and then to push to a sink (such as a Http
> > >>>>> rest sink
> > >>>>>> for example or Graphite sink or anything else...). Nevertheless, a
> > >>>>> polling
> > >>>>>> thread might not work for all the runners because some might not make the
> > >>>>>> metrics available before the end of the pipeline. Also, forking a thread
> > >>>>> would
> > >>>>>> be a bit unconventional, so it could be provided as a beam sdk extension.
> > >>>>>>
> > >>>>>> Another way, to avoid polling, would be to push metrics values to a sink
> > >>>>> when
> > >>>>>> they are updated but I don't know if it is feasible in a runner
> > >>>>> independent way.
> > >>>>>> WDYT about the ideas in this ticket?
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Etienne
> > >>>>> --
> > >>>>> Jean-Baptiste Onofré
> > >>>>> jbonofre@apache.org jbonofre@apache.org>
> > >>>>> http://blog.nanthrax.net
> > >>>>> Talend - http://www.talend.com
> > >>>>>
> > >>>
> > >>
> > 
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> > 
> -- 
> 
> 
> Got feedback? http://go/swegner-feedback

Re: [DISCUSSION] Runner agnostic metrics extractor?

Posted by Scott Wegner <sw...@google.com>.
Thanks for keeping this discussion going, Etienne. I can help investigate
what it would take to add support for Dataflow runner. I've filed BEAM-3926
to track [1].

Is there a @ValidatesRunner integration test [2] that can be used to verify
when the functionality has been correctly implemented for a new runner?

[1] https://issues.apache.org/jira/browse/BEAM-3926
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java


On Mon, Mar 26, 2018 at 8:54 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Etienne,
>
> as we might want to keep the runners consistent on such feature, I think it
> makes sense to have this in the dataflow runner.
>
> Especially, if it's not used by end-users, there's no impact in the runner.
>
> So, +1 to add MetricsPusher in dataflow runner.
>
> My $0.01
>
> Regards
> JB
>
> On 03/26/2018 05:29 PM, Etienne Chauchot wrote:
> > Hi guys,
> > As part of the work bellow I need the help of Google Dataflow engine
> maintainers:
> >
> > AFAIK Dataflow being a cloud hosted engine, the related runner is very
> different
> > from the others. It just submits a job to the cloud hosted engine. So,
> no access
> > to metrics container etc... from the runner. So I think that the
> MetricsPusher
> > (component responsible for merging metrics and pushing them to a sink
> backend)
> > must not be instanciated in DataflowRunner otherwise it would be more a
> client
> > (driver) piece of code and we will lose all the interest of being close
> to the
> > execution engine (among other things instrumentation of the execution of
> the
> > pipelines).
> > *I think that the MetricsPusher needs to be instanciated in the actual
> Dataflow
> > engine. *
> >
> > As a side note a java implementation of the MetricsPusher is available
> in the PR
> > in runner-core.
> >
> > We need a serialized (langage agnostic) version of the MetricsPusher
> that SDKs
> > will generate out of user pipelineOptions and that will be sent to
> runners for
> > instanciation
> >
> > *Are you guys willing to instanciate it in dataflow engine?  I think it
> is
> > important that dataflow is wired up in addition to flink and spark for
> this new
> > feature to ramp up (adding it to other runners) in Beam. *
> > *If you agree, can someone from Google help on that? *
> >
> > Thanks
> > Etienne
> >
> >
> > Le mercredi 31 janvier 2018 à 14:01 +0100, Etienne Chauchot a écrit :
> >>
> >> Hi all,
> >>
> >> Just to let you know that I have just submitted the PR [1]:
> >>
> >> This PR adds a MetricsPusher discussed in this [2] document in scenario
> 3.b.
> >> It merges and pushes beam metrics at a configurable (via
> pipelineOptions)
> >> frequency to a configurable sink. By default the sink is a DummySink
> also
> >> useful for tests. There is also a HttpMetricsSink available that pushes
> to a
> >> http backend the json-serialized metrics results. We could imagine in
> the
> >> future many Beam sinks to write to Ganglia, Graphite, ... Please note
> that
> >> these are not IOs because it needed to be transparent to the user.
> >>
> >> The pusher only supports attempted metrics for now.
> >>
> >> This feature is hosted in the runner-core module (as discussed in the
> doc) to
> >> be used by all the runners. I have wired it up with Spark and Flink
> runners
> >> (this part was quite painful :) )
> >>
> >> If the PR is merged, I hope that runner experts would wire this feature
> up in
> >> the other runners. Your help is needed guys :)
> >>
> >> Besides, there is nothing related to portability for now in this PR.
> >>
> >> Best!
> >>
> >> Etienne
> >>
> >> [1] https://github.com/apache/beam/pull/4548
> >>
> >> [2] https://s.apache.org/runner_independent_metrics_extraction
> >>
> >>
> >> Le 11/12/2017 à 17:33, Etienne Chauchot a écrit :
> >>>
> >>> Hi all,
> >>>
> >>> I sketched a little doc [1] about this subject. It tries to sum up the
> >>> differences between the runners towards metrics extraction and propose
> some
> >>> possible designs to have a runner agnostic extraction of the metrics.
> >>>
> >>> It is a 2 pages long doc, can you please comment it, and correct it if
> needed?
> >>>
> >>> Thanks
> >>>
> >>> Etienne
> >>>
> >>> [1]: *https://s.apache.org/runner_independent_metrics_extraction*
> >>>
> >>>
> >>> Le 27/11/2017 à 18:17, Ben Chambers a écrit :
> >>>> I think discussing a runner agnostic way of configuring how metrics
> are
> >>>> extracted is a great idea -- thanks for bringing it up Etienne!
> >>>>
> >>>> Using a thread that polls the pipeline result relies on the program
> that
> >>>> created and submitted the pipeline continuing to run (eg., no machine
> >>>> faults, network problems, etc.). For many applications, this isn't a
> good
> >>>> model (a Streaming pipeline may run for weeks, a Batch pipeline may be
> >>>> automatically run every hour, etc.).
> >>>>
> >>>> Etienne's proposal of having something the runner pushes metrics too
> has
> >>>> the benefit of running in the same cluster as the pipeline, thus
> having the
> >>>> same reliability benefits.
> >>>>
> >>>> As noted, it would require runners to ensure that metrics were pushed
> into
> >>>> the extractor but from there it would allow a general configuration
> of how
> >>>> metrics are extracted from the pipeline and exposed to some external
> >>>> services.
> >>>>
> >>>> Providing something that the runners could push metrics into and have
> them
> >>>> automatically exported seems like it would have several benefits:
> >>>>   1. It would provide a single way to configure how metrics are
> actually
> >>>> exported.
> >>>>   2. It would allow the runners to ensure it was reliably executed.
> >>>>   3. It would allow the runner to report system metrics directly
> (eg., if a
> >>>> runner wanted to report the watermark, it could push that in
> directly).
> >>>>
> >>>> -- Ben
> >>>>
> >>>> On Mon, Nov 27, 2017 at 9:06 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> <ma...@nanthrax.net>
> >>>> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> Etienne forgot to mention that we started a PoC about that.
> >>>>>
> >>>>> What I started is to wrap the Pipeline creation to include a thread
> that
> >>>>> polls
> >>>>> periodically the metrics in the pipeline result (it's what I
> proposed when
> >>>>> I
> >>>>> compared with Karaf Decanter some time ago).
> >>>>> Then, this thread marshalls the collected metrics and send to a
> sink. At
> >>>>> the
> >>>>> end, it means that the harvested metrics data will be store in a
> backend
> >>>>> (for
> >>>>> instance elasticsearch).
> >>>>>
> >>>>> The pro of this approach is that it doesn't require any change in the
> >>>>> core, it's
> >>>>> up to the user to use the PipelineWithMetric wrapper.
> >>>>>
> >>>>> The cons is that the user needs to explicitly use the
> PipelineWithMetric
> >>>>> wrapper.
> >>>>>
> >>>>> IMHO, it's good enough as user can decide to poll metrics for some
> >>>>> pipelines and
> >>>>> not for others.
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> On 11/27/2017 04:56 PM, Etienne Chauchot wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I came by this ticket
> https://issues.apache.org/jira/browse/BEAM-2456.
> >>>>> I know
> >>>>>> that the metrics subject has already been discussed a lot, but I
> would
> >>>>> like to
> >>>>>> revive the discussion.
> >>>>>>
> >>>>>> The aim in this ticket is to avoid relying on the runner to provide
> the
> >>>>> metrics
> >>>>>> because they don't have all the same capabilities towards metrics.
> The
> >>>>> idea in
> >>>>>> the ticket is to still use beam metrics API (and not others like
> >>>>> codahale as it
> >>>>>> has been discussed some time ago) and provide a way to extract the
> >>>>> metrics with
> >>>>>> a polling thread that would be forked by a PipelineWithMetrics (so,
> >>>>> almost
> >>>>>> invisible to the end user) and then to push to a sink (such as a
> Http
> >>>>> rest sink
> >>>>>> for example or Graphite sink or anything else...). Nevertheless, a
> >>>>> polling
> >>>>>> thread might not work for all the runners because some might not
> make the
> >>>>>> metrics available before the end of the pipeline. Also, forking a
> thread
> >>>>> would
> >>>>>> be a bit unconventional, so it could be provided as a beam sdk
> extension.
> >>>>>>
> >>>>>> Another way, to avoid polling, would be to push metrics values to a
> sink
> >>>>> when
> >>>>>> they are updated but I don't know if it is feasible in a runner
> >>>>> independent way.
> >>>>>> WDYT about the ideas in this ticket?
> >>>>>>
> >>>>>> Best,
> >>>>>> Etienne
> >>>>> --
> >>>>> Jean-Baptiste Onofré
> >>>>> jbonofre@apache.org <ma...@apache.org>
> >>>>> http://blog.nanthrax.net
> >>>>> Talend - http://www.talend.com
> >>>>>
> >>>
> >>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
-- 


Got feedback? http://go/swegner-feedback

Re: [DISCUSSION] Runner agnostic metrics extractor?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Etienne,

as we might want to keep the runners consistent on such feature, I think it
makes sense to have this in the dataflow runner.

Especially, if it's not used by end-users, there's no impact in the runner.

So, +1 to add MetricsPusher in dataflow runner.

My $0.01

Regards
JB

On 03/26/2018 05:29 PM, Etienne Chauchot wrote:
> Hi guys,
> As part of the work bellow I need the help of Google Dataflow engine maintainers:
> 
> AFAIK Dataflow being a cloud hosted engine, the related runner is very different
> from the others. It just submits a job to the cloud hosted engine. So, no access
> to metrics container etc... from the runner. So I think that the MetricsPusher
> (component responsible for merging metrics and pushing them to a sink backend)
> must not be instanciated in DataflowRunner otherwise it would be more a client
> (driver) piece of code and we will lose all the interest of being close to the
> execution engine (among other things instrumentation of the execution of the
> pipelines). 
> *I think that the MetricsPusher needs to be instanciated in the actual Dataflow
> engine. *
> 
> As a side note a java implementation of the MetricsPusher is available in the PR
> in runner-core.
> 
> We need a serialized (langage agnostic) version of the MetricsPusher that SDKs
> will generate out of user pipelineOptions and that will be sent to runners for
> instanciation
> 
> *Are you guys willing to instanciate it in dataflow engine?  I think it is
> important that dataflow is wired up in addition to flink and spark for this new
> feature to ramp up (adding it to other runners) in Beam. *
> *If you agree, can someone from Google help on that? *
> 
> Thanks 
> Etienne
> 
> 
> Le mercredi 31 janvier 2018 à 14:01 +0100, Etienne Chauchot a écrit :
>>
>> Hi all,
>>
>> Just to let you know that I have just submitted the PR [1]:
>>
>> This PR adds a MetricsPusher discussed in this [2] document in scenario 3.b.
>> It merges and pushes beam metrics at a configurable (via pipelineOptions)
>> frequency to a configurable sink. By default the sink is a DummySink also
>> useful for tests. There is also a HttpMetricsSink available that pushes to a
>> http backend the json-serialized metrics results. We could imagine in the
>> future many Beam sinks to write to Ganglia, Graphite, ... Please note that
>> these are not IOs because it needed to be transparent to the user.
>>
>> The pusher only supports attempted metrics for now.
>>
>> This feature is hosted in the runner-core module (as discussed in the doc) to
>> be used by all the runners. I have wired it up with Spark and Flink runners
>> (this part was quite painful :) )
>>
>> If the PR is merged, I hope that runner experts would wire this feature up in
>> the other runners. Your help is needed guys :)
>>
>> Besides, there is nothing related to portability for now in this PR.
>>
>> Best!
>>
>> Etienne
>>
>> [1] https://github.com/apache/beam/pull/4548
>>
>> [2] https://s.apache.org/runner_independent_metrics_extraction
>>
>>
>> Le 11/12/2017 à 17:33, Etienne Chauchot a écrit :
>>>
>>> Hi all,
>>>
>>> I sketched a little doc [1] about this subject. It tries to sum up the
>>> differences between the runners towards metrics extraction and propose some
>>> possible designs to have a runner agnostic extraction of the metrics.
>>>
>>> It is a 2 pages long doc, can you please comment it, and correct it if needed?
>>>
>>> Thanks
>>>
>>> Etienne
>>>
>>> [1]: *https://s.apache.org/runner_independent_metrics_extraction*
>>>
>>>
>>> Le 27/11/2017 à 18:17, Ben Chambers a écrit :
>>>> I think discussing a runner agnostic way of configuring how metrics are
>>>> extracted is a great idea -- thanks for bringing it up Etienne!
>>>>
>>>> Using a thread that polls the pipeline result relies on the program that
>>>> created and submitted the pipeline continuing to run (eg., no machine
>>>> faults, network problems, etc.). For many applications, this isn't a good
>>>> model (a Streaming pipeline may run for weeks, a Batch pipeline may be
>>>> automatically run every hour, etc.).
>>>>
>>>> Etienne's proposal of having something the runner pushes metrics too has
>>>> the benefit of running in the same cluster as the pipeline, thus having the
>>>> same reliability benefits.
>>>>
>>>> As noted, it would require runners to ensure that metrics were pushed into
>>>> the extractor but from there it would allow a general configuration of how
>>>> metrics are extracted from the pipeline and exposed to some external
>>>> services.
>>>>
>>>> Providing something that the runners could push metrics into and have them
>>>> automatically exported seems like it would have several benefits:
>>>>   1. It would provide a single way to configure how metrics are actually
>>>> exported.
>>>>   2. It would allow the runners to ensure it was reliably executed.
>>>>   3. It would allow the runner to report system metrics directly (eg., if a
>>>> runner wanted to report the watermark, it could push that in directly).
>>>>
>>>> -- Ben
>>>>
>>>> On Mon, Nov 27, 2017 at 9:06 AM Jean-Baptiste Onofré <jb...@nanthrax.net> <ma...@nanthrax.net>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Etienne forgot to mention that we started a PoC about that.
>>>>>
>>>>> What I started is to wrap the Pipeline creation to include a thread that
>>>>> polls
>>>>> periodically the metrics in the pipeline result (it's what I proposed when
>>>>> I
>>>>> compared with Karaf Decanter some time ago).
>>>>> Then, this thread marshalls the collected metrics and send to a sink. At
>>>>> the
>>>>> end, it means that the harvested metrics data will be store in a backend
>>>>> (for
>>>>> instance elasticsearch).
>>>>>
>>>>> The pro of this approach is that it doesn't require any change in the
>>>>> core, it's
>>>>> up to the user to use the PipelineWithMetric wrapper.
>>>>>
>>>>> The cons is that the user needs to explicitly use the PipelineWithMetric
>>>>> wrapper.
>>>>>
>>>>> IMHO, it's good enough as user can decide to poll metrics for some
>>>>> pipelines and
>>>>> not for others.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 11/27/2017 04:56 PM, Etienne Chauchot wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I came by this ticket https://issues.apache.org/jira/browse/BEAM-2456.
>>>>> I know
>>>>>> that the metrics subject has already been discussed a lot, but I would
>>>>> like to
>>>>>> revive the discussion.
>>>>>>
>>>>>> The aim in this ticket is to avoid relying on the runner to provide the
>>>>> metrics
>>>>>> because they don't have all the same capabilities towards metrics. The
>>>>> idea in
>>>>>> the ticket is to still use beam metrics API (and not others like
>>>>> codahale as it
>>>>>> has been discussed some time ago) and provide a way to extract the
>>>>> metrics with
>>>>>> a polling thread that would be forked by a PipelineWithMetrics (so,
>>>>> almost
>>>>>> invisible to the end user) and then to push to a sink (such as a Http
>>>>> rest sink
>>>>>> for example or Graphite sink or anything else...). Nevertheless, a
>>>>> polling
>>>>>> thread might not work for all the runners because some might not make the
>>>>>> metrics available before the end of the pipeline. Also, forking a thread
>>>>> would
>>>>>> be a bit unconventional, so it could be provided as a beam sdk extension.
>>>>>>
>>>>>> Another way, to avoid polling, would be to push metrics values to a sink
>>>>> when
>>>>>> they are updated but I don't know if it is feasible in a runner
>>>>> independent way.
>>>>>> WDYT about the ideas in this ticket?
>>>>>>
>>>>>> Best,
>>>>>> Etienne
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> jbonofre@apache.org <ma...@apache.org>
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>>>
>>>
>>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com