You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Scott Wegner <sc...@apache.org> on 2018/09/21 19:46:01 UTC

Re: Metrics Pusher support on Dataflow

Hi Etienne, sorry for the delay on this. I just got back from leave and
found this discussion.

We haven't started implementing MetricsPusher in the Dataflow runner,
mostly because the Dataflow service has it's own rich Metrics REST API and
we haven't heard a need from Dataflow customers to push metrics to an
external backend. However, it would be nice to have this implemented across
all runners for feature parity.

I read through the discussion in JIRA [1], and the simplest implementation
for Dataflow may be to have a single thread periodically poll the Dataflow
REST API [2] for latest metric values, and push them to a configured sink.
This polling thread could be hosted in a separate docker container, within
the worker process, or perhaps a ParDo with timers that gets injected into
the pipeline during graph translation.

At any rate, I'm not aware of anybody currently working on this. But with
the Dataflow worker code being donated to Beam [3], soon it will be
possible for anybody to contribute.

[1] https://issues.apache.org/jira/browse/BEAM-3926
[2]
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs/getMetrics
[3]
https://lists.apache.org/thread.html/2bdc645659e2fbd7e29f3a2758941faefedb01148a2a11558dfe60f8@%3Cdev.beam.apache.org%3E

On Fri, Aug 17, 2018 at 4:26 PM Lukasz Cwik <lc...@google.com> wrote:

> I forwarded your request to a few people who work on the internal parts of
> Dataflow to see if they could help in some way.
>
> On Thu, Aug 16, 2018 at 6:22 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi all
>>
>> As we already discussed, it would be good to support Metrics Pusher [1]
>> in Dataflow (in other runners also, of course). Today, only Spark and Flink
>> support it. It requires a modification in C++ Dataflow code, so only Google
>> friends can do it.
>>
>> Is someone interested in doing it ?
>>
>> Here is the ticket https://issues.apache.org/jira/browse/BEAM-3926
>>
>> Besides, I wonder if this feature should be added to the capability
>> matrix.
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/BEAM/Metrics+architecture+inside+the+runners
>>
>> Thanks
>> Etienne
>>
>

-- 

Got feedback? tinyurl.com/swegner-feedback

Re: Metrics Pusher support on Dataflow

Posted by Scott Wegner <sc...@apache.org>.

If we implement MetricsPusher to run from the local job submission JVM for
Dataflow jobs and the JVM dies: the Dataflow job would continue to
completion, but the MetricsPusher would not restart, so the exported
metrics would be stale.

I like that the Spark implementation has the ability to elect and manage a
driver worker in the cluster. Dataflow doesn't have such functionality,
although it would be useful. One way to emulate it would be Reuven's
suggestion to add a synthetic ParDo + timer to the graph.

On Thu, Oct 4, 2018 at 5:28 AM Etienne Chauchot <ec...@apache.org>
wrote:

> Well if we add the code in DataflowPipelineJob, then it would issue REST
> calls to the engine no ? And also, it will be far from the actual execution
> engine. Was is not what we wanted to avoid ?
>
> Regarding the driver, if I do an analogy with spark, even if it is not a
> cloud hosted runner (driver in spark is close to the execution engine): The
> Metrics pusher polling thread indeed lives inside the driver program
> context. This program can be either hosted in the submission client process
> (when pipeline is submitted in spark client mode) or in an elected driver
> worker in the cluster (when pipeline is submitted in spark cluster mode).
> In the latter case, the cluster manager such as yarn or mesos can relaunch
> the driver program if the worker dies.
>
> Etienne
>
> Le mercredi 03 octobre 2018 à 16:49 -0700, Scott Wegner a écrit :
>
> Another point that we discussed at ApacheCon is that a difference between
> Dataflow and other runners is Dataflow is service-based and doesn't need a
> locally executing "driver" program. A local driver context is a good place
> to implement MetricsPusher because it is a singleton process.
>
> In fact, DataflowRunner supports PipelineResult.waitUntilFinish() [1],
> where we do maintain the local JVM context. Currently in this mode the
> runner polls the Dataflow service API for log messages [2]. It would be
> very easy to also poll for metric updates and push them out via
> MetricsPusher.
>
> [1]
> https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L169
> [2]
> https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L291
>
> On Wed, Oct 3, 2018 at 4:44 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
> Hi Scott,
> Thanks for the update.
> Both solutions look good to me. Though, they both have plus and minus. I
> let the googlers chose which is more appropriate:
>
> - DAG modifcation: less intrusive in Dataflow but the DAG executed and
> shown in the DAG UI in dataflow will contain an extra step that the user
> might wonder about.
> - polling thread: it is exactly what I did for the other runners, it is
> more transparent to the user but requires more infra work (adds a container
> that needs to be resilient)
>
> Best
> Etienne
>
> Le vendredi 21 septembre 2018 à 12:46 -0700, Scott Wegner a écrit :
>
> Hi Etienne, sorry for the delay on this. I just got back from leave and
> found this discussion.
>
> We haven't started implementing MetricsPusher in the Dataflow runner,
> mostly because the Dataflow service has it's own rich Metrics REST API and
> we haven't heard a need from Dataflow customers to push metrics to an
> external backend. However, it would be nice to have this implemented across
> all runners for feature parity.
>
> I read through the discussion in JIRA [1], and the simplest implementation
> for Dataflow may be to have a single thread periodically poll the Dataflow
> REST API [2] for latest metric values, and push them to a configured sink.
> This polling thread could be hosted in a separate docker container, within
> the worker process, or perhaps a ParDo with timers that gets injected into
> the pipeline during graph translation.
>
> At any rate, I'm not aware of anybody currently working on this. But with
> the Dataflow worker code being donated to Beam [3], soon it will be
> possible for anybody to contribute.
>
> [1] https://issues.apache.org/jira/browse/BEAM-3926
> [2]
> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs/getMetrics
> [3]
> https://lists.apache.org/thread.html/2bdc645659e2fbd7e29f3a2758941faefedb01148a2a11558dfe60f8@%3Cdev.beam.apache.org%3E
>
> On Fri, Aug 17, 2018 at 4:26 PM Lukasz Cwik <lc...@google.com> wrote:
>
> I forwarded your request to a few people who work on the internal parts of
> Dataflow to see if they could help in some way.
>
> On Thu, Aug 16, 2018 at 6:22 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
> Hi all
>
> As we already discussed, it would be good to support Metrics Pusher [1] in
> Dataflow (in other runners also, of course). Today, only Spark and Flink
> support it. It requires a modification in C++ Dataflow code, so only Google
> friends can do it.
>
> Is someone interested in doing it ?
>
> Here is the ticket https://issues.apache.org/jira/browse/BEAM-3926
>
> Besides, I wonder if this feature should be added to the capability matrix.
>
> [1]
> https://cwiki.apache.org/confluence/display/BEAM/Metrics+architecture+inside+the+runners
>
> Thanks
> Etienne
>
>
>
>
>
>

-- 




Got feedback? tinyurl.com/swegner-feedback

Re: Metrics Pusher support on Dataflow

Posted by Etienne Chauchot <ec...@apache.org>.

Well if we add the code in DataflowPipelineJob, then it would issue REST calls to the engine no ? And also, it will be
far from the actual execution engine. Was is not what we wanted to avoid ?
Regarding the driver, if I do an analogy with spark, even if it is not a cloud hosted runner (driver in spark is close
to the execution engine): The Metrics pusher polling thread indeed lives inside the driver program context. This program
can be either hosted in the submission client process (when pipeline is submitted in spark client mode) or in an elected
driver worker in the cluster (when pipeline is submitted in spark cluster mode). In the latter case, the cluster manager
such as yarn or mesos can relaunch the driver program if the worker dies.
Etienne
Le mercredi 03 octobre 2018 à 16:49 -0700, Scott Wegner a écrit :
> Another point that we discussed at ApacheCon is that a difference between Dataflow and other runners is Dataflow is
> service-based and doesn't need a locally executing "driver" program. A local driver context is a good place to
> implement MetricsPusher because it is a singleton process.
> In fact, DataflowRunner supports PipelineResult.waitUntilFinish() [1], where we do maintain the local JVM context.
> Currently in this mode the runner polls the Dataflow service API for log messages [2]. It would be very easy to also
> poll for metric updates and push them out via MetricsPusher.
> 
> [1] https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/sr
> c/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L169
> [2] https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/sr
> c/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L291
> 
> On Wed, Oct 3, 2018 at 4:44 AM Etienne Chauchot <ec...@apache.org> wrote:
> > Hi Scott,Thanks for the update. Both solutions look good to me. Though, they both have plus and minus. I let the
> > googlers chose which is more appropriate:
> > - DAG modifcation: less intrusive in Dataflow but the DAG executed and shown in the DAG UI in dataflow will contain
> > an extra step that the user might wonder about.- polling thread: it is exactly what I did for the other runners, it
> > is more transparent to the user but  requires more infra work (adds a container that needs to be resilient)
> > BestEtienne
> > Le vendredi 21 septembre 2018 à 12:46 -0700, Scott Wegner a écrit :
> > > Hi Etienne, sorry for the delay on this. I just got back from leave and found this discussion.
> > > We haven't started implementing MetricsPusher in the Dataflow runner, mostly because the Dataflow service has it's
> > > own rich Metrics REST API and we haven't heard a need from Dataflow customers to push metrics to an external
> > > backend. However, it would be nice to have this implemented across all runners for feature parity.
> > > 
> > > I read through the discussion in JIRA [1], and the simplest implementation for Dataflow may be to have a single
> > > thread periodically poll the Dataflow REST API [2] for latest metric values, and push them to a configured sink.
> > > This polling thread could be hosted in a separate docker container, within the worker process, or perhaps a ParDo
> > > with timers that gets injected into the pipeline during graph translation.
> > > 
> > > At any rate, I'm not aware of anybody currently working on this. But with the Dataflow worker code being donated
> > > to Beam [3], soon it will be possible for anybody to contribute.
> > > 
> > > [1] https://issues.apache.org/jira/browse/BEAM-3926
> > > [2] https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs/getMetrics
> > > [3] https://lists.apache.org/thread.html/2bdc645659e2fbd7e29f3a2758941faefedb01148a2a11558dfe60f8@%3Cdev.beam.apac
> > > he.org%3E
> > > 
> > > On Fri, Aug 17, 2018 at 4:26 PM Lukasz Cwik <lc...@google.com> wrote:
> > > > I forwarded your request to a few people who work on the internal parts of Dataflow to see if they could help in
> > > > some way.
> > > > On Thu, Aug 16, 2018 at 6:22 AM Etienne Chauchot <ec...@apache.org> wrote:
> > > > > Hi all
> > > > > 
> > > > > As we already discussed, it would be good to support Metrics Pusher [1] in Dataflow (in other runners also, of
> > > > > course). Today, only Spark and Flink support it. It requires a modification in C++ Dataflow code, so only
> > > > > Google friends can do it. 
> > > > > 
> > > > > Is someone interested in doing it ? 
> > > > > 
> > > > > Here is the ticket https://issues.apache.org/jira/browse/BEAM-3926
> > > > > 
> > > > > Besides, I wonder if this feature should be added to the capability matrix.
> > > > > 
> > > > > [1] https://cwiki.apache.org/confluence/display/BEAM/Metrics+architecture+inside+the+runners
> > > > > 
> > > > > Thanks
> > > > > Etienne
> > > 
> > > 
> 
>

Re: Metrics Pusher support on Dataflow

Posted by Reuven Lax <re...@google.com>.

IF the local JVM dies, the job keeps running.

On Thu, Oct 4, 2018 at 12:03 AM Jozef Vilcek <jo...@gmail.com> wrote:

> Just curious here. What happens when this local JVM context dies for any
> reason?  How does it work with DataFlow?
>
> On Thu, Oct 4, 2018 at 1:50 AM Scott Wegner <sc...@apache.org> wrote:
>
>> Another point that we discussed at ApacheCon is that a difference between
>> Dataflow and other runners is Dataflow is service-based and doesn't need a
>> locally executing "driver" program. A local driver context is a good place
>> to implement MetricsPusher because it is a singleton process.
>>
>> In fact, DataflowRunner supports PipelineResult.waitUntilFinish() [1],
>> where we do maintain the local JVM context. Currently in this mode the
>> runner polls the Dataflow service API for log messages [2]. It would be
>> very easy to also poll for metric updates and push them out via
>> MetricsPusher.
>>
>> [1]
>> https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L169
>> [2]
>> https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L291
>>
>> On Wed, Oct 3, 2018 at 4:44 AM Etienne Chauchot <ec...@apache.org>
>> wrote:
>>
>>> Hi Scott,
>>> Thanks for the update.
>>> Both solutions look good to me. Though, they both have plus and minus. I
>>> let the googlers chose which is more appropriate:
>>>
>>> - DAG modifcation: less intrusive in Dataflow but the DAG executed and
>>> shown in the DAG UI in dataflow will contain an extra step that the user
>>> might wonder about.
>>> - polling thread: it is exactly what I did for the other runners, it is
>>> more transparent to the user but requires more infra work (adds a container
>>> that needs to be resilient)
>>>
>>> Best
>>> Etienne
>>>
>>> Le vendredi 21 septembre 2018 à 12:46 -0700, Scott Wegner a écrit :
>>>
>>> Hi Etienne, sorry for the delay on this. I just got back from leave and
>>> found this discussion.
>>>
>>> We haven't started implementing MetricsPusher in the Dataflow runner,
>>> mostly because the Dataflow service has it's own rich Metrics REST API and
>>> we haven't heard a need from Dataflow customers to push metrics to an
>>> external backend. However, it would be nice to have this implemented across
>>> all runners for feature parity.
>>>
>>> I read through the discussion in JIRA [1], and the simplest
>>> implementation for Dataflow may be to have a single thread periodically
>>> poll the Dataflow REST API [2] for latest metric values, and push them to a
>>> configured sink. This polling thread could be hosted in a separate docker
>>> container, within the worker process, or perhaps a ParDo with timers that
>>> gets injected into the pipeline during graph translation.
>>>
>>> At any rate, I'm not aware of anybody currently working on this. But
>>> with the Dataflow worker code being donated to Beam [3], soon it will be
>>> possible for anybody to contribute.
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-3926
>>> [2]
>>> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs/getMetrics
>>> [3]
>>> https://lists.apache.org/thread.html/2bdc645659e2fbd7e29f3a2758941faefedb01148a2a11558dfe60f8@%3Cdev.beam.apache.org%3E
>>>
>>> On Fri, Aug 17, 2018 at 4:26 PM Lukasz Cwik <lc...@google.com> wrote:
>>>
>>> I forwarded your request to a few people who work on the internal parts
>>> of Dataflow to see if they could help in some way.
>>>
>>> On Thu, Aug 16, 2018 at 6:22 AM Etienne Chauchot <ec...@apache.org>
>>> wrote:
>>>
>>> Hi all
>>>
>>> As we already discussed, it would be good to support Metrics Pusher [1]
>>> in Dataflow (in other runners also, of course). Today, only Spark and Flink
>>> support it. It requires a modification in C++ Dataflow code, so only Google
>>> friends can do it.
>>>
>>> Is someone interested in doing it ?
>>>
>>> Here is the ticket https://issues.apache.org/jira/browse/BEAM-3926
>>>
>>> Besides, I wonder if this feature should be added to the capability
>>> matrix.
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/BEAM/Metrics+architecture+inside+the+runners
>>>
>>> Thanks
>>> Etienne
>>>
>>>
>>>
>>>
>>
>> --
>>
>>
>>
>>
>> Got feedback? tinyurl.com/swegner-feedback
>>
>

Re: Metrics Pusher support on Dataflow

Posted by Jozef Vilcek <jo...@gmail.com>.

Just curious here. What happens when this local JVM context dies for any
reason?  How does it work with DataFlow?

On Thu, Oct 4, 2018 at 1:50 AM Scott Wegner <sc...@apache.org> wrote:

> Another point that we discussed at ApacheCon is that a difference between
> Dataflow and other runners is Dataflow is service-based and doesn't need a
> locally executing "driver" program. A local driver context is a good place
> to implement MetricsPusher because it is a singleton process.
>
> In fact, DataflowRunner supports PipelineResult.waitUntilFinish() [1],
> where we do maintain the local JVM context. Currently in this mode the
> runner polls the Dataflow service API for log messages [2]. It would be
> very easy to also poll for metric updates and push them out via
> MetricsPusher.
>
> [1]
> https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L169
> [2]
> https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L291
>
> On Wed, Oct 3, 2018 at 4:44 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi Scott,
>> Thanks for the update.
>> Both solutions look good to me. Though, they both have plus and minus. I
>> let the googlers chose which is more appropriate:
>>
>> - DAG modifcation: less intrusive in Dataflow but the DAG executed and
>> shown in the DAG UI in dataflow will contain an extra step that the user
>> might wonder about.
>> - polling thread: it is exactly what I did for the other runners, it is
>> more transparent to the user but requires more infra work (adds a container
>> that needs to be resilient)
>>
>> Best
>> Etienne
>>
>> Le vendredi 21 septembre 2018 à 12:46 -0700, Scott Wegner a écrit :
>>
>> Hi Etienne, sorry for the delay on this. I just got back from leave and
>> found this discussion.
>>
>> We haven't started implementing MetricsPusher in the Dataflow runner,
>> mostly because the Dataflow service has it's own rich Metrics REST API and
>> we haven't heard a need from Dataflow customers to push metrics to an
>> external backend. However, it would be nice to have this implemented across
>> all runners for feature parity.
>>
>> I read through the discussion in JIRA [1], and the simplest
>> implementation for Dataflow may be to have a single thread periodically
>> poll the Dataflow REST API [2] for latest metric values, and push them to a
>> configured sink. This polling thread could be hosted in a separate docker
>> container, within the worker process, or perhaps a ParDo with timers that
>> gets injected into the pipeline during graph translation.
>>
>> At any rate, I'm not aware of anybody currently working on this. But with
>> the Dataflow worker code being donated to Beam [3], soon it will be
>> possible for anybody to contribute.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-3926
>> [2]
>> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs/getMetrics
>> [3]
>> https://lists.apache.org/thread.html/2bdc645659e2fbd7e29f3a2758941faefedb01148a2a11558dfe60f8@%3Cdev.beam.apache.org%3E
>>
>> On Fri, Aug 17, 2018 at 4:26 PM Lukasz Cwik <lc...@google.com> wrote:
>>
>> I forwarded your request to a few people who work on the internal parts
>> of Dataflow to see if they could help in some way.
>>
>> On Thu, Aug 16, 2018 at 6:22 AM Etienne Chauchot <ec...@apache.org>
>> wrote:
>>
>> Hi all
>>
>> As we already discussed, it would be good to support Metrics Pusher [1]
>> in Dataflow (in other runners also, of course). Today, only Spark and Flink
>> support it. It requires a modification in C++ Dataflow code, so only Google
>> friends can do it.
>>
>> Is someone interested in doing it ?
>>
>> Here is the ticket https://issues.apache.org/jira/browse/BEAM-3926
>>
>> Besides, I wonder if this feature should be added to the capability
>> matrix.
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/BEAM/Metrics+architecture+inside+the+runners
>>
>> Thanks
>> Etienne
>>
>>
>>
>>
>
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>

Re: Metrics Pusher support on Dataflow

Posted by Scott Wegner <sc...@apache.org>.

Another point that we discussed at ApacheCon is that a difference between
Dataflow and other runners is Dataflow is service-based and doesn't need a
locally executing "driver" program. A local driver context is a good place
to implement MetricsPusher because it is a singleton process.

In fact, DataflowRunner supports PipelineResult.waitUntilFinish() [1],
where we do maintain the local JVM context. Currently in this mode the
runner polls the Dataflow service API for log messages [2]. It would be
very easy to also poll for metric updates and push them out via
MetricsPusher.

[1]
https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L169
[2]
https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L291

On Wed, Oct 3, 2018 at 4:44 AM Etienne Chauchot <ec...@apache.org>
wrote:

> Hi Scott,
> Thanks for the update.
> Both solutions look good to me. Though, they both have plus and minus. I
> let the googlers chose which is more appropriate:
>
> - DAG modifcation: less intrusive in Dataflow but the DAG executed and
> shown in the DAG UI in dataflow will contain an extra step that the user
> might wonder about.
> - polling thread: it is exactly what I did for the other runners, it is
> more transparent to the user but requires more infra work (adds a container
> that needs to be resilient)
>
> Best
> Etienne
>
> Le vendredi 21 septembre 2018 à 12:46 -0700, Scott Wegner a écrit :
>
> Hi Etienne, sorry for the delay on this. I just got back from leave and
> found this discussion.
>
> We haven't started implementing MetricsPusher in the Dataflow runner,
> mostly because the Dataflow service has it's own rich Metrics REST API and
> we haven't heard a need from Dataflow customers to push metrics to an
> external backend. However, it would be nice to have this implemented across
> all runners for feature parity.
>
> I read through the discussion in JIRA [1], and the simplest implementation
> for Dataflow may be to have a single thread periodically poll the Dataflow
> REST API [2] for latest metric values, and push them to a configured sink.
> This polling thread could be hosted in a separate docker container, within
> the worker process, or perhaps a ParDo with timers that gets injected into
> the pipeline during graph translation.
>
> At any rate, I'm not aware of anybody currently working on this. But with
> the Dataflow worker code being donated to Beam [3], soon it will be
> possible for anybody to contribute.
>
> [1] https://issues.apache.org/jira/browse/BEAM-3926
> [2]
> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs/getMetrics
> [3]
> https://lists.apache.org/thread.html/2bdc645659e2fbd7e29f3a2758941faefedb01148a2a11558dfe60f8@%3Cdev.beam.apache.org%3E
>
> On Fri, Aug 17, 2018 at 4:26 PM Lukasz Cwik <lc...@google.com> wrote:
>
> I forwarded your request to a few people who work on the internal parts of
> Dataflow to see if they could help in some way.
>
> On Thu, Aug 16, 2018 at 6:22 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
> Hi all
>
> As we already discussed, it would be good to support Metrics Pusher [1] in
> Dataflow (in other runners also, of course). Today, only Spark and Flink
> support it. It requires a modification in C++ Dataflow code, so only Google
> friends can do it.
>
> Is someone interested in doing it ?
>
> Here is the ticket https://issues.apache.org/jira/browse/BEAM-3926
>
> Besides, I wonder if this feature should be added to the capability matrix.
>
> [1]
> https://cwiki.apache.org/confluence/display/BEAM/Metrics+architecture+inside+the+runners
>
> Thanks
> Etienne
>
>
>
>

-- 




Got feedback? tinyurl.com/swegner-feedback

Re: Metrics Pusher support on Dataflow

Posted by Etienne Chauchot <ec...@apache.org>.

Hi Scott,Thanks for the update. Both solutions look good to me. Though, they both have plus and minus. I let the
googlers chose which is more appropriate:
- DAG modifcation: less intrusive in Dataflow but the DAG executed and shown in the DAG UI in dataflow will contain an
extra step that the user might wonder about.- polling thread: it is exactly what I did for the other runners, it is more
transparent to the user but  requires more infra work (adds a container that needs to be resilient)
BestEtienne
Le vendredi 21 septembre 2018 à 12:46 -0700, Scott Wegner a écrit :
> Hi Etienne, sorry for the delay on this. I just got back from leave and found this discussion.
> We haven't started implementing MetricsPusher in the Dataflow runner, mostly because the Dataflow service has it's own
> rich Metrics REST API and we haven't heard a need from Dataflow customers to push metrics to an external backend.
> However, it would be nice to have this implemented across all runners for feature parity.
> 
> I read through the discussion in JIRA [1], and the simplest implementation for Dataflow may be to have a single thread
> periodically poll the Dataflow REST API [2] for latest metric values, and push them to a configured sink. This polling
> thread could be hosted in a separate docker container, within the worker process, or perhaps a ParDo with timers that
> gets injected into the pipeline during graph translation.
> 
> At any rate, I'm not aware of anybody currently working on this. But with the Dataflow worker code being donated to
> Beam [3], soon it will be possible for anybody to contribute.
> 
> [1] https://issues.apache.org/jira/browse/BEAM-3926
> [2] https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs/getMetrics
> [3] https://lists.apache.org/thread.html/2bdc645659e2fbd7e29f3a2758941faefedb01148a2a11558dfe60f8@%3Cdev.beam.apache.o
> rg%3E
> 
> On Fri, Aug 17, 2018 at 4:26 PM Lukasz Cwik <lc...@google.com> wrote:
> > I forwarded your request to a few people who work on the internal parts of Dataflow to see if they could help in
> > some way.
> > On Thu, Aug 16, 2018 at 6:22 AM Etienne Chauchot <ec...@apache.org> wrote:
> > > Hi all
> > > 
> > > As we already discussed, it would be good to support Metrics Pusher [1] in Dataflow (in other runners also, of
> > > course). Today, only Spark and Flink support it. It requires a modification in C++ Dataflow code, so only Google
> > > friends can do it. 
> > > 
> > > Is someone interested in doing it ? 
> > > 
> > > Here is the ticket https://issues.apache.org/jira/browse/BEAM-3926
> > > 
> > > Besides, I wonder if this feature should be added to the capability matrix.
> > > 
> > > [1] https://cwiki.apache.org/confluence/display/BEAM/Metrics+architecture+inside+the+runners
> > > 
> > > Thanks
> > > Etienne
> 
>