You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by JingsongLee <lz...@aliyun.com> on 2017/06/02 01:52:11 UTC

Re: [DISCUSS] Source Watermark Metrics

@Aviem Zur @Ben Chambers What do you think about the value of METRIC_MAX_SPLITS?

------------------------------------------------------------------From:JingsongLee <lz...@aliyun.com>Time:2017 May 11 (Thu) 16:37To:dev@beam.apache.org <de...@beam.apache.org>Subject:[DISCUSS] Source Watermark Metrics
Hi everyone,

The source watermark metrics show the consumer latency of Source. 
It allows the user to know the health of the job, or it can be used to
 monitor and alarm.
We should have the runner report the watermark metricsrather than
 having the source report it using metrics. This addresses the fact that even
if the source has advanced to 8:00, the runner may still know about buffered
 elements at 7:00, and so not advance the watermark all the way to 8:00. 
The metrics Includes:
1.Source watermark (`min` amongst all splits):
type = Gauge, namespace = io, name = source_watermark
2.Source watermark per split:
type = Gauge, namespace = io.splits, name = <split_id>.source_watermark

Min Source watermark amongst all splits seems difficult to implement since 
some runners(like FlinkRunner) can't access to all the splits to aggregate 
and there is no such AggregatorMetric.

So We could report watermark per split and users could use a `min` 
aggregation on this in their metrics backends. However, as was mentioned 
in the IO metrics proposal by several people this could be problematic in 
sources with many splits.

So we do a check when report metrics to solve the problem of too many splits.
{code} 
if (splitsNum <= METRIC_MAX_SPLITS) {
  // set the sourceWatermarkOfSplit
}
{code}

So I'd like to take a discussion to the implement of source watermark metrics
 and specific how many splits is too many. (the value of METRIC_MAX_SPLITS)

JIRA: 
IO metrics (https://issues.apache.org/jira/browse/BEAM-1919)
Source watermark (https://issues.apache.org/jira/browse/BEAM-1941)

Re: [DISCUSS] Source Watermark Metrics

Posted by JingsongLee <lz...@aliyun.com>.

Hi @Ben Chambers @Aljoscha Krettek @Aviem Zur and other all, 
I've written this up as a proposal found here: 
https://docs.google.com/document/d/1ykjjG97DjVQP73jGbotGRbtK38hGvFbokNEOuNO4DAo/edit?usp=sharing
Feel free to comment/edit it.
Best, JingsongLee


------------------------------------------------------------------
From:Ben Chambers <bc...@google.com.INVALID>
Time:2017 Jun 6 (Tue) 22:46
To:JingsongLee <lz...@aliyun.com>; dev <de...@beam.apache.org>
Subject:Re: [DISCUSS] Source Watermark Metrics

The existing metrics allow a user to report additional values to the
runner. For something like the watermark that the runner already knows
about it doesn't need to fit into the set of metrics. Since the runner
already tracks the low watermark for each operator it can just report that
as it sees fit.

This means it shouldn't matter whether the current metrics tyoes can
express it, since they are for getting user numbers into the runner.

On Sun, Jun 4, 2017, 8:27 PM JingsongLee <lz...@aliyun.com> wrote:

> I feel reporting the current low watermark for each operator is better
> than just reporting the source watermark when I see Flink 1.3 web frontend.
>
> We want the smallest watermark in all splits.  But Some runners, like
> FlinkRunner, don't have a way to get the global smallest watermark,  and
> the metric's type(Counter, Guage, Distribution) can not express it.
>
> Best,JingsongLee
> ------------------------------------------------------------------
> From:Ben Chambers <bc...@google.com.INVALID>
> Time:2017 Jun 2 (Fri) 21:46
> To:dev <de...@beam.apache.org>; JingsongLee <lz...@aliyun.com>
> Cc:Aviem Zur <av...@gmail.com>; Ben Chambers
> <bc...@google.com.invalid>
> Subject:Re: [DISCUSS] Source Watermark Metrics
> I think having runners report important, general properties such as the
> source watermark is great. It is much easier than requiring every source to
> expose it.
>
> I'm not sure how we would require this or do so in a general way. Each
> runner has seperate code for handling the watermark as well as different
> ways information should be reported.
>
> Where would the runner do this? Where would the runner.put these values?
> Maybe this is just part of the documentation about what we would like
> runners to do?
>
> On Fri, Jun 2, 2017, 3:09 AM Aljoscha Krettek <al...@apache.org> wrote:
>
> > Hi,
> >
> > Thanks for reviving this thread. I think having the watermark is very
>
> > good. Some runners, for example Dataflow and Flink have their own internal
> > metric for the watermark but having it cross-runner seems beneficial (if
> > maybe a bit wasteful).
> >
> > Best,
> > Aljoscha
> >
> > > On 2. Jun 2017, at 03:52, JingsongLee <lz...@aliyun.com> wrote:
> > >
> > > @Aviem Zur @Ben Chambers What do you think about the value of
> > METRIC_MAX_SPLITS?
> > >
> > >
>
> > ------------------------------------------------------------------From:JingsongLee
> > <lz...@aliyun.com>Time:2017 May 11 (Thu)
> > 16:37To:dev@beam.apache.org <dev@beam.apache.org
> >Subject:[DISCUSS] Source
> > Watermark Metrics
> > > Hi everyone,
> > >
> > > The source watermark metrics show the consumer latency of Source.
> > > It allows the user to know the health of the job, or it can be used to
> > >  monitor and alarm.
> > > We should have the runner report the watermark metricsrather than
>
> > >  having the source report it using metrics. This addresses the fact that
> > even
> > > if the source has advanced to 8:00, the runner may still know about
> > buffered
>
> > >  elements at 7:00, and so not advance the watermark all the way to 8:00.
> > > The metrics Includes:
> > > 1.Source watermark (`min` amongst all splits):
> > > type = Gauge, namespace = io, name = source_watermark
> > > 2.Source watermark per split:
> > > type = Gauge, namespace = io.splits, name = <split_id>.source_watermark
> > >
> > > Min Source watermark amongst all splits seems difficult to implement
> > since
> > > some runners(like FlinkRunner) can't access to all the splits to
> > aggregate
> > > and there is no such AggregatorMetric.
> > >
> > > So We could report watermark per split and users could use a `min`
>
> > > aggregation on this in their metrics backends. However, as was mentioned
>
> > > in the IO metrics proposal by several people this could be problematic in
> > > sources with many splits.
> > >
> > > So we do a check when report metrics to solve the problem of too many
> > splits.
> > > {code}
> > > if (splitsNum <= METRIC_MAX_SPLITS) {
> > >   // set the sourceWatermarkOfSplit
> > > }
> > > {code}
> > >
> > > So I'd like to take a discussion to the implement of source watermark
> > metrics
> > >  and specific how many splits is too many. (the value of
> > METRIC_MAX_SPLITS)
> > >
> > > JIRA:
> > > IO metrics (https://issues.apache.org/jira/browse/BEAM-1919)
> > > Source watermark (https://issues.apache.org/jira/browse/BEAM-1941)
> > >
> >
> >
>
>

Re: [DISCUSS] Source Watermark Metrics

Posted by Ben Chambers <bc...@google.com.INVALID>.

The existing metrics allow a user to report additional values to the
runner. For something like the watermark that the runner already knows
about it doesn't need to fit into the set of metrics. Since the runner
already tracks the low watermark for each operator it can just report that
as it sees fit.

This means it shouldn't matter whether the current metrics tyoes can
express it, since they are for getting user numbers into the runner.

On Sun, Jun 4, 2017, 8:27 PM JingsongLee <lz...@aliyun.com> wrote:

> I feel reporting the current low watermark for each operator is better
> than just reporting the source watermark when I see Flink 1.3 web frontend.
>
> We want the smallest watermark in all splits.  But Some runners, like
> FlinkRunner, don't have a way to get the global smallest watermark,  and
> the metric's type(Counter, Guage, Distribution) can not express it.
>
> Best,JingsongLee
> ------------------------------------------------------------------
> From:Ben Chambers <bc...@google.com.INVALID>
> Time:2017 Jun 2 (Fri) 21:46
> To:dev <de...@beam.apache.org>; JingsongLee <lz...@aliyun.com>
> Cc:Aviem Zur <av...@gmail.com>; Ben Chambers
> <bc...@google.com.invalid>
> Subject:Re: [DISCUSS] Source Watermark Metrics
> I think having runners report important, general properties such as the
> source watermark is great. It is much easier than requiring every source to
> expose it.
>
> I'm not sure how we would require this or do so in a general way. Each
> runner has seperate code for handling the watermark as well as different
> ways information should be reported.
>
> Where would the runner do this? Where would the runner.put these values?
> Maybe this is just part of the documentation about what we would like
> runners to do?
>
> On Fri, Jun 2, 2017, 3:09 AM Aljoscha Krettek <al...@apache.org> wrote:
>
> > Hi,
> >
> > Thanks for reviving this thread. I think having the watermark is very
>
> > good. Some runners, for example Dataflow and Flink have their own internal
> > metric for the watermark but having it cross-runner seems beneficial (if
> > maybe a bit wasteful).
> >
> > Best,
> > Aljoscha
> >
> > > On 2. Jun 2017, at 03:52, JingsongLee <lz...@aliyun.com> wrote:
> > >
> > > @Aviem Zur @Ben Chambers What do you think about the value of
> > METRIC_MAX_SPLITS?
> > >
> > >
>
> > ------------------------------------------------------------------From:JingsongLee
> > <lz...@aliyun.com>Time:2017 May 11 (Thu)
> > 16:37To:dev@beam.apache.org <dev@beam.apache.org
> >Subject:[DISCUSS] Source
> > Watermark Metrics
> > > Hi everyone,
> > >
> > > The source watermark metrics show the consumer latency of Source.
> > > It allows the user to know the health of the job, or it can be used to
> > >  monitor and alarm.
> > > We should have the runner report the watermark metricsrather than
>
> > >  having the source report it using metrics. This addresses the fact that
> > even
> > > if the source has advanced to 8:00, the runner may still know about
> > buffered
>
> > >  elements at 7:00, and so not advance the watermark all the way to 8:00.
> > > The metrics Includes:
> > > 1.Source watermark (`min` amongst all splits):
> > > type = Gauge, namespace = io, name = source_watermark
> > > 2.Source watermark per split:
> > > type = Gauge, namespace = io.splits, name = <split_id>.source_watermark
> > >
> > > Min Source watermark amongst all splits seems difficult to implement
> > since
> > > some runners(like FlinkRunner) can't access to all the splits to
> > aggregate
> > > and there is no such AggregatorMetric.
> > >
> > > So We could report watermark per split and users could use a `min`
>
> > > aggregation on this in their metrics backends. However, as was mentioned
>
> > > in the IO metrics proposal by several people this could be problematic in
> > > sources with many splits.
> > >
> > > So we do a check when report metrics to solve the problem of too many
> > splits.
> > > {code}
> > > if (splitsNum <= METRIC_MAX_SPLITS) {
> > >   // set the sourceWatermarkOfSplit
> > > }
> > > {code}
> > >
> > > So I'd like to take a discussion to the implement of source watermark
> > metrics
> > >  and specific how many splits is too many. (the value of
> > METRIC_MAX_SPLITS)
> > >
> > > JIRA:
> > > IO metrics (https://issues.apache.org/jira/browse/BEAM-1919)
> > > Source watermark (https://issues.apache.org/jira/browse/BEAM-1941)
> > >
> >
> >
>
>

Re: [DISCUSS] Source Watermark Metrics

Posted by JingsongLee <lz...@aliyun.com>.

I feel reporting the current low watermark for each operator is better than just reporting the source watermark when I see Flink 1.3 web frontend.

We want the smallest watermark in all splits.  But Some runners, like FlinkRunner, don't have a way to get the global smallest watermark,  and the metric's type(Counter, Guage, Distribution) can not express it.

Best,JingsongLee
------------------------------------------------------------------
From:Ben Chambers <bc...@google.com.INVALID>
Time:2017 Jun 2 (Fri) 21:46
To:dev <de...@beam.apache.org>; JingsongLee <lz...@aliyun.com>
Cc:Aviem Zur <av...@gmail.com>; Ben Chambers <bc...@google.com.invalid>
Subject:Re: [DISCUSS] Source Watermark Metrics
I think having runners report important, general properties such as the
source watermark is great. It is much easier than requiring every source to
expose it.

I'm not sure how we would require this or do so in a general way. Each
runner has seperate code for handling the watermark as well as different
ways information should be reported.

Where would the runner do this? Where would the runner.put these values?
Maybe this is just part of the documentation about what we would like
runners to do?

On Fri, Jun 2, 2017, 3:09 AM Aljoscha Krettek <al...@apache.org> wrote:

> Hi,
>
> Thanks for reviving this thread. I think having the watermark is very
> good. Some runners, for example Dataflow and Flink have their own internal
> metric for the watermark but having it cross-runner seems beneficial (if
> maybe a bit wasteful).
>
> Best,
> Aljoscha
>
> > On 2. Jun 2017, at 03:52, JingsongLee <lz...@aliyun.com> wrote:
> >
> > @Aviem Zur @Ben Chambers What do you think about the value of
> METRIC_MAX_SPLITS?
> >
> >
> ------------------------------------------------------------------From:JingsongLee
> <lz...@aliyun.com>Time:2017 May 11 (Thu)
> 16:37To:dev@beam.apache.org <de...@beam.apache.org>Subject:[DISCUSS] Source
> Watermark Metrics
> > Hi everyone,
> >
> > The source watermark metrics show the consumer latency of Source.
> > It allows the user to know the health of the job, or it can be used to
> >  monitor and alarm.
> > We should have the runner report the watermark metricsrather than
> >  having the source report it using metrics. This addresses the fact that
> even
> > if the source has advanced to 8:00, the runner may still know about
> buffered
> >  elements at 7:00, and so not advance the watermark all the way to 8:00.
> > The metrics Includes:
> > 1.Source watermark (`min` amongst all splits):
> > type = Gauge, namespace = io, name = source_watermark
> > 2.Source watermark per split:
> > type = Gauge, namespace = io.splits, name = <split_id>.source_watermark
> >
> > Min Source watermark amongst all splits seems difficult to implement
> since
> > some runners(like FlinkRunner) can't access to all the splits to
> aggregate
> > and there is no such AggregatorMetric.
> >
> > So We could report watermark per split and users could use a `min`
> > aggregation on this in their metrics backends. However, as was mentioned
> > in the IO metrics proposal by several people this could be problematic in
> > sources with many splits.
> >
> > So we do a check when report metrics to solve the problem of too many
> splits.
> > {code}
> > if (splitsNum <= METRIC_MAX_SPLITS) {
> >   // set the sourceWatermarkOfSplit
> > }
> > {code}
> >
> > So I'd like to take a discussion to the implement of source watermark
> metrics
> >  and specific how many splits is too many. (the value of
> METRIC_MAX_SPLITS)
> >
> > JIRA:
> > IO metrics (https://issues.apache.org/jira/browse/BEAM-1919)
> > Source watermark (https://issues.apache.org/jira/browse/BEAM-1941)
> >
>
>

Re: [DISCUSS] Source Watermark Metrics

Posted by Ben Chambers <bc...@google.com.INVALID>.

I think having runners report important, general properties such as the
source watermark is great. It is much easier than requiring every source to
expose it.

I'm not sure how we would require this or do so in a general way. Each
runner has seperate code for handling the watermark as well as different
ways information should be reported.

Where would the runner do this? Where would the runner.put these values?
Maybe this is just part of the documentation about what we would like
runners to do?

On Fri, Jun 2, 2017, 3:09 AM Aljoscha Krettek <al...@apache.org> wrote:

> Hi,
>
> Thanks for reviving this thread. I think having the watermark is very
> good. Some runners, for example Dataflow and Flink have their own internal
> metric for the watermark but having it cross-runner seems beneficial (if
> maybe a bit wasteful).
>
> Best,
> Aljoscha
>
> > On 2. Jun 2017, at 03:52, JingsongLee <lz...@aliyun.com> wrote:
> >
> > @Aviem Zur @Ben Chambers What do you think about the value of
> METRIC_MAX_SPLITS?
> >
> >
> ------------------------------------------------------------------From:JingsongLee
> <lz...@aliyun.com>Time:2017 May 11 (Thu)
> 16:37To:dev@beam.apache.org <de...@beam.apache.org>Subject:[DISCUSS] Source
> Watermark Metrics
> > Hi everyone,
> >
> > The source watermark metrics show the consumer latency of Source.
> > It allows the user to know the health of the job, or it can be used to
> >  monitor and alarm.
> > We should have the runner report the watermark metricsrather than
> >  having the source report it using metrics. This addresses the fact that
> even
> > if the source has advanced to 8:00, the runner may still know about
> buffered
> >  elements at 7:00, and so not advance the watermark all the way to 8:00.
> > The metrics Includes:
> > 1.Source watermark (`min` amongst all splits):
> > type = Gauge, namespace = io, name = source_watermark
> > 2.Source watermark per split:
> > type = Gauge, namespace = io.splits, name = <split_id>.source_watermark
> >
> > Min Source watermark amongst all splits seems difficult to implement
> since
> > some runners(like FlinkRunner) can't access to all the splits to
> aggregate
> > and there is no such AggregatorMetric.
> >
> > So We could report watermark per split and users could use a `min`
> > aggregation on this in their metrics backends. However, as was mentioned
> > in the IO metrics proposal by several people this could be problematic in
> > sources with many splits.
> >
> > So we do a check when report metrics to solve the problem of too many
> splits.
> > {code}
> > if (splitsNum <= METRIC_MAX_SPLITS) {
> >   // set the sourceWatermarkOfSplit
> > }
> > {code}
> >
> > So I'd like to take a discussion to the implement of source watermark
> metrics
> >  and specific how many splits is too many. (the value of
> METRIC_MAX_SPLITS)
> >
> > JIRA:
> > IO metrics (https://issues.apache.org/jira/browse/BEAM-1919)
> > Source watermark (https://issues.apache.org/jira/browse/BEAM-1941)
> >
>
>

Re: [DISCUSS] Source Watermark Metrics

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,

Thanks for reviving this thread. I think having the watermark is very good. Some runners, for example Dataflow and Flink have their own internal metric for the watermark but having it cross-runner seems beneficial (if maybe a bit wasteful).

Best,
Aljoscha

> On 2. Jun 2017, at 03:52, JingsongLee <lz...@aliyun.com> wrote:
> 
> @Aviem Zur @Ben Chambers What do you think about the value of METRIC_MAX_SPLITS?
> 
> ------------------------------------------------------------------From:JingsongLee <lz...@aliyun.com>Time:2017 May 11 (Thu) 16:37To:dev@beam.apache.org <de...@beam.apache.org>Subject:[DISCUSS] Source Watermark Metrics
> Hi everyone,
> 
> The source watermark metrics show the consumer latency of Source. 
> It allows the user to know the health of the job, or it can be used to
>  monitor and alarm.
> We should have the runner report the watermark metricsrather than
>  having the source report it using metrics. This addresses the fact that even
> if the source has advanced to 8:00, the runner may still know about buffered
>  elements at 7:00, and so not advance the watermark all the way to 8:00. 
> The metrics Includes:
> 1.Source watermark (`min` amongst all splits):
> type = Gauge, namespace = io, name = source_watermark
> 2.Source watermark per split:
> type = Gauge, namespace = io.splits, name = <split_id>.source_watermark
> 
> Min Source watermark amongst all splits seems difficult to implement since 
> some runners(like FlinkRunner) can't access to all the splits to aggregate 
> and there is no such AggregatorMetric.
> 
> So We could report watermark per split and users could use a `min` 
> aggregation on this in their metrics backends. However, as was mentioned 
> in the IO metrics proposal by several people this could be problematic in 
> sources with many splits.
> 
> So we do a check when report metrics to solve the problem of too many splits.
> {code} 
> if (splitsNum <= METRIC_MAX_SPLITS) {
>   // set the sourceWatermarkOfSplit
> }
> {code}
> 
> So I'd like to take a discussion to the implement of source watermark metrics
>  and specific how many splits is too many. (the value of METRIC_MAX_SPLITS)
> 
> JIRA: 
> IO metrics (https://issues.apache.org/jira/browse/BEAM-1919)
> Source watermark (https://issues.apache.org/jira/browse/BEAM-1941)
>