You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by Jungtaek Lim <ka...@gmail.com> on 2016/06/01 10:10:09 UTC
Re: [DISCUSSION] opinions on breaking changes on metrics for 1.x

While we are talking about new metrics feature, I still want to apply
available fixes for current metrics to 1.x since current metrics feature is
used for the long time and fixes will help reducing pain.

- STORM-1698 <https://issues.apache.org/jira/browse/STORM-1698>:
Asynchronous MetricsConsumerBolt
  - related to STORM-1832
<https://issues.apache.org/jira/browse/STORM-1832> which
is marked as 'Critical'
- STORM-1700 <https://issues.apache.org/jira/browse/STORM-1700>: Introduce
'whitelist' / 'blacklist' option to MetricsConsumer
- STORM-1719 <https://issues.apache.org/jira/browse/STORM-1719>: Introduce
REST API: Topology metric stats for stream (only available for 1.x)
- STORM-1723 <https://issues.apache.org/jira/browse/STORM-1723>: Introduce
ClusterMetricsConsumer (only available for 1.x)
- STORM-1724 <https://issues.apache.org/jira/browse/STORM-1724>: Fill up
lacking contents to Metrics documentation

They're already available for reviewing, and are still helpful before
introducing new metrics feature which will take amount of time.
I'd like to include these as Epic for 1.1.0 if someone doesn't have
objection. Please share your opinions, and review these.

Thanks,
Jungtaek Lim (HeartSaVioR)


2016년 5월 24일 (화) 오후 3:59, Jungtaek Lim <ka...@gmail.com>님이 작성:

> Inlined my thought, but totally agreed with Bobby. Starting with good
> questions seems more important than trying to measure more parts.
> Thanks for bring up your opinion.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2016년 5월 23일 (월) 오후 10:36, Bobby Evans <ev...@yahoo-inc.com>님이 작성:
>
>> From JStorm we got the ability to show most of the metrics on the UI, but
>> I think we missed the ability to turn on and off metrics dynamically.  This
>> is important because there will be metrics that are expensive to gather,
>> but very important when debugging specific problems.
>>
>> I also think we should spend some time thinking about why we are
>> gathering/computing specific metrics, what is the question we are trying to
>> answer and is there a better way to answer it.
>>
>
>> For example we collect a lot of latency statistics for a bolt to answer a
>> few different questions, but for the most part the latency stats we have
>> are bad at answering either of those questions.
>>
>> 1) How over loaded is a particular bolt? aka capacity.
>> 2) How does this bolt contribute to the overall processing time in a
>> topology?
>>
>> For the first one we subsample because it is expensive to compute the
>> latency every time.  But in practice I have seen many bolts with a capacity
>> well over 1, I have even seen a 35.  How can a bolt be running 3500% of the
>> time? If we use the input queue depth instead that could give us a much
>> better picture, and be a lot less expensive.
>>
>
> Yes, subsample could make capacity calculation messed up. (though I guess
> 35 seems to be completely flawed.)
>
> In fact, it only considers execute latency so even though capacity is
> ideally calculated to 1, it doesn't mean that executor is keeping up. It
> only holds true when executor keeps calling execute() without any latencies
> (no latency from reading from receive queue, GC running, context switching,
> etc.) which doesn't make sense.
>
> Showing queue depth is good way to show current statistics without any
> sampling (so more clear). We can show the used ratio based on the queue
> size, and even alert when it nearly reaches backpressure. We're having
> backpressure feature but no indication on this. (right?)
>
> I heard some users want to have sojourn time on queue, and we already have
> this, but it's also calculated on assumption and I don't have knowledge on
> this side so I'm not sure its value holds true.
>
>
>> For the second one we have talked about adding in more latency
>> measurements, (serialization/deserialization time, etc) but I think we will
>> end up playing whack a mole unless we take a wholistic approach where we
>> don't just measure small part of the system, but we measure latency from
>> point A to point B and from point B to point C, so there are no gaps?
>>
>
> I just like to see that how much things which we don't measure contribute
> the whole latency. We all know that locality is important but we don't show
> how much is the transfer cost, serde, etc.
>
> But I totally agree what you stated. We can't measure whole of things so sum
> of the latencies from parts is not same to actual latency. Sampling and
> calculating average on latency makes it worse.
>
> Having end-to-end latency is great, but its worth depends on how we
> calculate the latency. If point A and B is placed on the different nodes,
> precision of time sync could contribute huge part of the latency
> especially latency is small. That's why I take an assumption with more
> accurate complete latency (STORM-1742
> <http://issues.apache.org/jira/browse/STORM-1742>).
>
> I just don't think we want to have a simple 1 to 1 translation.
>>
>
>> - Bobby
>>
>>
>> On Monday, May 23, 2016 5:05 AM, Jungtaek Lim <ka...@gmail.com> wrote:
>>
>>
>> Thanks for following up, Abhishek. I agree with you about the consensus.
>>
>> I think you already listed most of my / community requirements. I'm
>> adding up something I have in mind,
>>
>> 7. Aggregation at stream level (STORM-1719
>> <https://issues.apache.org/jira/browse/STORM-1719>), and machine level
>> 8. way to subscribe cluster metrics (STORM-1723
>> <https://issues.apache.org/jira/browse/STORM-1723>)
>> 9. counter stats as non-sampled if it doesn't hurt performance
>> 10. more metrics like serialization/deserialization latency, queue status
>>
>> I'd also like to see Spout offset information if implementation of Spout
>> can provide, but it's just my 2 cent.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 2016년 5월 23일 (월) 오후 6:33, Abhishek Agarwal <ab...@gmail.com>님이 작성:
>>
>> So I am assuming that there is a general consensus on adding new api for
>> metrics and gradually phasing out the old one. If yes, may be we can work
>> toward finer details of how to maintain two apis as well as the design of
>> new API.
>>
>> Jungtaek, it would be better to summarize the requirements and let others
>> add on what they feel missing. Some asks I have seen -
>>
>> 1. Aggregation at component level (Average, Sum etc) -
>> 2. Blacklist/whitelist
>> 3. Allow only numbers for values
>> 4. Efficient routing of built-in metrics to UI (current they get tagged
>> along with executor heartbeat which puts pressure on zookeeper)
>> 5. Worker/JVM level metrics which are not owned by a particular component
>> 6. Percentiles for latency metrics such as p99, p95 etc
>>
>> Not all of them may be considered. Please add anything I might have
>> missed.
>>
>>
>>
>> On Fri, May 20, 2016 at 5:57 AM, Jungtaek Lim <ka...@gmail.com> wrote:
>>
>> > Personally I'm also in favor of maintaining old API (but deprecated) and
>> > adding new API.
>> > It's ideal way, and that's what many projects are trying to do, and so
>> on
>> > the other project I'm also maintaining.
>> >
>> > And I also prefer to gone away current metrics feature in next major
>> > release. In general, before each major release we should discuss which
>> > classes/features to drop. I think we forgot this when we release 1.0.0,
>> and
>> > deprecated things are all alive.
>> >
>> > I'm also in favor of dropping the support for custom frequency for each
>> > metric. I guess we don't need to worry about mapping since internal
>> metrics
>> > on Storm will be moved to new metrics feature. While some behavioral
>> > changes
>> > could be occurred (for example, IMetricsConsumer could be changed to no
>> > longer receive built-in metrics on Storm), it will not break backward
>> > compatibility from the API side anyway.
>> >
>> > Thanks,
>> > Jungtaek Lim (HeartSaVioR)
>> >
>> > 2016년 5월 20일 (금) 오전 12:57, Abhishek Agarwal <ab...@gmail.com>님이
>> 작성:
>> >
>> > > Sounds good. Having two separate metric reporters may be confusing
>> but it
>> > > is better than breaking the client code.
>> > >
>> > > Codahale library allows user to specify frequency per reporter
>> instance.
>> > > Storm on the other hand allows different reporting frequency for each
>> > > metric. How will that mapping work? I am ok to drop the support for
>> > custom
>> > > frequency for each metric. Internal metrics in storm anyway use same
>> > > frequency of reporting.
>> > >
>> > > On Thu, May 19, 2016 at 9:04 PM, Bobby Evans
>> <evans@yahoo-inc.com.invalid
>> > >
>> > > wrote:
>> > >
>> > > > I personally would like to see that change happen differently for
>> the
>> > two
>> > > > branches.
>> > > > On 1.x we add in a new API for both reporting metrics and
>> collecting in
>> > > > parallel to the old API.  We leave IMetric and IMetricsConsumer in
>> > place
>> > > > but deprecated.  As we move internal metrics over from the old
>> > interface
>> > > to
>> > > > the new one, we either keep versions of the old ones in place or we
>> > > provide
>> > > > a translation shim going from the new to the old.
>> > > >
>> > > > In 2.x either the old way is gone completely or it is off by
>> default.
>> > I
>> > > > prefer gone completely.
>> > > >
>> > > > If we go off of dropwizard/codahale metrics or a layer around them
>> like
>> > > > was discussed previously it seems fairly straight forward to take
>> some
>> > of
>> > > > our current metrics that all trigger at the same interval and setup
>> a
>> > > > reporter that can translate them into the format that was reported
>> > > > previously.
>> > > > In 1.x to get a full picture of what is happening if your topology
>> you
>> > > may
>> > > > need two separate reporters.  One for the new metrics and one for
>> the
>> > > old,
>> > > > but it should only be for a short period of time. - Bobby
>> > > >
>> > > >     On Thursday, May 19, 2016 1:00 AM, Cody Innowhere <
>> > > e.neverme@gmail.com>
>> > > > wrote:
>> > > >
>> > > >
>> > > >  If we want to refactor the metrics system, I think we may have to
>> > incur
>> > > > breaking changes. We can make it backward compatible but this means
>> we
>> > > may
>> > > > build an adapt layer on top of metrics, or a lot of "if...else..."
>> > which
>> > > > might be ugly, either way, it might be a pain to maintain the code.
>> > > > So I prefer to making breaking changes if we want to build a new
>> > metrics
>> > > > system, and I'm OK to move JStorm metrics migration phase forward to
>> > 1.x,
>> > > > and I'm happy to share our design & experiences.
>> > > >
>> > > > On Thu, May 19, 2016 at 11:12 AM, Jungtaek Lim <ka...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > Hi devs,
>> > > > >
>> > > > > I'd like to see our opinions on breaking changes on metrics for
>> 1.x.
>> > > > >
>> > > > > Some backgrounds here:
>> > > > >
>> > > > > - As you may have seen, I'm trying to address some places to
>> improve
>> > > > > metrics without breaking backward compatibility, but it's limited
>> due
>> > > to
>> > > > > interface IMetric which is opened to public.
>> > > > > - We're working on Storm 2.0.0, and evaluation / adoption for
>> metrics
>> > > > > feature of JStorm is planned to phase 2 but we all don't know
>> > estimated
>> > > > > release date, and I feel it's not near future.
>> > > > > - We've just released Storm 1.0.x, so I expected the lifetime of
>> > Storm
>> > > > 1.0
>> > > > > (even 0.9) months or even years.
>> > > > >
>> > > > > If someone wants to know what exactly things current metrics
>> feature
>> > > > > matter, please let me know so that I will summarize.
>> > > > >
>> > > > > I have other ideas on mind to relieve some problems with current
>> > > metrics,
>> > > > > so I'm also OK to postpone renewal of metrics to 2.0.0 with
>> applying
>> > > > those
>> > > > > workaround ideas. But if we're having willingness to address
>> metrics
>> > on
>> > > > > 1.x, IMO we can consider breaking backward compatibility from 1.x
>> for
>> > > > once.
>> > > > >
>> > > > > What do you think?
>> > > > >
>> > > > > Thanks,
>> > > > > Jungtaek Lim (HeartSaVioR)
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Regards,
>> > > Abhishek Agarwal
>> > >
>> >
>>
>>
>>
>> --
>> Regards,
>> Abhishek Agarwal
>>
>>
>>
>>