You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by Jungtaek Lim <ka...@gmail.com> on 2016/03/18 15:19:02 UTC

Question on Metrics Server to Alibaba team

Hi,

I got something to do with metrics so I'm seeking the pull requests which
addresses metrics.
And at #753 <https://github.com/apache/storm/pull/753> I found Cody said we
(maybe it means Alibaba team) are currently working on Metrics Server.
(I also found comment which said there was some talk while ago around
integrating Hadoop timeline server. Seems like no one came up with the
result, and I prefer to avoid big dependency so I'm in favor of Metrics
Server for now.)

I think that would improve metrics feature of Storm much better, so I'd
like to see how the work is going. Sure it's only when there's no issue for
you to work transparently. I just would like to prevent duplication of
work, and would like to help if needed and possible.

Thanks,
Jungtaek Lim (HeartSaVioR)

Re: Question on Metrics Server to Alibaba team

Posted by Abhishek Agarwal <ab...@gmail.com>.

At Inmobi, we are using graphite to store metrics which works great for us.
Metrics can be viewed as a time series on grafana dashboards. Are there
more functionalities that the metric server provides?

Excuse typos
On Mar 18, 2016 9:22 PM, "Bobby Evans" <ev...@yahoo-inc.com.invalid> wrote:

> Yes we originally wanted to try and use the Hadoop Timeline Server for
> storm metrics feedback to nimbus + UI + history like server.  But it was
> not stable at the time, so we stopped.  For the sake of playing nicely with
> the rest of the big data ecosystem I would like to see us support it as an
> option for metrics collection/query, but until the timeline server v2 is
> ready and released.  For me the important thing is that we have a decent
> time series DB that comes with storm by default and is pluggable so we can
> replace it with something else that has similar capabilities in the future.
>  - Bobby
>
>     On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> e.neverme@gmail.com> wrote:
>
>
>  It's actually in Phase 2 of porting JStorm, but I'm absolutely ok to
> discuss this in advance.
>
> On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <e....@gmail.com>
> wrote:
>
> > Yes it's already in production.
> > The implementation basically follows the design document in
> > https://issues.apache.org/jira/browse/STORM-1329, you can take a look
> > first and feel free to ask questions.
> >
> > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <ka...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> I got something to do with metrics so I'm seeking the pull requests
> which
> >> addresses metrics.
> >> And at #753 <https://github.com/apache/storm/pull/753> I found Cody
> said
> >> we
> >> (maybe it means Alibaba team) are currently working on Metrics Server.
> >> (I also found comment which said there was some talk while ago around
> >> integrating Hadoop timeline server. Seems like no one came up with the
> >> result, and I prefer to avoid big dependency so I'm in favor of Metrics
> >> Server for now.)
> >>
> >> I think that would improve metrics feature of Storm much better, so I'd
> >> like to see how the work is going. Sure it's only when there's no issue
> >> for
> >> you to work transparently. I just would like to prevent duplication of
> >> work, and would like to help if needed and possible.
> >>
> >> Thanks,
> >> Jungtaek Lim (HeartSaVioR)
> >>
> >
> >
>
>
>

答复: Question on Metrics Server to Alibaba team

Posted by John Fang <xi...@alibaba-inc.com>.


-----邮件原件-----
发件人: John Fang [mailto:xiaojian.fxj@alibaba-inc.com] 
发送时间: 2016年3月23日 14:39
收件人: dev@storm.apache.org; 'Bobby Evans'
主题: 答复: Question on Metrics Server to Alibaba team

@ Bobby Evans Jstorm code has experienced a lot of tests over the past few years, espatially HA and scalability. We have done a lot of optimization about Metrics. The performance is better than Flink in my tests. In my personal opinion, the monitoring in jstorm offers very much informations. And the monitoring can tell us where is the bottleneck when we run a topology. The performance bottleneck maybe serialize/deserialize/netty/executor and so on. Of course, I also has some other good monitoring in the world. So I hope we can choice the better monitoring before phrase 2. And I will start study the Alas. If it is better, I am pleasured to redesign the monitoring by Alas.
  for my part, we have better make the monitoring to be a plugin.


Regards
       John Fang


-----邮件原件-----
发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
发送时间: 2016年3月22日 22:36
收件人: dev@storm.apache.org
主题: Re: Question on Metrics Server to Alibaba team

My personal opinion is that we should not reinvent the wheel (aka distributed fault tolerant metrics) ourselves.  The local file blobstore with nimbus HA was a big enough pain to write and it is relatively simple in comparison.
If the JStorm code is simple and offers everything we need in terms of HA and scalability then I would be OK with it, but if it doesn't I would lean towards a different compatible open source solution. 

https://github.com/Netflix/atlas
looks very promising as a default option.  It is actively maintained by a group that I think has some of the best monitoring in the world.  And it is both java and apache compatible.  It has no histogram support that I could find, but that I don't see as being super critical.  The biggest drawback is there is little documentation on how to use it, to really be able to evaluate it for our needs. - Bobby 

    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim <ka...@gmail.com> wrote:
 

 Harsha,

That's why I think new metric feature of JStorm looks promising.

According to design doc on https://issues.apache.org/jira/browse/STORM-1329,
there's no distinction between topology stat (which Apache Storm includes to worker heartbeat) and built-in metrics (which should be handled with separate consumer, as you stated).
All metrics are passed to Nimbus and Nimbus cached metrics, which implies we can treat all metrics as same, and we can also provide built-in metrics (including custom metrics) to users via REST API, too.

I thought about standalone metrics server process which handles whole metric works (maybe TopologyMaster + Nimbus on design doc), but if current implementation of metric feature on JStorm can take care of what I'm assuming, I guess it's great enough.

Since I don't know about TopologyMaster, I just wonder that there're any SPOFs (including soft) and how metrics work when if component of SPOF goes down.
Since Cody gives digging point to take a look at, we can evaluate that feature before phase 2.

Thanks,
Jungtaek Lim (HeartSaVioR)

2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:

> One of the goals of this work and probably can be addressed in 
> separate jira is how the topology metrics reporter works. Today its a 
> bolt thats part of a topology graph that means its another node in the 
> Topology DAG that needs be tuned for better performance. Some of our 
> users took performance hits by deploying topology metrics reporter 
> that can send metrics to Ganglia. Ideally this collection should be 
> asynchronous and not be a node in topology DAG.
>
> Shipping default metrics server and along with pluggable option for 
> users who wants to graphite or other timeline servers should be the 
> goal.
>
> --Harsha
>
>
> On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > @Cody - The design looks good. Does the design allow to aggregate 
> > metrics at the task/executor level? Basically, number of distinct 
> > metrics is proportional to the number of distinct tasks, did you 
> > ever run into such a use case?
> >
> >
> > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere 
> > <e....@gmail.com>
> > wrote:
> >
> > > Also, you can read the code from our latest release JStorm 2.1.1.
> > >
> > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere 
> > > <e....@gmail.com>
> > > wrote:
> > >
> > > > @Jungtaek,
> > > > We did some tests on codahale metrics, compared to 
> > > > meters/histograms, counters are quite fast. So we mainly focused 
> > > > on the optimization of
> > > meters
> > > > and histograms (they are indeed very slow) including double 
> > > > sampling, changing the clock from ns (System.nanoTime) to ms, etc.
> > > > You can take a look at the
> > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of 
> > > > our sequence-split-merge example code, as the client code entry 
> > > > to
> metrics.
> > > > After that, you may dig to TopologyMaster class, which is still 
> > > > part
> of a
> > > > topology, and then to TopologyMetricsRunnable, which is a part 
> > > > of
> nimbus
> > > > server, finally to MetricUploader plugin, this is where the 
> > > > metrics interfere with our "metrics server". Still, there're 
> > > > some nits in the
> > > code,
> > > > but I think that should be no big problem.
> > > >
> > > > I'd also like to point out that our "metrics server" is not 
> > > > strictly
> a
> > > > real metrics server, since most of the duty lies on nimbus 
> > > > server and topology master, it's more appropriate to call it metrics storage.
> The
> > > main
> > > > reason for this is that we don't want to make a heavy-weight 
> > > > metrics
> > > server
> > > > out of JStorm, and this makes us very easy to maintain (we have 
> > > > teams
> > > that
> > > > specifically maintain HBase/OTS in Alibaba since they're so 
> > > > commonly
> used
> > > > in production).
> > > >
> > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim 
> > > > <ka...@gmail.com>
> > > wrote:
> > > >
> > > >> Thanks Cody and Bobby for the explanation.
> > > >>
> > > >> Cody,
> > > >> I took a look at design doc and looks promising, especially it
> doesn't
> > > do
> > > >> sampling when metric type is 'counter'. As far as I heard (I 
> > > >> didn't
> try
> > > >> it)
> > > >> it becomes huge performance hit in Apache Storm when we change
> sample
> > > rate
> > > >> to 1.0.
> > > >> Could you guide the entry point of metric feature in JStorm to 
> > > >> dig
> into?
> > > >>
> > > >> And just a curiosity, did you consider extracting metric 
> > > >> feature
> (which
> > > is
> > > >> done with TopologyMasters and Nimbuses) into separate component?
> > > >> I understood your mention to 'metrics server' as separate
> component, but
> > > >> after seeing design doc, feature seems to be implemented on Nimbus.
> > > >>
> > > >> Thanks,
> > > >> Jungtaek Lim (HeartSaVioR)
> > > >>
> > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere 
> > > >> <e....@gmail.com>님이
> 작성:
> > > >>
> > > >> > JStorm has provided a MetricUploader interface, which is 
> > > >> > similar
> to
> > > >> > IMetricsConsumer in storm, and the underlying implementation 
> > > >> > is
> > > >> pluggable,
> > > >> > you can use HBase, or any other KV store that supports 
> > > >> > timeline
> > > queries
> > > >> or
> > > >> > even a database(maybe for it's a small cluster). We provide 
> > > >> > model
> > > >> classes
> > > >> > in jstorm-core, as to what kinds of metrics data need to be
> stored,
> > > it's
> > > >> > totally up to the detailed implementation. Our internal
> implementation
> > > >> uses
> > > >> > OTS, which is a product of aliyun (
> > > https://www.aliyun.com/product/ots/
> > > >> ),
> > > >> > but it's easy to adapt to other implementations.
> > > >> >
> > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > >> <evans@yahoo-inc.com.invalid
> > > >> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Yes we originally wanted to try and use the Hadoop Timeline
> Server
> > > for
> > > >> > > storm metrics feedback to nimbus + UI + history like server.
> But it
> > > >> was
> > > >> > > not stable at the time, so we stopped.  For the sake of 
> > > >> > > playing
> > > nicely
> > > >> > with
> > > >> > > the rest of the big data ecosystem I would like to see us
> support it
> > > >> as
> > > >> > an
> > > >> > > option for metrics collection/query, but until the timeline
> server
> > > v2
> > > >> is
> > > >> > > ready and released.  For me the important thing is that we 
> > > >> > > have
> a
> > > >> decent
> > > >> > > time series DB that comes with storm by default and is
> pluggable so
> > > we
> > > >> > can
> > > >> > > replace it with something else that has similar 
> > > >> > > capabilities in
> the
> > > >> > future.
> > > >> > >  - Bobby
> > > >> > >
> > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere < 
> > > >> > >e.neverme@gmail.com> wrote:
> > > >> > >
> > > >> > >
> > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm 
> > > >> > >absolutely
> ok
> > > to
> > > >> > > discuss this in advance.
> > > >> > >
> > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > e.neverme@gmail.com
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Yes it's already in production.
> > > >> > > > The implementation basically follows the design document 
> > > >> > > > in https://issues.apache.org/jira/browse/STORM-1329, you 
> > > >> > > > can
> take a
> > > >> look
> > > >> > > > first and feel free to ask questions.
> > > >> > > >
> > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> kabhwan@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > >> Hi,
> > > >> > > >>
> > > >> > > >> I got something to do with metrics so I'm seeking the 
> > > >> > > >> pull
> > > requests
> > > >> > > which
> > > >> > > >> addresses metrics.
> > > >> > > >> And at #753 <https://github.com/apache/storm/pull/753> I
> found
> > > >> Cody
> > > >> > > said
> > > >> > > >> we
> > > >> > > >> (maybe it means Alibaba team) are currently working on
> Metrics
> > > >> Server.
> > > >> > > >> (I also found comment which said there was some talk 
> > > >> > > >> while
> ago
> > > >> around
> > > >> > > >> integrating Hadoop timeline server. Seems like no one 
> > > >> > > >> came up
> > > with
> > > >> the
> > > >> > > >> result, and I prefer to avoid big dependency so I'm in 
> > > >> > > >> favor
> of
> > > >> > Metrics
> > > >> > > >> Server for now.)
> > > >> > > >>
> > > >> > > >> I think that would improve metrics feature of Storm much
> better,
> > > so
> > > >> > I'd
> > > >> > > >> like to see how the work is going. Sure it's only when
> there's no
> > > >> > issue
> > > >> > > >> for
> > > >> > > >> you to work transparently. I just would like to prevent
> > > >> duplication of
> > > >> > > >> work, and would like to help if needed and possible.
> > > >> > > >>
> > > >> > > >> Thanks,
> > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Abhishek Agarwal
>

Re: 答复: Question on Metrics Server to Alibaba team

Posted by Cody Innowhere <e....@gmail.com>.

@Harsha,
Currently we already use rocksdb to store a time series data rather than
the latest window values.

@Bobby,
I will think about HA and post a detailed document for review (together
with MetricUploader interface) later.

On Wed, Mar 30, 2016 at 9:35 AM, Harsha <st...@harsha.io> wrote:

> Another thing to consider is to store a time series data not the current
> approach where we store 1min, 10min, 3hrs windowed approach and
> definitely not depend on external storage such as hdfs .
>
> On Fri, Mar 25, 2016, at 06:43 AM, Bobby Evans wrote:
> > My concern is really around how much time/effort it is to get to a final
> > solution, and to ultimately maintain/support that solution.  If I was
> > doing this from scratch I would probably pull something off of the shelf
> > that is tested and has an entire community supporting it instead of
> > writing something ourselves from scratch.  But in this case we have a
> > solution from JStorm, that we know works.  Because this is the backend
> > that we are talking about we can switch things out later on if we need
> > to.  Like I said before I am fine with using the JStorm code initially.
> > I mostly want to be sure of a few things.
> > 1. The metrics interface we expose to end users is well thought out and
> > can be extended in the future.2. The interfaces that connect this front
> > end to the back end are though out and we could replace the back end if
> > needed.3. The solution offers some level of high availability.  If Nimbus
> > a worker, etc. crash it is OK to lose some data, but we don't want to
> >  - Bobby
> >
> >     On Friday, March 25, 2016 6:26 AM, Cody Innowhere
> >     <e....@gmail.com> wrote:
> >
> >
> >  Bobby,
> > I understand your concern. Still, I think our metrics design in JStorm
> > can
> > work without any external service, as I mentioned above, we can store
> > metrics in rocksdb on nimbus server. A rough thought will be: we store
> > the
> > latest 1 hour of 1-min window data, 10 hours of 10-min window data, 5
> > days
> > of 2-hour window data, 30 days of 1-day window, etc. And if there's the
> > need to sync metrics data between nimbus servers, we can add a sync
> > thread
> > to handle nimbus fail-over, since it's just metrics data that don't
> > really
> > matter too much, we can use a plain simple sync model.
> >
> > The external service is another option to end users, if users feel it's
> > important (or maybe their business built on top of storm is very
> > important), they can use this external service to build their own monitor
> > system which can be more useful than the original solution shipped with
> > storm.
> >
> > On Fri, Mar 25, 2016 at 2:09 AM, Bobby Evans
> > <ev...@yahoo-inc.com.invalid>
> > wrote:
> >
> > > The problem is that we want something for storm that can work out of
> the
> > > box, ideally without some other complicated external service (except
> > > zookeeper which we already have, and is not actually that complex to
> setup
> > > and run).
> > > If we feel that we must have some external state store that is required
> > > for storm to run, then we need to make the decision carefully and
> > > deliberately.
> > >  - Bobby
> > >
> > >    On Wednesday, March 23, 2016 8:37 AM, John Fang <
> > > xiaojian.fxj@alibaba-inc.com> wrote:
> > >
> > >
> > >  Sorry , I misunderstand it. We will make H/A for TopologyMaster. And
> > > metric meta will store at HDFS,  So the metrics meta won't rely on the
> > > nimbus. It can enhance the stability of the metric system.
> > >
> > > -----邮件原件-----
> > > 发件人: Cody Innowhere [mailto:e.neverme@gmail.com]
> > > 发送时间: 2016年3月23日 19:59
> > > 收件人: dev@storm.apache.org
> > > 主题: Re: Question on Metrics Server to Alibaba team
> > >
> > > If we don't rely on any external system, our metrics system is still
> > > available but will store metrics meta/data in rocksdb on nimbus
> servers.
> > > There will be limits though, for example, we cannot store metrics data
> all
> > > through the topology lifecycle, because rocksdb is only a KV storage,
> it
> > > may not support efficient scan operations and too much data in local
> disk
> > > may bring in extra IO overhead, so we may have to store latest 1hour
> of m1
> > > data, 6 hours of m10 data as such (currently not implemented in
> JStorm, but
> > > quite easy to do this).
> > >
> > > TopologyMaster is merely a channel for registering/computing/uploading
> > > metrics to nimbus, so if a TM goes down, the topology metrics will be
> > > unavailable for a while before it gets pulled up somewhere else(for a
> > > normal failover case, this should be very fast), while
> supervisor/nimbus
> > > metrics are unaffected as they're sent to nimbus via thrift interface.
> As
> > > long as TM is back, the topology metrics will be available again.
> > >
> > > Currently JStorm does sync metrics meta but metrics data between
> multiple
> > > nimbus serers is not synced. So under a nimbus failure, possibly we may
> > > lose some metrics data.
> > >
> > >
> > > On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <ka...@gmail.com>
> wrote:
> > >
> > > > John,
> > > >
> > > > My concern is H/A of metrics on Storm by default. (I'm not 100% sure
> > > > Bobby pointed out same things.)
> > > >
> > > > Since Apache Storm has been used by various users so that we can't
> > > > assume that users have knowledges of external systems (including
> > > > Hadoop ecosystem, personal opinion) and operate them smoothly.
> > > > It reminds me about the importance to keep in mind about default.
> > > >
> > > > Therefore, I'm curious that new metrics feature of JStom can work
> > > > smoothly without external system (HBase / OTS). And love to see it
> > > > supports H/A without other systems, or users have to tolerate lost of
> > > > metrics for some scenarios.
> > > >
> > > > I guess this may be valid questions on H/A (as far as my
> understanding
> > > > of design doc is right): How metrics work when TopologyMaster is
> down?
> > > > And how metrics work when failover of Nimbus occurs?
> > > >
> > > > Personally I don't mind losing metrics for short durations (just want
> > > > to check availability of H/A), but failure shouldn't mess up whole
> > > metrics.
> > > >
> > > > Thanks,
> > > > Jungtaek Lim (HeartSaVioR)
> > > >
> > > > 2016년 3월 23일 (수) 오후 3:39, John Fang <xi...@alibaba-inc.com>님이
> 작성:
> > > >
> > > > > @ Bobby Evans Jstorm code has experienced a lot of tests over the
> > > > > past
> > > > few
> > > > > years, espatially HA and scalability. We have done a lot of
> > > > > optimization about Metrics. The performance is better than Flink in
> > > > > my tests. In my personal opinion, the metric in jstorm offers very
> > > > > much informations. And the metric can tell us where is the
> bottleneck
> > > when we run a topology.
> > > > The
> > > > > performance bottleneck maybe serialize/deserialize/netty/executor
> > > > > and so on. Of course, I also has some other good monitoring in the
> > > > > world. So I hope we can choice the better monitoring before phrase
> > > > > 2. And I will
> > > > start
> > > > > study the Alas. If it is better, I am pleasured to redesign the
> > > > > metric by Alas.
> > > > > -----邮件原件-----
> > > > > 发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
> > > > > 发送时间: 2016年3月22日 22:36
> > > > > 收件人: dev@storm.apache.org
> > > > > 主题: Re: Question on Metrics Server to Alibaba team
> > > > >
> > > > > My personal opinion is that we should not reinvent the wheel (aka
> > > > > distributed fault tolerant metrics) ourselves.  The local file
> > > > > blobstore with nimbus HA was a big enough pain to write and it is
> > > > > relatively simple in comparison.
> > > > > If the JStorm code is simple and offers everything we need in terms
> > > > > of HA and scalability then I would be OK with it, but if it doesn't
> > > > > I would
> > > > lean
> > > > > towards a different compatible open source solution.
> > > > >
> > > > > https://github.com/Netflix/atlas
> > > > > looks very promising as a default option.  It is actively
> maintained
> > > > > by a group that I think has some of the best monitoring in the
> > > > > world.  And it
> > > > is
> > > > > both java and apache compatible.  It has no histogram support that
> I
> > > > could
> > > > > find, but that I don't see as being super critical.  The biggest
> > > > > drawback is there is little documentation on how to use it, to
> > > > > really be able to evaluate it for our needs. - Bobby
> > > > >
> > > > >    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim
> > > > > <ka...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >
> > > > >  Harsha,
> > > > >
> > > > > That's why I think new metric feature of JStorm looks promising.
> > > > >
> > > > > According to design doc on
> > > > > https://issues.apache.org/jira/browse/STORM-1329,
> > > > > there's no distinction between topology stat (which Apache Storm
> > > > > includes to worker heartbeat) and built-in metrics (which should be
> > > > > handled with separate consumer, as you stated).
> > > > > All metrics are passed to Nimbus and Nimbus cached metrics, which
> > > > > implies we can treat all metrics as same, and we can also provide
> > > > > built-in
> > > > metrics
> > > > > (including custom metrics) to users via REST API, too.
> > > > >
> > > > > I thought about standalone metrics server process which handles
> > > > > whole metric works (maybe TopologyMaster + Nimbus on design doc),
> > > > > but if
> > > > current
> > > > > implementation of metric feature on JStorm can take care of what
> I'm
> > > > > assuming, I guess it's great enough.
> > > > >
> > > > > Since I don't know about TopologyMaster, I just wonder that
> there're
> > > > > any SPOFs (including soft) and how metrics work when if component
> of
> > > > > SPOF
> > > > goes
> > > > > down.
> > > > > Since Cody gives digging point to take a look at, we can evaluate
> > > > > that feature before phase 2.
> > > > >
> > > > > Thanks,
> > > > > Jungtaek Lim (HeartSaVioR)
> > > > >
> > > > > 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:
> > > > >
> > > > > > One of the goals of this work and probably can be addressed in
> > > > > > separate jira is how the topology metrics reporter works. Today
> > > > > > its a bolt thats part of a topology graph that means its another
> > > > > > node in the Topology DAG that needs be tuned for better
> > > > > > performance. Some of our users took performance hits by deploying
> > > > > > topology metrics reporter that can send metrics to Ganglia.
> > > > > > Ideally this collection should be asynchronous and not be a node
> in
> > > topology DAG.
> > > > > >
> > > > > > Shipping default metrics server and along with pluggable option
> > > > > > for users who wants to graphite or other timeline servers should
> > > > > > be the goal.
> > > > > >
> > > > > > --Harsha
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > > > > @Cody - The design looks good. Does the design allow to
> > > > > > > aggregate metrics at the task/executor level? Basically, number
> > > > > > > of distinct metrics is proportional to the number of distinct
> > > > > > > tasks, did you ever run into such a use case?
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere
> > > > > > > <e....@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Also, you can read the code from our latest release JStorm
> 2.1.1.
> > > > > > > >
> > > > > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere
> > > > > > > > <e....@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > @Jungtaek,
> > > > > > > > > We did some tests on codahale metrics, compared to
> > > > > > > > > meters/histograms, counters are quite fast. So we mainly
> > > > > > > > > focused on the optimization of
> > > > > > > > meters
> > > > > > > > > and histograms (they are indeed very slow) including double
> > > > > > > > > sampling, changing the clock from ns (System.nanoTime) to
> > > > > > > > > ms,
> > > > etc.
> > > > > > > > > You can take a look at the
> > > > > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount"
> > > > > > > > > class of our sequence-split-merge example code, as the
> > > > > > > > > client code entry to
> > > > > > metrics.
> > > > > > > > > After that, you may dig to TopologyMaster class, which is
> > > > > > > > > still part
> > > > > > of a
> > > > > > > > > topology, and then to TopologyMetricsRunnable, which is a
> > > > > > > > > part of
> > > > > > nimbus
> > > > > > > > > server, finally to MetricUploader plugin, this is where the
> > > > > > > > > metrics interfere with our "metrics server". Still,
> there're
> > > > > > > > > some nits in the
> > > > > > > > code,
> > > > > > > > > but I think that should be no big problem.
> > > > > > > > >
> > > > > > > > > I'd also like to point out that our "metrics server" is not
> > > > > > > > > strictly
> > > > > > a
> > > > > > > > > real metrics server, since most of the duty lies on nimbus
> > > > > > > > > server and topology master, it's more appropriate to call
> it
> > > > > metrics storage.
> > > > > > The
> > > > > > > > main
> > > > > > > > > reason for this is that we don't want to make a
> heavy-weight
> > > > > > > > > metrics
> > > > > > > > server
> > > > > > > > > out of JStorm, and this makes us very easy to maintain (we
> > > > > > > > > have teams
> > > > > > > > that
> > > > > > > > > specifically maintain HBase/OTS in Alibaba since they're so
> > > > > > > > > commonly
> > > > > > used
> > > > > > > > > in production).
> > > > > > > > >
> > > > > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim
> > > > > > > > > <ka...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > > > > >>
> > > > > > > > >> Cody,
> > > > > > > > >> I took a look at design doc and looks promising,
> especially
> > > > > > > > >> it
> > > > > > doesn't
> > > > > > > > do
> > > > > > > > >> sampling when metric type is 'counter'. As far as I heard
> > > > > > > > >> (I didn't
> > > > > > try
> > > > > > > > >> it)
> > > > > > > > >> it becomes huge performance hit in Apache Storm when we
> > > > > > > > >> change
> > > > > > sample
> > > > > > > > rate
> > > > > > > > >> to 1.0.
> > > > > > > > >> Could you guide the entry point of metric feature in
> JStorm
> > > > > > > > >> to dig
> > > > > > into?
> > > > > > > > >>
> > > > > > > > >> And just a curiosity, did you consider extracting metric
> > > > > > > > >> feature
> > > > > > (which
> > > > > > > > is
> > > > > > > > >> done with TopologyMasters and Nimbuses) into separate
> > > component?
> > > > > > > > >> I understood your mention to 'metrics server' as separate
> > > > > > component, but
> > > > > > > > >> after seeing design doc, feature seems to be implemented
> on
> > > > > Nimbus.
> > > > > > > > >>
> > > > > > > > >> Thanks,
> > > > > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > > > >>
> > > > > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere
> > > > > > > > >> <e....@gmail.com>님이
> > > > > > 작성:
> > > > > > > > >>
> > > > > > > > >> > JStorm has provided a MetricUploader interface, which is
> > > > > > > > >> > similar
> > > > > > to
> > > > > > > > >> > IMetricsConsumer in storm, and the underlying
> > > > > > > > >> > implementation is
> > > > > > > > >> pluggable,
> > > > > > > > >> > you can use HBase, or any other KV store that supports
> > > > > > > > >> > timeline
> > > > > > > > queries
> > > > > > > > >> or
> > > > > > > > >> > even a database(maybe for it's a small cluster). We
> > > > > > > > >> > provide model
> > > > > > > > >> classes
> > > > > > > > >> > in jstorm-core, as to what kinds of metrics data need to
> > > > > > > > >> > be
> > > > > > stored,
> > > > > > > > it's
> > > > > > > > >> > totally up to the detailed implementation. Our internal
> > > > > > implementation
> > > > > > > > >> uses
> > > > > > > > >> > OTS, which is a product of aliyun (
> > > > > > > > https://www.aliyun.com/product/ots/
> > > > > > > > >> ),
> > > > > > > > >> > but it's easy to adapt to other implementations.
> > > > > > > > >> >
> > > > > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > > > > >> <evans@yahoo-inc.com.invalid
> > > > > > > > >> > >
> > > > > > > > >> > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Yes we originally wanted to try and use the Hadoop
> > > > > > > > >> > > Timeline
> > > > > > Server
> > > > > > > > for
> > > > > > > > >> > > storm metrics feedback to nimbus + UI + history like
> > > server.
> > > > > > But it
> > > > > > > > >> was
> > > > > > > > >> > > not stable at the time, so we stopped.  For the sake
> of
> > > > > > > > >> > > playing
> > > > > > > > nicely
> > > > > > > > >> > with
> > > > > > > > >> > > the rest of the big data ecosystem I would like to see
> > > > > > > > >> > > us
> > > > > > support it
> > > > > > > > >> as
> > > > > > > > >> > an
> > > > > > > > >> > > option for metrics collection/query, but until the
> > > > > > > > >> > > timeline
> > > > > > server
> > > > > > > > v2
> > > > > > > > >> is
> > > > > > > > >> > > ready and released.  For me the important thing is
> that
> > > > > > > > >> > > we have
> > > > > > a
> > > > > > > > >> decent
> > > > > > > > >> > > time series DB that comes with storm by default and is
> > > > > > pluggable so
> > > > > > > > we
> > > > > > > > >> > can
> > > > > > > > >> > > replace it with something else that has similar
> > > > > > > > >> > > capabilities in
> > > > > > the
> > > > > > > > >> > future.
> > > > > > > > >> > >  - Bobby
> > > > > > > > >> > >
> > > > > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere
> <
> > > > > > > > >> > >e.neverme@gmail.com> wrote:
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm
> > > > > > > > >> > >absolutely
> > > > > > ok
> > > > > > > > to
> > > > > > > > >> > > discuss this in advance.
> > > > > > > > >> > >
> > > > > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > > > > e.neverme@gmail.com
> > > > > > > > >> >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > Yes it's already in production.
> > > > > > > > >> > > > The implementation basically follows the design
> > > > > > > > >> > > > document in
> > > > > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329,
> you
> > > > > > > > >> > > > can
> > > > > > take a
> > > > > > > > >> look
> > > > > > > > >> > > > first and feel free to ask questions.
> > > > > > > > >> > > >
> > > > > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > > > > kabhwan@gmail.com
> > > > > > > > >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > > >
> > > > > > > > >> > > >> Hi,
> > > > > > > > >> > > >>
> > > > > > > > >> > > >> I got something to do with metrics so I'm seeking
> > > > > > > > >> > > >> the pull
> > > > > > > > requests
> > > > > > > > >> > > which
> > > > > > > > >> > > >> addresses metrics.
> > > > > > > > >> > > >> And at #753
> > > > > > > > >> > > >> <https://github.com/apache/storm/pull/753> I
> > > > > > found
> > > > > > > > >> Cody
> > > > > > > > >> > > said
> > > > > > > > >> > > >> we
> > > > > > > > >> > > >> (maybe it means Alibaba team) are currently working
> > > > > > > > >> > > >> on
> > > > > > Metrics
> > > > > > > > >> Server.
> > > > > > > > >> > > >> (I also found comment which said there was some
> talk
> > > > > > > > >> > > >> while
> > > > > > ago
> > > > > > > > >> around
> > > > > > > > >> > > >> integrating Hadoop timeline server. Seems like no
> > > > > > > > >> > > >> one came up
> > > > > > > > with
> > > > > > > > >> the
> > > > > > > > >> > > >> result, and I prefer to avoid big dependency so I'm
> > > > > > > > >> > > >> in favor
> > > > > > of
> > > > > > > > >> > Metrics
> > > > > > > > >> > > >> Server for now.)
> > > > > > > > >> > > >>
> > > > > > > > >> > > >> I think that would improve metrics feature of Storm
> > > > > > > > >> > > >> much
> > > > > > better,
> > > > > > > > so
> > > > > > > > >> > I'd
> > > > > > > > >> > > >> like to see how the work is going. Sure it's only
> > > > > > > > >> > > >> when
> > > > > > there's no
> > > > > > > > >> > issue
> > > > > > > > >> > > >> for
> > > > > > > > >> > > >> you to work transparently. I just would like to
> > > > > > > > >> > > >> prevent
> > > > > > > > >> duplication of
> > > > > > > > >> > > >> work, and would like to help if needed and
> possible.
> > > > > > > > >> > > >>
> > > > > > > > >> > > >> Thanks,
> > > > > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > > > >> > > >>
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > > Abhishek Agarwal
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > >
> >
> >
>

答复: 答复: Question on Metrics Server to Alibaba team

Posted by John Fang <xi...@alibaba-inc.com>.

@Harsha If we not depend on external storage such as hdfs, we can depend on the RockDB.

-----邮件原件-----
发件人: Harsha [mailto:storm@harsha.io] 
发送时间: 2016年3月30日 9:36
收件人: dev@storm.apache.org
主题: Re: 答复: Question on Metrics Server to Alibaba team

Another thing to consider is to store a time series data not the current approach where we store 1min, 10min, 3hrs windowed approach and definitely not depend on external storage such as hdfs .

On Fri, Mar 25, 2016, at 06:43 AM, Bobby Evans wrote:
> My concern is really around how much time/effort it is to get to a 
> final solution, and to ultimately maintain/support that solution.  If 
> I was doing this from scratch I would probably pull something off of 
> the shelf that is tested and has an entire community supporting it 
> instead of writing something ourselves from scratch.  But in this case 
> we have a solution from JStorm, that we know works.  Because this is 
> the backend that we are talking about we can switch things out later 
> on if we need to.  Like I said before I am fine with using the JStorm code initially.
> I mostly want to be sure of a few things.
> 1. The metrics interface we expose to end users is well thought out 
> and can be extended in the future.2. The interfaces that connect this 
> front end to the back end are though out and we could replace the back 
> end if needed.3. The solution offers some level of high availability.  
> If Nimbus a worker, etc. crash it is OK to lose some data, but we 
> don't want to
>  - Bobby
> 
>     On Friday, March 25, 2016 6:26 AM, Cody Innowhere
>     <e....@gmail.com> wrote:
>  
> 
>  Bobby,
> I understand your concern. Still, I think our metrics design in JStorm 
> can work without any external service, as I mentioned above, we can 
> store metrics in rocksdb on nimbus server. A rough thought will be: we 
> store the latest 1 hour of 1-min window data, 10 hours of 10-min 
> window data, 5 days of 2-hour window data, 30 days of 1-day window, 
> etc. And if there's the need to sync metrics data between nimbus 
> servers, we can add a sync thread to handle nimbus fail-over, since 
> it's just metrics data that don't really matter too much, we can use a 
> plain simple sync model.
> 
> The external service is another option to end users, if users feel 
> it's important (or maybe their business built on top of storm is very 
> important), they can use this external service to build their own 
> monitor system which can be more useful than the original solution 
> shipped with storm.
> 
> On Fri, Mar 25, 2016 at 2:09 AM, Bobby Evans 
> <ev...@yahoo-inc.com.invalid>
> wrote:
> 
> > The problem is that we want something for storm that can work out of 
> >the  box, ideally without some other complicated external service 
> >(except  zookeeper which we already have, and is not actually that 
> >complex to setup  and run).
> > If we feel that we must have some external state store that is 
> >required  for storm to run, then we need to make the decision 
> >carefully and  deliberately.
> >  - Bobby
> >
> >    On Wednesday, March 23, 2016 8:37 AM, John Fang <  
> >xiaojian.fxj@alibaba-inc.com> wrote:
> >
> >
> >  Sorry , I misunderstand it. We will make H/A for TopologyMaster. 
> >And  metric meta will store at HDFS,  So the metrics meta won't rely 
> >on the  nimbus. It can enhance the stability of the metric system.
> >
> > -----邮件原件-----
> > 发件人: Cody Innowhere [mailto:e.neverme@gmail.com]
> > 发送时间: 2016年3月23日 19:59
> > 收件人: dev@storm.apache.org
> > 主题: Re: Question on Metrics Server to Alibaba team
> >
> > If we don't rely on any external system, our metrics system is still 
> > available but will store metrics meta/data in rocksdb on nimbus servers.
> > There will be limits though, for example, we cannot store metrics 
> > data all through the topology lifecycle, because rocksdb is only a 
> > KV storage, it may not support efficient scan operations and too 
> > much data in local disk may bring in extra IO overhead, so we may 
> > have to store latest 1hour of m1 data, 6 hours of m10 data as such 
> > (currently not implemented in JStorm, but quite easy to do this).
> >
> > TopologyMaster is merely a channel for 
> > registering/computing/uploading metrics to nimbus, so if a TM goes 
> > down, the topology metrics will be unavailable for a while before it 
> > gets pulled up somewhere else(for a normal failover case, this 
> > should be very fast), while supervisor/nimbus metrics are unaffected 
> > as they're sent to nimbus via thrift interface. As long as TM is back, the topology metrics will be available again.
> >
> > Currently JStorm does sync metrics meta but metrics data between 
> > multiple nimbus serers is not synced. So under a nimbus failure, 
> > possibly we may lose some metrics data.
> >
> >
> > On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:
> >
> > > John,
> > >
> > > My concern is H/A of metrics on Storm by default. (I'm not 100% 
> > > sure Bobby pointed out same things.)
> > >
> > > Since Apache Storm has been used by various users so that we can't 
> > > assume that users have knowledges of external systems (including 
> > > Hadoop ecosystem, personal opinion) and operate them smoothly.
> > > It reminds me about the importance to keep in mind about default.
> > >
> > > Therefore, I'm curious that new metrics feature of JStom can work 
> > > smoothly without external system (HBase / OTS). And love to see it 
> > > supports H/A without other systems, or users have to tolerate lost 
> > > of metrics for some scenarios.
> > >
> > > I guess this may be valid questions on H/A (as far as my 
> > > understanding of design doc is right): How metrics work when TopologyMaster is down?
> > > And how metrics work when failover of Nimbus occurs?
> > >
> > > Personally I don't mind losing metrics for short durations (just 
> > > want to check availability of H/A), but failure shouldn't mess up 
> > > whole
> > metrics.
> > >
> > > Thanks,
> > > Jungtaek Lim (HeartSaVioR)
> > >
> > > 2016년 3월 23일 (수) 오후 3:39, John Fang <xi...@alibaba-inc.com>님이 작성:
> > >
> > > > @ Bobby Evans Jstorm code has experienced a lot of tests over 
> > > > the past
> > > few
> > > > years, espatially HA and scalability. We have done a lot of 
> > > > optimization about Metrics. The performance is better than Flink 
> > > > in my tests. In my personal opinion, the metric in jstorm offers 
> > > > very much informations. And the metric can tell us where is the 
> > > > bottleneck
> > when we run a topology.
> > > The
> > > > performance bottleneck maybe 
> > > > serialize/deserialize/netty/executor
> > > > and so on. Of course, I also has some other good monitoring in 
> > > > the world. So I hope we can choice the better monitoring before 
> > > > phrase 2. And I will
> > > start
> > > > study the Alas. If it is better, I am pleasured to redesign the 
> > > > metric by Alas.
> > > > -----邮件原件-----
> > > > 发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
> > > > 发送时间: 2016年3月22日 22:36
> > > > 收件人: dev@storm.apache.org
> > > > 主题: Re: Question on Metrics Server to Alibaba team
> > > >
> > > > My personal opinion is that we should not reinvent the wheel 
> > > > (aka distributed fault tolerant metrics) ourselves.  The local 
> > > > file blobstore with nimbus HA was a big enough pain to write and 
> > > > it is relatively simple in comparison.
> > > > If the JStorm code is simple and offers everything we need in 
> > > > terms of HA and scalability then I would be OK with it, but if 
> > > > it doesn't I would
> > > lean
> > > > towards a different compatible open source solution.
> > > >
> > > > https://github.com/Netflix/atlas looks very promising as a 
> > > > default option.  It is actively maintained by a group that I 
> > > > think has some of the best monitoring in the world.  And it
> > > is
> > > > both java and apache compatible.  It has no histogram support 
> > > > that I
> > > could
> > > > find, but that I don't see as being super critical.  The biggest 
> > > > drawback is there is little documentation on how to use it, to 
> > > > really be able to evaluate it for our needs. - Bobby
> > > >
> > > >    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim  
> > > ><ka...@gmail.com>
> > > > wrote:
> > > >
> > > >
> > > >  Harsha,
> > > >
> > > > That's why I think new metric feature of JStorm looks promising.
> > > >
> > > > According to design doc on
> > > > https://issues.apache.org/jira/browse/STORM-1329,
> > > > there's no distinction between topology stat (which Apache Storm 
> > > > includes to worker heartbeat) and built-in metrics (which should 
> > > > be handled with separate consumer, as you stated).
> > > > All metrics are passed to Nimbus and Nimbus cached metrics, 
> > > > which implies we can treat all metrics as same, and we can also 
> > > > provide built-in
> > > metrics
> > > > (including custom metrics) to users via REST API, too.
> > > >
> > > > I thought about standalone metrics server process which handles 
> > > > whole metric works (maybe TopologyMaster + Nimbus on design 
> > > > doc), but if
> > > current
> > > > implementation of metric feature on JStorm can take care of what 
> > > > I'm assuming, I guess it's great enough.
> > > >
> > > > Since I don't know about TopologyMaster, I just wonder that 
> > > > there're any SPOFs (including soft) and how metrics work when if 
> > > > component of SPOF
> > > goes
> > > > down.
> > > > Since Cody gives digging point to take a look at, we can 
> > > > evaluate that feature before phase 2.
> > > >
> > > > Thanks,
> > > > Jungtaek Lim (HeartSaVioR)
> > > >
> > > > 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:
> > > >
> > > > > One of the goals of this work and probably can be addressed in 
> > > > > separate jira is how the topology metrics reporter works. 
> > > > > Today its a bolt thats part of a topology graph that means its 
> > > > > another node in the Topology DAG that needs be tuned for 
> > > > > better performance. Some of our users took performance hits by 
> > > > > deploying topology metrics reporter that can send metrics to Ganglia.
> > > > > Ideally this collection should be asynchronous and not be a 
> > > > > node in
> > topology DAG.
> > > > >
> > > > > Shipping default metrics server and along with pluggable 
> > > > > option for users who wants to graphite or other timeline 
> > > > > servers should be the goal.
> > > > >
> > > > > --Harsha
> > > > >
> > > > >
> > > > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > > > @Cody - The design looks good. Does the design allow to 
> > > > > > aggregate metrics at the task/executor level? Basically, 
> > > > > > number of distinct metrics is proportional to the number of 
> > > > > > distinct tasks, did you ever run into such a use case?
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere 
> > > > > > <e....@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Also, you can read the code from our latest release JStorm 2.1.1.
> > > > > > >
> > > > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere 
> > > > > > > <e....@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > @Jungtaek,
> > > > > > > > We did some tests on codahale metrics, compared to 
> > > > > > > > meters/histograms, counters are quite fast. So we mainly 
> > > > > > > > focused on the optimization of
> > > > > > > meters
> > > > > > > > and histograms (they are indeed very slow) including 
> > > > > > > > double sampling, changing the clock from ns 
> > > > > > > > (System.nanoTime) to ms,
> > > etc.
> > > > > > > > You can take a look at the 
> > > > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount"
> > > > > > > > class of our sequence-split-merge example code, as the 
> > > > > > > > client code entry to
> > > > > metrics.
> > > > > > > > After that, you may dig to TopologyMaster class, which 
> > > > > > > > is still part
> > > > > of a
> > > > > > > > topology, and then to TopologyMetricsRunnable, which is 
> > > > > > > > a part of
> > > > > nimbus
> > > > > > > > server, finally to MetricUploader plugin, this is where 
> > > > > > > > the metrics interfere with our "metrics server". Still, 
> > > > > > > > there're some nits in the
> > > > > > > code,
> > > > > > > > but I think that should be no big problem.
> > > > > > > >
> > > > > > > > I'd also like to point out that our "metrics server" is 
> > > > > > > > not strictly
> > > > > a
> > > > > > > > real metrics server, since most of the duty lies on 
> > > > > > > > nimbus server and topology master, it's more appropriate 
> > > > > > > > to call it
> > > > metrics storage.
> > > > > The
> > > > > > > main
> > > > > > > > reason for this is that we don't want to make a 
> > > > > > > > heavy-weight metrics
> > > > > > > server
> > > > > > > > out of JStorm, and this makes us very easy to maintain 
> > > > > > > > (we have teams
> > > > > > > that
> > > > > > > > specifically maintain HBase/OTS in Alibaba since they're 
> > > > > > > > so commonly
> > > > > used
> > > > > > > > in production).
> > > > > > > >
> > > > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim 
> > > > > > > > <ka...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > > > >>
> > > > > > > >> Cody,
> > > > > > > >> I took a look at design doc and looks promising, 
> > > > > > > >> especially it
> > > > > doesn't
> > > > > > > do
> > > > > > > >> sampling when metric type is 'counter'. As far as I 
> > > > > > > >> heard (I didn't
> > > > > try
> > > > > > > >> it)
> > > > > > > >> it becomes huge performance hit in Apache Storm when we 
> > > > > > > >> change
> > > > > sample
> > > > > > > rate
> > > > > > > >> to 1.0.
> > > > > > > >> Could you guide the entry point of metric feature in 
> > > > > > > >> JStorm to dig
> > > > > into?
> > > > > > > >>
> > > > > > > >> And just a curiosity, did you consider extracting 
> > > > > > > >> metric feature
> > > > > (which
> > > > > > > is
> > > > > > > >> done with TopologyMasters and Nimbuses) into separate
> > component?
> > > > > > > >> I understood your mention to 'metrics server' as 
> > > > > > > >> separate
> > > > > component, but
> > > > > > > >> after seeing design doc, feature seems to be 
> > > > > > > >> implemented on
> > > > Nimbus.
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > > >>
> > > > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere 
> > > > > > > >> <e....@gmail.com>님이
> > > > > 작성:
> > > > > > > >>
> > > > > > > >> > JStorm has provided a MetricUploader interface, which 
> > > > > > > >> > is similar
> > > > > to
> > > > > > > >> > IMetricsConsumer in storm, and the underlying 
> > > > > > > >> > implementation is
> > > > > > > >> pluggable,
> > > > > > > >> > you can use HBase, or any other KV store that 
> > > > > > > >> > supports timeline
> > > > > > > queries
> > > > > > > >> or
> > > > > > > >> > even a database(maybe for it's a small cluster). We 
> > > > > > > >> > provide model
> > > > > > > >> classes
> > > > > > > >> > in jstorm-core, as to what kinds of metrics data need 
> > > > > > > >> > to be
> > > > > stored,
> > > > > > > it's
> > > > > > > >> > totally up to the detailed implementation. Our 
> > > > > > > >> > internal
> > > > > implementation
> > > > > > > >> uses
> > > > > > > >> > OTS, which is a product of aliyun (
> > > > > > > https://www.aliyun.com/product/ots/
> > > > > > > >> ),
> > > > > > > >> > but it's easy to adapt to other implementations.
> > > > > > > >> >
> > > > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > > > >> <evans@yahoo-inc.com.invalid
> > > > > > > >> > >
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> > > Yes we originally wanted to try and use the Hadoop 
> > > > > > > >> > > Timeline
> > > > > Server
> > > > > > > for
> > > > > > > >> > > storm metrics feedback to nimbus + UI + history 
> > > > > > > >> > > like
> > server.
> > > > > But it
> > > > > > > >> was
> > > > > > > >> > > not stable at the time, so we stopped.  For the 
> > > > > > > >> > > sake of playing
> > > > > > > nicely
> > > > > > > >> > with
> > > > > > > >> > > the rest of the big data ecosystem I would like to 
> > > > > > > >> > > see us
> > > > > support it
> > > > > > > >> as
> > > > > > > >> > an
> > > > > > > >> > > option for metrics collection/query, but until the 
> > > > > > > >> > > timeline
> > > > > server
> > > > > > > v2
> > > > > > > >> is
> > > > > > > >> > > ready and released.  For me the important thing is 
> > > > > > > >> > > that we have
> > > > > a
> > > > > > > >> decent
> > > > > > > >> > > time series DB that comes with storm by default and 
> > > > > > > >> > > is
> > > > > pluggable so
> > > > > > > we
> > > > > > > >> > can
> > > > > > > >> > > replace it with something else that has similar 
> > > > > > > >> > > capabilities in
> > > > > the
> > > > > > > >> > future.
> > > > > > > >> > >  - Bobby
> > > > > > > >> > >
> > > > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody 
> > > > > > > >> > >Innowhere < e.neverme@gmail.com> wrote:
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >  It's actually in Phase 2 of porting JStorm, but 
> > > > > > > >> > >I'm absolutely
> > > > > ok
> > > > > > > to
> > > > > > > >> > > discuss this in advance.
> > > > > > > >> > >
> > > > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > > > e.neverme@gmail.com
> > > > > > > >> >
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Yes it's already in production.
> > > > > > > >> > > > The implementation basically follows the design 
> > > > > > > >> > > > document in 
> > > > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, 
> > > > > > > >> > > > you can
> > > > > take a
> > > > > > > >> look
> > > > > > > >> > > > first and feel free to ask questions.
> > > > > > > >> > > >
> > > > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > > > kabhwan@gmail.com
> > > > > > > >
> > > > > > > >> > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > >> Hi,
> > > > > > > >> > > >>
> > > > > > > >> > > >> I got something to do with metrics so I'm 
> > > > > > > >> > > >> seeking the pull
> > > > > > > requests
> > > > > > > >> > > which
> > > > > > > >> > > >> addresses metrics.
> > > > > > > >> > > >> And at #753
> > > > > > > >> > > >> <https://github.com/apache/storm/pull/753> I
> > > > > found
> > > > > > > >> Cody
> > > > > > > >> > > said
> > > > > > > >> > > >> we
> > > > > > > >> > > >> (maybe it means Alibaba team) are currently 
> > > > > > > >> > > >> working on
> > > > > Metrics
> > > > > > > >> Server.
> > > > > > > >> > > >> (I also found comment which said there was some 
> > > > > > > >> > > >> talk while
> > > > > ago
> > > > > > > >> around
> > > > > > > >> > > >> integrating Hadoop timeline server. Seems like 
> > > > > > > >> > > >> no one came up
> > > > > > > with
> > > > > > > >> the
> > > > > > > >> > > >> result, and I prefer to avoid big dependency so 
> > > > > > > >> > > >> I'm in favor
> > > > > of
> > > > > > > >> > Metrics
> > > > > > > >> > > >> Server for now.)
> > > > > > > >> > > >>
> > > > > > > >> > > >> I think that would improve metrics feature of 
> > > > > > > >> > > >> Storm much
> > > > > better,
> > > > > > > so
> > > > > > > >> > I'd
> > > > > > > >> > > >> like to see how the work is going. Sure it's 
> > > > > > > >> > > >> only when
> > > > > there's no
> > > > > > > >> > issue
> > > > > > > >> > > >> for
> > > > > > > >> > > >> you to work transparently. I just would like to 
> > > > > > > >> > > >> prevent
> > > > > > > >> duplication of
> > > > > > > >> > > >> work, and would like to help if needed and possible.
> > > > > > > >> > > >>
> > > > > > > >> > > >> Thanks,
> > > > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > > >> > > >>
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Abhishek Agarwal
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> >
> 
>

Re: 答复: Question on Metrics Server to Alibaba team

Posted by Harsha <st...@harsha.io>.

Another thing to consider is to store a time series data not the current
approach where we store 1min, 10min, 3hrs windowed approach and
definitely not depend on external storage such as hdfs .

On Fri, Mar 25, 2016, at 06:43 AM, Bobby Evans wrote:
> My concern is really around how much time/effort it is to get to a final
> solution, and to ultimately maintain/support that solution.  If I was
> doing this from scratch I would probably pull something off of the shelf
> that is tested and has an entire community supporting it instead of
> writing something ourselves from scratch.  But in this case we have a
> solution from JStorm, that we know works.  Because this is the backend
> that we are talking about we can switch things out later on if we need
> to.  Like I said before I am fine with using the JStorm code initially. 
> I mostly want to be sure of a few things.
> 1. The metrics interface we expose to end users is well thought out and
> can be extended in the future.2. The interfaces that connect this front
> end to the back end are though out and we could replace the back end if
> needed.3. The solution offers some level of high availability.  If Nimbus
> a worker, etc. crash it is OK to lose some data, but we don't want to 
>  - Bobby 
> 
>     On Friday, March 25, 2016 6:26 AM, Cody Innowhere
>     <e....@gmail.com> wrote:
>  
> 
>  Bobby,
> I understand your concern. Still, I think our metrics design in JStorm
> can
> work without any external service, as I mentioned above, we can store
> metrics in rocksdb on nimbus server. A rough thought will be: we store
> the
> latest 1 hour of 1-min window data, 10 hours of 10-min window data, 5
> days
> of 2-hour window data, 30 days of 1-day window, etc. And if there's the
> need to sync metrics data between nimbus servers, we can add a sync
> thread
> to handle nimbus fail-over, since it's just metrics data that don't
> really
> matter too much, we can use a plain simple sync model.
> 
> The external service is another option to end users, if users feel it's
> important (or maybe their business built on top of storm is very
> important), they can use this external service to build their own monitor
> system which can be more useful than the original solution shipped with
> storm.
> 
> On Fri, Mar 25, 2016 at 2:09 AM, Bobby Evans
> <ev...@yahoo-inc.com.invalid>
> wrote:
> 
> > The problem is that we want something for storm that can work out of the
> > box, ideally without some other complicated external service (except
> > zookeeper which we already have, and is not actually that complex to setup
> > and run).
> > If we feel that we must have some external state store that is required
> > for storm to run, then we need to make the decision carefully and
> > deliberately.
> >  - Bobby
> >
> >    On Wednesday, March 23, 2016 8:37 AM, John Fang <
> > xiaojian.fxj@alibaba-inc.com> wrote:
> >
> >
> >  Sorry , I misunderstand it. We will make H/A for TopologyMaster. And
> > metric meta will store at HDFS,  So the metrics meta won't rely on the
> > nimbus. It can enhance the stability of the metric system.
> >
> > -----邮件原件-----
> > 发件人: Cody Innowhere [mailto:e.neverme@gmail.com]
> > 发送时间: 2016年3月23日 19:59
> > 收件人: dev@storm.apache.org
> > 主题: Re: Question on Metrics Server to Alibaba team
> >
> > If we don't rely on any external system, our metrics system is still
> > available but will store metrics meta/data in rocksdb on nimbus servers.
> > There will be limits though, for example, we cannot store metrics data all
> > through the topology lifecycle, because rocksdb is only a KV storage, it
> > may not support efficient scan operations and too much data in local disk
> > may bring in extra IO overhead, so we may have to store latest 1hour of m1
> > data, 6 hours of m10 data as such (currently not implemented in JStorm, but
> > quite easy to do this).
> >
> > TopologyMaster is merely a channel for registering/computing/uploading
> > metrics to nimbus, so if a TM goes down, the topology metrics will be
> > unavailable for a while before it gets pulled up somewhere else(for a
> > normal failover case, this should be very fast), while supervisor/nimbus
> > metrics are unaffected as they're sent to nimbus via thrift interface. As
> > long as TM is back, the topology metrics will be available again.
> >
> > Currently JStorm does sync metrics meta but metrics data between multiple
> > nimbus serers is not synced. So under a nimbus failure, possibly we may
> > lose some metrics data.
> >
> >
> > On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:
> >
> > > John,
> > >
> > > My concern is H/A of metrics on Storm by default. (I'm not 100% sure
> > > Bobby pointed out same things.)
> > >
> > > Since Apache Storm has been used by various users so that we can't
> > > assume that users have knowledges of external systems (including
> > > Hadoop ecosystem, personal opinion) and operate them smoothly.
> > > It reminds me about the importance to keep in mind about default.
> > >
> > > Therefore, I'm curious that new metrics feature of JStom can work
> > > smoothly without external system (HBase / OTS). And love to see it
> > > supports H/A without other systems, or users have to tolerate lost of
> > > metrics for some scenarios.
> > >
> > > I guess this may be valid questions on H/A (as far as my understanding
> > > of design doc is right): How metrics work when TopologyMaster is down?
> > > And how metrics work when failover of Nimbus occurs?
> > >
> > > Personally I don't mind losing metrics for short durations (just want
> > > to check availability of H/A), but failure shouldn't mess up whole
> > metrics.
> > >
> > > Thanks,
> > > Jungtaek Lim (HeartSaVioR)
> > >
> > > 2016년 3월 23일 (수) 오후 3:39, John Fang <xi...@alibaba-inc.com>님이 작성:
> > >
> > > > @ Bobby Evans Jstorm code has experienced a lot of tests over the
> > > > past
> > > few
> > > > years, espatially HA and scalability. We have done a lot of
> > > > optimization about Metrics. The performance is better than Flink in
> > > > my tests. In my personal opinion, the metric in jstorm offers very
> > > > much informations. And the metric can tell us where is the bottleneck
> > when we run a topology.
> > > The
> > > > performance bottleneck maybe serialize/deserialize/netty/executor
> > > > and so on. Of course, I also has some other good monitoring in the
> > > > world. So I hope we can choice the better monitoring before phrase
> > > > 2. And I will
> > > start
> > > > study the Alas. If it is better, I am pleasured to redesign the
> > > > metric by Alas.
> > > > -----邮件原件-----
> > > > 发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
> > > > 发送时间: 2016年3月22日 22:36
> > > > 收件人: dev@storm.apache.org
> > > > 主题: Re: Question on Metrics Server to Alibaba team
> > > >
> > > > My personal opinion is that we should not reinvent the wheel (aka
> > > > distributed fault tolerant metrics) ourselves.  The local file
> > > > blobstore with nimbus HA was a big enough pain to write and it is
> > > > relatively simple in comparison.
> > > > If the JStorm code is simple and offers everything we need in terms
> > > > of HA and scalability then I would be OK with it, but if it doesn't
> > > > I would
> > > lean
> > > > towards a different compatible open source solution.
> > > >
> > > > https://github.com/Netflix/atlas
> > > > looks very promising as a default option.  It is actively maintained
> > > > by a group that I think has some of the best monitoring in the
> > > > world.  And it
> > > is
> > > > both java and apache compatible.  It has no histogram support that I
> > > could
> > > > find, but that I don't see as being super critical.  The biggest
> > > > drawback is there is little documentation on how to use it, to
> > > > really be able to evaluate it for our needs. - Bobby
> > > >
> > > >    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim
> > > > <ka...@gmail.com>
> > > > wrote:
> > > >
> > > >
> > > >  Harsha,
> > > >
> > > > That's why I think new metric feature of JStorm looks promising.
> > > >
> > > > According to design doc on
> > > > https://issues.apache.org/jira/browse/STORM-1329,
> > > > there's no distinction between topology stat (which Apache Storm
> > > > includes to worker heartbeat) and built-in metrics (which should be
> > > > handled with separate consumer, as you stated).
> > > > All metrics are passed to Nimbus and Nimbus cached metrics, which
> > > > implies we can treat all metrics as same, and we can also provide
> > > > built-in
> > > metrics
> > > > (including custom metrics) to users via REST API, too.
> > > >
> > > > I thought about standalone metrics server process which handles
> > > > whole metric works (maybe TopologyMaster + Nimbus on design doc),
> > > > but if
> > > current
> > > > implementation of metric feature on JStorm can take care of what I'm
> > > > assuming, I guess it's great enough.
> > > >
> > > > Since I don't know about TopologyMaster, I just wonder that there're
> > > > any SPOFs (including soft) and how metrics work when if component of
> > > > SPOF
> > > goes
> > > > down.
> > > > Since Cody gives digging point to take a look at, we can evaluate
> > > > that feature before phase 2.
> > > >
> > > > Thanks,
> > > > Jungtaek Lim (HeartSaVioR)
> > > >
> > > > 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:
> > > >
> > > > > One of the goals of this work and probably can be addressed in
> > > > > separate jira is how the topology metrics reporter works. Today
> > > > > its a bolt thats part of a topology graph that means its another
> > > > > node in the Topology DAG that needs be tuned for better
> > > > > performance. Some of our users took performance hits by deploying
> > > > > topology metrics reporter that can send metrics to Ganglia.
> > > > > Ideally this collection should be asynchronous and not be a node in
> > topology DAG.
> > > > >
> > > > > Shipping default metrics server and along with pluggable option
> > > > > for users who wants to graphite or other timeline servers should
> > > > > be the goal.
> > > > >
> > > > > --Harsha
> > > > >
> > > > >
> > > > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > > > @Cody - The design looks good. Does the design allow to
> > > > > > aggregate metrics at the task/executor level? Basically, number
> > > > > > of distinct metrics is proportional to the number of distinct
> > > > > > tasks, did you ever run into such a use case?
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere
> > > > > > <e....@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Also, you can read the code from our latest release JStorm 2.1.1.
> > > > > > >
> > > > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere
> > > > > > > <e....@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > @Jungtaek,
> > > > > > > > We did some tests on codahale metrics, compared to
> > > > > > > > meters/histograms, counters are quite fast. So we mainly
> > > > > > > > focused on the optimization of
> > > > > > > meters
> > > > > > > > and histograms (they are indeed very slow) including double
> > > > > > > > sampling, changing the clock from ns (System.nanoTime) to
> > > > > > > > ms,
> > > etc.
> > > > > > > > You can take a look at the
> > > > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount"
> > > > > > > > class of our sequence-split-merge example code, as the
> > > > > > > > client code entry to
> > > > > metrics.
> > > > > > > > After that, you may dig to TopologyMaster class, which is
> > > > > > > > still part
> > > > > of a
> > > > > > > > topology, and then to TopologyMetricsRunnable, which is a
> > > > > > > > part of
> > > > > nimbus
> > > > > > > > server, finally to MetricUploader plugin, this is where the
> > > > > > > > metrics interfere with our "metrics server". Still, there're
> > > > > > > > some nits in the
> > > > > > > code,
> > > > > > > > but I think that should be no big problem.
> > > > > > > >
> > > > > > > > I'd also like to point out that our "metrics server" is not
> > > > > > > > strictly
> > > > > a
> > > > > > > > real metrics server, since most of the duty lies on nimbus
> > > > > > > > server and topology master, it's more appropriate to call it
> > > > metrics storage.
> > > > > The
> > > > > > > main
> > > > > > > > reason for this is that we don't want to make a heavy-weight
> > > > > > > > metrics
> > > > > > > server
> > > > > > > > out of JStorm, and this makes us very easy to maintain (we
> > > > > > > > have teams
> > > > > > > that
> > > > > > > > specifically maintain HBase/OTS in Alibaba since they're so
> > > > > > > > commonly
> > > > > used
> > > > > > > > in production).
> > > > > > > >
> > > > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim
> > > > > > > > <ka...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > > > >>
> > > > > > > >> Cody,
> > > > > > > >> I took a look at design doc and looks promising, especially
> > > > > > > >> it
> > > > > doesn't
> > > > > > > do
> > > > > > > >> sampling when metric type is 'counter'. As far as I heard
> > > > > > > >> (I didn't
> > > > > try
> > > > > > > >> it)
> > > > > > > >> it becomes huge performance hit in Apache Storm when we
> > > > > > > >> change
> > > > > sample
> > > > > > > rate
> > > > > > > >> to 1.0.
> > > > > > > >> Could you guide the entry point of metric feature in JStorm
> > > > > > > >> to dig
> > > > > into?
> > > > > > > >>
> > > > > > > >> And just a curiosity, did you consider extracting metric
> > > > > > > >> feature
> > > > > (which
> > > > > > > is
> > > > > > > >> done with TopologyMasters and Nimbuses) into separate
> > component?
> > > > > > > >> I understood your mention to 'metrics server' as separate
> > > > > component, but
> > > > > > > >> after seeing design doc, feature seems to be implemented on
> > > > Nimbus.
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > > >>
> > > > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere
> > > > > > > >> <e....@gmail.com>님이
> > > > > 작성:
> > > > > > > >>
> > > > > > > >> > JStorm has provided a MetricUploader interface, which is
> > > > > > > >> > similar
> > > > > to
> > > > > > > >> > IMetricsConsumer in storm, and the underlying
> > > > > > > >> > implementation is
> > > > > > > >> pluggable,
> > > > > > > >> > you can use HBase, or any other KV store that supports
> > > > > > > >> > timeline
> > > > > > > queries
> > > > > > > >> or
> > > > > > > >> > even a database(maybe for it's a small cluster). We
> > > > > > > >> > provide model
> > > > > > > >> classes
> > > > > > > >> > in jstorm-core, as to what kinds of metrics data need to
> > > > > > > >> > be
> > > > > stored,
> > > > > > > it's
> > > > > > > >> > totally up to the detailed implementation. Our internal
> > > > > implementation
> > > > > > > >> uses
> > > > > > > >> > OTS, which is a product of aliyun (
> > > > > > > https://www.aliyun.com/product/ots/
> > > > > > > >> ),
> > > > > > > >> > but it's easy to adapt to other implementations.
> > > > > > > >> >
> > > > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > > > >> <evans@yahoo-inc.com.invalid
> > > > > > > >> > >
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> > > Yes we originally wanted to try and use the Hadoop
> > > > > > > >> > > Timeline
> > > > > Server
> > > > > > > for
> > > > > > > >> > > storm metrics feedback to nimbus + UI + history like
> > server.
> > > > > But it
> > > > > > > >> was
> > > > > > > >> > > not stable at the time, so we stopped.  For the sake of
> > > > > > > >> > > playing
> > > > > > > nicely
> > > > > > > >> > with
> > > > > > > >> > > the rest of the big data ecosystem I would like to see
> > > > > > > >> > > us
> > > > > support it
> > > > > > > >> as
> > > > > > > >> > an
> > > > > > > >> > > option for metrics collection/query, but until the
> > > > > > > >> > > timeline
> > > > > server
> > > > > > > v2
> > > > > > > >> is
> > > > > > > >> > > ready and released.  For me the important thing is that
> > > > > > > >> > > we have
> > > > > a
> > > > > > > >> decent
> > > > > > > >> > > time series DB that comes with storm by default and is
> > > > > pluggable so
> > > > > > > we
> > > > > > > >> > can
> > > > > > > >> > > replace it with something else that has similar
> > > > > > > >> > > capabilities in
> > > > > the
> > > > > > > >> > future.
> > > > > > > >> > >  - Bobby
> > > > > > > >> > >
> > > > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > > > > > > >> > >e.neverme@gmail.com> wrote:
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm
> > > > > > > >> > >absolutely
> > > > > ok
> > > > > > > to
> > > > > > > >> > > discuss this in advance.
> > > > > > > >> > >
> > > > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > > > e.neverme@gmail.com
> > > > > > > >> >
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Yes it's already in production.
> > > > > > > >> > > > The implementation basically follows the design
> > > > > > > >> > > > document in
> > > > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, you
> > > > > > > >> > > > can
> > > > > take a
> > > > > > > >> look
> > > > > > > >> > > > first and feel free to ask questions.
> > > > > > > >> > > >
> > > > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > > > kabhwan@gmail.com
> > > > > > > >
> > > > > > > >> > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > >> Hi,
> > > > > > > >> > > >>
> > > > > > > >> > > >> I got something to do with metrics so I'm seeking
> > > > > > > >> > > >> the pull
> > > > > > > requests
> > > > > > > >> > > which
> > > > > > > >> > > >> addresses metrics.
> > > > > > > >> > > >> And at #753
> > > > > > > >> > > >> <https://github.com/apache/storm/pull/753> I
> > > > > found
> > > > > > > >> Cody
> > > > > > > >> > > said
> > > > > > > >> > > >> we
> > > > > > > >> > > >> (maybe it means Alibaba team) are currently working
> > > > > > > >> > > >> on
> > > > > Metrics
> > > > > > > >> Server.
> > > > > > > >> > > >> (I also found comment which said there was some talk
> > > > > > > >> > > >> while
> > > > > ago
> > > > > > > >> around
> > > > > > > >> > > >> integrating Hadoop timeline server. Seems like no
> > > > > > > >> > > >> one came up
> > > > > > > with
> > > > > > > >> the
> > > > > > > >> > > >> result, and I prefer to avoid big dependency so I'm
> > > > > > > >> > > >> in favor
> > > > > of
> > > > > > > >> > Metrics
> > > > > > > >> > > >> Server for now.)
> > > > > > > >> > > >>
> > > > > > > >> > > >> I think that would improve metrics feature of Storm
> > > > > > > >> > > >> much
> > > > > better,
> > > > > > > so
> > > > > > > >> > I'd
> > > > > > > >> > > >> like to see how the work is going. Sure it's only
> > > > > > > >> > > >> when
> > > > > there's no
> > > > > > > >> > issue
> > > > > > > >> > > >> for
> > > > > > > >> > > >> you to work transparently. I just would like to
> > > > > > > >> > > >> prevent
> > > > > > > >> duplication of
> > > > > > > >> > > >> work, and would like to help if needed and possible.
> > > > > > > >> > > >>
> > > > > > > >> > > >> Thanks,
> > > > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > > >> > > >>
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Abhishek Agarwal
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> >
> 
>

Re: 答复: Question on Metrics Server to Alibaba team

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

What is more I see pacemaker as a stepping stone towards a better metrics system. If a new metrics system works well and scales well pacemaker can go away, assuming that the heartbeats themselves have all of the metrics stuff removed from them.
 - Bobby 

    On Tuesday, March 29, 2016 10:34 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
 

 Agreed. Storm should be able to operate with a minimum of external dependencies.

Beyond ZooKeeper, we don't have any strict runtime service dependencies. We should keep it that way. I'm all for sane default/reference API implementations combined with the ability to swap out alternative implementations at runtime.

With pacemaker, the ZooKeeper dependency is even starting to blur.

Keeping Storm as standalone as possible is just as important as integration with resource negotiation frameworks. Users don't want to be forced into a specific architecture.

-Taylor

> On Mar 25, 2016, at 9:43 AM, Bobby Evans <ev...@yahoo-inc.com.INVALID> wrote:
> 
> My concern is really around how much time/effort it is to get to a final solution, and to ultimately maintain/support that solution.  If I was doing this from scratch I would probably pull something off of the shelf that is tested and has an entire community supporting it instead of writing something ourselves from scratch.  But in this case we have a solution from JStorm, that we know works.  Because this is the backend that we are talking about we can switch things out later on if we need to.  Like I said before I am fine with using the JStorm code initially.  I mostly want to be sure of a few things.
> 1. The metrics interface we expose to end users is well thought out and can be extended in the future.2. The interfaces that connect this front end to the back end are though out and we could replace the back end if needed.3. The solution offers some level of high availability.  If Nimbus a worker, etc. crash it is OK to lose some data, but we don't want to 
>  - Bobby 
> 
>    On Friday, March 25, 2016 6:26 AM, Cody Innowhere <e....@gmail.com> wrote:
> 
> 
> Bobby,
> I understand your concern. Still, I think our metrics design in JStorm can
> work without any external service, as I mentioned above, we can store
> metrics in rocksdb on nimbus server. A rough thought will be: we store the
> latest 1 hour of 1-min window data, 10 hours of 10-min window data, 5 days
> of 2-hour window data, 30 days of 1-day window, etc. And if there's the
> need to sync metrics data between nimbus servers, we can add a sync thread
> to handle nimbus fail-over, since it's just metrics data that don't really
> matter too much, we can use a plain simple sync model.
> 
> The external service is another option to end users, if users feel it's
> important (or maybe their business built on top of storm is very
> important), they can use this external service to build their own monitor
> system which can be more useful than the original solution shipped with
> storm.
> 
> On Fri, Mar 25, 2016 at 2:09 AM, Bobby Evans <ev...@yahoo-inc.com.invalid>
> wrote:
> 
>> The problem is that we want something for storm that can work out of the
>> box, ideally without some other complicated external service (except
>> zookeeper which we already have, and is not actually that complex to setup
>> and run).
>> If we feel that we must have some external state store that is required
>> for storm to run, then we need to make the decision carefully and
>> deliberately.
>>  - Bobby
>> 
>>    On Wednesday, March 23, 2016 8:37 AM, John Fang <
>> xiaojian.fxj@alibaba-inc.com> wrote:
>> 
>> 
>>  Sorry , I misunderstand it. We will make H/A for TopologyMaster. And
>> metric meta will store at HDFS,  So the metrics meta won't rely on the
>> nimbus. It can enhance the stability of the metric system.
>> 
>> -----邮件原件-----
>> 发件人: Cody Innowhere [mailto:e.neverme@gmail.com]
>> 发送时间: 2016年3月23日 19:59
>> 收件人: dev@storm.apache.org
>> 主题: Re: Question on Metrics Server to Alibaba team
>> 
>> If we don't rely on any external system, our metrics system is still
>> available but will store metrics meta/data in rocksdb on nimbus servers.
>> There will be limits though, for example, we cannot store metrics data all
>> through the topology lifecycle, because rocksdb is only a KV storage, it
>> may not support efficient scan operations and too much data in local disk
>> may bring in extra IO overhead, so we may have to store latest 1hour of m1
>> data, 6 hours of m10 data as such (currently not implemented in JStorm, but
>> quite easy to do this).
>> 
>> TopologyMaster is merely a channel for registering/computing/uploading
>> metrics to nimbus, so if a TM goes down, the topology metrics will be
>> unavailable for a while before it gets pulled up somewhere else(for a
>> normal failover case, this should be very fast), while supervisor/nimbus
>> metrics are unaffected as they're sent to nimbus via thrift interface. As
>> long as TM is back, the topology metrics will be available again.
>> 
>> Currently JStorm does sync metrics meta but metrics data between multiple
>> nimbus serers is not synced. So under a nimbus failure, possibly we may
>> lose some metrics data.
>> 
>> 
>>> On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>>> 
>>> John,
>>> 
>>> My concern is H/A of metrics on Storm by default. (I'm not 100% sure
>>> Bobby pointed out same things.)
>>> 
>>> Since Apache Storm has been used by various users so that we can't
>>> assume that users have knowledges of external systems (including
>>> Hadoop ecosystem, personal opinion) and operate them smoothly.
>>> It reminds me about the importance to keep in mind about default.
>>> 
>>> Therefore, I'm curious that new metrics feature of JStom can work
>>> smoothly without external system (HBase / OTS). And love to see it
>>> supports H/A without other systems, or users have to tolerate lost of
>>> metrics for some scenarios.
>>> 
>>> I guess this may be valid questions on H/A (as far as my understanding
>>> of design doc is right): How metrics work when TopologyMaster is down?
>>> And how metrics work when failover of Nimbus occurs?
>>> 
>>> Personally I don't mind losing metrics for short durations (just want
>>> to check availability of H/A), but failure shouldn't mess up whole
>> metrics.
>>> 
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>> 
>>> 2016년 3월 23일 (수) 오후 3:39, John Fang <xi...@alibaba-inc.com>님이 작성:
>>> 
>>>> @ Bobby Evans Jstorm code has experienced a lot of tests over the
>>>> past
>>> few
>>>> years, espatially HA and scalability. We have done a lot of
>>>> optimization about Metrics. The performance is better than Flink in
>>>> my tests. In my personal opinion, the metric in jstorm offers very
>>>> much informations. And the metric can tell us where is the bottleneck
>> when we run a topology.
>>> The
>>>> performance bottleneck maybe serialize/deserialize/netty/executor
>>>> and so on. Of course, I also has some other good monitoring in the
>>>> world. So I hope we can choice the better monitoring before phrase
>>>> 2. And I will
>>> start
>>>> study the Alas. If it is better, I am pleasured to redesign the
>>>> metric by Alas.
>>>> -----邮件原件-----
>>>> 发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
>>>> 发送时间: 2016年3月22日 22:36
>>>> 收件人: dev@storm.apache.org
>>>> 主题: Re: Question on Metrics Server to Alibaba team
>>>> 
>>>> My personal opinion is that we should not reinvent the wheel (aka
>>>> distributed fault tolerant metrics) ourselves.  The local file
>>>> blobstore with nimbus HA was a big enough pain to write and it is
>>>> relatively simple in comparison.
>>>> If the JStorm code is simple and offers everything we need in terms
>>>> of HA and scalability then I would be OK with it, but if it doesn't
>>>> I would
>>> lean
>>>> towards a different compatible open source solution.
>>>> 
>>>> https://github.com/Netflix/atlas
>>>> looks very promising as a default option.  It is actively maintained
>>>> by a group that I think has some of the best monitoring in the
>>>> world.  And it
>>> is
>>>> both java and apache compatible.  It has no histogram support that I
>>> could
>>>> find, but that I don't see as being super critical.  The biggest
>>>> drawback is there is little documentation on how to use it, to
>>>> really be able to evaluate it for our needs. - Bobby
>>>> 
>>>>    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim
>>>> <ka...@gmail.com>
>>>> wrote:
>>>> 
>>>> 
>>>>  Harsha,
>>>> 
>>>> That's why I think new metric feature of JStorm looks promising.
>>>> 
>>>> According to design doc on
>>>> https://issues.apache.org/jira/browse/STORM-1329,
>>>> there's no distinction between topology stat (which Apache Storm
>>>> includes to worker heartbeat) and built-in metrics (which should be
>>>> handled with separate consumer, as you stated).
>>>> All metrics are passed to Nimbus and Nimbus cached metrics, which
>>>> implies we can treat all metrics as same, and we can also provide
>>>> built-in
>>> metrics
>>>> (including custom metrics) to users via REST API, too.
>>>> 
>>>> I thought about standalone metrics server process which handles
>>>> whole metric works (maybe TopologyMaster + Nimbus on design doc),
>>>> but if
>>> current
>>>> implementation of metric feature on JStorm can take care of what I'm
>>>> assuming, I guess it's great enough.
>>>> 
>>>> Since I don't know about TopologyMaster, I just wonder that there're
>>>> any SPOFs (including soft) and how metrics work when if component of
>>>> SPOF
>>> goes
>>>> down.
>>>> Since Cody gives digging point to take a look at, we can evaluate
>>>> that feature before phase 2.
>>>> 
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>> 
>>>> 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:
>>>> 
>>>>> One of the goals of this work and probably can be addressed in
>>>>> separate jira is how the topology metrics reporter works. Today
>>>>> its a bolt thats part of a topology graph that means its another
>>>>> node in the Topology DAG that needs be tuned for better
>>>>> performance. Some of our users took performance hits by deploying
>>>>> topology metrics reporter that can send metrics to Ganglia.
>>>>> Ideally this collection should be asynchronous and not be a node in
>> topology DAG.
>>>>> 
>>>>> Shipping default metrics server and along with pluggable option
>>>>> for users who wants to graphite or other timeline servers should
>>>>> be the goal.
>>>>> 
>>>>> --Harsha
>>>>> 
>>>>> 
>>>>>> On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
>>>>>> @Cody - The design looks good. Does the design allow to
>>>>>> aggregate metrics at the task/executor level? Basically, number
>>>>>> of distinct metrics is proportional to the number of distinct
>>>>>> tasks, did you ever run into such a use case?
>>>>>> 
>>>>>> 
>>>>>> On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere
>>>>>> <e....@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Also, you can read the code from our latest release JStorm 2.1.1.
>>>>>>> 
>>>>>>> On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere
>>>>>>> <e....@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> @Jungtaek,
>>>>>>>> We did some tests on codahale metrics, compared to
>>>>>>>> meters/histograms, counters are quite fast. So we mainly
>>>>>>>> focused on the optimization of
>>>>>>> meters
>>>>>>>> and histograms (they are indeed very slow) including double
>>>>>>>> sampling, changing the clock from ns (System.nanoTime) to
>>>>>>>> ms,
>>> etc.
>>>>>>>> You can take a look at the
>>>>>>>> "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount"
>>>>>>>> class of our sequence-split-merge example code, as the
>>>>>>>> client code entry to
>>>>> metrics.
>>>>>>>> After that, you may dig to TopologyMaster class, which is
>>>>>>>> still part
>>>>> of a
>>>>>>>> topology, and then to TopologyMetricsRunnable, which is a
>>>>>>>> part of
>>>>> nimbus
>>>>>>>> server, finally to MetricUploader plugin, this is where the
>>>>>>>> metrics interfere with our "metrics server". Still, there're
>>>>>>>> some nits in the
>>>>>>> code,
>>>>>>>> but I think that should be no big problem.
>>>>>>>> 
>>>>>>>> I'd also like to point out that our "metrics server" is not
>>>>>>>> strictly
>>>>> a
>>>>>>>> real metrics server, since most of the duty lies on nimbus
>>>>>>>> server and topology master, it's more appropriate to call it
>>>> metrics storage.
>>>>> The
>>>>>>> main
>>>>>>>> reason for this is that we don't want to make a heavy-weight
>>>>>>>> metrics
>>>>>>> server
>>>>>>>> out of JStorm, and this makes us very easy to maintain (we
>>>>>>>> have teams
>>>>>>> that
>>>>>>>> specifically maintain HBase/OTS in Alibaba since they're so
>>>>>>>> commonly
>>>>> used
>>>>>>>> in production).
>>>>>>>> 
>>>>>>>> On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim
>>>>>>>> <ka...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Thanks Cody and Bobby for the explanation.
>>>>>>>>> 
>>>>>>>>> Cody,
>>>>>>>>> I took a look at design doc and looks promising, especially
>>>>>>>>> it
>>>>> doesn't
>>>>>>> do
>>>>>>>>> sampling when metric type is 'counter'. As far as I heard
>>>>>>>>> (I didn't
>>>>> try
>>>>>>>>> it)
>>>>>>>>> it becomes huge performance hit in Apache Storm when we
>>>>>>>>> change
>>>>> sample
>>>>>>> rate
>>>>>>>>> to 1.0.
>>>>>>>>> Could you guide the entry point of metric feature in JStorm
>>>>>>>>> to dig
>>>>> into?
>>>>>>>>> 
>>>>>>>>> And just a curiosity, did you consider extracting metric
>>>>>>>>> feature
>>>>> (which
>>>>>>> is
>>>>>>>>> done with TopologyMasters and Nimbuses) into separate
>> component?
>>>>>>>>> I understood your mention to 'metrics server' as separate
>>>>> component, but
>>>>>>>>> after seeing design doc, feature seems to be implemented on
>>>> Nimbus.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>> 
>>>>>>>>> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere
>>>>>>>>> <e....@gmail.com>님이
>>>>> 작성:
>>>>>>>>> 
>>>>>>>>>> JStorm has provided a MetricUploader interface, which is
>>>>>>>>>> similar
>>>>> to
>>>>>>>>>> IMetricsConsumer in storm, and the underlying
>>>>>>>>>> implementation is
>>>>>>>>> pluggable,
>>>>>>>>>> you can use HBase, or any other KV store that supports
>>>>>>>>>> timeline
>>>>>>> queries
>>>>>>>>> or
>>>>>>>>>> even a database(maybe for it's a small cluster). We
>>>>>>>>>> provide model
>>>>>>>>> classes
>>>>>>>>>> in jstorm-core, as to what kinds of metrics data need to
>>>>>>>>>> be
>>>>> stored,
>>>>>>> it's
>>>>>>>>>> totally up to the detailed implementation. Our internal
>>>>> implementation
>>>>>>>>> uses
>>>>>>>>>> OTS, which is a product of aliyun (
>>>>>>> https://www.aliyun.com/product/ots/
>>>>>>>>> ),
>>>>>>>>>> but it's easy to adapt to other implementations.
>>>>>>>>>> 
>>>>>>>>>> On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
>>>>>>>>> <evans@yahoo-inc.com.invalid
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Yes we originally wanted to try and use the Hadoop
>>>>>>>>>>> Timeline
>>>>> Server
>>>>>>> for
>>>>>>>>>>> storm metrics feedback to nimbus + UI + history like
>> server.
>>>>> But it
>>>>>>>>> was
>>>>>>>>>>> not stable at the time, so we stopped.  For the sake of
>>>>>>>>>>> playing
>>>>>>> nicely
>>>>>>>>>> with
>>>>>>>>>>> the rest of the big data ecosystem I would like to see
>>>>>>>>>>> us
>>>>> support it
>>>>>>>>> as
>>>>>>>>>> an
>>>>>>>>>>> option for metrics collection/query, but until the
>>>>>>>>>>> timeline
>>>>> server
>>>>>>> v2
>>>>>>>>> is
>>>>>>>>>>> ready and released.  For me the important thing is that
>>>>>>>>>>> we have
>>>>> a
>>>>>>>>> decent
>>>>>>>>>>> time series DB that comes with storm by default and is
>>>>> pluggable so
>>>>>>> we
>>>>>>>>>> can
>>>>>>>>>>> replace it with something else that has similar
>>>>>>>>>>> capabilities in
>>>>> the
>>>>>>>>>> future.
>>>>>>>>>>>  - Bobby
>>>>>>>>>>> 
>>>>>>>>>>>    On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
>>>>>>>>>>> e.neverme@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  It's actually in Phase 2 of porting JStorm, but I'm
>>>>>>>>>>> absolutely
>>>>> ok
>>>>>>> to
>>>>>>>>>>> discuss this in advance.
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
>>>>>>> e.neverme@gmail.com
>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Yes it's already in production.
>>>>>>>>>>>> The implementation basically follows the design
>>>>>>>>>>>> document in
>>>>>>>>>>>> https://issues.apache.org/jira/browse/STORM-1329, you
>>>>>>>>>>>> can
>>>>> take a
>>>>>>>>> look
>>>>>>>>>>>> first and feel free to ask questions.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
>>>>> kabhwan@gmail.com
>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I got something to do with metrics so I'm seeking
>>>>>>>>>>>>> the pull
>>>>>>> requests
>>>>>>>>>>> which
>>>>>>>>>>>>> addresses metrics.
>>>>>>>>>>>>> And at #753
>>>>>>>>>>>>> <https://github.com/apache/storm/pull/753> I
>>>>> found
>>>>>>>>> Cody
>>>>>>>>>>> said
>>>>>>>>>>>>> we
>>>>>>>>>>>>> (maybe it means Alibaba team) are currently working
>>>>>>>>>>>>> on
>>>>> Metrics
>>>>>>>>> Server.
>>>>>>>>>>>>> (I also found comment which said there was some talk
>>>>>>>>>>>>> while
>>>>> ago
>>>>>>>>> around
>>>>>>>>>>>>> integrating Hadoop timeline server. Seems like no
>>>>>>>>>>>>> one came up
>>>>>>> with
>>>>>>>>> the
>>>>>>>>>>>>> result, and I prefer to avoid big dependency so I'm
>>>>>>>>>>>>> in favor
>>>>> of
>>>>>>>>>> Metrics
>>>>>>>>>>>>> Server for now.)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think that would improve metrics feature of Storm
>>>>>>>>>>>>> much
>>>>> better,
>>>>>>> so
>>>>>>>>>> I'd
>>>>>>>>>>>>> like to see how the work is going. Sure it's only
>>>>>>>>>>>>> when
>>>>> there's no
>>>>>>>>>> issue
>>>>>>>>>>>>> for
>>>>>>>>>>>>> you to work transparently. I just would like to
>>>>>>>>>>>>> prevent
>>>>>>>>> duplication of
>>>>>>>>>>>>> work, and would like to help if needed and possible.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards,
>>>>>> Abhishek Agarwal
>

Re: 答复: Question on Metrics Server to Alibaba team

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Agreed. Storm should be able to operate with a minimum of external dependencies.

Beyond ZooKeeper, we don't have any strict runtime service dependencies. We should keep it that way. I'm all for sane default/reference API implementations combined with the ability to swap out alternative implementations at runtime.

With pacemaker, the ZooKeeper dependency is even starting to blur.

Keeping Storm as standalone as possible is just as important as integration with resource negotiation frameworks. Users don't want to be forced into a specific architecture.

-Taylor

> On Mar 25, 2016, at 9:43 AM, Bobby Evans <ev...@yahoo-inc.com.INVALID> wrote:
> 
> My concern is really around how much time/effort it is to get to a final solution, and to ultimately maintain/support that solution.  If I was doing this from scratch I would probably pull something off of the shelf that is tested and has an entire community supporting it instead of writing something ourselves from scratch.  But in this case we have a solution from JStorm, that we know works.  Because this is the backend that we are talking about we can switch things out later on if we need to.  Like I said before I am fine with using the JStorm code initially.  I mostly want to be sure of a few things.
> 1. The metrics interface we expose to end users is well thought out and can be extended in the future.2. The interfaces that connect this front end to the back end are though out and we could replace the back end if needed.3. The solution offers some level of high availability.  If Nimbus a worker, etc. crash it is OK to lose some data, but we don't want to 
>  - Bobby 
> 
>    On Friday, March 25, 2016 6:26 AM, Cody Innowhere <e....@gmail.com> wrote:
> 
> 
> Bobby,
> I understand your concern. Still, I think our metrics design in JStorm can
> work without any external service, as I mentioned above, we can store
> metrics in rocksdb on nimbus server. A rough thought will be: we store the
> latest 1 hour of 1-min window data, 10 hours of 10-min window data, 5 days
> of 2-hour window data, 30 days of 1-day window, etc. And if there's the
> need to sync metrics data between nimbus servers, we can add a sync thread
> to handle nimbus fail-over, since it's just metrics data that don't really
> matter too much, we can use a plain simple sync model.
> 
> The external service is another option to end users, if users feel it's
> important (or maybe their business built on top of storm is very
> important), they can use this external service to build their own monitor
> system which can be more useful than the original solution shipped with
> storm.
> 
> On Fri, Mar 25, 2016 at 2:09 AM, Bobby Evans <ev...@yahoo-inc.com.invalid>
> wrote:
> 
>> The problem is that we want something for storm that can work out of the
>> box, ideally without some other complicated external service (except
>> zookeeper which we already have, and is not actually that complex to setup
>> and run).
>> If we feel that we must have some external state store that is required
>> for storm to run, then we need to make the decision carefully and
>> deliberately.
>>   - Bobby
>> 
>>     On Wednesday, March 23, 2016 8:37 AM, John Fang <
>> xiaojian.fxj@alibaba-inc.com> wrote:
>> 
>> 
>>   Sorry , I misunderstand it. We will make H/A for TopologyMaster. And
>> metric meta will store at HDFS,  So the metrics meta won't rely on the
>> nimbus. It can enhance the stability of the metric system.
>> 
>> -----邮件原件-----
>> 发件人: Cody Innowhere [mailto:e.neverme@gmail.com]
>> 发送时间: 2016年3月23日 19:59
>> 收件人: dev@storm.apache.org
>> 主题: Re: Question on Metrics Server to Alibaba team
>> 
>> If we don't rely on any external system, our metrics system is still
>> available but will store metrics meta/data in rocksdb on nimbus servers.
>> There will be limits though, for example, we cannot store metrics data all
>> through the topology lifecycle, because rocksdb is only a KV storage, it
>> may not support efficient scan operations and too much data in local disk
>> may bring in extra IO overhead, so we may have to store latest 1hour of m1
>> data, 6 hours of m10 data as such (currently not implemented in JStorm, but
>> quite easy to do this).
>> 
>> TopologyMaster is merely a channel for registering/computing/uploading
>> metrics to nimbus, so if a TM goes down, the topology metrics will be
>> unavailable for a while before it gets pulled up somewhere else(for a
>> normal failover case, this should be very fast), while supervisor/nimbus
>> metrics are unaffected as they're sent to nimbus via thrift interface. As
>> long as TM is back, the topology metrics will be available again.
>> 
>> Currently JStorm does sync metrics meta but metrics data between multiple
>> nimbus serers is not synced. So under a nimbus failure, possibly we may
>> lose some metrics data.
>> 
>> 
>>> On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>>> 
>>> John,
>>> 
>>> My concern is H/A of metrics on Storm by default. (I'm not 100% sure
>>> Bobby pointed out same things.)
>>> 
>>> Since Apache Storm has been used by various users so that we can't
>>> assume that users have knowledges of external systems (including
>>> Hadoop ecosystem, personal opinion) and operate them smoothly.
>>> It reminds me about the importance to keep in mind about default.
>>> 
>>> Therefore, I'm curious that new metrics feature of JStom can work
>>> smoothly without external system (HBase / OTS). And love to see it
>>> supports H/A without other systems, or users have to tolerate lost of
>>> metrics for some scenarios.
>>> 
>>> I guess this may be valid questions on H/A (as far as my understanding
>>> of design doc is right): How metrics work when TopologyMaster is down?
>>> And how metrics work when failover of Nimbus occurs?
>>> 
>>> Personally I don't mind losing metrics for short durations (just want
>>> to check availability of H/A), but failure shouldn't mess up whole
>> metrics.
>>> 
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>> 
>>> 2016년 3월 23일 (수) 오후 3:39, John Fang <xi...@alibaba-inc.com>님이 작성:
>>> 
>>>> @ Bobby Evans Jstorm code has experienced a lot of tests over the
>>>> past
>>> few
>>>> years, espatially HA and scalability. We have done a lot of
>>>> optimization about Metrics. The performance is better than Flink in
>>>> my tests. In my personal opinion, the metric in jstorm offers very
>>>> much informations. And the metric can tell us where is the bottleneck
>> when we run a topology.
>>> The
>>>> performance bottleneck maybe serialize/deserialize/netty/executor
>>>> and so on. Of course, I also has some other good monitoring in the
>>>> world. So I hope we can choice the better monitoring before phrase
>>>> 2. And I will
>>> start
>>>> study the Alas. If it is better, I am pleasured to redesign the
>>>> metric by Alas.
>>>> -----邮件原件-----
>>>> 发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
>>>> 发送时间: 2016年3月22日 22:36
>>>> 收件人: dev@storm.apache.org
>>>> 主题: Re: Question on Metrics Server to Alibaba team
>>>> 
>>>> My personal opinion is that we should not reinvent the wheel (aka
>>>> distributed fault tolerant metrics) ourselves.  The local file
>>>> blobstore with nimbus HA was a big enough pain to write and it is
>>>> relatively simple in comparison.
>>>> If the JStorm code is simple and offers everything we need in terms
>>>> of HA and scalability then I would be OK with it, but if it doesn't
>>>> I would
>>> lean
>>>> towards a different compatible open source solution.
>>>> 
>>>> https://github.com/Netflix/atlas
>>>> looks very promising as a default option.  It is actively maintained
>>>> by a group that I think has some of the best monitoring in the
>>>> world.  And it
>>> is
>>>> both java and apache compatible.  It has no histogram support that I
>>> could
>>>> find, but that I don't see as being super critical.  The biggest
>>>> drawback is there is little documentation on how to use it, to
>>>> really be able to evaluate it for our needs. - Bobby
>>>> 
>>>>     On Monday, March 21, 2016 7:29 PM, Jungtaek Lim
>>>> <ka...@gmail.com>
>>>> wrote:
>>>> 
>>>> 
>>>>   Harsha,
>>>> 
>>>> That's why I think new metric feature of JStorm looks promising.
>>>> 
>>>> According to design doc on
>>>> https://issues.apache.org/jira/browse/STORM-1329,
>>>> there's no distinction between topology stat (which Apache Storm
>>>> includes to worker heartbeat) and built-in metrics (which should be
>>>> handled with separate consumer, as you stated).
>>>> All metrics are passed to Nimbus and Nimbus cached metrics, which
>>>> implies we can treat all metrics as same, and we can also provide
>>>> built-in
>>> metrics
>>>> (including custom metrics) to users via REST API, too.
>>>> 
>>>> I thought about standalone metrics server process which handles
>>>> whole metric works (maybe TopologyMaster + Nimbus on design doc),
>>>> but if
>>> current
>>>> implementation of metric feature on JStorm can take care of what I'm
>>>> assuming, I guess it's great enough.
>>>> 
>>>> Since I don't know about TopologyMaster, I just wonder that there're
>>>> any SPOFs (including soft) and how metrics work when if component of
>>>> SPOF
>>> goes
>>>> down.
>>>> Since Cody gives digging point to take a look at, we can evaluate
>>>> that feature before phase 2.
>>>> 
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>> 
>>>> 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:
>>>> 
>>>>> One of the goals of this work and probably can be addressed in
>>>>> separate jira is how the topology metrics reporter works. Today
>>>>> its a bolt thats part of a topology graph that means its another
>>>>> node in the Topology DAG that needs be tuned for better
>>>>> performance. Some of our users took performance hits by deploying
>>>>> topology metrics reporter that can send metrics to Ganglia.
>>>>> Ideally this collection should be asynchronous and not be a node in
>> topology DAG.
>>>>> 
>>>>> Shipping default metrics server and along with pluggable option
>>>>> for users who wants to graphite or other timeline servers should
>>>>> be the goal.
>>>>> 
>>>>> --Harsha
>>>>> 
>>>>> 
>>>>>> On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
>>>>>> @Cody - The design looks good. Does the design allow to
>>>>>> aggregate metrics at the task/executor level? Basically, number
>>>>>> of distinct metrics is proportional to the number of distinct
>>>>>> tasks, did you ever run into such a use case?
>>>>>> 
>>>>>> 
>>>>>> On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere
>>>>>> <e....@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Also, you can read the code from our latest release JStorm 2.1.1.
>>>>>>> 
>>>>>>> On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere
>>>>>>> <e....@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> @Jungtaek,
>>>>>>>> We did some tests on codahale metrics, compared to
>>>>>>>> meters/histograms, counters are quite fast. So we mainly
>>>>>>>> focused on the optimization of
>>>>>>> meters
>>>>>>>> and histograms (they are indeed very slow) including double
>>>>>>>> sampling, changing the clock from ns (System.nanoTime) to
>>>>>>>> ms,
>>> etc.
>>>>>>>> You can take a look at the
>>>>>>>> "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount"
>>>>>>>> class of our sequence-split-merge example code, as the
>>>>>>>> client code entry to
>>>>> metrics.
>>>>>>>> After that, you may dig to TopologyMaster class, which is
>>>>>>>> still part
>>>>> of a
>>>>>>>> topology, and then to TopologyMetricsRunnable, which is a
>>>>>>>> part of
>>>>> nimbus
>>>>>>>> server, finally to MetricUploader plugin, this is where the
>>>>>>>> metrics interfere with our "metrics server". Still, there're
>>>>>>>> some nits in the
>>>>>>> code,
>>>>>>>> but I think that should be no big problem.
>>>>>>>> 
>>>>>>>> I'd also like to point out that our "metrics server" is not
>>>>>>>> strictly
>>>>> a
>>>>>>>> real metrics server, since most of the duty lies on nimbus
>>>>>>>> server and topology master, it's more appropriate to call it
>>>> metrics storage.
>>>>> The
>>>>>>> main
>>>>>>>> reason for this is that we don't want to make a heavy-weight
>>>>>>>> metrics
>>>>>>> server
>>>>>>>> out of JStorm, and this makes us very easy to maintain (we
>>>>>>>> have teams
>>>>>>> that
>>>>>>>> specifically maintain HBase/OTS in Alibaba since they're so
>>>>>>>> commonly
>>>>> used
>>>>>>>> in production).
>>>>>>>> 
>>>>>>>> On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim
>>>>>>>> <ka...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Thanks Cody and Bobby for the explanation.
>>>>>>>>> 
>>>>>>>>> Cody,
>>>>>>>>> I took a look at design doc and looks promising, especially
>>>>>>>>> it
>>>>> doesn't
>>>>>>> do
>>>>>>>>> sampling when metric type is 'counter'. As far as I heard
>>>>>>>>> (I didn't
>>>>> try
>>>>>>>>> it)
>>>>>>>>> it becomes huge performance hit in Apache Storm when we
>>>>>>>>> change
>>>>> sample
>>>>>>> rate
>>>>>>>>> to 1.0.
>>>>>>>>> Could you guide the entry point of metric feature in JStorm
>>>>>>>>> to dig
>>>>> into?
>>>>>>>>> 
>>>>>>>>> And just a curiosity, did you consider extracting metric
>>>>>>>>> feature
>>>>> (which
>>>>>>> is
>>>>>>>>> done with TopologyMasters and Nimbuses) into separate
>> component?
>>>>>>>>> I understood your mention to 'metrics server' as separate
>>>>> component, but
>>>>>>>>> after seeing design doc, feature seems to be implemented on
>>>> Nimbus.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>> 
>>>>>>>>> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere
>>>>>>>>> <e....@gmail.com>님이
>>>>> 작성:
>>>>>>>>> 
>>>>>>>>>> JStorm has provided a MetricUploader interface, which is
>>>>>>>>>> similar
>>>>> to
>>>>>>>>>> IMetricsConsumer in storm, and the underlying
>>>>>>>>>> implementation is
>>>>>>>>> pluggable,
>>>>>>>>>> you can use HBase, or any other KV store that supports
>>>>>>>>>> timeline
>>>>>>> queries
>>>>>>>>> or
>>>>>>>>>> even a database(maybe for it's a small cluster). We
>>>>>>>>>> provide model
>>>>>>>>> classes
>>>>>>>>>> in jstorm-core, as to what kinds of metrics data need to
>>>>>>>>>> be
>>>>> stored,
>>>>>>> it's
>>>>>>>>>> totally up to the detailed implementation. Our internal
>>>>> implementation
>>>>>>>>> uses
>>>>>>>>>> OTS, which is a product of aliyun (
>>>>>>> https://www.aliyun.com/product/ots/
>>>>>>>>> ),
>>>>>>>>>> but it's easy to adapt to other implementations.
>>>>>>>>>> 
>>>>>>>>>> On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
>>>>>>>>> <evans@yahoo-inc.com.invalid
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Yes we originally wanted to try and use the Hadoop
>>>>>>>>>>> Timeline
>>>>> Server
>>>>>>> for
>>>>>>>>>>> storm metrics feedback to nimbus + UI + history like
>> server.
>>>>> But it
>>>>>>>>> was
>>>>>>>>>>> not stable at the time, so we stopped.  For the sake of
>>>>>>>>>>> playing
>>>>>>> nicely
>>>>>>>>>> with
>>>>>>>>>>> the rest of the big data ecosystem I would like to see
>>>>>>>>>>> us
>>>>> support it
>>>>>>>>> as
>>>>>>>>>> an
>>>>>>>>>>> option for metrics collection/query, but until the
>>>>>>>>>>> timeline
>>>>> server
>>>>>>> v2
>>>>>>>>> is
>>>>>>>>>>> ready and released.  For me the important thing is that
>>>>>>>>>>> we have
>>>>> a
>>>>>>>>> decent
>>>>>>>>>>> time series DB that comes with storm by default and is
>>>>> pluggable so
>>>>>>> we
>>>>>>>>>> can
>>>>>>>>>>> replace it with something else that has similar
>>>>>>>>>>> capabilities in
>>>>> the
>>>>>>>>>> future.
>>>>>>>>>>>   - Bobby
>>>>>>>>>>> 
>>>>>>>>>>>     On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
>>>>>>>>>>> e.neverme@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>   It's actually in Phase 2 of porting JStorm, but I'm
>>>>>>>>>>> absolutely
>>>>> ok
>>>>>>> to
>>>>>>>>>>> discuss this in advance.
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
>>>>>>> e.neverme@gmail.com
>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Yes it's already in production.
>>>>>>>>>>>> The implementation basically follows the design
>>>>>>>>>>>> document in
>>>>>>>>>>>> https://issues.apache.org/jira/browse/STORM-1329, you
>>>>>>>>>>>> can
>>>>> take a
>>>>>>>>> look
>>>>>>>>>>>> first and feel free to ask questions.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
>>>>> kabhwan@gmail.com
>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I got something to do with metrics so I'm seeking
>>>>>>>>>>>>> the pull
>>>>>>> requests
>>>>>>>>>>> which
>>>>>>>>>>>>> addresses metrics.
>>>>>>>>>>>>> And at #753
>>>>>>>>>>>>> <https://github.com/apache/storm/pull/753> I
>>>>> found
>>>>>>>>> Cody
>>>>>>>>>>> said
>>>>>>>>>>>>> we
>>>>>>>>>>>>> (maybe it means Alibaba team) are currently working
>>>>>>>>>>>>> on
>>>>> Metrics
>>>>>>>>> Server.
>>>>>>>>>>>>> (I also found comment which said there was some talk
>>>>>>>>>>>>> while
>>>>> ago
>>>>>>>>> around
>>>>>>>>>>>>> integrating Hadoop timeline server. Seems like no
>>>>>>>>>>>>> one came up
>>>>>>> with
>>>>>>>>> the
>>>>>>>>>>>>> result, and I prefer to avoid big dependency so I'm
>>>>>>>>>>>>> in favor
>>>>> of
>>>>>>>>>> Metrics
>>>>>>>>>>>>> Server for now.)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think that would improve metrics feature of Storm
>>>>>>>>>>>>> much
>>>>> better,
>>>>>>> so
>>>>>>>>>> I'd
>>>>>>>>>>>>> like to see how the work is going. Sure it's only
>>>>>>>>>>>>> when
>>>>> there's no
>>>>>>>>>> issue
>>>>>>>>>>>>> for
>>>>>>>>>>>>> you to work transparently. I just would like to
>>>>>>>>>>>>> prevent
>>>>>>>>> duplication of
>>>>>>>>>>>>> work, and would like to help if needed and possible.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards,
>>>>>> Abhishek Agarwal
>

Re: 答复: Question on Metrics Server to Alibaba team

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

My concern is really around how much time/effort it is to get to a final solution, and to ultimately maintain/support that solution.  If I was doing this from scratch I would probably pull something off of the shelf that is tested and has an entire community supporting it instead of writing something ourselves from scratch.  But in this case we have a solution from JStorm, that we know works.  Because this is the backend that we are talking about we can switch things out later on if we need to.  Like I said before I am fine with using the JStorm code initially.  I mostly want to be sure of a few things.
1. The metrics interface we expose to end users is well thought out and can be extended in the future.2. The interfaces that connect this front end to the back end are though out and we could replace the back end if needed.3. The solution offers some level of high availability.  If Nimbus a worker, etc. crash it is OK to lose some data, but we don't want to 
 - Bobby 

    On Friday, March 25, 2016 6:26 AM, Cody Innowhere <e....@gmail.com> wrote:
 

 Bobby,
I understand your concern. Still, I think our metrics design in JStorm can
work without any external service, as I mentioned above, we can store
metrics in rocksdb on nimbus server. A rough thought will be: we store the
latest 1 hour of 1-min window data, 10 hours of 10-min window data, 5 days
of 2-hour window data, 30 days of 1-day window, etc. And if there's the
need to sync metrics data between nimbus servers, we can add a sync thread
to handle nimbus fail-over, since it's just metrics data that don't really
matter too much, we can use a plain simple sync model.

The external service is another option to end users, if users feel it's
important (or maybe their business built on top of storm is very
important), they can use this external service to build their own monitor
system which can be more useful than the original solution shipped with
storm.

On Fri, Mar 25, 2016 at 2:09 AM, Bobby Evans <ev...@yahoo-inc.com.invalid>
wrote:

> The problem is that we want something for storm that can work out of the
> box, ideally without some other complicated external service (except
> zookeeper which we already have, and is not actually that complex to setup
> and run).
> If we feel that we must have some external state store that is required
> for storm to run, then we need to make the decision carefully and
> deliberately.
>  - Bobby
>
>    On Wednesday, March 23, 2016 8:37 AM, John Fang <
> xiaojian.fxj@alibaba-inc.com> wrote:
>
>
>  Sorry , I misunderstand it. We will make H/A for TopologyMaster. And
> metric meta will store at HDFS,  So the metrics meta won't rely on the
> nimbus. It can enhance the stability of the metric system.
>
> -----邮件原件-----
> 发件人: Cody Innowhere [mailto:e.neverme@gmail.com]
> 发送时间: 2016年3月23日 19:59
> 收件人: dev@storm.apache.org
> 主题: Re: Question on Metrics Server to Alibaba team
>
> If we don't rely on any external system, our metrics system is still
> available but will store metrics meta/data in rocksdb on nimbus servers.
> There will be limits though, for example, we cannot store metrics data all
> through the topology lifecycle, because rocksdb is only a KV storage, it
> may not support efficient scan operations and too much data in local disk
> may bring in extra IO overhead, so we may have to store latest 1hour of m1
> data, 6 hours of m10 data as such (currently not implemented in JStorm, but
> quite easy to do this).
>
> TopologyMaster is merely a channel for registering/computing/uploading
> metrics to nimbus, so if a TM goes down, the topology metrics will be
> unavailable for a while before it gets pulled up somewhere else(for a
> normal failover case, this should be very fast), while supervisor/nimbus
> metrics are unaffected as they're sent to nimbus via thrift interface. As
> long as TM is back, the topology metrics will be available again.
>
> Currently JStorm does sync metrics meta but metrics data between multiple
> nimbus serers is not synced. So under a nimbus failure, possibly we may
> lose some metrics data.
>
>
> On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>
> > John,
> >
> > My concern is H/A of metrics on Storm by default. (I'm not 100% sure
> > Bobby pointed out same things.)
> >
> > Since Apache Storm has been used by various users so that we can't
> > assume that users have knowledges of external systems (including
> > Hadoop ecosystem, personal opinion) and operate them smoothly.
> > It reminds me about the importance to keep in mind about default.
> >
> > Therefore, I'm curious that new metrics feature of JStom can work
> > smoothly without external system (HBase / OTS). And love to see it
> > supports H/A without other systems, or users have to tolerate lost of
> > metrics for some scenarios.
> >
> > I guess this may be valid questions on H/A (as far as my understanding
> > of design doc is right): How metrics work when TopologyMaster is down?
> > And how metrics work when failover of Nimbus occurs?
> >
> > Personally I don't mind losing metrics for short durations (just want
> > to check availability of H/A), but failure shouldn't mess up whole
> metrics.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> >
> > 2016년 3월 23일 (수) 오후 3:39, John Fang <xi...@alibaba-inc.com>님이 작성:
> >
> > > @ Bobby Evans Jstorm code has experienced a lot of tests over the
> > > past
> > few
> > > years, espatially HA and scalability. We have done a lot of
> > > optimization about Metrics. The performance is better than Flink in
> > > my tests. In my personal opinion, the metric in jstorm offers very
> > > much informations. And the metric can tell us where is the bottleneck
> when we run a topology.
> > The
> > > performance bottleneck maybe serialize/deserialize/netty/executor
> > > and so on. Of course, I also has some other good monitoring in the
> > > world. So I hope we can choice the better monitoring before phrase
> > > 2. And I will
> > start
> > > study the Alas. If it is better, I am pleasured to redesign the
> > > metric by Alas.
> > > -----邮件原件-----
> > > 发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
> > > 发送时间: 2016年3月22日 22:36
> > > 收件人: dev@storm.apache.org
> > > 主题: Re: Question on Metrics Server to Alibaba team
> > >
> > > My personal opinion is that we should not reinvent the wheel (aka
> > > distributed fault tolerant metrics) ourselves.  The local file
> > > blobstore with nimbus HA was a big enough pain to write and it is
> > > relatively simple in comparison.
> > > If the JStorm code is simple and offers everything we need in terms
> > > of HA and scalability then I would be OK with it, but if it doesn't
> > > I would
> > lean
> > > towards a different compatible open source solution.
> > >
> > > https://github.com/Netflix/atlas
> > > looks very promising as a default option.  It is actively maintained
> > > by a group that I think has some of the best monitoring in the
> > > world.  And it
> > is
> > > both java and apache compatible.  It has no histogram support that I
> > could
> > > find, but that I don't see as being super critical.  The biggest
> > > drawback is there is little documentation on how to use it, to
> > > really be able to evaluate it for our needs. - Bobby
> > >
> > >    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim
> > > <ka...@gmail.com>
> > > wrote:
> > >
> > >
> > >  Harsha,
> > >
> > > That's why I think new metric feature of JStorm looks promising.
> > >
> > > According to design doc on
> > > https://issues.apache.org/jira/browse/STORM-1329,
> > > there's no distinction between topology stat (which Apache Storm
> > > includes to worker heartbeat) and built-in metrics (which should be
> > > handled with separate consumer, as you stated).
> > > All metrics are passed to Nimbus and Nimbus cached metrics, which
> > > implies we can treat all metrics as same, and we can also provide
> > > built-in
> > metrics
> > > (including custom metrics) to users via REST API, too.
> > >
> > > I thought about standalone metrics server process which handles
> > > whole metric works (maybe TopologyMaster + Nimbus on design doc),
> > > but if
> > current
> > > implementation of metric feature on JStorm can take care of what I'm
> > > assuming, I guess it's great enough.
> > >
> > > Since I don't know about TopologyMaster, I just wonder that there're
> > > any SPOFs (including soft) and how metrics work when if component of
> > > SPOF
> > goes
> > > down.
> > > Since Cody gives digging point to take a look at, we can evaluate
> > > that feature before phase 2.
> > >
> > > Thanks,
> > > Jungtaek Lim (HeartSaVioR)
> > >
> > > 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:
> > >
> > > > One of the goals of this work and probably can be addressed in
> > > > separate jira is how the topology metrics reporter works. Today
> > > > its a bolt thats part of a topology graph that means its another
> > > > node in the Topology DAG that needs be tuned for better
> > > > performance. Some of our users took performance hits by deploying
> > > > topology metrics reporter that can send metrics to Ganglia.
> > > > Ideally this collection should be asynchronous and not be a node in
> topology DAG.
> > > >
> > > > Shipping default metrics server and along with pluggable option
> > > > for users who wants to graphite or other timeline servers should
> > > > be the goal.
> > > >
> > > > --Harsha
> > > >
> > > >
> > > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > > @Cody - The design looks good. Does the design allow to
> > > > > aggregate metrics at the task/executor level? Basically, number
> > > > > of distinct metrics is proportional to the number of distinct
> > > > > tasks, did you ever run into such a use case?
> > > > >
> > > > >
> > > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere
> > > > > <e....@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Also, you can read the code from our latest release JStorm 2.1.1.
> > > > > >
> > > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere
> > > > > > <e....@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > @Jungtaek,
> > > > > > > We did some tests on codahale metrics, compared to
> > > > > > > meters/histograms, counters are quite fast. So we mainly
> > > > > > > focused on the optimization of
> > > > > > meters
> > > > > > > and histograms (they are indeed very slow) including double
> > > > > > > sampling, changing the clock from ns (System.nanoTime) to
> > > > > > > ms,
> > etc.
> > > > > > > You can take a look at the
> > > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount"
> > > > > > > class of our sequence-split-merge example code, as the
> > > > > > > client code entry to
> > > > metrics.
> > > > > > > After that, you may dig to TopologyMaster class, which is
> > > > > > > still part
> > > > of a
> > > > > > > topology, and then to TopologyMetricsRunnable, which is a
> > > > > > > part of
> > > > nimbus
> > > > > > > server, finally to MetricUploader plugin, this is where the
> > > > > > > metrics interfere with our "metrics server". Still, there're
> > > > > > > some nits in the
> > > > > > code,
> > > > > > > but I think that should be no big problem.
> > > > > > >
> > > > > > > I'd also like to point out that our "metrics server" is not
> > > > > > > strictly
> > > > a
> > > > > > > real metrics server, since most of the duty lies on nimbus
> > > > > > > server and topology master, it's more appropriate to call it
> > > metrics storage.
> > > > The
> > > > > > main
> > > > > > > reason for this is that we don't want to make a heavy-weight
> > > > > > > metrics
> > > > > > server
> > > > > > > out of JStorm, and this makes us very easy to maintain (we
> > > > > > > have teams
> > > > > > that
> > > > > > > specifically maintain HBase/OTS in Alibaba since they're so
> > > > > > > commonly
> > > > used
> > > > > > > in production).
> > > > > > >
> > > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim
> > > > > > > <ka...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > > >>
> > > > > > >> Cody,
> > > > > > >> I took a look at design doc and looks promising, especially
> > > > > > >> it
> > > > doesn't
> > > > > > do
> > > > > > >> sampling when metric type is 'counter'. As far as I heard
> > > > > > >> (I didn't
> > > > try
> > > > > > >> it)
> > > > > > >> it becomes huge performance hit in Apache Storm when we
> > > > > > >> change
> > > > sample
> > > > > > rate
> > > > > > >> to 1.0.
> > > > > > >> Could you guide the entry point of metric feature in JStorm
> > > > > > >> to dig
> > > > into?
> > > > > > >>
> > > > > > >> And just a curiosity, did you consider extracting metric
> > > > > > >> feature
> > > > (which
> > > > > > is
> > > > > > >> done with TopologyMasters and Nimbuses) into separate
> component?
> > > > > > >> I understood your mention to 'metrics server' as separate
> > > > component, but
> > > > > > >> after seeing design doc, feature seems to be implemented on
> > > Nimbus.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > >>
> > > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere
> > > > > > >> <e....@gmail.com>님이
> > > > 작성:
> > > > > > >>
> > > > > > >> > JStorm has provided a MetricUploader interface, which is
> > > > > > >> > similar
> > > > to
> > > > > > >> > IMetricsConsumer in storm, and the underlying
> > > > > > >> > implementation is
> > > > > > >> pluggable,
> > > > > > >> > you can use HBase, or any other KV store that supports
> > > > > > >> > timeline
> > > > > > queries
> > > > > > >> or
> > > > > > >> > even a database(maybe for it's a small cluster). We
> > > > > > >> > provide model
> > > > > > >> classes
> > > > > > >> > in jstorm-core, as to what kinds of metrics data need to
> > > > > > >> > be
> > > > stored,
> > > > > > it's
> > > > > > >> > totally up to the detailed implementation. Our internal
> > > > implementation
> > > > > > >> uses
> > > > > > >> > OTS, which is a product of aliyun (
> > > > > > https://www.aliyun.com/product/ots/
> > > > > > >> ),
> > > > > > >> > but it's easy to adapt to other implementations.
> > > > > > >> >
> > > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > > >> <evans@yahoo-inc.com.invalid
> > > > > > >> > >
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > Yes we originally wanted to try and use the Hadoop
> > > > > > >> > > Timeline
> > > > Server
> > > > > > for
> > > > > > >> > > storm metrics feedback to nimbus + UI + history like
> server.
> > > > But it
> > > > > > >> was
> > > > > > >> > > not stable at the time, so we stopped.  For the sake of
> > > > > > >> > > playing
> > > > > > nicely
> > > > > > >> > with
> > > > > > >> > > the rest of the big data ecosystem I would like to see
> > > > > > >> > > us
> > > > support it
> > > > > > >> as
> > > > > > >> > an
> > > > > > >> > > option for metrics collection/query, but until the
> > > > > > >> > > timeline
> > > > server
> > > > > > v2
> > > > > > >> is
> > > > > > >> > > ready and released.  For me the important thing is that
> > > > > > >> > > we have
> > > > a
> > > > > > >> decent
> > > > > > >> > > time series DB that comes with storm by default and is
> > > > pluggable so
> > > > > > we
> > > > > > >> > can
> > > > > > >> > > replace it with something else that has similar
> > > > > > >> > > capabilities in
> > > > the
> > > > > > >> > future.
> > > > > > >> > >  - Bobby
> > > > > > >> > >
> > > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > > > > > >> > >e.neverme@gmail.com> wrote:
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm
> > > > > > >> > >absolutely
> > > > ok
> > > > > > to
> > > > > > >> > > discuss this in advance.
> > > > > > >> > >
> > > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > > e.neverme@gmail.com
> > > > > > >> >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Yes it's already in production.
> > > > > > >> > > > The implementation basically follows the design
> > > > > > >> > > > document in
> > > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, you
> > > > > > >> > > > can
> > > > take a
> > > > > > >> look
> > > > > > >> > > > first and feel free to ask questions.
> > > > > > >> > > >
> > > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > > kabhwan@gmail.com
> > > > > > >
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > >> Hi,
> > > > > > >> > > >>
> > > > > > >> > > >> I got something to do with metrics so I'm seeking
> > > > > > >> > > >> the pull
> > > > > > requests
> > > > > > >> > > which
> > > > > > >> > > >> addresses metrics.
> > > > > > >> > > >> And at #753
> > > > > > >> > > >> <https://github.com/apache/storm/pull/753> I
> > > > found
> > > > > > >> Cody
> > > > > > >> > > said
> > > > > > >> > > >> we
> > > > > > >> > > >> (maybe it means Alibaba team) are currently working
> > > > > > >> > > >> on
> > > > Metrics
> > > > > > >> Server.
> > > > > > >> > > >> (I also found comment which said there was some talk
> > > > > > >> > > >> while
> > > > ago
> > > > > > >> around
> > > > > > >> > > >> integrating Hadoop timeline server. Seems like no
> > > > > > >> > > >> one came up
> > > > > > with
> > > > > > >> the
> > > > > > >> > > >> result, and I prefer to avoid big dependency so I'm
> > > > > > >> > > >> in favor
> > > > of
> > > > > > >> > Metrics
> > > > > > >> > > >> Server for now.)
> > > > > > >> > > >>
> > > > > > >> > > >> I think that would improve metrics feature of Storm
> > > > > > >> > > >> much
> > > > better,
> > > > > > so
> > > > > > >> > I'd
> > > > > > >> > > >> like to see how the work is going. Sure it's only
> > > > > > >> > > >> when
> > > > there's no
> > > > > > >> > issue
> > > > > > >> > > >> for
> > > > > > >> > > >> you to work transparently. I just would like to
> > > > > > >> > > >> prevent
> > > > > > >> duplication of
> > > > > > >> > > >> work, and would like to help if needed and possible.
> > > > > > >> > > >>
> > > > > > >> > > >> Thanks,
> > > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > >> > > >>
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Abhishek Agarwal
> > > >
> > >
> > >
> > >
> > >
> >
>
>
>
>

Re: 答复: Question on Metrics Server to Alibaba team

Posted by Cody Innowhere <e....@gmail.com>.

Bobby,
I understand your concern. Still, I think our metrics design in JStorm can
work without any external service, as I mentioned above, we can store
metrics in rocksdb on nimbus server. A rough thought will be: we store the
latest 1 hour of 1-min window data, 10 hours of 10-min window data, 5 days
of 2-hour window data, 30 days of 1-day window, etc. And if there's the
need to sync metrics data between nimbus servers, we can add a sync thread
to handle nimbus fail-over, since it's just metrics data that don't really
matter too much, we can use a plain simple sync model.

The external service is another option to end users, if users feel it's
important (or maybe their business built on top of storm is very
important), they can use this external service to build their own monitor
system which can be more useful than the original solution shipped with
storm.

On Fri, Mar 25, 2016 at 2:09 AM, Bobby Evans <ev...@yahoo-inc.com.invalid>
wrote:

> The problem is that we want something for storm that can work out of the
> box, ideally without some other complicated external service (except
> zookeeper which we already have, and is not actually that complex to setup
> and run).
> If we feel that we must have some external state store that is required
> for storm to run, then we need to make the decision carefully and
> deliberately.
>  - Bobby
>
>     On Wednesday, March 23, 2016 8:37 AM, John Fang <
> xiaojian.fxj@alibaba-inc.com> wrote:
>
>
>  Sorry , I misunderstand it. We will make H/A for TopologyMaster. And
> metric meta will store at HDFS,  So the metrics meta won't rely on the
> nimbus. It can enhance the stability of the metric system.
>
> -----邮件原件-----
> 发件人: Cody Innowhere [mailto:e.neverme@gmail.com]
> 发送时间: 2016年3月23日 19:59
> 收件人: dev@storm.apache.org
> 主题: Re: Question on Metrics Server to Alibaba team
>
> If we don't rely on any external system, our metrics system is still
> available but will store metrics meta/data in rocksdb on nimbus servers.
> There will be limits though, for example, we cannot store metrics data all
> through the topology lifecycle, because rocksdb is only a KV storage, it
> may not support efficient scan operations and too much data in local disk
> may bring in extra IO overhead, so we may have to store latest 1hour of m1
> data, 6 hours of m10 data as such (currently not implemented in JStorm, but
> quite easy to do this).
>
> TopologyMaster is merely a channel for registering/computing/uploading
> metrics to nimbus, so if a TM goes down, the topology metrics will be
> unavailable for a while before it gets pulled up somewhere else(for a
> normal failover case, this should be very fast), while supervisor/nimbus
> metrics are unaffected as they're sent to nimbus via thrift interface. As
> long as TM is back, the topology metrics will be available again.
>
> Currently JStorm does sync metrics meta but metrics data between multiple
> nimbus serers is not synced. So under a nimbus failure, possibly we may
> lose some metrics data.
>
>
> On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>
> > John,
> >
> > My concern is H/A of metrics on Storm by default. (I'm not 100% sure
> > Bobby pointed out same things.)
> >
> > Since Apache Storm has been used by various users so that we can't
> > assume that users have knowledges of external systems (including
> > Hadoop ecosystem, personal opinion) and operate them smoothly.
> > It reminds me about the importance to keep in mind about default.
> >
> > Therefore, I'm curious that new metrics feature of JStom can work
> > smoothly without external system (HBase / OTS). And love to see it
> > supports H/A without other systems, or users have to tolerate lost of
> > metrics for some scenarios.
> >
> > I guess this may be valid questions on H/A (as far as my understanding
> > of design doc is right): How metrics work when TopologyMaster is down?
> > And how metrics work when failover of Nimbus occurs?
> >
> > Personally I don't mind losing metrics for short durations (just want
> > to check availability of H/A), but failure shouldn't mess up whole
> metrics.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> >
> > 2016년 3월 23일 (수) 오후 3:39, John Fang <xi...@alibaba-inc.com>님이 작성:
> >
> > > @ Bobby Evans Jstorm code has experienced a lot of tests over the
> > > past
> > few
> > > years, espatially HA and scalability. We have done a lot of
> > > optimization about Metrics. The performance is better than Flink in
> > > my tests. In my personal opinion, the metric in jstorm offers very
> > > much informations. And the metric can tell us where is the bottleneck
> when we run a topology.
> > The
> > > performance bottleneck maybe serialize/deserialize/netty/executor
> > > and so on. Of course, I also has some other good monitoring in the
> > > world. So I hope we can choice the better monitoring before phrase
> > > 2. And I will
> > start
> > > study the Alas. If it is better, I am pleasured to redesign the
> > > metric by Alas.
> > > -----邮件原件-----
> > > 发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
> > > 发送时间: 2016年3月22日 22:36
> > > 收件人: dev@storm.apache.org
> > > 主题: Re: Question on Metrics Server to Alibaba team
> > >
> > > My personal opinion is that we should not reinvent the wheel (aka
> > > distributed fault tolerant metrics) ourselves.  The local file
> > > blobstore with nimbus HA was a big enough pain to write and it is
> > > relatively simple in comparison.
> > > If the JStorm code is simple and offers everything we need in terms
> > > of HA and scalability then I would be OK with it, but if it doesn't
> > > I would
> > lean
> > > towards a different compatible open source solution.
> > >
> > > https://github.com/Netflix/atlas
> > > looks very promising as a default option.  It is actively maintained
> > > by a group that I think has some of the best monitoring in the
> > > world.  And it
> > is
> > > both java and apache compatible.  It has no histogram support that I
> > could
> > > find, but that I don't see as being super critical.  The biggest
> > > drawback is there is little documentation on how to use it, to
> > > really be able to evaluate it for our needs. - Bobby
> > >
> > >    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim
> > > <ka...@gmail.com>
> > > wrote:
> > >
> > >
> > >  Harsha,
> > >
> > > That's why I think new metric feature of JStorm looks promising.
> > >
> > > According to design doc on
> > > https://issues.apache.org/jira/browse/STORM-1329,
> > > there's no distinction between topology stat (which Apache Storm
> > > includes to worker heartbeat) and built-in metrics (which should be
> > > handled with separate consumer, as you stated).
> > > All metrics are passed to Nimbus and Nimbus cached metrics, which
> > > implies we can treat all metrics as same, and we can also provide
> > > built-in
> > metrics
> > > (including custom metrics) to users via REST API, too.
> > >
> > > I thought about standalone metrics server process which handles
> > > whole metric works (maybe TopologyMaster + Nimbus on design doc),
> > > but if
> > current
> > > implementation of metric feature on JStorm can take care of what I'm
> > > assuming, I guess it's great enough.
> > >
> > > Since I don't know about TopologyMaster, I just wonder that there're
> > > any SPOFs (including soft) and how metrics work when if component of
> > > SPOF
> > goes
> > > down.
> > > Since Cody gives digging point to take a look at, we can evaluate
> > > that feature before phase 2.
> > >
> > > Thanks,
> > > Jungtaek Lim (HeartSaVioR)
> > >
> > > 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:
> > >
> > > > One of the goals of this work and probably can be addressed in
> > > > separate jira is how the topology metrics reporter works. Today
> > > > its a bolt thats part of a topology graph that means its another
> > > > node in the Topology DAG that needs be tuned for better
> > > > performance. Some of our users took performance hits by deploying
> > > > topology metrics reporter that can send metrics to Ganglia.
> > > > Ideally this collection should be asynchronous and not be a node in
> topology DAG.
> > > >
> > > > Shipping default metrics server and along with pluggable option
> > > > for users who wants to graphite or other timeline servers should
> > > > be the goal.
> > > >
> > > > --Harsha
> > > >
> > > >
> > > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > > @Cody - The design looks good. Does the design allow to
> > > > > aggregate metrics at the task/executor level? Basically, number
> > > > > of distinct metrics is proportional to the number of distinct
> > > > > tasks, did you ever run into such a use case?
> > > > >
> > > > >
> > > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere
> > > > > <e....@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Also, you can read the code from our latest release JStorm 2.1.1.
> > > > > >
> > > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere
> > > > > > <e....@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > @Jungtaek,
> > > > > > > We did some tests on codahale metrics, compared to
> > > > > > > meters/histograms, counters are quite fast. So we mainly
> > > > > > > focused on the optimization of
> > > > > > meters
> > > > > > > and histograms (they are indeed very slow) including double
> > > > > > > sampling, changing the clock from ns (System.nanoTime) to
> > > > > > > ms,
> > etc.
> > > > > > > You can take a look at the
> > > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount"
> > > > > > > class of our sequence-split-merge example code, as the
> > > > > > > client code entry to
> > > > metrics.
> > > > > > > After that, you may dig to TopologyMaster class, which is
> > > > > > > still part
> > > > of a
> > > > > > > topology, and then to TopologyMetricsRunnable, which is a
> > > > > > > part of
> > > > nimbus
> > > > > > > server, finally to MetricUploader plugin, this is where the
> > > > > > > metrics interfere with our "metrics server". Still, there're
> > > > > > > some nits in the
> > > > > > code,
> > > > > > > but I think that should be no big problem.
> > > > > > >
> > > > > > > I'd also like to point out that our "metrics server" is not
> > > > > > > strictly
> > > > a
> > > > > > > real metrics server, since most of the duty lies on nimbus
> > > > > > > server and topology master, it's more appropriate to call it
> > > metrics storage.
> > > > The
> > > > > > main
> > > > > > > reason for this is that we don't want to make a heavy-weight
> > > > > > > metrics
> > > > > > server
> > > > > > > out of JStorm, and this makes us very easy to maintain (we
> > > > > > > have teams
> > > > > > that
> > > > > > > specifically maintain HBase/OTS in Alibaba since they're so
> > > > > > > commonly
> > > > used
> > > > > > > in production).
> > > > > > >
> > > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim
> > > > > > > <ka...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > > >>
> > > > > > >> Cody,
> > > > > > >> I took a look at design doc and looks promising, especially
> > > > > > >> it
> > > > doesn't
> > > > > > do
> > > > > > >> sampling when metric type is 'counter'. As far as I heard
> > > > > > >> (I didn't
> > > > try
> > > > > > >> it)
> > > > > > >> it becomes huge performance hit in Apache Storm when we
> > > > > > >> change
> > > > sample
> > > > > > rate
> > > > > > >> to 1.0.
> > > > > > >> Could you guide the entry point of metric feature in JStorm
> > > > > > >> to dig
> > > > into?
> > > > > > >>
> > > > > > >> And just a curiosity, did you consider extracting metric
> > > > > > >> feature
> > > > (which
> > > > > > is
> > > > > > >> done with TopologyMasters and Nimbuses) into separate
> component?
> > > > > > >> I understood your mention to 'metrics server' as separate
> > > > component, but
> > > > > > >> after seeing design doc, feature seems to be implemented on
> > > Nimbus.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > >>
> > > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere
> > > > > > >> <e....@gmail.com>님이
> > > > 작성:
> > > > > > >>
> > > > > > >> > JStorm has provided a MetricUploader interface, which is
> > > > > > >> > similar
> > > > to
> > > > > > >> > IMetricsConsumer in storm, and the underlying
> > > > > > >> > implementation is
> > > > > > >> pluggable,
> > > > > > >> > you can use HBase, or any other KV store that supports
> > > > > > >> > timeline
> > > > > > queries
> > > > > > >> or
> > > > > > >> > even a database(maybe for it's a small cluster). We
> > > > > > >> > provide model
> > > > > > >> classes
> > > > > > >> > in jstorm-core, as to what kinds of metrics data need to
> > > > > > >> > be
> > > > stored,
> > > > > > it's
> > > > > > >> > totally up to the detailed implementation. Our internal
> > > > implementation
> > > > > > >> uses
> > > > > > >> > OTS, which is a product of aliyun (
> > > > > > https://www.aliyun.com/product/ots/
> > > > > > >> ),
> > > > > > >> > but it's easy to adapt to other implementations.
> > > > > > >> >
> > > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > > >> <evans@yahoo-inc.com.invalid
> > > > > > >> > >
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > Yes we originally wanted to try and use the Hadoop
> > > > > > >> > > Timeline
> > > > Server
> > > > > > for
> > > > > > >> > > storm metrics feedback to nimbus + UI + history like
> server.
> > > > But it
> > > > > > >> was
> > > > > > >> > > not stable at the time, so we stopped.  For the sake of
> > > > > > >> > > playing
> > > > > > nicely
> > > > > > >> > with
> > > > > > >> > > the rest of the big data ecosystem I would like to see
> > > > > > >> > > us
> > > > support it
> > > > > > >> as
> > > > > > >> > an
> > > > > > >> > > option for metrics collection/query, but until the
> > > > > > >> > > timeline
> > > > server
> > > > > > v2
> > > > > > >> is
> > > > > > >> > > ready and released.  For me the important thing is that
> > > > > > >> > > we have
> > > > a
> > > > > > >> decent
> > > > > > >> > > time series DB that comes with storm by default and is
> > > > pluggable so
> > > > > > we
> > > > > > >> > can
> > > > > > >> > > replace it with something else that has similar
> > > > > > >> > > capabilities in
> > > > the
> > > > > > >> > future.
> > > > > > >> > >  - Bobby
> > > > > > >> > >
> > > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > > > > > >> > >e.neverme@gmail.com> wrote:
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm
> > > > > > >> > >absolutely
> > > > ok
> > > > > > to
> > > > > > >> > > discuss this in advance.
> > > > > > >> > >
> > > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > > e.neverme@gmail.com
> > > > > > >> >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Yes it's already in production.
> > > > > > >> > > > The implementation basically follows the design
> > > > > > >> > > > document in
> > > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, you
> > > > > > >> > > > can
> > > > take a
> > > > > > >> look
> > > > > > >> > > > first and feel free to ask questions.
> > > > > > >> > > >
> > > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > > kabhwan@gmail.com
> > > > > > >
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > >> Hi,
> > > > > > >> > > >>
> > > > > > >> > > >> I got something to do with metrics so I'm seeking
> > > > > > >> > > >> the pull
> > > > > > requests
> > > > > > >> > > which
> > > > > > >> > > >> addresses metrics.
> > > > > > >> > > >> And at #753
> > > > > > >> > > >> <https://github.com/apache/storm/pull/753> I
> > > > found
> > > > > > >> Cody
> > > > > > >> > > said
> > > > > > >> > > >> we
> > > > > > >> > > >> (maybe it means Alibaba team) are currently working
> > > > > > >> > > >> on
> > > > Metrics
> > > > > > >> Server.
> > > > > > >> > > >> (I also found comment which said there was some talk
> > > > > > >> > > >> while
> > > > ago
> > > > > > >> around
> > > > > > >> > > >> integrating Hadoop timeline server. Seems like no
> > > > > > >> > > >> one came up
> > > > > > with
> > > > > > >> the
> > > > > > >> > > >> result, and I prefer to avoid big dependency so I'm
> > > > > > >> > > >> in favor
> > > > of
> > > > > > >> > Metrics
> > > > > > >> > > >> Server for now.)
> > > > > > >> > > >>
> > > > > > >> > > >> I think that would improve metrics feature of Storm
> > > > > > >> > > >> much
> > > > better,
> > > > > > so
> > > > > > >> > I'd
> > > > > > >> > > >> like to see how the work is going. Sure it's only
> > > > > > >> > > >> when
> > > > there's no
> > > > > > >> > issue
> > > > > > >> > > >> for
> > > > > > >> > > >> you to work transparently. I just would like to
> > > > > > >> > > >> prevent
> > > > > > >> duplication of
> > > > > > >> > > >> work, and would like to help if needed and possible.
> > > > > > >> > > >>
> > > > > > >> > > >> Thanks,
> > > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > >> > > >>
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Abhishek Agarwal
> > > >
> > >
> > >
> > >
> > >
> >
>
>
>
>

Re: 答复: Question on Metrics Server to Alibaba team

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

The problem is that we want something for storm that can work out of the box, ideally without some other complicated external service (except zookeeper which we already have, and is not actually that complex to setup and run).
If we feel that we must have some external state store that is required for storm to run, then we need to make the decision carefully and deliberately.
 - Bobby 

    On Wednesday, March 23, 2016 8:37 AM, John Fang <xi...@alibaba-inc.com> wrote:
 

 Sorry , I misunderstand it. We will make H/A for TopologyMaster. And metric meta will store at HDFS,  So the metrics meta won't rely on the nimbus. It can enhance the stability of the metric system.    

-----邮件原件-----
发件人: Cody Innowhere [mailto:e.neverme@gmail.com] 
发送时间: 2016年3月23日 19:59
收件人: dev@storm.apache.org
主题: Re: Question on Metrics Server to Alibaba team

If we don't rely on any external system, our metrics system is still available but will store metrics meta/data in rocksdb on nimbus servers.
There will be limits though, for example, we cannot store metrics data all through the topology lifecycle, because rocksdb is only a KV storage, it may not support efficient scan operations and too much data in local disk may bring in extra IO overhead, so we may have to store latest 1hour of m1 data, 6 hours of m10 data as such (currently not implemented in JStorm, but quite easy to do this).

TopologyMaster is merely a channel for registering/computing/uploading metrics to nimbus, so if a TM goes down, the topology metrics will be unavailable for a while before it gets pulled up somewhere else(for a normal failover case, this should be very fast), while supervisor/nimbus metrics are unaffected as they're sent to nimbus via thrift interface. As long as TM is back, the topology metrics will be available again.

Currently JStorm does sync metrics meta but metrics data between multiple nimbus serers is not synced. So under a nimbus failure, possibly we may lose some metrics data.


On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:

> John,
>
> My concern is H/A of metrics on Storm by default. (I'm not 100% sure 
> Bobby pointed out same things.)
>
> Since Apache Storm has been used by various users so that we can't 
> assume that users have knowledges of external systems (including 
> Hadoop ecosystem, personal opinion) and operate them smoothly.
> It reminds me about the importance to keep in mind about default.
>
> Therefore, I'm curious that new metrics feature of JStom can work 
> smoothly without external system (HBase / OTS). And love to see it 
> supports H/A without other systems, or users have to tolerate lost of 
> metrics for some scenarios.
>
> I guess this may be valid questions on H/A (as far as my understanding 
> of design doc is right): How metrics work when TopologyMaster is down? 
> And how metrics work when failover of Nimbus occurs?
>
> Personally I don't mind losing metrics for short durations (just want 
> to check availability of H/A), but failure shouldn't mess up whole metrics.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2016년 3월 23일 (수) 오후 3:39, John Fang <xi...@alibaba-inc.com>님이 작성:
>
> > @ Bobby Evans Jstorm code has experienced a lot of tests over the 
> > past
> few
> > years, espatially HA and scalability. We have done a lot of 
> > optimization about Metrics. The performance is better than Flink in 
> > my tests. In my personal opinion, the metric in jstorm offers very 
> > much informations. And the metric can tell us where is the bottleneck when we run a topology.
> The
> > performance bottleneck maybe serialize/deserialize/netty/executor 
> > and so on. Of course, I also has some other good monitoring in the 
> > world. So I hope we can choice the better monitoring before phrase 
> > 2. And I will
> start
> > study the Alas. If it is better, I am pleasured to redesign the 
> > metric by Alas.
> > -----邮件原件-----
> > 发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
> > 发送时间: 2016年3月22日 22:36
> > 收件人: dev@storm.apache.org
> > 主题: Re: Question on Metrics Server to Alibaba team
> >
> > My personal opinion is that we should not reinvent the wheel (aka 
> > distributed fault tolerant metrics) ourselves.  The local file 
> > blobstore with nimbus HA was a big enough pain to write and it is 
> > relatively simple in comparison.
> > If the JStorm code is simple and offers everything we need in terms 
> > of HA and scalability then I would be OK with it, but if it doesn't 
> > I would
> lean
> > towards a different compatible open source solution.
> >
> > https://github.com/Netflix/atlas
> > looks very promising as a default option.  It is actively maintained 
> > by a group that I think has some of the best monitoring in the 
> > world.  And it
> is
> > both java and apache compatible.  It has no histogram support that I
> could
> > find, but that I don't see as being super critical.  The biggest 
> > drawback is there is little documentation on how to use it, to 
> > really be able to evaluate it for our needs. - Bobby
> >
> >    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim 
> > <ka...@gmail.com>
> > wrote:
> >
> >
> >  Harsha,
> >
> > That's why I think new metric feature of JStorm looks promising.
> >
> > According to design doc on
> > https://issues.apache.org/jira/browse/STORM-1329,
> > there's no distinction between topology stat (which Apache Storm 
> > includes to worker heartbeat) and built-in metrics (which should be 
> > handled with separate consumer, as you stated).
> > All metrics are passed to Nimbus and Nimbus cached metrics, which 
> > implies we can treat all metrics as same, and we can also provide 
> > built-in
> metrics
> > (including custom metrics) to users via REST API, too.
> >
> > I thought about standalone metrics server process which handles 
> > whole metric works (maybe TopologyMaster + Nimbus on design doc), 
> > but if
> current
> > implementation of metric feature on JStorm can take care of what I'm 
> > assuming, I guess it's great enough.
> >
> > Since I don't know about TopologyMaster, I just wonder that there're 
> > any SPOFs (including soft) and how metrics work when if component of 
> > SPOF
> goes
> > down.
> > Since Cody gives digging point to take a look at, we can evaluate 
> > that feature before phase 2.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> >
> > 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:
> >
> > > One of the goals of this work and probably can be addressed in 
> > > separate jira is how the topology metrics reporter works. Today 
> > > its a bolt thats part of a topology graph that means its another 
> > > node in the Topology DAG that needs be tuned for better 
> > > performance. Some of our users took performance hits by deploying 
> > > topology metrics reporter that can send metrics to Ganglia. 
> > > Ideally this collection should be asynchronous and not be a node in topology DAG.
> > >
> > > Shipping default metrics server and along with pluggable option 
> > > for users who wants to graphite or other timeline servers should 
> > > be the goal.
> > >
> > > --Harsha
> > >
> > >
> > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > @Cody - The design looks good. Does the design allow to 
> > > > aggregate metrics at the task/executor level? Basically, number 
> > > > of distinct metrics is proportional to the number of distinct 
> > > > tasks, did you ever run into such a use case?
> > > >
> > > >
> > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere 
> > > > <e....@gmail.com>
> > > > wrote:
> > > >
> > > > > Also, you can read the code from our latest release JStorm 2.1.1.
> > > > >
> > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere 
> > > > > <e....@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > @Jungtaek,
> > > > > > We did some tests on codahale metrics, compared to 
> > > > > > meters/histograms, counters are quite fast. So we mainly 
> > > > > > focused on the optimization of
> > > > > meters
> > > > > > and histograms (they are indeed very slow) including double 
> > > > > > sampling, changing the clock from ns (System.nanoTime) to 
> > > > > > ms,
> etc.
> > > > > > You can take a look at the
> > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" 
> > > > > > class of our sequence-split-merge example code, as the 
> > > > > > client code entry to
> > > metrics.
> > > > > > After that, you may dig to TopologyMaster class, which is 
> > > > > > still part
> > > of a
> > > > > > topology, and then to TopologyMetricsRunnable, which is a 
> > > > > > part of
> > > nimbus
> > > > > > server, finally to MetricUploader plugin, this is where the 
> > > > > > metrics interfere with our "metrics server". Still, there're 
> > > > > > some nits in the
> > > > > code,
> > > > > > but I think that should be no big problem.
> > > > > >
> > > > > > I'd also like to point out that our "metrics server" is not 
> > > > > > strictly
> > > a
> > > > > > real metrics server, since most of the duty lies on nimbus 
> > > > > > server and topology master, it's more appropriate to call it
> > metrics storage.
> > > The
> > > > > main
> > > > > > reason for this is that we don't want to make a heavy-weight 
> > > > > > metrics
> > > > > server
> > > > > > out of JStorm, and this makes us very easy to maintain (we 
> > > > > > have teams
> > > > > that
> > > > > > specifically maintain HBase/OTS in Alibaba since they're so 
> > > > > > commonly
> > > used
> > > > > > in production).
> > > > > >
> > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim 
> > > > > > <ka...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > >>
> > > > > >> Cody,
> > > > > >> I took a look at design doc and looks promising, especially 
> > > > > >> it
> > > doesn't
> > > > > do
> > > > > >> sampling when metric type is 'counter'. As far as I heard 
> > > > > >> (I didn't
> > > try
> > > > > >> it)
> > > > > >> it becomes huge performance hit in Apache Storm when we 
> > > > > >> change
> > > sample
> > > > > rate
> > > > > >> to 1.0.
> > > > > >> Could you guide the entry point of metric feature in JStorm 
> > > > > >> to dig
> > > into?
> > > > > >>
> > > > > >> And just a curiosity, did you consider extracting metric 
> > > > > >> feature
> > > (which
> > > > > is
> > > > > >> done with TopologyMasters and Nimbuses) into separate component?
> > > > > >> I understood your mention to 'metrics server' as separate
> > > component, but
> > > > > >> after seeing design doc, feature seems to be implemented on
> > Nimbus.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > >>
> > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere 
> > > > > >> <e....@gmail.com>님이
> > > 작성:
> > > > > >>
> > > > > >> > JStorm has provided a MetricUploader interface, which is 
> > > > > >> > similar
> > > to
> > > > > >> > IMetricsConsumer in storm, and the underlying 
> > > > > >> > implementation is
> > > > > >> pluggable,
> > > > > >> > you can use HBase, or any other KV store that supports 
> > > > > >> > timeline
> > > > > queries
> > > > > >> or
> > > > > >> > even a database(maybe for it's a small cluster). We 
> > > > > >> > provide model
> > > > > >> classes
> > > > > >> > in jstorm-core, as to what kinds of metrics data need to 
> > > > > >> > be
> > > stored,
> > > > > it's
> > > > > >> > totally up to the detailed implementation. Our internal
> > > implementation
> > > > > >> uses
> > > > > >> > OTS, which is a product of aliyun (
> > > > > https://www.aliyun.com/product/ots/
> > > > > >> ),
> > > > > >> > but it's easy to adapt to other implementations.
> > > > > >> >
> > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > >> <evans@yahoo-inc.com.invalid
> > > > > >> > >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Yes we originally wanted to try and use the Hadoop 
> > > > > >> > > Timeline
> > > Server
> > > > > for
> > > > > >> > > storm metrics feedback to nimbus + UI + history like server.
> > > But it
> > > > > >> was
> > > > > >> > > not stable at the time, so we stopped.  For the sake of 
> > > > > >> > > playing
> > > > > nicely
> > > > > >> > with
> > > > > >> > > the rest of the big data ecosystem I would like to see 
> > > > > >> > > us
> > > support it
> > > > > >> as
> > > > > >> > an
> > > > > >> > > option for metrics collection/query, but until the 
> > > > > >> > > timeline
> > > server
> > > > > v2
> > > > > >> is
> > > > > >> > > ready and released.  For me the important thing is that 
> > > > > >> > > we have
> > > a
> > > > > >> decent
> > > > > >> > > time series DB that comes with storm by default and is
> > > pluggable so
> > > > > we
> > > > > >> > can
> > > > > >> > > replace it with something else that has similar 
> > > > > >> > > capabilities in
> > > the
> > > > > >> > future.
> > > > > >> > >  - Bobby
> > > > > >> > >
> > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere < 
> > > > > >> > >e.neverme@gmail.com> wrote:
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm 
> > > > > >> > >absolutely
> > > ok
> > > > > to
> > > > > >> > > discuss this in advance.
> > > > > >> > >
> > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > e.neverme@gmail.com
> > > > > >> >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Yes it's already in production.
> > > > > >> > > > The implementation basically follows the design 
> > > > > >> > > > document in 
> > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, you 
> > > > > >> > > > can
> > > take a
> > > > > >> look
> > > > > >> > > > first and feel free to ask questions.
> > > > > >> > > >
> > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > kabhwan@gmail.com
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > >> Hi,
> > > > > >> > > >>
> > > > > >> > > >> I got something to do with metrics so I'm seeking 
> > > > > >> > > >> the pull
> > > > > requests
> > > > > >> > > which
> > > > > >> > > >> addresses metrics.
> > > > > >> > > >> And at #753 
> > > > > >> > > >> <https://github.com/apache/storm/pull/753> I
> > > found
> > > > > >> Cody
> > > > > >> > > said
> > > > > >> > > >> we
> > > > > >> > > >> (maybe it means Alibaba team) are currently working 
> > > > > >> > > >> on
> > > Metrics
> > > > > >> Server.
> > > > > >> > > >> (I also found comment which said there was some talk 
> > > > > >> > > >> while
> > > ago
> > > > > >> around
> > > > > >> > > >> integrating Hadoop timeline server. Seems like no 
> > > > > >> > > >> one came up
> > > > > with
> > > > > >> the
> > > > > >> > > >> result, and I prefer to avoid big dependency so I'm 
> > > > > >> > > >> in favor
> > > of
> > > > > >> > Metrics
> > > > > >> > > >> Server for now.)
> > > > > >> > > >>
> > > > > >> > > >> I think that would improve metrics feature of Storm 
> > > > > >> > > >> much
> > > better,
> > > > > so
> > > > > >> > I'd
> > > > > >> > > >> like to see how the work is going. Sure it's only 
> > > > > >> > > >> when
> > > there's no
> > > > > >> > issue
> > > > > >> > > >> for
> > > > > >> > > >> you to work transparently. I just would like to 
> > > > > >> > > >> prevent
> > > > > >> duplication of
> > > > > >> > > >> work, and would like to help if needed and possible.
> > > > > >> > > >>
> > > > > >> > > >> Thanks,
> > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > >> > > >>
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Abhishek Agarwal
> > >
> >
> >
> >
> >
>

答复: Question on Metrics Server to Alibaba team

Posted by John Fang <xi...@alibaba-inc.com>.

Sorry , I misunderstand it. We will make H/A for TopologyMaster. And metric meta will store at HDFS,  So the metrics meta won't rely on the nimbus. It can enhance the stability of the metric system.    

-----邮件原件-----
发件人: Cody Innowhere [mailto:e.neverme@gmail.com] 
发送时间: 2016年3月23日 19:59
收件人: dev@storm.apache.org
主题: Re: Question on Metrics Server to Alibaba team

If we don't rely on any external system, our metrics system is still available but will store metrics meta/data in rocksdb on nimbus servers.
There will be limits though, for example, we cannot store metrics data all through the topology lifecycle, because rocksdb is only a KV storage, it may not support efficient scan operations and too much data in local disk may bring in extra IO overhead, so we may have to store latest 1hour of m1 data, 6 hours of m10 data as such (currently not implemented in JStorm, but quite easy to do this).

TopologyMaster is merely a channel for registering/computing/uploading metrics to nimbus, so if a TM goes down, the topology metrics will be unavailable for a while before it gets pulled up somewhere else(for a normal failover case, this should be very fast), while supervisor/nimbus metrics are unaffected as they're sent to nimbus via thrift interface. As long as TM is back, the topology metrics will be available again.

Currently JStorm does sync metrics meta but metrics data between multiple nimbus serers is not synced. So under a nimbus failure, possibly we may lose some metrics data.


On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:

> John,
>
> My concern is H/A of metrics on Storm by default. (I'm not 100% sure 
> Bobby pointed out same things.)
>
> Since Apache Storm has been used by various users so that we can't 
> assume that users have knowledges of external systems (including 
> Hadoop ecosystem, personal opinion) and operate them smoothly.
> It reminds me about the importance to keep in mind about default.
>
> Therefore, I'm curious that new metrics feature of JStom can work 
> smoothly without external system (HBase / OTS). And love to see it 
> supports H/A without other systems, or users have to tolerate lost of 
> metrics for some scenarios.
>
> I guess this may be valid questions on H/A (as far as my understanding 
> of design doc is right): How metrics work when TopologyMaster is down? 
> And how metrics work when failover of Nimbus occurs?
>
> Personally I don't mind losing metrics for short durations (just want 
> to check availability of H/A), but failure shouldn't mess up whole metrics.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2016년 3월 23일 (수) 오후 3:39, John Fang <xi...@alibaba-inc.com>님이 작성:
>
> > @ Bobby Evans Jstorm code has experienced a lot of tests over the 
> > past
> few
> > years, espatially HA and scalability. We have done a lot of 
> > optimization about Metrics. The performance is better than Flink in 
> > my tests. In my personal opinion, the metric in jstorm offers very 
> > much informations. And the metric can tell us where is the bottleneck when we run a topology.
> The
> > performance bottleneck maybe serialize/deserialize/netty/executor 
> > and so on. Of course, I also has some other good monitoring in the 
> > world. So I hope we can choice the better monitoring before phrase 
> > 2. And I will
> start
> > study the Alas. If it is better, I am pleasured to redesign the 
> > metric by Alas.
> > -----邮件原件-----
> > 发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
> > 发送时间: 2016年3月22日 22:36
> > 收件人: dev@storm.apache.org
> > 主题: Re: Question on Metrics Server to Alibaba team
> >
> > My personal opinion is that we should not reinvent the wheel (aka 
> > distributed fault tolerant metrics) ourselves.  The local file 
> > blobstore with nimbus HA was a big enough pain to write and it is 
> > relatively simple in comparison.
> > If the JStorm code is simple and offers everything we need in terms 
> > of HA and scalability then I would be OK with it, but if it doesn't 
> > I would
> lean
> > towards a different compatible open source solution.
> >
> > https://github.com/Netflix/atlas
> > looks very promising as a default option.  It is actively maintained 
> > by a group that I think has some of the best monitoring in the 
> > world.  And it
> is
> > both java and apache compatible.  It has no histogram support that I
> could
> > find, but that I don't see as being super critical.  The biggest 
> > drawback is there is little documentation on how to use it, to 
> > really be able to evaluate it for our needs. - Bobby
> >
> >     On Monday, March 21, 2016 7:29 PM, Jungtaek Lim 
> > <ka...@gmail.com>
> > wrote:
> >
> >
> >  Harsha,
> >
> > That's why I think new metric feature of JStorm looks promising.
> >
> > According to design doc on
> > https://issues.apache.org/jira/browse/STORM-1329,
> > there's no distinction between topology stat (which Apache Storm 
> > includes to worker heartbeat) and built-in metrics (which should be 
> > handled with separate consumer, as you stated).
> > All metrics are passed to Nimbus and Nimbus cached metrics, which 
> > implies we can treat all metrics as same, and we can also provide 
> > built-in
> metrics
> > (including custom metrics) to users via REST API, too.
> >
> > I thought about standalone metrics server process which handles 
> > whole metric works (maybe TopologyMaster + Nimbus on design doc), 
> > but if
> current
> > implementation of metric feature on JStorm can take care of what I'm 
> > assuming, I guess it's great enough.
> >
> > Since I don't know about TopologyMaster, I just wonder that there're 
> > any SPOFs (including soft) and how metrics work when if component of 
> > SPOF
> goes
> > down.
> > Since Cody gives digging point to take a look at, we can evaluate 
> > that feature before phase 2.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> >
> > 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:
> >
> > > One of the goals of this work and probably can be addressed in 
> > > separate jira is how the topology metrics reporter works. Today 
> > > its a bolt thats part of a topology graph that means its another 
> > > node in the Topology DAG that needs be tuned for better 
> > > performance. Some of our users took performance hits by deploying 
> > > topology metrics reporter that can send metrics to Ganglia. 
> > > Ideally this collection should be asynchronous and not be a node in topology DAG.
> > >
> > > Shipping default metrics server and along with pluggable option 
> > > for users who wants to graphite or other timeline servers should 
> > > be the goal.
> > >
> > > --Harsha
> > >
> > >
> > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > @Cody - The design looks good. Does the design allow to 
> > > > aggregate metrics at the task/executor level? Basically, number 
> > > > of distinct metrics is proportional to the number of distinct 
> > > > tasks, did you ever run into such a use case?
> > > >
> > > >
> > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere 
> > > > <e....@gmail.com>
> > > > wrote:
> > > >
> > > > > Also, you can read the code from our latest release JStorm 2.1.1.
> > > > >
> > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere 
> > > > > <e....@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > @Jungtaek,
> > > > > > We did some tests on codahale metrics, compared to 
> > > > > > meters/histograms, counters are quite fast. So we mainly 
> > > > > > focused on the optimization of
> > > > > meters
> > > > > > and histograms (they are indeed very slow) including double 
> > > > > > sampling, changing the clock from ns (System.nanoTime) to 
> > > > > > ms,
> etc.
> > > > > > You can take a look at the
> > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" 
> > > > > > class of our sequence-split-merge example code, as the 
> > > > > > client code entry to
> > > metrics.
> > > > > > After that, you may dig to TopologyMaster class, which is 
> > > > > > still part
> > > of a
> > > > > > topology, and then to TopologyMetricsRunnable, which is a 
> > > > > > part of
> > > nimbus
> > > > > > server, finally to MetricUploader plugin, this is where the 
> > > > > > metrics interfere with our "metrics server". Still, there're 
> > > > > > some nits in the
> > > > > code,
> > > > > > but I think that should be no big problem.
> > > > > >
> > > > > > I'd also like to point out that our "metrics server" is not 
> > > > > > strictly
> > > a
> > > > > > real metrics server, since most of the duty lies on nimbus 
> > > > > > server and topology master, it's more appropriate to call it
> > metrics storage.
> > > The
> > > > > main
> > > > > > reason for this is that we don't want to make a heavy-weight 
> > > > > > metrics
> > > > > server
> > > > > > out of JStorm, and this makes us very easy to maintain (we 
> > > > > > have teams
> > > > > that
> > > > > > specifically maintain HBase/OTS in Alibaba since they're so 
> > > > > > commonly
> > > used
> > > > > > in production).
> > > > > >
> > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim 
> > > > > > <ka...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > >>
> > > > > >> Cody,
> > > > > >> I took a look at design doc and looks promising, especially 
> > > > > >> it
> > > doesn't
> > > > > do
> > > > > >> sampling when metric type is 'counter'. As far as I heard 
> > > > > >> (I didn't
> > > try
> > > > > >> it)
> > > > > >> it becomes huge performance hit in Apache Storm when we 
> > > > > >> change
> > > sample
> > > > > rate
> > > > > >> to 1.0.
> > > > > >> Could you guide the entry point of metric feature in JStorm 
> > > > > >> to dig
> > > into?
> > > > > >>
> > > > > >> And just a curiosity, did you consider extracting metric 
> > > > > >> feature
> > > (which
> > > > > is
> > > > > >> done with TopologyMasters and Nimbuses) into separate component?
> > > > > >> I understood your mention to 'metrics server' as separate
> > > component, but
> > > > > >> after seeing design doc, feature seems to be implemented on
> > Nimbus.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > >>
> > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere 
> > > > > >> <e....@gmail.com>님이
> > > 작성:
> > > > > >>
> > > > > >> > JStorm has provided a MetricUploader interface, which is 
> > > > > >> > similar
> > > to
> > > > > >> > IMetricsConsumer in storm, and the underlying 
> > > > > >> > implementation is
> > > > > >> pluggable,
> > > > > >> > you can use HBase, or any other KV store that supports 
> > > > > >> > timeline
> > > > > queries
> > > > > >> or
> > > > > >> > even a database(maybe for it's a small cluster). We 
> > > > > >> > provide model
> > > > > >> classes
> > > > > >> > in jstorm-core, as to what kinds of metrics data need to 
> > > > > >> > be
> > > stored,
> > > > > it's
> > > > > >> > totally up to the detailed implementation. Our internal
> > > implementation
> > > > > >> uses
> > > > > >> > OTS, which is a product of aliyun (
> > > > > https://www.aliyun.com/product/ots/
> > > > > >> ),
> > > > > >> > but it's easy to adapt to other implementations.
> > > > > >> >
> > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > >> <evans@yahoo-inc.com.invalid
> > > > > >> > >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Yes we originally wanted to try and use the Hadoop 
> > > > > >> > > Timeline
> > > Server
> > > > > for
> > > > > >> > > storm metrics feedback to nimbus + UI + history like server.
> > > But it
> > > > > >> was
> > > > > >> > > not stable at the time, so we stopped.  For the sake of 
> > > > > >> > > playing
> > > > > nicely
> > > > > >> > with
> > > > > >> > > the rest of the big data ecosystem I would like to see 
> > > > > >> > > us
> > > support it
> > > > > >> as
> > > > > >> > an
> > > > > >> > > option for metrics collection/query, but until the 
> > > > > >> > > timeline
> > > server
> > > > > v2
> > > > > >> is
> > > > > >> > > ready and released.  For me the important thing is that 
> > > > > >> > > we have
> > > a
> > > > > >> decent
> > > > > >> > > time series DB that comes with storm by default and is
> > > pluggable so
> > > > > we
> > > > > >> > can
> > > > > >> > > replace it with something else that has similar 
> > > > > >> > > capabilities in
> > > the
> > > > > >> > future.
> > > > > >> > >  - Bobby
> > > > > >> > >
> > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere < 
> > > > > >> > >e.neverme@gmail.com> wrote:
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm 
> > > > > >> > >absolutely
> > > ok
> > > > > to
> > > > > >> > > discuss this in advance.
> > > > > >> > >
> > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > e.neverme@gmail.com
> > > > > >> >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Yes it's already in production.
> > > > > >> > > > The implementation basically follows the design 
> > > > > >> > > > document in 
> > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, you 
> > > > > >> > > > can
> > > take a
> > > > > >> look
> > > > > >> > > > first and feel free to ask questions.
> > > > > >> > > >
> > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > kabhwan@gmail.com
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > >> Hi,
> > > > > >> > > >>
> > > > > >> > > >> I got something to do with metrics so I'm seeking 
> > > > > >> > > >> the pull
> > > > > requests
> > > > > >> > > which
> > > > > >> > > >> addresses metrics.
> > > > > >> > > >> And at #753 
> > > > > >> > > >> <https://github.com/apache/storm/pull/753> I
> > > found
> > > > > >> Cody
> > > > > >> > > said
> > > > > >> > > >> we
> > > > > >> > > >> (maybe it means Alibaba team) are currently working 
> > > > > >> > > >> on
> > > Metrics
> > > > > >> Server.
> > > > > >> > > >> (I also found comment which said there was some talk 
> > > > > >> > > >> while
> > > ago
> > > > > >> around
> > > > > >> > > >> integrating Hadoop timeline server. Seems like no 
> > > > > >> > > >> one came up
> > > > > with
> > > > > >> the
> > > > > >> > > >> result, and I prefer to avoid big dependency so I'm 
> > > > > >> > > >> in favor
> > > of
> > > > > >> > Metrics
> > > > > >> > > >> Server for now.)
> > > > > >> > > >>
> > > > > >> > > >> I think that would improve metrics feature of Storm 
> > > > > >> > > >> much
> > > better,
> > > > > so
> > > > > >> > I'd
> > > > > >> > > >> like to see how the work is going. Sure it's only 
> > > > > >> > > >> when
> > > there's no
> > > > > >> > issue
> > > > > >> > > >> for
> > > > > >> > > >> you to work transparently. I just would like to 
> > > > > >> > > >> prevent
> > > > > >> duplication of
> > > > > >> > > >> work, and would like to help if needed and possible.
> > > > > >> > > >>
> > > > > >> > > >> Thanks,
> > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > >> > > >>
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Abhishek Agarwal
> > >
> >
> >
> >
> >
>

Re: Question on Metrics Server to Alibaba team

Posted by Cody Innowhere <e....@gmail.com>.

If we don't rely on any external system, our metrics system is still
available but will store metrics meta/data in rocksdb on nimbus servers.
There will be limits though, for example, we cannot store metrics data all
through the topology lifecycle, because rocksdb is only a KV storage, it
may not support efficient scan operations and too much data in local disk
may bring in extra IO overhead, so we may have to store latest 1hour of m1
data, 6 hours of m10 data as such (currently not implemented in JStorm, but
quite easy to do this).

TopologyMaster is merely a channel for registering/computing/uploading
metrics to nimbus, so if a TM goes down, the topology metrics will be
unavailable for a while before it gets pulled up somewhere else(for a
normal failover case, this should be very fast), while supervisor/nimbus
metrics are unaffected as they're sent to nimbus via thrift interface. As
long as TM is back, the topology metrics will be available again.

Currently JStorm does sync metrics meta but metrics data between multiple
nimbus serers is not synced. So under a nimbus failure, possibly we may
lose some metrics data.


On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:

> John,
>
> My concern is H/A of metrics on Storm by default. (I'm not 100% sure Bobby
> pointed out same things.)
>
> Since Apache Storm has been used by various users so that we can't assume
> that users have knowledges of external systems (including Hadoop ecosystem,
> personal opinion) and operate them smoothly.
> It reminds me about the importance to keep in mind about default.
>
> Therefore, I'm curious that new metrics feature of JStom can work smoothly
> without external system (HBase / OTS). And love to see it supports H/A
> without other systems, or users have to tolerate lost of metrics for some
> scenarios.
>
> I guess this may be valid questions on H/A (as far as my understanding of
> design doc is right): How metrics work when TopologyMaster is down? And how
> metrics work when failover of Nimbus occurs?
>
> Personally I don't mind losing metrics for short durations (just want to
> check availability of H/A), but failure shouldn't mess up whole metrics.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2016년 3월 23일 (수) 오후 3:39, John Fang <xi...@alibaba-inc.com>님이 작성:
>
> > @ Bobby Evans Jstorm code has experienced a lot of tests over the past
> few
> > years, espatially HA and scalability. We have done a lot of optimization
> > about Metrics. The performance is better than Flink in my tests. In my
> > personal opinion, the metric in jstorm offers very much informations. And
> > the metric can tell us where is the bottleneck when we run a topology.
> The
> > performance bottleneck maybe serialize/deserialize/netty/executor and so
> > on. Of course, I also has some other good monitoring in the world. So I
> > hope we can choice the better monitoring before phrase 2. And I will
> start
> > study the Alas. If it is better, I am pleasured to redesign the metric by
> > Alas.
> > -----邮件原件-----
> > 发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
> > 发送时间: 2016年3月22日 22:36
> > 收件人: dev@storm.apache.org
> > 主题: Re: Question on Metrics Server to Alibaba team
> >
> > My personal opinion is that we should not reinvent the wheel (aka
> > distributed fault tolerant metrics) ourselves.  The local file blobstore
> > with nimbus HA was a big enough pain to write and it is relatively simple
> > in comparison.
> > If the JStorm code is simple and offers everything we need in terms of HA
> > and scalability then I would be OK with it, but if it doesn't I would
> lean
> > towards a different compatible open source solution.
> >
> > https://github.com/Netflix/atlas
> > looks very promising as a default option.  It is actively maintained by a
> > group that I think has some of the best monitoring in the world.  And it
> is
> > both java and apache compatible.  It has no histogram support that I
> could
> > find, but that I don't see as being super critical.  The biggest drawback
> > is there is little documentation on how to use it, to really be able to
> > evaluate it for our needs. - Bobby
> >
> >     On Monday, March 21, 2016 7:29 PM, Jungtaek Lim <ka...@gmail.com>
> > wrote:
> >
> >
> >  Harsha,
> >
> > That's why I think new metric feature of JStorm looks promising.
> >
> > According to design doc on
> > https://issues.apache.org/jira/browse/STORM-1329,
> > there's no distinction between topology stat (which Apache Storm includes
> > to worker heartbeat) and built-in metrics (which should be handled with
> > separate consumer, as you stated).
> > All metrics are passed to Nimbus and Nimbus cached metrics, which implies
> > we can treat all metrics as same, and we can also provide built-in
> metrics
> > (including custom metrics) to users via REST API, too.
> >
> > I thought about standalone metrics server process which handles whole
> > metric works (maybe TopologyMaster + Nimbus on design doc), but if
> current
> > implementation of metric feature on JStorm can take care of what I'm
> > assuming, I guess it's great enough.
> >
> > Since I don't know about TopologyMaster, I just wonder that there're any
> > SPOFs (including soft) and how metrics work when if component of SPOF
> goes
> > down.
> > Since Cody gives digging point to take a look at, we can evaluate that
> > feature before phase 2.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> >
> > 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:
> >
> > > One of the goals of this work and probably can be addressed in
> > > separate jira is how the topology metrics reporter works. Today its a
> > > bolt thats part of a topology graph that means its another node in the
> > > Topology DAG that needs be tuned for better performance. Some of our
> > > users took performance hits by deploying topology metrics reporter
> > > that can send metrics to Ganglia. Ideally this collection should be
> > > asynchronous and not be a node in topology DAG.
> > >
> > > Shipping default metrics server and along with pluggable option for
> > > users who wants to graphite or other timeline servers should be the
> > > goal.
> > >
> > > --Harsha
> > >
> > >
> > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > @Cody - The design looks good. Does the design allow to aggregate
> > > > metrics at the task/executor level? Basically, number of distinct
> > > > metrics is proportional to the number of distinct tasks, did you
> > > > ever run into such a use case?
> > > >
> > > >
> > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere
> > > > <e....@gmail.com>
> > > > wrote:
> > > >
> > > > > Also, you can read the code from our latest release JStorm 2.1.1.
> > > > >
> > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere
> > > > > <e....@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > @Jungtaek,
> > > > > > We did some tests on codahale metrics, compared to
> > > > > > meters/histograms, counters are quite fast. So we mainly focused
> > > > > > on the optimization of
> > > > > meters
> > > > > > and histograms (they are indeed very slow) including double
> > > > > > sampling, changing the clock from ns (System.nanoTime) to ms,
> etc.
> > > > > > You can take a look at the
> > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of
> > > > > > our sequence-split-merge example code, as the client code entry
> > > > > > to
> > > metrics.
> > > > > > After that, you may dig to TopologyMaster class, which is still
> > > > > > part
> > > of a
> > > > > > topology, and then to TopologyMetricsRunnable, which is a part
> > > > > > of
> > > nimbus
> > > > > > server, finally to MetricUploader plugin, this is where the
> > > > > > metrics interfere with our "metrics server". Still, there're
> > > > > > some nits in the
> > > > > code,
> > > > > > but I think that should be no big problem.
> > > > > >
> > > > > > I'd also like to point out that our "metrics server" is not
> > > > > > strictly
> > > a
> > > > > > real metrics server, since most of the duty lies on nimbus
> > > > > > server and topology master, it's more appropriate to call it
> > metrics storage.
> > > The
> > > > > main
> > > > > > reason for this is that we don't want to make a heavy-weight
> > > > > > metrics
> > > > > server
> > > > > > out of JStorm, and this makes us very easy to maintain (we have
> > > > > > teams
> > > > > that
> > > > > > specifically maintain HBase/OTS in Alibaba since they're so
> > > > > > commonly
> > > used
> > > > > > in production).
> > > > > >
> > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim
> > > > > > <ka...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > >>
> > > > > >> Cody,
> > > > > >> I took a look at design doc and looks promising, especially it
> > > doesn't
> > > > > do
> > > > > >> sampling when metric type is 'counter'. As far as I heard (I
> > > > > >> didn't
> > > try
> > > > > >> it)
> > > > > >> it becomes huge performance hit in Apache Storm when we change
> > > sample
> > > > > rate
> > > > > >> to 1.0.
> > > > > >> Could you guide the entry point of metric feature in JStorm to
> > > > > >> dig
> > > into?
> > > > > >>
> > > > > >> And just a curiosity, did you consider extracting metric
> > > > > >> feature
> > > (which
> > > > > is
> > > > > >> done with TopologyMasters and Nimbuses) into separate component?
> > > > > >> I understood your mention to 'metrics server' as separate
> > > component, but
> > > > > >> after seeing design doc, feature seems to be implemented on
> > Nimbus.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > >>
> > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere
> > > > > >> <e....@gmail.com>님이
> > > 작성:
> > > > > >>
> > > > > >> > JStorm has provided a MetricUploader interface, which is
> > > > > >> > similar
> > > to
> > > > > >> > IMetricsConsumer in storm, and the underlying implementation
> > > > > >> > is
> > > > > >> pluggable,
> > > > > >> > you can use HBase, or any other KV store that supports
> > > > > >> > timeline
> > > > > queries
> > > > > >> or
> > > > > >> > even a database(maybe for it's a small cluster). We provide
> > > > > >> > model
> > > > > >> classes
> > > > > >> > in jstorm-core, as to what kinds of metrics data need to be
> > > stored,
> > > > > it's
> > > > > >> > totally up to the detailed implementation. Our internal
> > > implementation
> > > > > >> uses
> > > > > >> > OTS, which is a product of aliyun (
> > > > > https://www.aliyun.com/product/ots/
> > > > > >> ),
> > > > > >> > but it's easy to adapt to other implementations.
> > > > > >> >
> > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > >> <evans@yahoo-inc.com.invalid
> > > > > >> > >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Yes we originally wanted to try and use the Hadoop Timeline
> > > Server
> > > > > for
> > > > > >> > > storm metrics feedback to nimbus + UI + history like server.
> > > But it
> > > > > >> was
> > > > > >> > > not stable at the time, so we stopped.  For the sake of
> > > > > >> > > playing
> > > > > nicely
> > > > > >> > with
> > > > > >> > > the rest of the big data ecosystem I would like to see us
> > > support it
> > > > > >> as
> > > > > >> > an
> > > > > >> > > option for metrics collection/query, but until the timeline
> > > server
> > > > > v2
> > > > > >> is
> > > > > >> > > ready and released.  For me the important thing is that we
> > > > > >> > > have
> > > a
> > > > > >> decent
> > > > > >> > > time series DB that comes with storm by default and is
> > > pluggable so
> > > > > we
> > > > > >> > can
> > > > > >> > > replace it with something else that has similar
> > > > > >> > > capabilities in
> > > the
> > > > > >> > future.
> > > > > >> > >  - Bobby
> > > > > >> > >
> > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > > > > >> > >e.neverme@gmail.com> wrote:
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm
> > > > > >> > >absolutely
> > > ok
> > > > > to
> > > > > >> > > discuss this in advance.
> > > > > >> > >
> > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > e.neverme@gmail.com
> > > > > >> >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Yes it's already in production.
> > > > > >> > > > The implementation basically follows the design document
> > > > > >> > > > in https://issues.apache.org/jira/browse/STORM-1329, you
> > > > > >> > > > can
> > > take a
> > > > > >> look
> > > > > >> > > > first and feel free to ask questions.
> > > > > >> > > >
> > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > kabhwan@gmail.com
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > >> Hi,
> > > > > >> > > >>
> > > > > >> > > >> I got something to do with metrics so I'm seeking the
> > > > > >> > > >> pull
> > > > > requests
> > > > > >> > > which
> > > > > >> > > >> addresses metrics.
> > > > > >> > > >> And at #753 <https://github.com/apache/storm/pull/753> I
> > > found
> > > > > >> Cody
> > > > > >> > > said
> > > > > >> > > >> we
> > > > > >> > > >> (maybe it means Alibaba team) are currently working on
> > > Metrics
> > > > > >> Server.
> > > > > >> > > >> (I also found comment which said there was some talk
> > > > > >> > > >> while
> > > ago
> > > > > >> around
> > > > > >> > > >> integrating Hadoop timeline server. Seems like no one
> > > > > >> > > >> came up
> > > > > with
> > > > > >> the
> > > > > >> > > >> result, and I prefer to avoid big dependency so I'm in
> > > > > >> > > >> favor
> > > of
> > > > > >> > Metrics
> > > > > >> > > >> Server for now.)
> > > > > >> > > >>
> > > > > >> > > >> I think that would improve metrics feature of Storm much
> > > better,
> > > > > so
> > > > > >> > I'd
> > > > > >> > > >> like to see how the work is going. Sure it's only when
> > > there's no
> > > > > >> > issue
> > > > > >> > > >> for
> > > > > >> > > >> you to work transparently. I just would like to prevent
> > > > > >> duplication of
> > > > > >> > > >> work, and would like to help if needed and possible.
> > > > > >> > > >>
> > > > > >> > > >> Thanks,
> > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > >> > > >>
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Abhishek Agarwal
> > >
> >
> >
> >
> >
>

Re: Question on Metrics Server to Alibaba team

Posted by Jungtaek Lim <ka...@gmail.com>.

John,

My concern is H/A of metrics on Storm by default. (I'm not 100% sure Bobby
pointed out same things.)

Since Apache Storm has been used by various users so that we can't assume
that users have knowledges of external systems (including Hadoop ecosystem,
personal opinion) and operate them smoothly.
It reminds me about the importance to keep in mind about default.

Therefore, I'm curious that new metrics feature of JStom can work smoothly
without external system (HBase / OTS). And love to see it supports H/A
without other systems, or users have to tolerate lost of metrics for some
scenarios.

I guess this may be valid questions on H/A (as far as my understanding of
design doc is right): How metrics work when TopologyMaster is down? And how
metrics work when failover of Nimbus occurs?

Personally I don't mind losing metrics for short durations (just want to
check availability of H/A), but failure shouldn't mess up whole metrics.

Thanks,
Jungtaek Lim (HeartSaVioR)

2016년 3월 23일 (수) 오후 3:39, John Fang <xi...@alibaba-inc.com>님이 작성:

> @ Bobby Evans Jstorm code has experienced a lot of tests over the past few
> years, espatially HA and scalability. We have done a lot of optimization
> about Metrics. The performance is better than Flink in my tests. In my
> personal opinion, the metric in jstorm offers very much informations. And
> the metric can tell us where is the bottleneck when we run a topology. The
> performance bottleneck maybe serialize/deserialize/netty/executor and so
> on. Of course, I also has some other good monitoring in the world. So I
> hope we can choice the better monitoring before phrase 2. And I will start
> study the Alas. If it is better, I am pleasured to redesign the metric by
> Alas.
> -----邮件原件-----
> 发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID]
> 发送时间: 2016年3月22日 22:36
> 收件人: dev@storm.apache.org
> 主题: Re: Question on Metrics Server to Alibaba team
>
> My personal opinion is that we should not reinvent the wheel (aka
> distributed fault tolerant metrics) ourselves.  The local file blobstore
> with nimbus HA was a big enough pain to write and it is relatively simple
> in comparison.
> If the JStorm code is simple and offers everything we need in terms of HA
> and scalability then I would be OK with it, but if it doesn't I would lean
> towards a different compatible open source solution.
>
> https://github.com/Netflix/atlas
> looks very promising as a default option.  It is actively maintained by a
> group that I think has some of the best monitoring in the world.  And it is
> both java and apache compatible.  It has no histogram support that I could
> find, but that I don't see as being super critical.  The biggest drawback
> is there is little documentation on how to use it, to really be able to
> evaluate it for our needs. - Bobby
>
>     On Monday, March 21, 2016 7:29 PM, Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>
>  Harsha,
>
> That's why I think new metric feature of JStorm looks promising.
>
> According to design doc on
> https://issues.apache.org/jira/browse/STORM-1329,
> there's no distinction between topology stat (which Apache Storm includes
> to worker heartbeat) and built-in metrics (which should be handled with
> separate consumer, as you stated).
> All metrics are passed to Nimbus and Nimbus cached metrics, which implies
> we can treat all metrics as same, and we can also provide built-in metrics
> (including custom metrics) to users via REST API, too.
>
> I thought about standalone metrics server process which handles whole
> metric works (maybe TopologyMaster + Nimbus on design doc), but if current
> implementation of metric feature on JStorm can take care of what I'm
> assuming, I guess it's great enough.
>
> Since I don't know about TopologyMaster, I just wonder that there're any
> SPOFs (including soft) and how metrics work when if component of SPOF goes
> down.
> Since Cody gives digging point to take a look at, we can evaluate that
> feature before phase 2.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:
>
> > One of the goals of this work and probably can be addressed in
> > separate jira is how the topology metrics reporter works. Today its a
> > bolt thats part of a topology graph that means its another node in the
> > Topology DAG that needs be tuned for better performance. Some of our
> > users took performance hits by deploying topology metrics reporter
> > that can send metrics to Ganglia. Ideally this collection should be
> > asynchronous and not be a node in topology DAG.
> >
> > Shipping default metrics server and along with pluggable option for
> > users who wants to graphite or other timeline servers should be the
> > goal.
> >
> > --Harsha
> >
> >
> > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > @Cody - The design looks good. Does the design allow to aggregate
> > > metrics at the task/executor level? Basically, number of distinct
> > > metrics is proportional to the number of distinct tasks, did you
> > > ever run into such a use case?
> > >
> > >
> > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere
> > > <e....@gmail.com>
> > > wrote:
> > >
> > > > Also, you can read the code from our latest release JStorm 2.1.1.
> > > >
> > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere
> > > > <e....@gmail.com>
> > > > wrote:
> > > >
> > > > > @Jungtaek,
> > > > > We did some tests on codahale metrics, compared to
> > > > > meters/histograms, counters are quite fast. So we mainly focused
> > > > > on the optimization of
> > > > meters
> > > > > and histograms (they are indeed very slow) including double
> > > > > sampling, changing the clock from ns (System.nanoTime) to ms, etc.
> > > > > You can take a look at the
> > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of
> > > > > our sequence-split-merge example code, as the client code entry
> > > > > to
> > metrics.
> > > > > After that, you may dig to TopologyMaster class, which is still
> > > > > part
> > of a
> > > > > topology, and then to TopologyMetricsRunnable, which is a part
> > > > > of
> > nimbus
> > > > > server, finally to MetricUploader plugin, this is where the
> > > > > metrics interfere with our "metrics server". Still, there're
> > > > > some nits in the
> > > > code,
> > > > > but I think that should be no big problem.
> > > > >
> > > > > I'd also like to point out that our "metrics server" is not
> > > > > strictly
> > a
> > > > > real metrics server, since most of the duty lies on nimbus
> > > > > server and topology master, it's more appropriate to call it
> metrics storage.
> > The
> > > > main
> > > > > reason for this is that we don't want to make a heavy-weight
> > > > > metrics
> > > > server
> > > > > out of JStorm, and this makes us very easy to maintain (we have
> > > > > teams
> > > > that
> > > > > specifically maintain HBase/OTS in Alibaba since they're so
> > > > > commonly
> > used
> > > > > in production).
> > > > >
> > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim
> > > > > <ka...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Thanks Cody and Bobby for the explanation.
> > > > >>
> > > > >> Cody,
> > > > >> I took a look at design doc and looks promising, especially it
> > doesn't
> > > > do
> > > > >> sampling when metric type is 'counter'. As far as I heard (I
> > > > >> didn't
> > try
> > > > >> it)
> > > > >> it becomes huge performance hit in Apache Storm when we change
> > sample
> > > > rate
> > > > >> to 1.0.
> > > > >> Could you guide the entry point of metric feature in JStorm to
> > > > >> dig
> > into?
> > > > >>
> > > > >> And just a curiosity, did you consider extracting metric
> > > > >> feature
> > (which
> > > > is
> > > > >> done with TopologyMasters and Nimbuses) into separate component?
> > > > >> I understood your mention to 'metrics server' as separate
> > component, but
> > > > >> after seeing design doc, feature seems to be implemented on
> Nimbus.
> > > > >>
> > > > >> Thanks,
> > > > >> Jungtaek Lim (HeartSaVioR)
> > > > >>
> > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere
> > > > >> <e....@gmail.com>님이
> > 작성:
> > > > >>
> > > > >> > JStorm has provided a MetricUploader interface, which is
> > > > >> > similar
> > to
> > > > >> > IMetricsConsumer in storm, and the underlying implementation
> > > > >> > is
> > > > >> pluggable,
> > > > >> > you can use HBase, or any other KV store that supports
> > > > >> > timeline
> > > > queries
> > > > >> or
> > > > >> > even a database(maybe for it's a small cluster). We provide
> > > > >> > model
> > > > >> classes
> > > > >> > in jstorm-core, as to what kinds of metrics data need to be
> > stored,
> > > > it's
> > > > >> > totally up to the detailed implementation. Our internal
> > implementation
> > > > >> uses
> > > > >> > OTS, which is a product of aliyun (
> > > > https://www.aliyun.com/product/ots/
> > > > >> ),
> > > > >> > but it's easy to adapt to other implementations.
> > > > >> >
> > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > >> <evans@yahoo-inc.com.invalid
> > > > >> > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Yes we originally wanted to try and use the Hadoop Timeline
> > Server
> > > > for
> > > > >> > > storm metrics feedback to nimbus + UI + history like server.
> > But it
> > > > >> was
> > > > >> > > not stable at the time, so we stopped.  For the sake of
> > > > >> > > playing
> > > > nicely
> > > > >> > with
> > > > >> > > the rest of the big data ecosystem I would like to see us
> > support it
> > > > >> as
> > > > >> > an
> > > > >> > > option for metrics collection/query, but until the timeline
> > server
> > > > v2
> > > > >> is
> > > > >> > > ready and released.  For me the important thing is that we
> > > > >> > > have
> > a
> > > > >> decent
> > > > >> > > time series DB that comes with storm by default and is
> > pluggable so
> > > > we
> > > > >> > can
> > > > >> > > replace it with something else that has similar
> > > > >> > > capabilities in
> > the
> > > > >> > future.
> > > > >> > >  - Bobby
> > > > >> > >
> > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > > > >> > >e.neverme@gmail.com> wrote:
> > > > >> > >
> > > > >> > >
> > > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm
> > > > >> > >absolutely
> > ok
> > > > to
> > > > >> > > discuss this in advance.
> > > > >> > >
> > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > e.neverme@gmail.com
> > > > >> >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Yes it's already in production.
> > > > >> > > > The implementation basically follows the design document
> > > > >> > > > in https://issues.apache.org/jira/browse/STORM-1329, you
> > > > >> > > > can
> > take a
> > > > >> look
> > > > >> > > > first and feel free to ask questions.
> > > > >> > > >
> > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > kabhwan@gmail.com
> > > > >
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > >> Hi,
> > > > >> > > >>
> > > > >> > > >> I got something to do with metrics so I'm seeking the
> > > > >> > > >> pull
> > > > requests
> > > > >> > > which
> > > > >> > > >> addresses metrics.
> > > > >> > > >> And at #753 <https://github.com/apache/storm/pull/753> I
> > found
> > > > >> Cody
> > > > >> > > said
> > > > >> > > >> we
> > > > >> > > >> (maybe it means Alibaba team) are currently working on
> > Metrics
> > > > >> Server.
> > > > >> > > >> (I also found comment which said there was some talk
> > > > >> > > >> while
> > ago
> > > > >> around
> > > > >> > > >> integrating Hadoop timeline server. Seems like no one
> > > > >> > > >> came up
> > > > with
> > > > >> the
> > > > >> > > >> result, and I prefer to avoid big dependency so I'm in
> > > > >> > > >> favor
> > of
> > > > >> > Metrics
> > > > >> > > >> Server for now.)
> > > > >> > > >>
> > > > >> > > >> I think that would improve metrics feature of Storm much
> > better,
> > > > so
> > > > >> > I'd
> > > > >> > > >> like to see how the work is going. Sure it's only when
> > there's no
> > > > >> > issue
> > > > >> > > >> for
> > > > >> > > >> you to work transparently. I just would like to prevent
> > > > >> duplication of
> > > > >> > > >> work, and would like to help if needed and possible.
> > > > >> > > >>
> > > > >> > > >> Thanks,
> > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > >> > > >>
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Abhishek Agarwal
> >
>
>
>
>

答复: Question on Metrics Server to Alibaba team

Posted by John Fang <xi...@alibaba-inc.com>.

@ Bobby Evans Jstorm code has experienced a lot of tests over the past few years, espatially HA and scalability. We have done a lot of optimization about Metrics. The performance is better than Flink in my tests. In my personal opinion, the metric in jstorm offers very much informations. And the metric can tell us where is the bottleneck when we run a topology. The performance bottleneck maybe serialize/deserialize/netty/executor and so on. Of course, I also has some other good monitoring in the world. So I hope we can choice the better monitoring before phrase 2. And I will start study the Alas. If it is better, I am pleasured to redesign the metric by Alas.
-----邮件原件-----
发件人: Bobby Evans [mailto:evans@yahoo-inc.com.INVALID] 
发送时间: 2016年3月22日 22:36
收件人: dev@storm.apache.org
主题: Re: Question on Metrics Server to Alibaba team

My personal opinion is that we should not reinvent the wheel (aka distributed fault tolerant metrics) ourselves.  The local file blobstore with nimbus HA was a big enough pain to write and it is relatively simple in comparison.
If the JStorm code is simple and offers everything we need in terms of HA and scalability then I would be OK with it, but if it doesn't I would lean towards a different compatible open source solution. 

https://github.com/Netflix/atlas
looks very promising as a default option.  It is actively maintained by a group that I think has some of the best monitoring in the world.  And it is both java and apache compatible.  It has no histogram support that I could find, but that I don't see as being super critical.  The biggest drawback is there is little documentation on how to use it, to really be able to evaluate it for our needs. - Bobby 

    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim <ka...@gmail.com> wrote:
 

 Harsha,

That's why I think new metric feature of JStorm looks promising.

According to design doc on https://issues.apache.org/jira/browse/STORM-1329,
there's no distinction between topology stat (which Apache Storm includes to worker heartbeat) and built-in metrics (which should be handled with separate consumer, as you stated).
All metrics are passed to Nimbus and Nimbus cached metrics, which implies we can treat all metrics as same, and we can also provide built-in metrics (including custom metrics) to users via REST API, too.

I thought about standalone metrics server process which handles whole metric works (maybe TopologyMaster + Nimbus on design doc), but if current implementation of metric feature on JStorm can take care of what I'm assuming, I guess it's great enough.

Since I don't know about TopologyMaster, I just wonder that there're any SPOFs (including soft) and how metrics work when if component of SPOF goes down.
Since Cody gives digging point to take a look at, we can evaluate that feature before phase 2.

Thanks,
Jungtaek Lim (HeartSaVioR)

2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:

> One of the goals of this work and probably can be addressed in 
> separate jira is how the topology metrics reporter works. Today its a 
> bolt thats part of a topology graph that means its another node in the 
> Topology DAG that needs be tuned for better performance. Some of our 
> users took performance hits by deploying topology metrics reporter 
> that can send metrics to Ganglia. Ideally this collection should be 
> asynchronous and not be a node in topology DAG.
>
> Shipping default metrics server and along with pluggable option for 
> users who wants to graphite or other timeline servers should be the 
> goal.
>
> --Harsha
>
>
> On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > @Cody - The design looks good. Does the design allow to aggregate 
> > metrics at the task/executor level? Basically, number of distinct 
> > metrics is proportional to the number of distinct tasks, did you 
> > ever run into such a use case?
> >
> >
> > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere 
> > <e....@gmail.com>
> > wrote:
> >
> > > Also, you can read the code from our latest release JStorm 2.1.1.
> > >
> > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere 
> > > <e....@gmail.com>
> > > wrote:
> > >
> > > > @Jungtaek,
> > > > We did some tests on codahale metrics, compared to 
> > > > meters/histograms, counters are quite fast. So we mainly focused 
> > > > on the optimization of
> > > meters
> > > > and histograms (they are indeed very slow) including double 
> > > > sampling, changing the clock from ns (System.nanoTime) to ms, etc.
> > > > You can take a look at the
> > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of 
> > > > our sequence-split-merge example code, as the client code entry 
> > > > to
> metrics.
> > > > After that, you may dig to TopologyMaster class, which is still 
> > > > part
> of a
> > > > topology, and then to TopologyMetricsRunnable, which is a part 
> > > > of
> nimbus
> > > > server, finally to MetricUploader plugin, this is where the 
> > > > metrics interfere with our "metrics server". Still, there're 
> > > > some nits in the
> > > code,
> > > > but I think that should be no big problem.
> > > >
> > > > I'd also like to point out that our "metrics server" is not 
> > > > strictly
> a
> > > > real metrics server, since most of the duty lies on nimbus 
> > > > server and topology master, it's more appropriate to call it metrics storage.
> The
> > > main
> > > > reason for this is that we don't want to make a heavy-weight 
> > > > metrics
> > > server
> > > > out of JStorm, and this makes us very easy to maintain (we have 
> > > > teams
> > > that
> > > > specifically maintain HBase/OTS in Alibaba since they're so 
> > > > commonly
> used
> > > > in production).
> > > >
> > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim 
> > > > <ka...@gmail.com>
> > > wrote:
> > > >
> > > >> Thanks Cody and Bobby for the explanation.
> > > >>
> > > >> Cody,
> > > >> I took a look at design doc and looks promising, especially it
> doesn't
> > > do
> > > >> sampling when metric type is 'counter'. As far as I heard (I 
> > > >> didn't
> try
> > > >> it)
> > > >> it becomes huge performance hit in Apache Storm when we change
> sample
> > > rate
> > > >> to 1.0.
> > > >> Could you guide the entry point of metric feature in JStorm to 
> > > >> dig
> into?
> > > >>
> > > >> And just a curiosity, did you consider extracting metric 
> > > >> feature
> (which
> > > is
> > > >> done with TopologyMasters and Nimbuses) into separate component?
> > > >> I understood your mention to 'metrics server' as separate
> component, but
> > > >> after seeing design doc, feature seems to be implemented on Nimbus.
> > > >>
> > > >> Thanks,
> > > >> Jungtaek Lim (HeartSaVioR)
> > > >>
> > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere 
> > > >> <e....@gmail.com>님이
> 작성:
> > > >>
> > > >> > JStorm has provided a MetricUploader interface, which is 
> > > >> > similar
> to
> > > >> > IMetricsConsumer in storm, and the underlying implementation 
> > > >> > is
> > > >> pluggable,
> > > >> > you can use HBase, or any other KV store that supports 
> > > >> > timeline
> > > queries
> > > >> or
> > > >> > even a database(maybe for it's a small cluster). We provide 
> > > >> > model
> > > >> classes
> > > >> > in jstorm-core, as to what kinds of metrics data need to be
> stored,
> > > it's
> > > >> > totally up to the detailed implementation. Our internal
> implementation
> > > >> uses
> > > >> > OTS, which is a product of aliyun (
> > > https://www.aliyun.com/product/ots/
> > > >> ),
> > > >> > but it's easy to adapt to other implementations.
> > > >> >
> > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > >> <evans@yahoo-inc.com.invalid
> > > >> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Yes we originally wanted to try and use the Hadoop Timeline
> Server
> > > for
> > > >> > > storm metrics feedback to nimbus + UI + history like server.
> But it
> > > >> was
> > > >> > > not stable at the time, so we stopped.  For the sake of 
> > > >> > > playing
> > > nicely
> > > >> > with
> > > >> > > the rest of the big data ecosystem I would like to see us
> support it
> > > >> as
> > > >> > an
> > > >> > > option for metrics collection/query, but until the timeline
> server
> > > v2
> > > >> is
> > > >> > > ready and released.  For me the important thing is that we 
> > > >> > > have
> a
> > > >> decent
> > > >> > > time series DB that comes with storm by default and is
> pluggable so
> > > we
> > > >> > can
> > > >> > > replace it with something else that has similar 
> > > >> > > capabilities in
> the
> > > >> > future.
> > > >> > >  - Bobby
> > > >> > >
> > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere <  
> > > >> > >e.neverme@gmail.com> wrote:
> > > >> > >
> > > >> > >
> > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm 
> > > >> > >absolutely
> ok
> > > to
> > > >> > > discuss this in advance.
> > > >> > >
> > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > e.neverme@gmail.com
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Yes it's already in production.
> > > >> > > > The implementation basically follows the design document 
> > > >> > > > in https://issues.apache.org/jira/browse/STORM-1329, you 
> > > >> > > > can
> take a
> > > >> look
> > > >> > > > first and feel free to ask questions.
> > > >> > > >
> > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> kabhwan@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > >> Hi,
> > > >> > > >>
> > > >> > > >> I got something to do with metrics so I'm seeking the 
> > > >> > > >> pull
> > > requests
> > > >> > > which
> > > >> > > >> addresses metrics.
> > > >> > > >> And at #753 <https://github.com/apache/storm/pull/753> I
> found
> > > >> Cody
> > > >> > > said
> > > >> > > >> we
> > > >> > > >> (maybe it means Alibaba team) are currently working on
> Metrics
> > > >> Server.
> > > >> > > >> (I also found comment which said there was some talk 
> > > >> > > >> while
> ago
> > > >> around
> > > >> > > >> integrating Hadoop timeline server. Seems like no one 
> > > >> > > >> came up
> > > with
> > > >> the
> > > >> > > >> result, and I prefer to avoid big dependency so I'm in 
> > > >> > > >> favor
> of
> > > >> > Metrics
> > > >> > > >> Server for now.)
> > > >> > > >>
> > > >> > > >> I think that would improve metrics feature of Storm much
> better,
> > > so
> > > >> > I'd
> > > >> > > >> like to see how the work is going. Sure it's only when
> there's no
> > > >> > issue
> > > >> > > >> for
> > > >> > > >> you to work transparently. I just would like to prevent
> > > >> duplication of
> > > >> > > >> work, and would like to help if needed and possible.
> > > >> > > >>
> > > >> > > >> Thanks,
> > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Abhishek Agarwal
>

Re: Question on Metrics Server to Alibaba team

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

My personal opinion is that we should not reinvent the wheel (aka distributed fault tolerant metrics) ourselves.  The local file blobstore with nimbus HA was a big enough pain to write and it is relatively simple in comparison.
If the JStorm code is simple and offers everything we need in terms of HA and scalability then I would be OK with it, but if it doesn't I would lean towards a different compatible open source solution. 

https://github.com/Netflix/atlas
looks very promising as a default option.  It is actively maintained by a group that I think has some of the best monitoring in the world.  And it is both java and apache compatible.  It has no histogram support that I could find, but that I don't see as being super critical.  The biggest drawback is there is little documentation on how to use it, to really be able to evaluate it for our needs. - Bobby 

    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim <ka...@gmail.com> wrote:
 

 Harsha,

That's why I think new metric feature of JStorm looks promising.

According to design doc on https://issues.apache.org/jira/browse/STORM-1329,
there's no distinction between topology stat (which Apache Storm includes
to worker heartbeat) and built-in metrics (which should be handled with
separate consumer, as you stated).
All metrics are passed to Nimbus and Nimbus cached metrics, which implies
we can treat all metrics as same, and we can also provide built-in metrics
(including custom metrics) to users via REST API, too.

I thought about standalone metrics server process which handles whole
metric works (maybe TopologyMaster + Nimbus on design doc), but if current
implementation of metric feature on JStorm can take care of what I'm
assuming, I guess it's great enough.

Since I don't know about TopologyMaster, I just wonder that there're any
SPOFs (including soft) and how metrics work when if component of SPOF goes
down.
Since Cody gives digging point to take a look at, we can evaluate that
feature before phase 2.

Thanks,
Jungtaek Lim (HeartSaVioR)

2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:

> One of the goals of this work and probably can be addressed in separate
> jira is how the topology metrics reporter works. Today its a bolt thats
> part of a topology graph that means its another node in the Topology DAG
> that needs be tuned for better performance. Some of our users took
> performance hits by deploying topology metrics reporter that can send
> metrics to Ganglia. Ideally this collection should be asynchronous and
> not be a node in topology DAG.
>
> Shipping default metrics server and along with pluggable option for
> users who wants to graphite or other timeline servers should be the
> goal.
>
> --Harsha
>
>
> On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > @Cody - The design looks good. Does the design allow to aggregate metrics
> > at the task/executor level? Basically, number of distinct metrics is
> > proportional to the number of distinct tasks, did you ever run into such
> > a
> > use case?
> >
> >
> > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere <e....@gmail.com>
> > wrote:
> >
> > > Also, you can read the code from our latest release JStorm 2.1.1.
> > >
> > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere <e....@gmail.com>
> > > wrote:
> > >
> > > > @Jungtaek,
> > > > We did some tests on codahale metrics, compared to meters/histograms,
> > > > counters are quite fast. So we mainly focused on the optimization of
> > > meters
> > > > and histograms (they are indeed very slow) including double sampling,
> > > > changing the clock from ns (System.nanoTime) to ms, etc.
> > > > You can take a look at the
> > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of our
> > > > sequence-split-merge example code, as the client code entry to
> metrics.
> > > > After that, you may dig to TopologyMaster class, which is still part
> of a
> > > > topology, and then to TopologyMetricsRunnable, which is a part of
> nimbus
> > > > server, finally to MetricUploader plugin, this is where the metrics
> > > > interfere with our "metrics server". Still, there're some nits in the
> > > code,
> > > > but I think that should be no big problem.
> > > >
> > > > I'd also like to point out that our "metrics server" is not strictly
> a
> > > > real metrics server, since most of the duty lies on nimbus server and
> > > > topology master, it's more appropriate to call it metrics storage.
> The
> > > main
> > > > reason for this is that we don't want to make a heavy-weight metrics
> > > server
> > > > out of JStorm, and this makes us very easy to maintain (we have teams
> > > that
> > > > specifically maintain HBase/OTS in Alibaba since they're so commonly
> used
> > > > in production).
> > > >
> > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim <ka...@gmail.com>
> > > wrote:
> > > >
> > > >> Thanks Cody and Bobby for the explanation.
> > > >>
> > > >> Cody,
> > > >> I took a look at design doc and looks promising, especially it
> doesn't
> > > do
> > > >> sampling when metric type is 'counter'. As far as I heard (I didn't
> try
> > > >> it)
> > > >> it becomes huge performance hit in Apache Storm when we change
> sample
> > > rate
> > > >> to 1.0.
> > > >> Could you guide the entry point of metric feature in JStorm to dig
> into?
> > > >>
> > > >> And just a curiosity, did you consider extracting metric feature
> (which
> > > is
> > > >> done with TopologyMasters and Nimbuses) into separate component?
> > > >> I understood your mention to 'metrics server' as separate
> component, but
> > > >> after seeing design doc, feature seems to be implemented on Nimbus.
> > > >>
> > > >> Thanks,
> > > >> Jungtaek Lim (HeartSaVioR)
> > > >>
> > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere <e....@gmail.com>님이
> 작성:
> > > >>
> > > >> > JStorm has provided a MetricUploader interface, which is similar
> to
> > > >> > IMetricsConsumer in storm, and the underlying implementation is
> > > >> pluggable,
> > > >> > you can use HBase, or any other KV store that supports timeline
> > > queries
> > > >> or
> > > >> > even a database(maybe for it's a small cluster). We provide model
> > > >> classes
> > > >> > in jstorm-core, as to what kinds of metrics data need to be
> stored,
> > > it's
> > > >> > totally up to the detailed implementation. Our internal
> implementation
> > > >> uses
> > > >> > OTS, which is a product of aliyun (
> > > https://www.aliyun.com/product/ots/
> > > >> ),
> > > >> > but it's easy to adapt to other implementations.
> > > >> >
> > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > >> <evans@yahoo-inc.com.invalid
> > > >> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Yes we originally wanted to try and use the Hadoop Timeline
> Server
> > > for
> > > >> > > storm metrics feedback to nimbus + UI + history like server.
> But it
> > > >> was
> > > >> > > not stable at the time, so we stopped.  For the sake of playing
> > > nicely
> > > >> > with
> > > >> > > the rest of the big data ecosystem I would like to see us
> support it
> > > >> as
> > > >> > an
> > > >> > > option for metrics collection/query, but until the timeline
> server
> > > v2
> > > >> is
> > > >> > > ready and released.  For me the important thing is that we have
> a
> > > >> decent
> > > >> > > time series DB that comes with storm by default and is
> pluggable so
> > > we
> > > >> > can
> > > >> > > replace it with something else that has similar capabilities in
> the
> > > >> > future.
> > > >> > >  - Bobby
> > > >> > >
> > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > > >> > > e.neverme@gmail.com> wrote:
> > > >> > >
> > > >> > >
> > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm absolutely
> ok
> > > to
> > > >> > > discuss this in advance.
> > > >> > >
> > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > e.neverme@gmail.com
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Yes it's already in production.
> > > >> > > > The implementation basically follows the design document in
> > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, you can
> take a
> > > >> look
> > > >> > > > first and feel free to ask questions.
> > > >> > > >
> > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> kabhwan@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > >> Hi,
> > > >> > > >>
> > > >> > > >> I got something to do with metrics so I'm seeking the pull
> > > requests
> > > >> > > which
> > > >> > > >> addresses metrics.
> > > >> > > >> And at #753 <https://github.com/apache/storm/pull/753> I
> found
> > > >> Cody
> > > >> > > said
> > > >> > > >> we
> > > >> > > >> (maybe it means Alibaba team) are currently working on
> Metrics
> > > >> Server.
> > > >> > > >> (I also found comment which said there was some talk while
> ago
> > > >> around
> > > >> > > >> integrating Hadoop timeline server. Seems like no one came up
> > > with
> > > >> the
> > > >> > > >> result, and I prefer to avoid big dependency so I'm in favor
> of
> > > >> > Metrics
> > > >> > > >> Server for now.)
> > > >> > > >>
> > > >> > > >> I think that would improve metrics feature of Storm much
> better,
> > > so
> > > >> > I'd
> > > >> > > >> like to see how the work is going. Sure it's only when
> there's no
> > > >> > issue
> > > >> > > >> for
> > > >> > > >> you to work transparently. I just would like to prevent
> > > >> duplication of
> > > >> > > >> work, and would like to help if needed and possible.
> > > >> > > >>
> > > >> > > >> Thanks,
> > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Abhishek Agarwal
>

Re: Question on Metrics Server to Alibaba team

Posted by Jungtaek Lim <ka...@gmail.com>.

Harsha,

That's why I think new metric feature of JStorm looks promising.

According to design doc on https://issues.apache.org/jira/browse/STORM-1329,
there's no distinction between topology stat (which Apache Storm includes
to worker heartbeat) and built-in metrics (which should be handled with
separate consumer, as you stated).
All metrics are passed to Nimbus and Nimbus cached metrics, which implies
we can treat all metrics as same, and we can also provide built-in metrics
(including custom metrics) to users via REST API, too.

I thought about standalone metrics server process which handles whole
metric works (maybe TopologyMaster + Nimbus on design doc), but if current
implementation of metric feature on JStorm can take care of what I'm
assuming, I guess it's great enough.

Since I don't know about TopologyMaster, I just wonder that there're any
SPOFs (including soft) and how metrics work when if component of SPOF goes
down.
Since Cody gives digging point to take a look at, we can evaluate that
feature before phase 2.

Thanks,
Jungtaek Lim (HeartSaVioR)

2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성:

> One of the goals of this work and probably can be addressed in separate
> jira is how the topology metrics reporter works. Today its a bolt thats
> part of a topology graph that means its another node in the Topology DAG
> that needs be tuned for better performance. Some of our users took
> performance hits by deploying topology metrics reporter that can send
> metrics to Ganglia. Ideally this collection should be asynchronous and
> not be a node in topology DAG.
>
> Shipping default metrics server and along with pluggable option for
> users who wants to graphite or other timeline servers should be the
> goal.
>
> --Harsha
>
>
> On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > @Cody - The design looks good. Does the design allow to aggregate metrics
> > at the task/executor level? Basically, number of distinct metrics is
> > proportional to the number of distinct tasks, did you ever run into such
> > a
> > use case?
> >
> >
> > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere <e....@gmail.com>
> > wrote:
> >
> > > Also, you can read the code from our latest release JStorm 2.1.1.
> > >
> > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere <e....@gmail.com>
> > > wrote:
> > >
> > > > @Jungtaek,
> > > > We did some tests on codahale metrics, compared to meters/histograms,
> > > > counters are quite fast. So we mainly focused on the optimization of
> > > meters
> > > > and histograms (they are indeed very slow) including double sampling,
> > > > changing the clock from ns (System.nanoTime) to ms, etc.
> > > > You can take a look at the
> > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of our
> > > > sequence-split-merge example code, as the client code entry to
> metrics.
> > > > After that, you may dig to TopologyMaster class, which is still part
> of a
> > > > topology, and then to TopologyMetricsRunnable, which is a part of
> nimbus
> > > > server, finally to MetricUploader plugin, this is where the metrics
> > > > interfere with our "metrics server". Still, there're some nits in the
> > > code,
> > > > but I think that should be no big problem.
> > > >
> > > > I'd also like to point out that our "metrics server" is not strictly
> a
> > > > real metrics server, since most of the duty lies on nimbus server and
> > > > topology master, it's more appropriate to call it metrics storage.
> The
> > > main
> > > > reason for this is that we don't want to make a heavy-weight metrics
> > > server
> > > > out of JStorm, and this makes us very easy to maintain (we have teams
> > > that
> > > > specifically maintain HBase/OTS in Alibaba since they're so commonly
> used
> > > > in production).
> > > >
> > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim <ka...@gmail.com>
> > > wrote:
> > > >
> > > >> Thanks Cody and Bobby for the explanation.
> > > >>
> > > >> Cody,
> > > >> I took a look at design doc and looks promising, especially it
> doesn't
> > > do
> > > >> sampling when metric type is 'counter'. As far as I heard (I didn't
> try
> > > >> it)
> > > >> it becomes huge performance hit in Apache Storm when we change
> sample
> > > rate
> > > >> to 1.0.
> > > >> Could you guide the entry point of metric feature in JStorm to dig
> into?
> > > >>
> > > >> And just a curiosity, did you consider extracting metric feature
> (which
> > > is
> > > >> done with TopologyMasters and Nimbuses) into separate component?
> > > >> I understood your mention to 'metrics server' as separate
> component, but
> > > >> after seeing design doc, feature seems to be implemented on Nimbus.
> > > >>
> > > >> Thanks,
> > > >> Jungtaek Lim (HeartSaVioR)
> > > >>
> > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere <e....@gmail.com>님이
> 작성:
> > > >>
> > > >> > JStorm has provided a MetricUploader interface, which is similar
> to
> > > >> > IMetricsConsumer in storm, and the underlying implementation is
> > > >> pluggable,
> > > >> > you can use HBase, or any other KV store that supports timeline
> > > queries
> > > >> or
> > > >> > even a database(maybe for it's a small cluster). We provide model
> > > >> classes
> > > >> > in jstorm-core, as to what kinds of metrics data need to be
> stored,
> > > it's
> > > >> > totally up to the detailed implementation. Our internal
> implementation
> > > >> uses
> > > >> > OTS, which is a product of aliyun (
> > > https://www.aliyun.com/product/ots/
> > > >> ),
> > > >> > but it's easy to adapt to other implementations.
> > > >> >
> > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > >> <evans@yahoo-inc.com.invalid
> > > >> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Yes we originally wanted to try and use the Hadoop Timeline
> Server
> > > for
> > > >> > > storm metrics feedback to nimbus + UI + history like server.
> But it
> > > >> was
> > > >> > > not stable at the time, so we stopped.  For the sake of playing
> > > nicely
> > > >> > with
> > > >> > > the rest of the big data ecosystem I would like to see us
> support it
> > > >> as
> > > >> > an
> > > >> > > option for metrics collection/query, but until the timeline
> server
> > > v2
> > > >> is
> > > >> > > ready and released.  For me the important thing is that we have
> a
> > > >> decent
> > > >> > > time series DB that comes with storm by default and is
> pluggable so
> > > we
> > > >> > can
> > > >> > > replace it with something else that has similar capabilities in
> the
> > > >> > future.
> > > >> > >  - Bobby
> > > >> > >
> > > >> > >     On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > > >> > > e.neverme@gmail.com> wrote:
> > > >> > >
> > > >> > >
> > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm absolutely
> ok
> > > to
> > > >> > > discuss this in advance.
> > > >> > >
> > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > e.neverme@gmail.com
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Yes it's already in production.
> > > >> > > > The implementation basically follows the design document in
> > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, you can
> take a
> > > >> look
> > > >> > > > first and feel free to ask questions.
> > > >> > > >
> > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> kabhwan@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > >> Hi,
> > > >> > > >>
> > > >> > > >> I got something to do with metrics so I'm seeking the pull
> > > requests
> > > >> > > which
> > > >> > > >> addresses metrics.
> > > >> > > >> And at #753 <https://github.com/apache/storm/pull/753> I
> found
> > > >> Cody
> > > >> > > said
> > > >> > > >> we
> > > >> > > >> (maybe it means Alibaba team) are currently working on
> Metrics
> > > >> Server.
> > > >> > > >> (I also found comment which said there was some talk while
> ago
> > > >> around
> > > >> > > >> integrating Hadoop timeline server. Seems like no one came up
> > > with
> > > >> the
> > > >> > > >> result, and I prefer to avoid big dependency so I'm in favor
> of
> > > >> > Metrics
> > > >> > > >> Server for now.)
> > > >> > > >>
> > > >> > > >> I think that would improve metrics feature of Storm much
> better,
> > > so
> > > >> > I'd
> > > >> > > >> like to see how the work is going. Sure it's only when
> there's no
> > > >> > issue
> > > >> > > >> for
> > > >> > > >> you to work transparently. I just would like to prevent
> > > >> duplication of
> > > >> > > >> work, and would like to help if needed and possible.
> > > >> > > >>
> > > >> > > >> Thanks,
> > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Abhishek Agarwal
>

Re: Question on Metrics Server to Alibaba team

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

I also want to make sure that storm can have at least a minimal feedback loop to nimbus for scheduling purposes.  Having it be pluggable to send metrics to another system is important, but so is having the ability for nimbus to be able to query the data (CPU, Memory, process latency, etc.) and adjust scheduling accordingly.  This is really required for automatic elasticity, intelligent resource over-committing, guaranteed SLAs, lots of important features that can differentiate storm from everything else that is out there.  
 - Bobby 

    On Monday, March 21, 2016 11:36 AM, Harsha <st...@harsha.io> wrote:
 

 One of the goals of this work and probably can be addressed in separate
jira is how the topology metrics reporter works. Today its a bolt thats
part of a topology graph that means its another node in the Topology DAG
that needs be tuned for better performance. Some of our users took
performance hits by deploying topology metrics reporter that can send
metrics to Ganglia. Ideally this collection should be asynchronous and
not be a node in topology DAG.

Shipping default metrics server and along with pluggable option for
users who wants to graphite or other timeline servers should be the
goal.

--Harsha


On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> @Cody - The design looks good. Does the design allow to aggregate metrics
> at the task/executor level? Basically, number of distinct metrics is
> proportional to the number of distinct tasks, did you ever run into such
> a
> use case?
> 
> 
> On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere <e....@gmail.com>
> wrote:
> 
> > Also, you can read the code from our latest release JStorm 2.1.1.
> >
> > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere <e....@gmail.com>
> > wrote:
> >
> > > @Jungtaek,
> > > We did some tests on codahale metrics, compared to meters/histograms,
> > > counters are quite fast. So we mainly focused on the optimization of
> > meters
> > > and histograms (they are indeed very slow) including double sampling,
> > > changing the clock from ns (System.nanoTime) to ms, etc.
> > > You can take a look at the
> > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of our
> > > sequence-split-merge example code, as the client code entry to metrics.
> > > After that, you may dig to TopologyMaster class, which is still part of a
> > > topology, and then to TopologyMetricsRunnable, which is a part of nimbus
> > > server, finally to MetricUploader plugin, this is where the metrics
> > > interfere with our "metrics server". Still, there're some nits in the
> > code,
> > > but I think that should be no big problem.
> > >
> > > I'd also like to point out that our "metrics server" is not strictly a
> > > real metrics server, since most of the duty lies on nimbus server and
> > > topology master, it's more appropriate to call it metrics storage. The
> > main
> > > reason for this is that we don't want to make a heavy-weight metrics
> > server
> > > out of JStorm, and this makes us very easy to maintain (we have teams
> > that
> > > specifically maintain HBase/OTS in Alibaba since they're so commonly used
> > > in production).
> > >
> > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim <ka...@gmail.com>
> > wrote:
> > >
> > >> Thanks Cody and Bobby for the explanation.
> > >>
> > >> Cody,
> > >> I took a look at design doc and looks promising, especially it doesn't
> > do
> > >> sampling when metric type is 'counter'. As far as I heard (I didn't try
> > >> it)
> > >> it becomes huge performance hit in Apache Storm when we change sample
> > rate
> > >> to 1.0.
> > >> Could you guide the entry point of metric feature in JStorm to dig into?
> > >>
> > >> And just a curiosity, did you consider extracting metric feature (which
> > is
> > >> done with TopologyMasters and Nimbuses) into separate component?
> > >> I understood your mention to 'metrics server' as separate component, but
> > >> after seeing design doc, feature seems to be implemented on Nimbus.
> > >>
> > >> Thanks,
> > >> Jungtaek Lim (HeartSaVioR)
> > >>
> > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere <e....@gmail.com>님이 작성:
> > >>
> > >> > JStorm has provided a MetricUploader interface, which is similar to
> > >> > IMetricsConsumer in storm, and the underlying implementation is
> > >> pluggable,
> > >> > you can use HBase, or any other KV store that supports timeline
> > queries
> > >> or
> > >> > even a database(maybe for it's a small cluster). We provide model
> > >> classes
> > >> > in jstorm-core, as to what kinds of metrics data need to be stored,
> > it's
> > >> > totally up to the detailed implementation. Our internal implementation
> > >> uses
> > >> > OTS, which is a product of aliyun (
> > https://www.aliyun.com/product/ots/
> > >> ),
> > >> > but it's easy to adapt to other implementations.
> > >> >
> > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > >> <evans@yahoo-inc.com.invalid
> > >> > >
> > >> > wrote:
> > >> >
> > >> > > Yes we originally wanted to try and use the Hadoop Timeline Server
> > for
> > >> > > storm metrics feedback to nimbus + UI + history like server.  But it
> > >> was
> > >> > > not stable at the time, so we stopped.  For the sake of playing
> > nicely
> > >> > with
> > >> > > the rest of the big data ecosystem I would like to see us support it
> > >> as
> > >> > an
> > >> > > option for metrics collection/query, but until the timeline server
> > v2
> > >> is
> > >> > > ready and released.  For me the important thing is that we have a
> > >> decent
> > >> > > time series DB that comes with storm by default and is pluggable so
> > we
> > >> > can
> > >> > > replace it with something else that has similar capabilities in the
> > >> > future.
> > >> > >  - Bobby
> > >> > >
> > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > >> > > e.neverme@gmail.com> wrote:
> > >> > >
> > >> > >
> > >> > >  It's actually in Phase 2 of porting JStorm, but I'm absolutely ok
> > to
> > >> > > discuss this in advance.
> > >> > >
> > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > e.neverme@gmail.com
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > Yes it's already in production.
> > >> > > > The implementation basically follows the design document in
> > >> > > > https://issues.apache.org/jira/browse/STORM-1329, you can take a
> > >> look
> > >> > > > first and feel free to ask questions.
> > >> > > >
> > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <kabhwan@gmail.com
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > >> Hi,
> > >> > > >>
> > >> > > >> I got something to do with metrics so I'm seeking the pull
> > requests
> > >> > > which
> > >> > > >> addresses metrics.
> > >> > > >> And at #753 <https://github.com/apache/storm/pull/753> I found
> > >> Cody
> > >> > > said
> > >> > > >> we
> > >> > > >> (maybe it means Alibaba team) are currently working on Metrics
> > >> Server.
> > >> > > >> (I also found comment which said there was some talk while ago
> > >> around
> > >> > > >> integrating Hadoop timeline server. Seems like no one came up
> > with
> > >> the
> > >> > > >> result, and I prefer to avoid big dependency so I'm in favor of
> > >> > Metrics
> > >> > > >> Server for now.)
> > >> > > >>
> > >> > > >> I think that would improve metrics feature of Storm much better,
> > so
> > >> > I'd
> > >> > > >> like to see how the work is going. Sure it's only when there's no
> > >> > issue
> > >> > > >> for
> > >> > > >> you to work transparently. I just would like to prevent
> > >> duplication of
> > >> > > >> work, and would like to help if needed and possible.
> > >> > > >>
> > >> > > >> Thanks,
> > >> > > >> Jungtaek Lim (HeartSaVioR)
> > >> > > >>
> > >> > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
> 
> 
> 
> -- 
> Regards,
> Abhishek Agarwal

Re: Question on Metrics Server to Alibaba team

Posted by Harsha <st...@harsha.io>.

One of the goals of this work and probably can be addressed in separate
jira is how the topology metrics reporter works. Today its a bolt thats
part of a topology graph that means its another node in the Topology DAG
that needs be tuned for better performance. Some of our users took
performance hits by deploying topology metrics reporter that can send
metrics to Ganglia. Ideally this collection should be asynchronous and
not be a node in topology DAG.

Shipping default metrics server and along with pluggable option for
users who wants to graphite or other timeline servers should be the
goal.

--Harsha


On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> @Cody - The design looks good. Does the design allow to aggregate metrics
> at the task/executor level? Basically, number of distinct metrics is
> proportional to the number of distinct tasks, did you ever run into such
> a
> use case?
> 
> 
> On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere <e....@gmail.com>
> wrote:
> 
> > Also, you can read the code from our latest release JStorm 2.1.1.
> >
> > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere <e....@gmail.com>
> > wrote:
> >
> > > @Jungtaek,
> > > We did some tests on codahale metrics, compared to meters/histograms,
> > > counters are quite fast. So we mainly focused on the optimization of
> > meters
> > > and histograms (they are indeed very slow) including double sampling,
> > > changing the clock from ns (System.nanoTime) to ms, etc.
> > > You can take a look at the
> > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of our
> > > sequence-split-merge example code, as the client code entry to metrics.
> > > After that, you may dig to TopologyMaster class, which is still part of a
> > > topology, and then to TopologyMetricsRunnable, which is a part of nimbus
> > > server, finally to MetricUploader plugin, this is where the metrics
> > > interfere with our "metrics server". Still, there're some nits in the
> > code,
> > > but I think that should be no big problem.
> > >
> > > I'd also like to point out that our "metrics server" is not strictly a
> > > real metrics server, since most of the duty lies on nimbus server and
> > > topology master, it's more appropriate to call it metrics storage. The
> > main
> > > reason for this is that we don't want to make a heavy-weight metrics
> > server
> > > out of JStorm, and this makes us very easy to maintain (we have teams
> > that
> > > specifically maintain HBase/OTS in Alibaba since they're so commonly used
> > > in production).
> > >
> > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim <ka...@gmail.com>
> > wrote:
> > >
> > >> Thanks Cody and Bobby for the explanation.
> > >>
> > >> Cody,
> > >> I took a look at design doc and looks promising, especially it doesn't
> > do
> > >> sampling when metric type is 'counter'. As far as I heard (I didn't try
> > >> it)
> > >> it becomes huge performance hit in Apache Storm when we change sample
> > rate
> > >> to 1.0.
> > >> Could you guide the entry point of metric feature in JStorm to dig into?
> > >>
> > >> And just a curiosity, did you consider extracting metric feature (which
> > is
> > >> done with TopologyMasters and Nimbuses) into separate component?
> > >> I understood your mention to 'metrics server' as separate component, but
> > >> after seeing design doc, feature seems to be implemented on Nimbus.
> > >>
> > >> Thanks,
> > >> Jungtaek Lim (HeartSaVioR)
> > >>
> > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere <e....@gmail.com>님이 작성:
> > >>
> > >> > JStorm has provided a MetricUploader interface, which is similar to
> > >> > IMetricsConsumer in storm, and the underlying implementation is
> > >> pluggable,
> > >> > you can use HBase, or any other KV store that supports timeline
> > queries
> > >> or
> > >> > even a database(maybe for it's a small cluster). We provide model
> > >> classes
> > >> > in jstorm-core, as to what kinds of metrics data need to be stored,
> > it's
> > >> > totally up to the detailed implementation. Our internal implementation
> > >> uses
> > >> > OTS, which is a product of aliyun (
> > https://www.aliyun.com/product/ots/
> > >> ),
> > >> > but it's easy to adapt to other implementations.
> > >> >
> > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > >> <evans@yahoo-inc.com.invalid
> > >> > >
> > >> > wrote:
> > >> >
> > >> > > Yes we originally wanted to try and use the Hadoop Timeline Server
> > for
> > >> > > storm metrics feedback to nimbus + UI + history like server.  But it
> > >> was
> > >> > > not stable at the time, so we stopped.  For the sake of playing
> > nicely
> > >> > with
> > >> > > the rest of the big data ecosystem I would like to see us support it
> > >> as
> > >> > an
> > >> > > option for metrics collection/query, but until the timeline server
> > v2
> > >> is
> > >> > > ready and released.  For me the important thing is that we have a
> > >> decent
> > >> > > time series DB that comes with storm by default and is pluggable so
> > we
> > >> > can
> > >> > > replace it with something else that has similar capabilities in the
> > >> > future.
> > >> > >  - Bobby
> > >> > >
> > >> > >     On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > >> > > e.neverme@gmail.com> wrote:
> > >> > >
> > >> > >
> > >> > >  It's actually in Phase 2 of porting JStorm, but I'm absolutely ok
> > to
> > >> > > discuss this in advance.
> > >> > >
> > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > e.neverme@gmail.com
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > Yes it's already in production.
> > >> > > > The implementation basically follows the design document in
> > >> > > > https://issues.apache.org/jira/browse/STORM-1329, you can take a
> > >> look
> > >> > > > first and feel free to ask questions.
> > >> > > >
> > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <kabhwan@gmail.com
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > >> Hi,
> > >> > > >>
> > >> > > >> I got something to do with metrics so I'm seeking the pull
> > requests
> > >> > > which
> > >> > > >> addresses metrics.
> > >> > > >> And at #753 <https://github.com/apache/storm/pull/753> I found
> > >> Cody
> > >> > > said
> > >> > > >> we
> > >> > > >> (maybe it means Alibaba team) are currently working on Metrics
> > >> Server.
> > >> > > >> (I also found comment which said there was some talk while ago
> > >> around
> > >> > > >> integrating Hadoop timeline server. Seems like no one came up
> > with
> > >> the
> > >> > > >> result, and I prefer to avoid big dependency so I'm in favor of
> > >> > Metrics
> > >> > > >> Server for now.)
> > >> > > >>
> > >> > > >> I think that would improve metrics feature of Storm much better,
> > so
> > >> > I'd
> > >> > > >> like to see how the work is going. Sure it's only when there's no
> > >> > issue
> > >> > > >> for
> > >> > > >> you to work transparently. I just would like to prevent
> > >> duplication of
> > >> > > >> work, and would like to help if needed and possible.
> > >> > > >>
> > >> > > >> Thanks,
> > >> > > >> Jungtaek Lim (HeartSaVioR)
> > >> > > >>
> > >> > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
> 
> 
> 
> -- 
> Regards,
> Abhishek Agarwal

Re: Question on Metrics Server to Alibaba team

Posted by Abhishek Agarwal <ab...@gmail.com>.

@Cody - The design looks good. Does the design allow to aggregate metrics
at the task/executor level? Basically, number of distinct metrics is
proportional to the number of distinct tasks, did you ever run into such a
use case?


On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere <e....@gmail.com> wrote:

> Also, you can read the code from our latest release JStorm 2.1.1.
>
> On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere <e....@gmail.com>
> wrote:
>
> > @Jungtaek,
> > We did some tests on codahale metrics, compared to meters/histograms,
> > counters are quite fast. So we mainly focused on the optimization of
> meters
> > and histograms (they are indeed very slow) including double sampling,
> > changing the clock from ns (System.nanoTime) to ms, etc.
> > You can take a look at the
> > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of our
> > sequence-split-merge example code, as the client code entry to metrics.
> > After that, you may dig to TopologyMaster class, which is still part of a
> > topology, and then to TopologyMetricsRunnable, which is a part of nimbus
> > server, finally to MetricUploader plugin, this is where the metrics
> > interfere with our "metrics server". Still, there're some nits in the
> code,
> > but I think that should be no big problem.
> >
> > I'd also like to point out that our "metrics server" is not strictly a
> > real metrics server, since most of the duty lies on nimbus server and
> > topology master, it's more appropriate to call it metrics storage. The
> main
> > reason for this is that we don't want to make a heavy-weight metrics
> server
> > out of JStorm, and this makes us very easy to maintain (we have teams
> that
> > specifically maintain HBase/OTS in Alibaba since they're so commonly used
> > in production).
> >
> > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim <ka...@gmail.com>
> wrote:
> >
> >> Thanks Cody and Bobby for the explanation.
> >>
> >> Cody,
> >> I took a look at design doc and looks promising, especially it doesn't
> do
> >> sampling when metric type is 'counter'. As far as I heard (I didn't try
> >> it)
> >> it becomes huge performance hit in Apache Storm when we change sample
> rate
> >> to 1.0.
> >> Could you guide the entry point of metric feature in JStorm to dig into?
> >>
> >> And just a curiosity, did you consider extracting metric feature (which
> is
> >> done with TopologyMasters and Nimbuses) into separate component?
> >> I understood your mention to 'metrics server' as separate component, but
> >> after seeing design doc, feature seems to be implemented on Nimbus.
> >>
> >> Thanks,
> >> Jungtaek Lim (HeartSaVioR)
> >>
> >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere <e....@gmail.com>님이 작성:
> >>
> >> > JStorm has provided a MetricUploader interface, which is similar to
> >> > IMetricsConsumer in storm, and the underlying implementation is
> >> pluggable,
> >> > you can use HBase, or any other KV store that supports timeline
> queries
> >> or
> >> > even a database(maybe for it's a small cluster). We provide model
> >> classes
> >> > in jstorm-core, as to what kinds of metrics data need to be stored,
> it's
> >> > totally up to the detailed implementation. Our internal implementation
> >> uses
> >> > OTS, which is a product of aliyun (
> https://www.aliyun.com/product/ots/
> >> ),
> >> > but it's easy to adapt to other implementations.
> >> >
> >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> >> <evans@yahoo-inc.com.invalid
> >> > >
> >> > wrote:
> >> >
> >> > > Yes we originally wanted to try and use the Hadoop Timeline Server
> for
> >> > > storm metrics feedback to nimbus + UI + history like server.  But it
> >> was
> >> > > not stable at the time, so we stopped.  For the sake of playing
> nicely
> >> > with
> >> > > the rest of the big data ecosystem I would like to see us support it
> >> as
> >> > an
> >> > > option for metrics collection/query, but until the timeline server
> v2
> >> is
> >> > > ready and released.  For me the important thing is that we have a
> >> decent
> >> > > time series DB that comes with storm by default and is pluggable so
> we
> >> > can
> >> > > replace it with something else that has similar capabilities in the
> >> > future.
> >> > >  - Bobby
> >> > >
> >> > >     On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> >> > > e.neverme@gmail.com> wrote:
> >> > >
> >> > >
> >> > >  It's actually in Phase 2 of porting JStorm, but I'm absolutely ok
> to
> >> > > discuss this in advance.
> >> > >
> >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> e.neverme@gmail.com
> >> >
> >> > > wrote:
> >> > >
> >> > > > Yes it's already in production.
> >> > > > The implementation basically follows the design document in
> >> > > > https://issues.apache.org/jira/browse/STORM-1329, you can take a
> >> look
> >> > > > first and feel free to ask questions.
> >> > > >
> >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <kabhwan@gmail.com
> >
> >> > > wrote:
> >> > > >
> >> > > >> Hi,
> >> > > >>
> >> > > >> I got something to do with metrics so I'm seeking the pull
> requests
> >> > > which
> >> > > >> addresses metrics.
> >> > > >> And at #753 <https://github.com/apache/storm/pull/753> I found
> >> Cody
> >> > > said
> >> > > >> we
> >> > > >> (maybe it means Alibaba team) are currently working on Metrics
> >> Server.
> >> > > >> (I also found comment which said there was some talk while ago
> >> around
> >> > > >> integrating Hadoop timeline server. Seems like no one came up
> with
> >> the
> >> > > >> result, and I prefer to avoid big dependency so I'm in favor of
> >> > Metrics
> >> > > >> Server for now.)
> >> > > >>
> >> > > >> I think that would improve metrics feature of Storm much better,
> so
> >> > I'd
> >> > > >> like to see how the work is going. Sure it's only when there's no
> >> > issue
> >> > > >> for
> >> > > >> you to work transparently. I just would like to prevent
> >> duplication of
> >> > > >> work, and would like to help if needed and possible.
> >> > > >>
> >> > > >> Thanks,
> >> > > >> Jungtaek Lim (HeartSaVioR)
> >> > > >>
> >> > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> >
> >>
> >
> >
>



-- 
Regards,
Abhishek Agarwal

Re: Question on Metrics Server to Alibaba team

Posted by Cody Innowhere <e....@gmail.com>.

Also, you can read the code from our latest release JStorm 2.1.1.

On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere <e....@gmail.com>
wrote:

> @Jungtaek,
> We did some tests on codahale metrics, compared to meters/histograms,
> counters are quite fast. So we mainly focused on the optimization of meters
> and histograms (they are indeed very slow) including double sampling,
> changing the clock from ns (System.nanoTime) to ms, etc.
> You can take a look at the
> "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of our
> sequence-split-merge example code, as the client code entry to metrics.
> After that, you may dig to TopologyMaster class, which is still part of a
> topology, and then to TopologyMetricsRunnable, which is a part of nimbus
> server, finally to MetricUploader plugin, this is where the metrics
> interfere with our "metrics server". Still, there're some nits in the code,
> but I think that should be no big problem.
>
> I'd also like to point out that our "metrics server" is not strictly a
> real metrics server, since most of the duty lies on nimbus server and
> topology master, it's more appropriate to call it metrics storage. The main
> reason for this is that we don't want to make a heavy-weight metrics server
> out of JStorm, and this makes us very easy to maintain (we have teams that
> specifically maintain HBase/OTS in Alibaba since they're so commonly used
> in production).
>
> On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>
>> Thanks Cody and Bobby for the explanation.
>>
>> Cody,
>> I took a look at design doc and looks promising, especially it doesn't do
>> sampling when metric type is 'counter'. As far as I heard (I didn't try
>> it)
>> it becomes huge performance hit in Apache Storm when we change sample rate
>> to 1.0.
>> Could you guide the entry point of metric feature in JStorm to dig into?
>>
>> And just a curiosity, did you consider extracting metric feature (which is
>> done with TopologyMasters and Nimbuses) into separate component?
>> I understood your mention to 'metrics server' as separate component, but
>> after seeing design doc, feature seems to be implemented on Nimbus.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere <e....@gmail.com>님이 작성:
>>
>> > JStorm has provided a MetricUploader interface, which is similar to
>> > IMetricsConsumer in storm, and the underlying implementation is
>> pluggable,
>> > you can use HBase, or any other KV store that supports timeline queries
>> or
>> > even a database(maybe for it's a small cluster). We provide model
>> classes
>> > in jstorm-core, as to what kinds of metrics data need to be stored, it's
>> > totally up to the detailed implementation. Our internal implementation
>> uses
>> > OTS, which is a product of aliyun (https://www.aliyun.com/product/ots/
>> ),
>> > but it's easy to adapt to other implementations.
>> >
>> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
>> <evans@yahoo-inc.com.invalid
>> > >
>> > wrote:
>> >
>> > > Yes we originally wanted to try and use the Hadoop Timeline Server for
>> > > storm metrics feedback to nimbus + UI + history like server.  But it
>> was
>> > > not stable at the time, so we stopped.  For the sake of playing nicely
>> > with
>> > > the rest of the big data ecosystem I would like to see us support it
>> as
>> > an
>> > > option for metrics collection/query, but until the timeline server v2
>> is
>> > > ready and released.  For me the important thing is that we have a
>> decent
>> > > time series DB that comes with storm by default and is pluggable so we
>> > can
>> > > replace it with something else that has similar capabilities in the
>> > future.
>> > >  - Bobby
>> > >
>> > >     On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
>> > > e.neverme@gmail.com> wrote:
>> > >
>> > >
>> > >  It's actually in Phase 2 of porting JStorm, but I'm absolutely ok to
>> > > discuss this in advance.
>> > >
>> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <e.neverme@gmail.com
>> >
>> > > wrote:
>> > >
>> > > > Yes it's already in production.
>> > > > The implementation basically follows the design document in
>> > > > https://issues.apache.org/jira/browse/STORM-1329, you can take a
>> look
>> > > > first and feel free to ask questions.
>> > > >
>> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <ka...@gmail.com>
>> > > wrote:
>> > > >
>> > > >> Hi,
>> > > >>
>> > > >> I got something to do with metrics so I'm seeking the pull requests
>> > > which
>> > > >> addresses metrics.
>> > > >> And at #753 <https://github.com/apache/storm/pull/753> I found
>> Cody
>> > > said
>> > > >> we
>> > > >> (maybe it means Alibaba team) are currently working on Metrics
>> Server.
>> > > >> (I also found comment which said there was some talk while ago
>> around
>> > > >> integrating Hadoop timeline server. Seems like no one came up with
>> the
>> > > >> result, and I prefer to avoid big dependency so I'm in favor of
>> > Metrics
>> > > >> Server for now.)
>> > > >>
>> > > >> I think that would improve metrics feature of Storm much better, so
>> > I'd
>> > > >> like to see how the work is going. Sure it's only when there's no
>> > issue
>> > > >> for
>> > > >> you to work transparently. I just would like to prevent
>> duplication of
>> > > >> work, and would like to help if needed and possible.
>> > > >>
>> > > >> Thanks,
>> > > >> Jungtaek Lim (HeartSaVioR)
>> > > >>
>> > > >
>> > > >
>> > >
>> > >
>> > >
>> > >
>> >
>>
>
>

Re: Question on Metrics Server to Alibaba team

Posted by Cody Innowhere <e....@gmail.com>.

@Jungtaek,
We did some tests on codahale metrics, compared to meters/histograms,
counters are quite fast. So we mainly focused on the optimization of meters
and histograms (they are indeed very slow) including double sampling,
changing the clock from ns (System.nanoTime) to ms, etc.
You can take a look at the
"com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of our
sequence-split-merge example code, as the client code entry to metrics.
After that, you may dig to TopologyMaster class, which is still part of a
topology, and then to TopologyMetricsRunnable, which is a part of nimbus
server, finally to MetricUploader plugin, this is where the metrics
interfere with our "metrics server". Still, there're some nits in the code,
but I think that should be no big problem.

I'd also like to point out that our "metrics server" is not strictly a real
metrics server, since most of the duty lies on nimbus server and topology
master, it's more appropriate to call it metrics storage. The main reason
for this is that we don't want to make a heavy-weight metrics server out of
JStorm, and this makes us very easy to maintain (we have teams that
specifically maintain HBase/OTS in Alibaba since they're so commonly used
in production).

On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim <ka...@gmail.com> wrote:

> Thanks Cody and Bobby for the explanation.
>
> Cody,
> I took a look at design doc and looks promising, especially it doesn't do
> sampling when metric type is 'counter'. As far as I heard (I didn't try it)
> it becomes huge performance hit in Apache Storm when we change sample rate
> to 1.0.
> Could you guide the entry point of metric feature in JStorm to dig into?
>
> And just a curiosity, did you consider extracting metric feature (which is
> done with TopologyMasters and Nimbuses) into separate component?
> I understood your mention to 'metrics server' as separate component, but
> after seeing design doc, feature seems to be implemented on Nimbus.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere <e....@gmail.com>님이 작성:
>
> > JStorm has provided a MetricUploader interface, which is similar to
> > IMetricsConsumer in storm, and the underlying implementation is
> pluggable,
> > you can use HBase, or any other KV store that supports timeline queries
> or
> > even a database(maybe for it's a small cluster). We provide model classes
> > in jstorm-core, as to what kinds of metrics data need to be stored, it's
> > totally up to the detailed implementation. Our internal implementation
> uses
> > OTS, which is a product of aliyun (https://www.aliyun.com/product/ots/),
> > but it's easy to adapt to other implementations.
> >
> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> <evans@yahoo-inc.com.invalid
> > >
> > wrote:
> >
> > > Yes we originally wanted to try and use the Hadoop Timeline Server for
> > > storm metrics feedback to nimbus + UI + history like server.  But it
> was
> > > not stable at the time, so we stopped.  For the sake of playing nicely
> > with
> > > the rest of the big data ecosystem I would like to see us support it as
> > an
> > > option for metrics collection/query, but until the timeline server v2
> is
> > > ready and released.  For me the important thing is that we have a
> decent
> > > time series DB that comes with storm by default and is pluggable so we
> > can
> > > replace it with something else that has similar capabilities in the
> > future.
> > >  - Bobby
> > >
> > >     On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > > e.neverme@gmail.com> wrote:
> > >
> > >
> > >  It's actually in Phase 2 of porting JStorm, but I'm absolutely ok to
> > > discuss this in advance.
> > >
> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <e....@gmail.com>
> > > wrote:
> > >
> > > > Yes it's already in production.
> > > > The implementation basically follows the design document in
> > > > https://issues.apache.org/jira/browse/STORM-1329, you can take a
> look
> > > > first and feel free to ask questions.
> > > >
> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <ka...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> I got something to do with metrics so I'm seeking the pull requests
> > > which
> > > >> addresses metrics.
> > > >> And at #753 <https://github.com/apache/storm/pull/753> I found Cody
> > > said
> > > >> we
> > > >> (maybe it means Alibaba team) are currently working on Metrics
> Server.
> > > >> (I also found comment which said there was some talk while ago
> around
> > > >> integrating Hadoop timeline server. Seems like no one came up with
> the
> > > >> result, and I prefer to avoid big dependency so I'm in favor of
> > Metrics
> > > >> Server for now.)
> > > >>
> > > >> I think that would improve metrics feature of Storm much better, so
> > I'd
> > > >> like to see how the work is going. Sure it's only when there's no
> > issue
> > > >> for
> > > >> you to work transparently. I just would like to prevent duplication
> of
> > > >> work, and would like to help if needed and possible.
> > > >>
> > > >> Thanks,
> > > >> Jungtaek Lim (HeartSaVioR)
> > > >>
> > > >
> > > >
> > >
> > >
> > >
> > >
> >
>

Re: Question on Metrics Server to Alibaba team

Posted by Jungtaek Lim <ka...@gmail.com>.

Thanks Cody and Bobby for the explanation.

Cody,
I took a look at design doc and looks promising, especially it doesn't do
sampling when metric type is 'counter'. As far as I heard (I didn't try it)
it becomes huge performance hit in Apache Storm when we change sample rate
to 1.0.
Could you guide the entry point of metric feature in JStorm to dig into?

And just a curiosity, did you consider extracting metric feature (which is
done with TopologyMasters and Nimbuses) into separate component?
I understood your mention to 'metrics server' as separate component, but
after seeing design doc, feature seems to be implemented on Nimbus.

Thanks,
Jungtaek Lim (HeartSaVioR)

2016년 3월 19일 (토) 오전 1:25, Cody Innowhere <e....@gmail.com>님이 작성:

> JStorm has provided a MetricUploader interface, which is similar to
> IMetricsConsumer in storm, and the underlying implementation is pluggable,
> you can use HBase, or any other KV store that supports timeline queries or
> even a database(maybe for it's a small cluster). We provide model classes
> in jstorm-core, as to what kinds of metrics data need to be stored, it's
> totally up to the detailed implementation. Our internal implementation uses
> OTS, which is a product of aliyun (https://www.aliyun.com/product/ots/),
> but it's easy to adapt to other implementations.
>
> On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans <evans@yahoo-inc.com.invalid
> >
> wrote:
>
> > Yes we originally wanted to try and use the Hadoop Timeline Server for
> > storm metrics feedback to nimbus + UI + history like server.  But it was
> > not stable at the time, so we stopped.  For the sake of playing nicely
> with
> > the rest of the big data ecosystem I would like to see us support it as
> an
> > option for metrics collection/query, but until the timeline server v2 is
> > ready and released.  For me the important thing is that we have a decent
> > time series DB that comes with storm by default and is pluggable so we
> can
> > replace it with something else that has similar capabilities in the
> future.
> >  - Bobby
> >
> >     On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > e.neverme@gmail.com> wrote:
> >
> >
> >  It's actually in Phase 2 of porting JStorm, but I'm absolutely ok to
> > discuss this in advance.
> >
> > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <e....@gmail.com>
> > wrote:
> >
> > > Yes it's already in production.
> > > The implementation basically follows the design document in
> > > https://issues.apache.org/jira/browse/STORM-1329, you can take a look
> > > first and feel free to ask questions.
> > >
> > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <ka...@gmail.com>
> > wrote:
> > >
> > >> Hi,
> > >>
> > >> I got something to do with metrics so I'm seeking the pull requests
> > which
> > >> addresses metrics.
> > >> And at #753 <https://github.com/apache/storm/pull/753> I found Cody
> > said
> > >> we
> > >> (maybe it means Alibaba team) are currently working on Metrics Server.
> > >> (I also found comment which said there was some talk while ago around
> > >> integrating Hadoop timeline server. Seems like no one came up with the
> > >> result, and I prefer to avoid big dependency so I'm in favor of
> Metrics
> > >> Server for now.)
> > >>
> > >> I think that would improve metrics feature of Storm much better, so
> I'd
> > >> like to see how the work is going. Sure it's only when there's no
> issue
> > >> for
> > >> you to work transparently. I just would like to prevent duplication of
> > >> work, and would like to help if needed and possible.
> > >>
> > >> Thanks,
> > >> Jungtaek Lim (HeartSaVioR)
> > >>
> > >
> > >
> >
> >
> >
> >
>

Re: Question on Metrics Server to Alibaba team

Posted by Cody Innowhere <e....@gmail.com>.

JStorm has provided a MetricUploader interface, which is similar to
IMetricsConsumer in storm, and the underlying implementation is pluggable,
you can use HBase, or any other KV store that supports timeline queries or
even a database(maybe for it's a small cluster). We provide model classes
in jstorm-core, as to what kinds of metrics data need to be stored, it's
totally up to the detailed implementation. Our internal implementation uses
OTS, which is a product of aliyun (https://www.aliyun.com/product/ots/),
but it's easy to adapt to other implementations.

On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans <ev...@yahoo-inc.com.invalid>
wrote:

> Yes we originally wanted to try and use the Hadoop Timeline Server for
> storm metrics feedback to nimbus + UI + history like server.  But it was
> not stable at the time, so we stopped.  For the sake of playing nicely with
> the rest of the big data ecosystem I would like to see us support it as an
> option for metrics collection/query, but until the timeline server v2 is
> ready and released.  For me the important thing is that we have a decent
> time series DB that comes with storm by default and is pluggable so we can
> replace it with something else that has similar capabilities in the future.
>  - Bobby
>
>     On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> e.neverme@gmail.com> wrote:
>
>
>  It's actually in Phase 2 of porting JStorm, but I'm absolutely ok to
> discuss this in advance.
>
> On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <e....@gmail.com>
> wrote:
>
> > Yes it's already in production.
> > The implementation basically follows the design document in
> > https://issues.apache.org/jira/browse/STORM-1329, you can take a look
> > first and feel free to ask questions.
> >
> > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <ka...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> I got something to do with metrics so I'm seeking the pull requests
> which
> >> addresses metrics.
> >> And at #753 <https://github.com/apache/storm/pull/753> I found Cody
> said
> >> we
> >> (maybe it means Alibaba team) are currently working on Metrics Server.
> >> (I also found comment which said there was some talk while ago around
> >> integrating Hadoop timeline server. Seems like no one came up with the
> >> result, and I prefer to avoid big dependency so I'm in favor of Metrics
> >> Server for now.)
> >>
> >> I think that would improve metrics feature of Storm much better, so I'd
> >> like to see how the work is going. Sure it's only when there's no issue
> >> for
> >> you to work transparently. I just would like to prevent duplication of
> >> work, and would like to help if needed and possible.
> >>
> >> Thanks,
> >> Jungtaek Lim (HeartSaVioR)
> >>
> >
> >
>
>
>
>

Re: Question on Metrics Server to Alibaba team

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

Yes we originally wanted to try and use the Hadoop Timeline Server for storm metrics feedback to nimbus + UI + history like server.  But it was not stable at the time, so we stopped.  For the sake of playing nicely with the rest of the big data ecosystem I would like to see us support it as an option for metrics collection/query, but until the timeline server v2 is ready and released.  For me the important thing is that we have a decent time series DB that comes with storm by default and is pluggable so we can replace it with something else that has similar capabilities in the future.
 - Bobby 

    On Friday, March 18, 2016 10:39 AM, Cody Innowhere <e....@gmail.com> wrote:
 

 It's actually in Phase 2 of porting JStorm, but I'm absolutely ok to
discuss this in advance.

On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <e....@gmail.com>
wrote:

> Yes it's already in production.
> The implementation basically follows the design document in
> https://issues.apache.org/jira/browse/STORM-1329, you can take a look
> first and feel free to ask questions.
>
> On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>
>> Hi,
>>
>> I got something to do with metrics so I'm seeking the pull requests which
>> addresses metrics.
>> And at #753 <https://github.com/apache/storm/pull/753> I found Cody said
>> we
>> (maybe it means Alibaba team) are currently working on Metrics Server.
>> (I also found comment which said there was some talk while ago around
>> integrating Hadoop timeline server. Seems like no one came up with the
>> result, and I prefer to avoid big dependency so I'm in favor of Metrics
>> Server for now.)
>>
>> I think that would improve metrics feature of Storm much better, so I'd
>> like to see how the work is going. Sure it's only when there's no issue
>> for
>> you to work transparently. I just would like to prevent duplication of
>> work, and would like to help if needed and possible.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>
>

Re: Question on Metrics Server to Alibaba team

Posted by Cody Innowhere <e....@gmail.com>.

It's actually in Phase 2 of porting JStorm, but I'm absolutely ok to
discuss this in advance.

On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <e....@gmail.com>
wrote:

> Yes it's already in production.
> The implementation basically follows the design document in
> https://issues.apache.org/jira/browse/STORM-1329, you can take a look
> first and feel free to ask questions.
>
> On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:
>
>> Hi,
>>
>> I got something to do with metrics so I'm seeking the pull requests which
>> addresses metrics.
>> And at #753 <https://github.com/apache/storm/pull/753> I found Cody said
>> we
>> (maybe it means Alibaba team) are currently working on Metrics Server.
>> (I also found comment which said there was some talk while ago around
>> integrating Hadoop timeline server. Seems like no one came up with the
>> result, and I prefer to avoid big dependency so I'm in favor of Metrics
>> Server for now.)
>>
>> I think that would improve metrics feature of Storm much better, so I'd
>> like to see how the work is going. Sure it's only when there's no issue
>> for
>> you to work transparently. I just would like to prevent duplication of
>> work, and would like to help if needed and possible.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>
>

Re: Question on Metrics Server to Alibaba team

Posted by Cody Innowhere <e....@gmail.com>.

Yes it's already in production.
The implementation basically follows the design document in
https://issues.apache.org/jira/browse/STORM-1329, you can take a look first
and feel free to ask questions.

On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <ka...@gmail.com> wrote:

> Hi,
>
> I got something to do with metrics so I'm seeking the pull requests which
> addresses metrics.
> And at #753 <https://github.com/apache/storm/pull/753> I found Cody said
> we
> (maybe it means Alibaba team) are currently working on Metrics Server.
> (I also found comment which said there was some talk while ago around
> integrating Hadoop timeline server. Seems like no one came up with the
> result, and I prefer to avoid big dependency so I'm in favor of Metrics
> Server for now.)
>
> I think that would improve metrics feature of Storm much better, so I'd
> like to see how the work is going. Sure it's only when there's no issue for
> you to work transparently. I just would like to prevent duplication of
> work, and would like to help if needed and possible.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>