You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@eagle.apache.org by "Zhang, Edward (GDI Hadoop)" <yo...@ebay.com> on 2016/01/08 07:53:08 UTC
Re: [Discuss] Hadoop metrics,job,GC monitoring

please review latest design of monitoring on hadoop native metrics.

https://cwiki.apache.org/confluence/display/EAG/Hadoop+Native+Metrics+Monit
oring


Thanks
Edward

On 12/14/15, 23:48, "Zhang, Edward (GDI Hadoop)" <yo...@ebay.com> wrote:

>started some documentation on
>https://cwiki.apache.org/confluence/display/EAG/Hadoop+Native+Metrics+Moni
>t
>oring
>
>Thanks Hao, Ralph etc. for offline review and suggestions, I would improve
>that.
>
>In terms of the question ³if user adds a new metric to monitor, how
>processing layer would change accordingly²
>
>I think if user adds a new metric, this metric should be added into
>metadata table, and data source layer and processing layer should see
>consistent list of metrics.
>
>But we still need bake this design, please comment whatever is your
>thoughts.
>
>Thanks
>Edward
>
>
>On 12/14/15, 11:04, "Arun Manoharan" <ar...@apache.org> wrote:
>
>>Thanks Edward for starting the thread. I think it is important to have
>>the
>>job monitoring (MR/Spark) workloads for performance of the cluster and
>>availability.
>>
>>But it will be beneficial to have an extensible framework where users can
>>create business rules like "I want an alert when NN is in safemode or RM
>>is
>>flipping etc".
>>
>>Thanks,
>>Arun
>>
>>On Mon, Dec 14, 2015 at 10:58 AM, Zhang, Edward (GDI Hadoop) <
>>yonzhang@ebay.com> wrote:
>>
>>> Hi Eagle devs/users,
>>>
>>> As proposed in apache eagle incubator proposal, Eagle will start
>>> design/dev to support Hadoop system monitoring besides security
>>>monitoring
>>> which includes Hadoop native metrics, job, gclog etc.
>>>
>>> The community is also interested in Hadoop system monitoring by Eagle
>>>when
>>> we recently talked about Eagle product in public conferences, meet up
>>>etc.
>>>
>>> Take Hadoop native metrics as an example, first of all those metrics
>>>are
>>> pretty valuable in determining system health status, secondly
>>>collecting
>>> huge amount metrics, visualizing, and alerting is very challenging.  We
>>> need think of declarative collection, dynamic aggregation, metric
>>>storage,
>>> metric query engine etc.
>>>
>>> Besides technical design, comprehensive policy/rule are also valuable
>>>to
>>> be shared in the community. Those policy/rule represent best practice
>>>in
>>> the world to manage large Hadoop clusters.
>>>
>>> Please suggest whatever is for engineering design or business
>>>policy/rules.
>>>
>>> Thanks
>>> Edward
>>>
>>>
>