You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Mark Bonetti <ma...@gmail.com> on 2018/04/06 10:13:54 UTC

Which monitoring metrics to alert on?

Hi,
I'm building a monitoring system for HBase and want to set up default
alerts (threshold or anomaly) on 2-3 key metrics everyone who uses HBase
typically wants to alert on, but I don't yet have production-grade
experience with HBase.

Importantly, alert rules have to be generally useful, so can't be on
metrics whose values vary wildly based on the size of deployment.

In other words, which metrics would be most significant indicators that
something went wrong with your HBase?

I thought the best place to find experienced HBase users, who would find
answering this question trivial, would be here.

Thanks very much,
Mark

Re: Which monitoring metrics to alert on?

Posted by Mark Bonetti <ma...@gmail.com>.

Hubbert, no worries, thanks for the effort regardless.

Sudhir,
thanks for that.
Yes, each server will have a monitoring agent (that sends back metrics)
installed.

On Sun, Apr 8, 2018 at 3:15 AM, sudhir patil <sp...@gmail.com>
wrote:

> Few important thingsto monitor from top of head
>
> Compaction queue size, compaction size ( size of all files in compaction)
> GC pause time, number gc (highly co rellated to compactions)
> Ipc read write call size
> Slow query logs
> Number of failed regions from canary tests
> Replication queue size
>
> Its better to monitor these metrics at each region server level to detect
> issues e.g overall cluster gc may be around average but all of the gc’s
> could be happening in only one region server, its very difficult to find
> these unless you track these metrics at each region server level.
>
> -sudhir
>
> On Fri, 6 Apr 2018 at 11:27 PM, Hubbert Smith <hu...@hubbertsmith.com>
> wrote:
>
> > OK, guilty as charged. my imagination got away from me
> > you just wanted to monitor your hbase, not your hardware ... ok then
> >
> > On Fri, Apr 6, 2018 at 4:13 AM, Mark Bonetti <
> mark.bonetti.scott@gmail.com
> > >
> > wrote:
> >
> > > Hi,
> > > I'm building a monitoring system for HBase and want to set up default
> > > alerts (threshold or anomaly) on 2-3 key metrics everyone who uses
> HBase
> > > typically wants to alert on, but I don't yet have production-grade
> > > experience with HBase.
> > >
> > > Importantly, alert rules have to be generally useful, so can't be on
> > > metrics whose values vary wildly based on the size of deployment.
> > >
> > > In other words, which metrics would be most significant indicators that
> > > something went wrong with your HBase?
> > >
> > > I thought the best place to find experienced HBase users, who would
> find
> > > answering this question trivial, would be here.
> > >
> > > Thanks very much,
> > > Mark
> > >
> >
> >
> >
> > --
> > Hubbert@hubbertsmith.com | 385 321 0757  |   LinkedIN
> > <http://tinyurl.com/7v5eu2p>
> > Linkedin Learning: Storage Foundations Cert Prep: SNCP Foundations
> S10-110
> > <
> > https://www.linkedin.com/learning/cert-prep-sncp-
> foundations-s10-110/storage-and-business-and-career-path
> > >
> >
>

Re: Which monitoring metrics to alert on?

Posted by sudhir patil <sp...@gmail.com>.

Few important thingsto monitor from top of head

Compaction queue size, compaction size ( size of all files in compaction)
GC pause time, number gc (highly co rellated to compactions)
Ipc read write call size
Slow query logs
Number of failed regions from canary tests
Replication queue size

Its better to monitor these metrics at each region server level to detect
issues e.g overall cluster gc may be around average but all of the gc’s
could be happening in only one region server, its very difficult to find
these unless you track these metrics at each region server level.

-sudhir

On Fri, 6 Apr 2018 at 11:27 PM, Hubbert Smith <hu...@hubbertsmith.com>
wrote:

> OK, guilty as charged. my imagination got away from me
> you just wanted to monitor your hbase, not your hardware ... ok then
>
> On Fri, Apr 6, 2018 at 4:13 AM, Mark Bonetti <mark.bonetti.scott@gmail.com
> >
> wrote:
>
> > Hi,
> > I'm building a monitoring system for HBase and want to set up default
> > alerts (threshold or anomaly) on 2-3 key metrics everyone who uses HBase
> > typically wants to alert on, but I don't yet have production-grade
> > experience with HBase.
> >
> > Importantly, alert rules have to be generally useful, so can't be on
> > metrics whose values vary wildly based on the size of deployment.
> >
> > In other words, which metrics would be most significant indicators that
> > something went wrong with your HBase?
> >
> > I thought the best place to find experienced HBase users, who would find
> > answering this question trivial, would be here.
> >
> > Thanks very much,
> > Mark
> >
>
>
>
> --
> Hubbert@hubbertsmith.com | 385 321 0757  |   LinkedIN
> <http://tinyurl.com/7v5eu2p>
> Linkedin Learning: Storage Foundations Cert Prep: SNCP Foundations S10-110
> <
> https://www.linkedin.com/learning/cert-prep-sncp-foundations-s10-110/storage-and-business-and-career-path
> >
>

Re: Which monitoring metrics to alert on?

Posted by Hubbert Smith <hu...@hubbertsmith.com>.

OK, guilty as charged. my imagination got away from me
you just wanted to monitor your hbase, not your hardware ... ok then

On Fri, Apr 6, 2018 at 4:13 AM, Mark Bonetti <ma...@gmail.com>
wrote:

> Hi,
> I'm building a monitoring system for HBase and want to set up default
> alerts (threshold or anomaly) on 2-3 key metrics everyone who uses HBase
> typically wants to alert on, but I don't yet have production-grade
> experience with HBase.
>
> Importantly, alert rules have to be generally useful, so can't be on
> metrics whose values vary wildly based on the size of deployment.
>
> In other words, which metrics would be most significant indicators that
> something went wrong with your HBase?
>
> I thought the best place to find experienced HBase users, who would find
> answering this question trivial, would be here.
>
> Thanks very much,
> Mark
>



-- 
Hubbert@hubbertsmith.com | 385 321 0757  |   LinkedIN
<http://tinyurl.com/7v5eu2p>
Linkedin Learning: Storage Foundations Cert Prep: SNCP Foundations S10-110
<https://www.linkedin.com/learning/cert-prep-sncp-foundations-s10-110/storage-and-business-and-career-path>

Re: Which monitoring metrics to alert on?

Posted by Hubbert Smith <hu...@hubbertsmith.com>.

suggesting storage-related metrics - storage device failure is sort of a
big deal
storage is where the valuable data sits, and device failures impacts
everything
suggest your goal be - identify which SSDs, HDDs and Servers are reliable,
and which are unreliable.

there are tools - https://en.wikipedia.org/wiki/S.M.A.R.T.

start of day - discover the server, server type, server thermals, also each
HDD/SSD discover WWID
Mac address - this is your unique server database attribute
WWID - for each storage device (HDD or SSD) has a world wide id (WWID) a
unique id like a mac address. this is your unique HDD/SSD database attribute
also capture HDD/SSD manufacturer, drive type, drive total capacity, drive
capacity consumed/empty

monitor - device uptime, device failures, monitor for device failure rates
(ex. drive 322 installed 2018-jan-20 and failed 2018-mar-20) get the idea?
if possible capture what caused a device failure (sometimes possible if you
capture errors leading up to the failure)
(all HDDs are not the same. HDDs failure rates vary wildly. All SSDs are
not the same, SSD failure rates and failure causes vary wildly.
(hint. pay attention to SSD failures as compared to HDD failures and pay
attention to the frequency or infrequency of SSD failures related to write
endurance)
(hint. all server systems are not the same. HDDs dont do well in high heat
or high vibration environments, capture system thermals and system types,
you will soon see a pattern showing well desgined and poorly designed
server systems)
its useful to know which HDD/SSDs fail a lot or fail a little, and which
server systems contribute to HDD failures, and which ones dont

just my two cents

On Fri, Apr 6, 2018 at 4:13 AM, Mark Bonetti <ma...@gmail.com>
wrote:

> Hi,
> I'm building a monitoring system for HBase and want to set up default
> alerts (threshold or anomaly) on 2-3 key metrics everyone who uses HBase
> typically wants to alert on, but I don't yet have production-grade
> experience with HBase.
>
> Importantly, alert rules have to be generally useful, so can't be on
> metrics whose values vary wildly based on the size of deployment.
>
> In other words, which metrics would be most significant indicators that
> something went wrong with your HBase?
>
> I thought the best place to find experienced HBase users, who would find
> answering this question trivial, would be here.
>
> Thanks very much,
> Mark
>

-- 
Hubbert@hubbertsmith.com | 385 321 0757  |   LinkedIN
<http://tinyurl.com/7v5eu2p>
Linkedin Learning: Storage Foundations Cert Prep: SNCP Foundations S10-110
<https://www.linkedin.com/learning/cert-prep-sncp-foundations-s10-110/storage-and-business-and-career-path>