You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ambari.apache.org by "Hari Sekhon (JIRA)" <ji...@apache.org> on 2018/07/12 14:06:00 UTC

[jira] [Updated] (AMBARI-24244) Grafana HBase GC Time graph wrong / misleading - hiding large GC pauses

     [ https://issues.apache.org/jira/browse/AMBARI-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hari Sekhon updated AMBARI-24244:
---------------------------------
    Summary: Grafana HBase GC Time graph wrong / misleading - hiding large GC pauses  (was: Grafana HBase GC Time graph showing very wrong GC times (off by two dozen secs))

> Grafana HBase GC Time graph wrong / misleading - hiding large GC pauses
> -----------------------------------------------------------------------
>
>                 Key: AMBARI-24244
>                 URL: https://issues.apache.org/jira/browse/AMBARI-24244
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-metrics, metrics
>    Affects Versions: 2.5.2
>            Reporter: Hari Sekhon
>            Priority: Major
>
> Ambari's in-built Grafana graph for "JVM GC Times" graph in the HBase - RegionServers dashboard is very wrong and doesn't reflect the times I've grepped across HBase RegionServer logs for util.JvmPauseMonitor.
> I've inherited a very heavily loaded HBase + OpenTSDB cluster where there are RegionServer losses occurring due to GCs around 30 seconds(!) causing ZK + HMaster to declare them dead. The Grafana graphs show peaks around 70ms due to averaging the GC time spent over all seconds, which smooths out the peaks so as to not show any problem. If you are going to use GCTimeMillis then I believe you need to divide by GCCount.
> Otherwise I believe this is actually the wrong metric to be watching and instead the following metric from HBase JMX should be monitored with a value of last. This does show the significant GC time spent:
> {code:java}
> java.lang:type=GarbageCollector,name=G1 Old Generation -> LastGcInfo -> duration{code}
> Obviously make it search for a regex to match whichever garbage collector you are using, whether G1 or CMS etc:
> {code:java}
> java.lang:type=GarbageCollector,name=.*Old Gen.*  -> LastGcInfo -> duration{code}
> Right now the GC Times graph is worse than useless, it's misleading as it implies there are no GC issues when there are actually very large very severe GC issues on this cluster.
> This is a vanilla Ambari deployed Grafana with Ambari Metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)