You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ambari.apache.org by "Hari Sekhon (JIRA)" <ji...@apache.org> on 2018/07/19 10:33:00 UTC

[jira] [Updated] (AMBARI-24306) Ambari Metrics + Grafana - add LastGcInfo duration graphs for all server components for all GCs - G1GC Young + Old Gens, CMS and ParallelNew

     [ https://issues.apache.org/jira/browse/AMBARI-24306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hari Sekhon updated AMBARI-24306:
---------------------------------
    Description: 
Feature Request to add Grafana graph of last value (not average please) LastGcInfo duration for all 3 major garbage collectors :
 * G1GC Young Gen
 * G1GC Old Generations
 * CMS
 * ParallelNew

CMS and ParNew example taken from NameNode JMX metrics:
{code:java}
  }, {
    "name" : "java.lang:type=GarbageCollector,name=ConcurrentMarkSweep",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 11,
      "duration" : 5206,
...
  }, {
    "name" : "java.lang:type=GarbageCollector,name=ParNew",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 11,
      "duration" : 6,
 {code}
G1GC Young and Old Gen example taken from RegionServer JMX metrics:
{code:java}
  }, {
    "name" : "java.lang:type=GarbageCollector,name=G1 Young Generation",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 24,
      "duration" : 120,
{code}
{code:java}
  }, {
    "name" : "java.lang:type=GarbageCollector,name=G1 Old Generation",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 24,
      "duration" : 19641,
{code}
Yes this old gen GC is atrocious which is why I'm here to tune this, but it helps if this stuff is monitored properly in the first place to know there is a problem without waiting until there are random RegionServer deaths due to long GC pauses.

Right now Ambari's Grafana has GCTimeMillis which would make one think this is not a problem as it only shows an averaged out 40ms per sec of GC time which isn't very helpful to spotting this long GC pause problem.

  was:
Feature Request to add Grafana graph of last value (not average please) LastGcInfo duration for all 3 major garbage collectors :
 * G1GC Young Gen
 * G1GC Old Generations
 * CMS
 * ParallelNew

CMS and ParNew example taken from NameNode JMX metrics:
{code:java}
  }, {
    "name" : "java.lang:type=GarbageCollector,name=ConcurrentMarkSweep",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 11,
      "duration" : 5206,
...
  }, {
    "name" : "java.lang:type=GarbageCollector,name=ParNew",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 11,
      "duration" : 6,
 {code}
G1GC Young and Old Gen example taken from RegionServer JMX metrics:
{code:java}
  }, {
    "name" : "java.lang:type=GarbageCollector,name=G1 Young Generation",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 24,
      "duration" : 120,
{code}
{code:java}
  }, {
    "name" : "java.lang:type=GarbageCollector,name=G1 Old Generation",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 24,
      "duration" : 19641,
{code}
Yes this old gen GC is atrocious which is why I'm here to tune this, but it helps if this stuff is monitoring properly in the first place to know there is a problem without waiting until there are random RegionServer deaths due to long GC pauses.

Right now Ambari's Grafana has GCTimeMillis which would make one think this is not a problem as it only shows an averaged out 40ms per sec of GC time which isn't very helpful to spotting this long GC pause problem.


> Ambari Metrics + Grafana - add LastGcInfo duration graphs for all server components for all GCs - G1GC Young + Old Gens, CMS and ParallelNew
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMBARI-24306
>                 URL: https://issues.apache.org/jira/browse/AMBARI-24306
>             Project: Ambari
>          Issue Type: New Feature
>          Components: ambari-metrics, metrics
>            Reporter: Hari Sekhon
>            Priority: Major
>
> Feature Request to add Grafana graph of last value (not average please) LastGcInfo duration for all 3 major garbage collectors :
>  * G1GC Young Gen
>  * G1GC Old Generations
>  * CMS
>  * ParallelNew
> CMS and ParNew example taken from NameNode JMX metrics:
> {code:java}
>   }, {
>     "name" : "java.lang:type=GarbageCollector,name=ConcurrentMarkSweep",
>     "modelerType" : "sun.management.GarbageCollectorImpl",
>     "LastGcInfo" : {
>       "GcThreadCount" : 11,
>       "duration" : 5206,
> ...
>   }, {
>     "name" : "java.lang:type=GarbageCollector,name=ParNew",
>     "modelerType" : "sun.management.GarbageCollectorImpl",
>     "LastGcInfo" : {
>       "GcThreadCount" : 11,
>       "duration" : 6,
>  {code}
> G1GC Young and Old Gen example taken from RegionServer JMX metrics:
> {code:java}
>   }, {
>     "name" : "java.lang:type=GarbageCollector,name=G1 Young Generation",
>     "modelerType" : "sun.management.GarbageCollectorImpl",
>     "LastGcInfo" : {
>       "GcThreadCount" : 24,
>       "duration" : 120,
> {code}
> {code:java}
>   }, {
>     "name" : "java.lang:type=GarbageCollector,name=G1 Old Generation",
>     "modelerType" : "sun.management.GarbageCollectorImpl",
>     "LastGcInfo" : {
>       "GcThreadCount" : 24,
>       "duration" : 19641,
> {code}
> Yes this old gen GC is atrocious which is why I'm here to tune this, but it helps if this stuff is monitored properly in the first place to know there is a problem without waiting until there are random RegionServer deaths due to long GC pauses.
> Right now Ambari's Grafana has GCTimeMillis which would make one think this is not a problem as it only shows an averaged out 40ms per sec of GC time which isn't very helpful to spotting this long GC pause problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)