You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Misha Dmitriev (JIRA)" <ji...@apache.org> on 2017/11/01 00:27:01 UTC

[jira] [Commented] (HADOOP-14960) Add GC time percentage monitor/alerter

    [ https://issues.apache.org/jira/browse/HADOOP-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16229368#comment-16229368 ] 

Misha Dmitriev commented on HADOOP-14960:
-----------------------------------------

Thank you for the review, [~xiaochen]. See my responses inline below:

{quote}
High level, have you thought about using 1 data structure to hold both the gcPause and timestamp ring buffers? I think a SortedMap would fit well in our use case. The current way (2 arrays) may be the most efficient, I'd prefer readability here since I assume this would be a very small part of memory consumption for the service running it.
{quote}

I thought, but this makes my performance-oriented soul hurt. After optimizing so many applications storing numerical data in suboptimal format, I can hardly force myself to use an object-oriented data structure for something that I know in advance is just a fixed-size set of ints... And then, I think a ring buffer is not the most sophisticated/unreadable data structure. We could use something like an ArrayDeque here (which employs a ring buffer internally), but will still need to have similar code to move to the first entry that is within the observation window, etc.

{quote}
Suggest to do some input validation in GcTimeMonitor constructor: timestamps should not be negative, and also should not create too big a buffer size (this may go away if we change data structures ). maxGcTimePercentage also only makes sense for a (0, 100) value.
{quote}

Done.

{quote}
This actually does not really discard any buffers...
{quote}

Updated the comment to make this more clear.

{quote}
For the startIdx and endIdx, we should handle integer overflows of (index + 1) % bufSize. Maybe have a method like incrementInRing.
{quote}

I think there is no need for that - where can we get an overflow in this {{index = (index + 1) % bufSize;}} expression? It's designed to keep {{index}} within the {{0 .. bufSize - 1}} interval.

{quote}
We can Preconditions.checkNotNull on the input param in JvmMetrics#setGcTimeMonitor. (I know the current setPauseMonitor doesn't check, let's do better than that )
{quote}

Done.

{quote}
Should we make the methods synchronized so we don't have to worry about the user observes inconsistent values?
{quote}

This is tricky... We can, unfortunately, get inconsistent values from public {{getXXX()}} methods of {{GcTimeMonitor}} in any case, because all these methods are called separately and return one value each. So it can always happens that between a call to say {{getTotalGcTimeMillis()}} and {{getTotalGcCount()}} one of these would be updated. In practice I think this doesn't matter much, since this inconsistency will be just temporary and this information is, after all, expected to be used for monitoring, not for some precise calculations I guess.

If we really want to return all this data in a consistent form, we should, at a minimum, return all these numbers in a single object by a single method (and then they indeed should be written into that object in a synchronized way, and maybe more synchronization will be needed in other places). For now I think this is too much work for too unobvious a benefit. But please let me know if you think differently.

{quote}
Typo in test: 
{quote}

Fixed.

> Add GC time percentage monitor/alerter
> --------------------------------------
>
>                 Key: HADOOP-14960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14960
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>            Priority: Major
>         Attachments: HADOOP-14960.01.patch, HADOOP-14960.02.patch
>
>
> Currently class {{org.apache.hadoop.metrics2.source.JvmMetrics}} provides several metrics related to GC. Unfortunately, all these metrics are not as useful as they could be, because they don't answer the first and most important question related to GC and JVM health: what percentage of time my JVM is paused in GC? This percentage, calculated as the sum of the GC pauses over some period, like 1 minute, divided by that period - is the most convenient measure of the GC health because:
> - it is just one number, and it's clear that, say, 1..5% is good, but 80..90% is really bad
> - it allows for easy apple-to-apple comparison between runs, even between different apps
> - when this metric reaches some critical value like 70%, it almost always indicates a "GC death spiral", from which the app can recover only if it drops some task(s) etc.
> The existing "total GC time", "total number of GCs" etc. metrics only give numbers that can be used to rougly estimate this percentage. Thus it is suggested to add a new metric to this class, and possibly allow users to register handlers that will be automatically invoked if this metric reaches the specified threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org