You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2012/10/01 22:03:07 UTC
[jira] [Commented] (MAPREDUCE-4229) Intern counter names in the JT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467138#comment-13467138 ] 

Robert Joseph Evans commented on MAPREDUCE-4229:
------------------------------------------------

I have a couple of questions/suggestions for the patch.

Why are we using a strong interner and not a weak interner?  By using the strong version I think we will have a memory leak in the history server.  When someone declares a custom counter name it will never go away, even after that job's jhist file has been deleted out of HDFS and the counters are no longer accessible.  I think this is even worse for some of the strings that we are interning that contain the value of the counter, not just the name of the counter.  They will probably be different every time the function is called, causing some potentially very large memory leaks.  But I am not really all that sure they are called from a long running process.

It seems kind of odd where the interning calls are happening.  I doubt it is going to save any memory at all. For example we save the name of a counter inside a counter class instance and then only intern it when we return the name from the getter method.  If the intern call actually did anything it would not allow for the original name to be garbage collected, because it is still pointed to by the counter instance.  The only time we really need to intern a string is when that string is the result of reading data from a stream.  So this would be for all RPC calls and anything that parses out a string from a file.  In most other cases, like with strings that come directly from an ENUM or quoted string in the code they will already be interned by the runtime environment and adding this extra layer will only slow things down and actually use more memory.

I think it would be preferable to start out just interning the names of the counters and counter groups as they are read from a stream, like in the case of parsing the job history files.  Once that happens we can go back and evaluate if there are other places, like through RPC, that are using a lot of memory.  I would hold off on the RPC, because I am not really sure how clean it is to insert this into the protocol buffer bridge code that we use.  I think PB plays games with lazy parsing of the data and if we are not careful it could slow things down, or cause more memory usage.
                
> Intern counter names in the JT
> ------------------------------
>
>                 Key: MAPREDUCE-4229
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4229
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobtracker
>    Affects Versions: 1.0.2
>            Reporter: Todd Lipcon
>         Attachments: MAPREDUCE-4229.patch
>
>
> In our experience, most of the memory in production JTs goes to storing counter names (String objects and character arrays). Since most counter names are reused again and again, it would be a big memory savings to keep a hash set of already-used counter names within a job, and refer to the same object from all tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira