You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Misha Dmitriev (JIRA)" <ji...@apache.org> on 2018/06/29 02:24:00 UTC

[jira] [Comment Edited] (HIVE-19937) Intern JobConf objects in Spark tasks

    [ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527045#comment-16527045 ] 

Misha Dmitriev edited comment on HIVE-19937 at 6/29/18 2:23 AM:
----------------------------------------------------------------

I took a quick look, and I am not sure this is done correctly. The code below
{code:java}
jobConf.forEach(entry -> {
  StringInternUtils.internIfNotNull(entry.getKey());
  StringInternUtils.internIfNotNull(entry.getValue());
}){code}
goes over each table entry and just invokes intern() for each key and value. {{intern()}} returns an existing, "canonical" string for each string that is duplicate. But the code doesn't store the returned strings back into the table. To intern both keys and values in a hashtable, you typically need to create a new table and effectively "intern and transfer" the contents from the old table to the new table. Sometimes it may be possible to be more creative and actually create a table with interned contents right away. Here it probably could be done if you added some custom kryo deserialization code for such tables. But maybe that's too big an effort.

As always, it would be good to see how much memory was wasted before this change and saved after it. This helps to prevent errors and to see how much was actually achieved.

If {{jobConf}} is an instance of {{java.lang.Properties}}, and there are many duplicates of such tables, then memory is wasted by both string contents of these tables and by tables themselves (each table uses many extra Java objects internally). So you may consider checking the {{org.apache.hadoop.hive.common.CopyOnFirstWriteProperties}} class that I once added for a somewhat similar use case.


was (Author: misha@cloudera.com):
I took a quick look, and I am not sure this is done correctly. The code below
{code:java}
jobConf.forEach(entry -> {
  StringInternUtils.internIfNotNull(entry.getKey());
  StringInternUtils.internIfNotNull(entry.getValue());
}){code}

> Intern JobConf objects in Spark tasks
> -------------------------------------
>
>                 Key: HIVE-19937
>                 URL: https://issues.apache.org/jira/browse/HIVE-19937
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-19937.1.patch
>
>
> When fixing HIVE-16395, we decided that each new Spark task should clone the {{JobConf}} object to prevent any {{ConcurrentModificationException}} from being thrown. However, setting this variable comes at a cost of storing a duplicate {{JobConf}} object for each Spark task. These objects can take up a significant amount of memory, we should intern them so that Spark tasks running in the same JVM don't store duplicate copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)