You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Misha Dmitriev (JIRA)" <ji...@apache.org> on 2017/11/15 19:59:00 UTC
[jira] [Commented] (HIVE-17684) HoS memory issues with MapJoinMemoryExhaustionHandler

    [ https://issues.apache.org/jira/browse/HIVE-17684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16254068#comment-16254068 ] 

Misha Dmitriev commented on HIVE-17684:
---------------------------------------

The problem with {{MapJoinMemoryExhaustionHandler}} is a large percentage of false alarms about memory exhaustion. Without it, however, Hive may go into a "GC death spiral", where the JVM runs back-to-back full GCs, but doesn't fail for long enough. Because user threads are unable to run most of the time, the executor stops responding, and the Spark driver eventually drops it after some time. This results in hard-to-debug failures, because from the logs it's not clear why the executor stopped responding.

I recently added the new https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/GcTimeMonitor.java class to hadoop, which allows the user to accurately monitor the percentage of time that the JVM spends in GC. When this percentage grows above ~50% over a 1-minute period, it's almost always a signal that the JVM is in the above "GC death spiral". Even if it is not, extremely long GC pauses are very bad for performance, and it makes sense to treat them in the same way as OOM, i.e. fail the task and ask the user to increase their executors' heap size.

I ran some experiments where I replaced MapJoinMemoryExhaustionHandler with checking GC time percentage reported by GcTimeMonitor, and it work well. GcTimeMonitor will become available for other projects when Hadoop 3.0.0-GA is released (which, according to Hadoop developers, should happen in a few weeks). Currently Hive depends on Hadoop 3.0.0-beta1, so to use GcTimeMonitor in Hive, we will need to change this dependency to Hadoop GA. Are there any objections against:
(a) dependency change from Hadoop 3.0.0-beta1 to 3.0.0-GA
(b) replacing MapJoinMemoryExhaustionHandler with GcTimeMonitor ?

> HoS memory issues with MapJoinMemoryExhaustionHandler
> -----------------------------------------------------
>
>                 Key: HIVE-17684
>                 URL: https://issues.apache.org/jira/browse/HIVE-17684
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>
> We have seen a number of memory issues due the {{HashSinkOperator}} use of the {{MapJoinMemoryExhaustionHandler}}. This handler is meant to detect scenarios where the small table is taking too much space in memory, in which case a {{MapJoinMemoryExhaustionError}} is thrown.
> The configs to control this logic are:
> {{hive.mapjoin.localtask.max.memory.usage}} (default 0.90)
> {{hive.mapjoin.followby.gby.localtask.max.memory.usage}} (default 0.55)
> The handler works by using the {{MemoryMXBean}} and uses the following logic to estimate how much memory the {{HashMap}} is consuming: {{MemoryMXBean#getHeapMemoryUsage().getUsed() / MemoryMXBean#getHeapMemoryUsage().getMax()}}
> The issue is that {{MemoryMXBean#getHeapMemoryUsage().getUsed()}} can be inaccurate. The value returned by this method returns all reachable and unreachable memory on the heap, so there may be a bunch of garbage data, and the JVM just hasn't taken the time to reclaim it all. This can lead to intermittent failures of this check even though a simple GC would have reclaimed enough space for the process to continue working.
> We should re-think the usage of {{MapJoinMemoryExhaustionHandler}} for HoS. In Hive-on-MR this probably made sense to use because every Hive task was run in a dedicated container, so a Hive Task could assume it created most of the data on the heap. However, in Hive-on-Spark there can be multiple Hive Tasks running in a single executor, each doing different things.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)