You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sean McNamara <Se...@Webtrends.com> on 2014/10/14 07:53:17 UTC

GC/OOM fix when writing large/many columns

Hello-

I’ve found a condition where the MemoryManager will wait too long before notifying writers to check their memory and flush.


This issue affects anyone who is writing a lot of columns, very large columns, or worst of all: both. I have tested and confirmed this issue on hive 0.12, 0.13, and trunk.

Doing some searching it looks like other folks have been running into this as well. The issue manifests itself as large GC pauses that eventually throw the exception below when writing data. Tuning hive.exec.orc.memory.pool, or any of the orc params has no apparent affect when hitting this issue.

java.lang.OutOfMemoryError: Java heap space
        java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
        java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
        org.apache.hadoop.hive.ql.io.orc.OutStream.getNewInputBuffer(OutStream.java:107)
        org.apache.hadoop.hive.ql.io.orc.OutStream.spill(OutStream.java:223)
        org.apache.hadoop.hive.ql.io.orc.OutStream.flush(OutStream.java:239)
...

I ran into this issue while generating ORCs, but I believe it affects all storage types.  The only present workaround is to give tasks lots of extra memory.

https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L50

The issue is on line 50: ROWS_BETWEEN_CHECKS = 5000;

For large or many columns it’s easy to hit GC issues or OOM before 5k rows are written.

I believe that rows-between-checks should be made a configuration parameter that can be passed in on the JobConf.

Does this suggestion make sense?  If so I can open a Jira ticket and throw some code together.

Thank you,

Sean


Re: GC/OOM fix when writing large/many columns

Posted by Alan Gates <ga...@hortonworks.com>.
A config variable is a good place to start.  It would be even cooler if 
the system could somehow auto-detect the condition and then reduce the 
number of rows between checks.

Alan.

> Sean McNamara <ma...@Webtrends.com>
> October 13, 2014 at 22:53
> Hello-
>
> I’ve found a condition where the MemoryManager will wait too long 
> before notifying writers to check their memory and flush.
>
>
> This issue affects anyone who is writing a lot of columns, very large 
> columns, or worst of all: both. I have tested and confirmed this issue 
> on hive 0.12, 0.13, and trunk.
>
> Doing some searching it looks like other folks have been running into 
> this as well. The issue manifests itself as large GC pauses that 
> eventually throw the exception below when writing data. Tuning 
> hive.exec.orc.memory.pool, or any of the orc params has no apparent 
> affect when hitting this issue.
>
> java.lang.OutOfMemoryError: Java heap space
> java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
> java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
> org.apache.hadoop.hive.ql.io.orc.OutStream.getNewInputBuffer(OutStream.java:107)
> org.apache.hadoop.hive.ql.io.orc.OutStream.spill(OutStream.java:223)
> org.apache.hadoop.hive.ql.io.orc.OutStream.flush(OutStream.java:239)
> ...
>
> I ran into this issue while generating ORCs, but I believe it affects 
> all storage types. The only present workaround is to give tasks lots 
> of extra memory.
>
> https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L50
>
> The issue is on line 50: ROWS_BETWEEN_CHECKS = 5000;
>
> For large or many columns it’s easy to hit GC issues or OOM before 5k 
> rows are written.
>
> I believe that rows-between-checks should be made a configuration 
> parameter that can be passed in on the JobConf.
>
> Does this suggestion make sense? If so I can open a Jira ticket and 
> throw some code together.
>
> Thank you,
>
> Sean
>
>

-- 
Sent with Postbox <http://www.getpostbox.com>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.