You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Nathan Bamford <na...@redpoint.net> on 2017/08/01 16:59:46 UTC

hcatreader out of memory error

Hello,

My company has a product that is a data processing yarn app. Because we essentially take the place of map reduce, we use HCatalog for reading and writing Hive tables.

We implemented our solution using the reader and writer as described here:

https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter

HCatalog ReaderWriter - Apache Hive - Apache Software ...<https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter>
cwiki.apache.org
Overview. HCatalog provides a data transfer API for parallel input and output without using MapReduce. This API provides a way to read data from a Hadoop cluster or ...

This has worked more or less okay, but there are a couple of issues with it.

First, some time back (I think either 0.13 or 0.14), the interface to the ReaderContext changed so we were no longer able to retrieve InputSplit objects from the ReaderContext via getSplits(). Now one must getNumSplits() and retrieve individual splits by an id number.

This was a big problem for us, because we have our own load balancing algorithms and need to know the locations and sizes of the splits. I managed to get around this by using reflection to call the internal getSplits(), but of course this is far from a good solution.

Recently, we've been getting into some very large clusters with very large Hive tables. In some cases, tens of thousands of data splits.

This has the effect of causing out of memory errors from the JVM when this call is made:

ReaderContext cntxt = reader.prepareRead();

org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1078)
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1105)
org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:153)
org.apache.hive.hcatalog.data.transfer.impl.HCatInputFormatReader.prepareRead(HCatInputFormatReader.java:68)

Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space

Sometimes we've been able to deal with it by increasing the memory to the JVM (although the slowdown in prepareRead is awful), but sometimes we can't seem to provide enough.
I notice from perusing the code that each InputSplit contains a copy of the table schema, which is enormous in these cases.

My question to the community at large is: Is HCatalog still the recommended way for a yarn app like us to interface with Hive? HiveServer2 has most of the functionality we need, but no ability to get information about the data splits. If HCatalog is the only game in town, how are we meant to deal with these memory errors?

thanks,

Nathan Bamford