You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Mohamed Riadh Trad <Mo...@inria.fr> on 2010/03/23 17:00:09 UTC

Small Files as input, Heap Size and garbage Collector

Hi,

I am running hadoop over a collection of several millions of small files using the CombineFileInputFormat.

However, when generating splits, the job fails because of a Garbage Collector Overhead limit exceed exception.

I disabled the Garbage Colelctor overhead limit exception with -server -XX:-UseGCOverheadLimit; I get a java.lang.OutOfMemoryError: Java heap space with -Xmx8192m -server.

Is there any solution to avoid this limit when splitting input?

Regards

Re: Small Files as input, Heap Size and garbage Collector

Posted by "Bae, Jae Hyeon" <me...@gmail.com>.

I think there's no way to avoid this limit if you have several
millions of small files.

You might know that at least one InputSplit instance is created for
one file, so if there are several millions of small files, you might
have several millions of InputSplit instance, this should consume much
gigabytes of memory.

You should tar your small files and you should implement input
formatter for those tar files.

One of my colleagues had a similar case to handle a number of small
files, he tarred those files and made input formatter, and he finally
solved the problem.

I like to suggest him to open those sources but I am not sure that his
manager would permit it :)

2010/3/24 Mohamed Riadh Trad <Mo...@inria.fr>:
> Hi,
>
> I am running hadoop over a collection of several millions of small files using the CombineFileInputFormat.
>
> However, when generating splits, the job fails because of a Garbage Collector Overhead limit exceed exception.
>
> I disabled the Garbage Colelctor overhead limit exception with -server -XX:-UseGCOverheadLimit; I get a java.lang.OutOfMemoryError: Java heap space with -Xmx8192m -server.
>
> Is there any solution to avoid this limit when splitting input?
>
> Regards
>
>
>
>

Re: Small Files as input, Heap Size and garbage Collector

Posted by welman Lu <we...@gmail.com>.

I am reading the "Hadoop The Definitive Guide", and in the page 71, it said,
when there are too many small files, the memory of the NameNode will be eat
out since each file need to keep its metadata in NameNode. The book also
suggest using Hadoop Archives, or HAR files to pack files into HDFS blocks.

Hope this can help you!


Best Regards
Jiamin Lu

Re: Small Files as input, Heap Size and garbage Collector

Posted by Karthik K <os...@gmail.com>.

On Tue, Mar 23, 2010 at 9:00 AM, Mohamed Riadh Trad
<Mo...@inria.fr>wrote:

> Hi,
>
> I am running hadoop over a collection of several millions of small files
> using the CombineFileInputFormat.
>
> However, when generating splits, the job fails because of a Garbage
> Collector Overhead limit exceed exception.
>
> I disabled the Garbage Colelctor overhead limit exception with -server
> -XX:-UseGCOverheadLimit; I get a java.lang.OutOfMemoryError: Java heap space
> with -Xmx8192m -server.
>
> Is there any solution to avoid this limit when splitting input?
>

You can directly start inheriting from InputFormat and create your
InputSplit-s / RecordReader-s , accordingly.

When we say a million of small files, you can define a set of custom
InputSplits based on higher level logic , but your record reader would be
cumbersome w.r.t ( nextKeyValue() /currentKey() / currentValue()
implementations), but at least you have better control over the behavior.

But if you have the data on HDFS , you may have to rethink about having
large number of small files in the first place / look for some archiving
options that can help your InputSplit / RecordReaders relatively simple.

>
> Regards
>
>
>
>