You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Marc Sturlese <ma...@gmail.com> on 2010/11/30 00:26:16 UTC

small files and number of mappers

Hey there,
I am doing some tests and wandering which are the best practices to deal
with very small files which are continuously being generated(1Mb or even
less).

I see that if I have hundreds of small files in hdfs, hadoop automatically
will create A LOT of map tasks to consume them. Each map task will take 10
seconds or less... I don't know if it's possible to change the number of map
tasks from java code using the new API (I know it can be done with the old
one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3.
This way, less maps tasks would be instanciated and each would be working
more time.

I have had a look at hadoop archives aswell but don't thing they can help me
here.

Any advice or similar experience?
Thanks in advance.


-- 
View this message in context: http://lucene.472066.n3.nabble.com/small-files-and-number-of-mappers-tp1989598p1989598.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: small files and number of mappers

Posted by Edward Capriolo <ed...@gmail.com>.

On Tue, Nov 30, 2010 at 3:21 AM, Harsh J <qw...@gmail.com> wrote:
> Hey,
>
> On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese <ma...@gmail.com> wrote:
>>
>> Hey there,
>> I am doing some tests and wandering which are the best practices to deal
>> with very small files which are continuously being generated(1Mb or even
>> less).
>
> Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>
>>
>> I see that if I have hundreds of small files in hdfs, hadoop automatically
>> will create A LOT of map tasks to consume them. Each map task will take 10
>> seconds or less... I don't know if it's possible to change the number of map
>> tasks from java code using the new API (I know it can be done with the old
>> one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3.
>> This way, less maps tasks would be instanciated and each would be working
>> more time.
>
> Perhaps you need to use MultiFileInputFormat:
> http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>
> --
> Harsh J
> www.harshj.com
>

MultiFile and ConbinedInputFormats help.
JVM Re-use helps.

The larger problem is that an average NameNode with 4GB ram will start
JVM pausing with a relatively low number of files/blocks, say
10,000,000. 10mil is not a large number when generating thousands of
files a day.

We open sourced a tool to deal with this problem.
http://www.jointhegrid.com/hadoop_filecrush/index.jsp

Essentially it takes a pass over a directory and combines multiple
files into one. On 'hourly' directories we run it after the hour is
closed out.

V2 (which we should throw over the fence in a week or so) uses the
same techniques but will be optimized for dealing with very large
directories and/or subdirectories of varying sizes by doing more
intelligent planning and grouping of which files an individual mapper
or reducer is going to combine.

Re: small files and number of mappers

Posted by Harsh J <qw...@gmail.com>.

Hey,

On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese <ma...@gmail.com> wrote:
>
> Hey there,
> I am doing some tests and wandering which are the best practices to deal
> with very small files which are continuously being generated(1Mb or even
> less).

Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/

>
> I see that if I have hundreds of small files in hdfs, hadoop automatically
> will create A LOT of map tasks to consume them. Each map task will take 10
> seconds or less... I don't know if it's possible to change the number of map
> tasks from java code using the new API (I know it can be done with the old
> one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3.
> This way, less maps tasks would be instanciated and each would be working
> more time.

Perhaps you need to use MultiFileInputFormat:
http://www.cloudera.com/blog/2009/02/the-small-files-problem/

-- 
Harsh J
www.harshj.com