You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Stuart Sierra <ma...@stuartsierra.com> on 2008/04/23 17:55:54 UTC

Best practices for handling many small files

Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single "record"?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that
work?

Thanks,
-Stuart, altlaw.org

Re: Best practices for handling many small files

Posted by Chris K Wensel <ch...@wensel.net>.

are the files to be stored on HDFS long term, or do they need to be  
fetched from an external authoritative source?

depending on how things are setup in your datacenter etc...

you could aggregate them into a fat sequence file (or a few). keep in  
mind how long it would take to fetch the files and aggregate them  
(this is a serial process) and if the corpus changes often (how often  
will you need to make these sequence files).

another option is to make a manifest (list of docs to fetch), feed  
that to your mapper and have it fetch each file individually. this  
would be useful if the corpus is reasonably arbitrary between runs and  
could eliminate much of the load time. but painful if the data is  
external to your datacenter and the cost to refetch is high.

there really is no simple answer..

ckw

On Apr 23, 2008, at 9:16 AM, Joydeep Sen Sarma wrote:
> million map processes are horrible. aside from overhead - don't do  
> it if u share the cluster with other jobs (all other jobs will get  
> killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393)
>
> well - even for #2 - it begs the question of how the packing itself  
> will be parallelized ..
>
> There's a MultiFileInputFormat that can be extended - that allows  
> processing of multiple files in a single map job. it needs  
> improvement. For one - it's an abstract class - and a concrete  
> implementation for (at least)  text files would help. also - the  
> splitting logic is not very smart (from what i last saw). ideally -  
> it should take the million files and form it into N groups (say N is  
> size of your cluster) where each group has files local to the Nth  
> machine and then process them on that machine. currently it doesn't  
> do this (the groups are arbitrary). But it's still the way to go ..
>
>
> -----Original Message-----
> From: the.stuart.sierra@gmail.com on behalf of Stuart Sierra
> Sent: Wed 4/23/2008 8:55 AM
> To: core-user@hadoop.apache.org
> Subject: Best practices for handling many small files
>
> Hello all, Hadoop newbie here, asking: what's the preferred way to
> handle large (~1 million) collections of small files (10 to 100KB) in
> which each file is a single "record"?
>
> 1. Ignore it, let Hadoop create a million Map processes;
> 2. Pack all the files into a single SequenceFile; or
> 3. Something else?
>
> I started writing code to do #2, transforming a big tar.bz2 into a
> BLOCK-compressed SequenceFile, with the file names as keys.  Will that
> work?
>
> Thanks,
> -Stuart, altlaw.org
>

Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/

RE: Best practices for handling many small files

Posted by Joydeep Sen Sarma <js...@facebook.com>.

Wouldn't it be possible to take the recordreader class as an config
variable and then have a concrete implementation that instantiates the
configured record reader? (like streaminputformat)

What I meant about about the splits wasn't so much about the number of
maps - but how the allocation of files to each map job is done.
Currently the logic (in MultiFileInputFormat) doesn't take the location
into account. All we need to do is sort the files by location in the
getSplits() method and then do the binning. That way - files in the same
split will be co-located. (ok - there are multiple locations for each
file - but I would think choosing _a_ location and binning based on that
would be better then doing so randomly).

-----Original Message-----
From: Enis Soztutar [mailto:enis.soz.nutch@gmail.com] 
Sent: Thursday, April 24, 2008 12:27 AM
To: core-user@hadoop.apache.org
Subject: Re: Best practices for handling many small files

A shameless attempt to defend MultiFileInputFormat :

A concrete implementation of MultiFileInputFormat is not needed, since 
every InputFormat relying on MultiFileInputFormat is expected to have 
its custom RecordReader implementation, thus they need to override 
getRecordReader(). An implementation which returns (sort of) 
LineRecordReader  is under src/examples/.../MultiFileWordCount. However 
we may include it if any generic (for example returning 
SequenceFileRecordReader) implementation pops up.

An InputFormat returns <numSplits> many Splits from getSplits(JobConf 
job, int numSplits), which is the number of maps, not the number of 
machines in the cluster.

Last of all, MultiFileSplit class implements getLocations() method, 
which returns the files' locations. Thus it's the JT's job to assign 
tasks to leverage local processing.

Coming to the original question, I think #2 is better, if the 
construction of the sequence file is not a bottleneck. You may, for 
example, create several sequence files in parallel and use all of them 
as input w/o merging.

Joydeep Sen Sarma wrote:
> million map processes are horrible. aside from overhead - don't do it
if u share the cluster with other jobs (all other jobs will get killed
whenever the million map job is finished - see
https://issues.apache.org/jira/browse/HADOOP-2393)
>
> well - even for #2 - it begs the question of how the packing itself
will be parallelized ..
>
> There's a MultiFileInputFormat that can be extended - that allows
processing of multiple files in a single map job. it needs improvement.
For one - it's an abstract class - and a concrete implementation for (at
least)  text files would help. also - the splitting logic is not very
smart (from what i last saw). ideally - it should take the million files
and form it into N groups (say N is size of your cluster) where each
group has files local to the Nth machine and then process them on that
machine. currently it doesn't do this (the groups are arbitrary). But
it's still the way to go ..
>
>
> -----Original Message-----
> From: the.stuart.sierra@gmail.com on behalf of Stuart Sierra
> Sent: Wed 4/23/2008 8:55 AM
> To: core-user@hadoop.apache.org
> Subject: Best practices for handling many small files
>  
> Hello all, Hadoop newbie here, asking: what's the preferred way to
> handle large (~1 million) collections of small files (10 to 100KB) in
> which each file is a single "record"?
>
> 1. Ignore it, let Hadoop create a million Map processes;
> 2. Pack all the files into a single SequenceFile; or
> 3. Something else?
>
> I started writing code to do #2, transforming a big tar.bz2 into a
> BLOCK-compressed SequenceFile, with the file names as keys.  Will that
> work?
>
> Thanks,
> -Stuart, altlaw.org
>
>
>

Re: Best practices for handling many small files

Posted by Enis Soztutar <en...@gmail.com>.

A shameless attempt to defend MultiFileInputFormat :

A concrete implementation of MultiFileInputFormat is not needed, since 
every InputFormat relying on MultiFileInputFormat is expected to have 
its custom RecordReader implementation, thus they need to override 
getRecordReader(). An implementation which returns (sort of) 
LineRecordReader  is under src/examples/.../MultiFileWordCount. However 
we may include it if any generic (for example returning 
SequenceFileRecordReader) implementation pops up.

An InputFormat returns <numSplits> many Splits from getSplits(JobConf 
job, int numSplits), which is the number of maps, not the number of 
machines in the cluster.

Last of all, MultiFileSplit class implements getLocations() method, 
which returns the files' locations. Thus it's the JT's job to assign 
tasks to leverage local processing.

Coming to the original question, I think #2 is better, if the 
construction of the sequence file is not a bottleneck. You may, for 
example, create several sequence files in parallel and use all of them 
as input w/o merging.

Joydeep Sen Sarma wrote:
> million map processes are horrible. aside from overhead - don't do it if u share the cluster with other jobs (all other jobs will get killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393)
>
> well - even for #2 - it begs the question of how the packing itself will be parallelized ..
>
> There's a MultiFileInputFormat that can be extended - that allows processing of multiple files in a single map job. it needs improvement. For one - it's an abstract class - and a concrete implementation for (at least)  text files would help. also - the splitting logic is not very smart (from what i last saw). ideally - it should take the million files and form it into N groups (say N is size of your cluster) where each group has files local to the Nth machine and then process them on that machine. currently it doesn't do this (the groups are arbitrary). But it's still the way to go ..
>
>
> -----Original Message-----
> From: the.stuart.sierra@gmail.com on behalf of Stuart Sierra
> Sent: Wed 4/23/2008 8:55 AM
> To: core-user@hadoop.apache.org
> Subject: Best practices for handling many small files
>  
> Hello all, Hadoop newbie here, asking: what's the preferred way to
> handle large (~1 million) collections of small files (10 to 100KB) in
> which each file is a single "record"?
>
> 1. Ignore it, let Hadoop create a million Map processes;
> 2. Pack all the files into a single SequenceFile; or
> 3. Something else?
>
> I started writing code to do #2, transforming a big tar.bz2 into a
> BLOCK-compressed SequenceFile, with the file names as keys.  Will that
> work?
>
> Thanks,
> -Stuart, altlaw.org
>
>
>

RE: Best practices for handling many small files

Posted by Joydeep Sen Sarma <js...@facebook.com>.

million map processes are horrible. aside from overhead - don't do it if u share the cluster with other jobs (all other jobs will get killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393)

well - even for #2 - it begs the question of how the packing itself will be parallelized ..

There's a MultiFileInputFormat that can be extended - that allows processing of multiple files in a single map job. it needs improvement. For one - it's an abstract class - and a concrete implementation for (at least)  text files would help. also - the splitting logic is not very smart (from what i last saw). ideally - it should take the million files and form it into N groups (say N is size of your cluster) where each group has files local to the Nth machine and then process them on that machine. currently it doesn't do this (the groups are arbitrary). But it's still the way to go ..


-----Original Message-----
From: the.stuart.sierra@gmail.com on behalf of Stuart Sierra
Sent: Wed 4/23/2008 8:55 AM
To: core-user@hadoop.apache.org
Subject: Best practices for handling many small files
 
Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single "record"?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that
work?

Thanks,
-Stuart, altlaw.org

Re: Best practices for handling many small files

Posted by Doug Cutting <cu...@apache.org>.

Joydeep Sen Sarma wrote:
> There seems to be two problems with small files:
> 1. namenode overhead. (3307 seems like _a_ solution)
> 2. map-reduce processing overhead and locality 
> 
> It's not clear from 3307 description, how the archives interface with
> map-reduce. How are the splits done? Will they solve problem #2?

Yes, I think 3307 will address (2).  Many small files will be packed 
into fewer larger files, each file typically substantially larger than a 
block.  A splitter can read the index files and then use 
MultiFileInputFormat, so that each split could contain files that are 
contained almost entirely in a single block.

Good MapReduce performance is a requirement for the design of 3307.

Doug

RE: Best practices for handling many small files

Posted by Joydeep Sen Sarma <js...@facebook.com>.

There seems to be two problems with small files:
1. namenode overhead. (3307 seems like _a_ solution)
2. map-reduce processing overhead and locality 

It's not clear from 3307 description, how the archives interface with
map-reduce. How are the splits done? Will they solve problem #2?

To some extent, the goals of a archive and the (ideal)
multifileinputformat (MFIF) differ. The archive wants to preserve the
identity of each of the subobjects. MFIF offers a way for users to
homogenize small objects into larger ones (while processing). When a
user wants to use MFIF - they don't care about the identities of each of
the small files.

In the interest of layering - it seems we should keep these separate.
3307 can offer an efficient way to store small files. A good MFIF
implementation can offer a way to efficiently do map-reduce on small
files (whether those small files are a regular hdfs file or are backed
by a archive). This way the user will also have an option to use regular
fileinputformat (say they _do_ care about the file being processed).

-----Original Message-----
From: Konstantin Shvachko [mailto:shv@yahoo-inc.com] 
Sent: Friday, April 25, 2008 10:46 AM
To: core-user@hadoop.apache.org
Subject: Re: Best practices for handling many small files

Would the new archive feature HADOOP-3307 that is currently being
developed help this problem?
http://issues.apache.org/jira/browse/HADOOP-3307

--Konstantin

Subramaniam Krishnan wrote:
> 
> We have actually written a custom Multi File Splitter that collapses
all 
> the small files to a single split till the DFS Block Size is hit.
> We also take care of handling big files by splitting them on Block
Size 
> and adding up all the reminders(if any) to a single split.
> 
> It works great for us....:-)
> We are working on optimizing it further to club all the small files in
a 
> single data node together so that the Map can have maximum local data.
> 
> We plan to share this(provided it's found acceptable, of course) once 
> this is done.
> 
> Regards,
> Subru
> 
> Stuart Sierra wrote:
> 
>> Thanks for the advice, everyone.  I'm going to go with #2, packing my
>> million files into a small number of SequenceFiles.  This is slow,
but
>> only has to be done once.  My "datacenter" is Amazon Web Services :),
>> so storing a few large, compressed files is the easiest way to go.
>>
>> My code, if anyone's interested, is here:
>> http://stuartsierra.com/2008/04/24/a-million-little-files
>>
>> -Stuart
>> altlaw.org
>>
>>
>> On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra 
>> <ma...@stuartsierra.com> wrote:
>>  
>>
>>> Hello all, Hadoop newbie here, asking: what's the preferred way to
>>>  handle large (~1 million) collections of small files (10 to 100KB)
in
>>>  which each file is a single "record"?
>>>
>>>  1. Ignore it, let Hadoop create a million Map processes;
>>>  2. Pack all the files into a single SequenceFile; or
>>>  3. Something else?
>>>
>>>  I started writing code to do #2, transforming a big tar.bz2 into a
>>>  BLOCK-compressed SequenceFile, with the file names as keys.  Will
that
>>>  work?
>>>
>>>  Thanks,
>>>  -Stuart, altlaw.org
>>>
>>>     
> 
>

Re: Best practices for handling many small files

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Would the new archive feature HADOOP-3307 that is currently being developed help this problem?
http://issues.apache.org/jira/browse/HADOOP-3307

--Konstantin

Subramaniam Krishnan wrote:
> 
> We have actually written a custom Multi File Splitter that collapses all 
> the small files to a single split till the DFS Block Size is hit.
> We also take care of handling big files by splitting them on Block Size 
> and adding up all the reminders(if any) to a single split.
> 
> It works great for us....:-)
> We are working on optimizing it further to club all the small files in a 
> single data node together so that the Map can have maximum local data.
> 
> We plan to share this(provided it's found acceptable, of course) once 
> this is done.
> 
> Regards,
> Subru
> 
> Stuart Sierra wrote:
> 
>> Thanks for the advice, everyone.  I'm going to go with #2, packing my
>> million files into a small number of SequenceFiles.  This is slow, but
>> only has to be done once.  My "datacenter" is Amazon Web Services :),
>> so storing a few large, compressed files is the easiest way to go.
>>
>> My code, if anyone's interested, is here:
>> http://stuartsierra.com/2008/04/24/a-million-little-files
>>
>> -Stuart
>> altlaw.org
>>
>>
>> On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra 
>> <ma...@stuartsierra.com> wrote:
>>  
>>
>>> Hello all, Hadoop newbie here, asking: what's the preferred way to
>>>  handle large (~1 million) collections of small files (10 to 100KB) in
>>>  which each file is a single "record"?
>>>
>>>  1. Ignore it, let Hadoop create a million Map processes;
>>>  2. Pack all the files into a single SequenceFile; or
>>>  3. Something else?
>>>
>>>  I started writing code to do #2, transforming a big tar.bz2 into a
>>>  BLOCK-compressed SequenceFile, with the file names as keys.  Will that
>>>  work?
>>>
>>>  Thanks,
>>>  -Stuart, altlaw.org
>>>
>>>     
> 
>

Re: Best practices for handling many small files

Posted by Subramaniam Krishnan <su...@yahoo-inc.com>.

We have actually written a custom Multi File Splitter that collapses all 
the small files to a single split till the DFS Block Size is hit.
We also take care of handling big files by splitting them on Block Size 
and adding up all the reminders(if any) to a single split.

It works great for us....:-)
We are working on optimizing it further to club all the small files in a 
single data node together so that the Map can have maximum local data.

We plan to share this(provided it's found acceptable, of course) once 
this is done.

Regards,
Subru

Stuart Sierra wrote:
> Thanks for the advice, everyone.  I'm going to go with #2, packing my
> million files into a small number of SequenceFiles.  This is slow, but
> only has to be done once.  My "datacenter" is Amazon Web Services :),
> so storing a few large, compressed files is the easiest way to go.
>
> My code, if anyone's interested, is here:
> http://stuartsierra.com/2008/04/24/a-million-little-files
>
> -Stuart
> altlaw.org
>
>
> On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <ma...@stuartsierra.com> wrote:
>   
>> Hello all, Hadoop newbie here, asking: what's the preferred way to
>>  handle large (~1 million) collections of small files (10 to 100KB) in
>>  which each file is a single "record"?
>>
>>  1. Ignore it, let Hadoop create a million Map processes;
>>  2. Pack all the files into a single SequenceFile; or
>>  3. Something else?
>>
>>  I started writing code to do #2, transforming a big tar.bz2 into a
>>  BLOCK-compressed SequenceFile, with the file names as keys.  Will that
>>  work?
>>
>>  Thanks,
>>  -Stuart, altlaw.org
>>
>>

Re: Best practices for handling many small files

Posted by Ted Dunning <td...@veoh.com>.

Stuart,

This will have the (slightly) desirable side-effect of making your total
disk foot-print smaller.  I don't suppose that matters all that much any
more, but it is still a nice thought.


On 4/24/08 8:28 AM, "Stuart Sierra" <ma...@stuartsierra.com> wrote:

> Thanks for the advice, everyone.  I'm going to go with #2, packing my
> million files into a small number of SequenceFiles.  This is slow, but
> only has to be done once.  My "datacenter" is Amazon Web Services :),
> so storing a few large, compressed files is the easiest way to go.
> 
> My code, if anyone's interested, is here:
> http://stuartsierra.com/2008/04/24/a-million-little-files
> 
> -Stuart
> altlaw.org
> 
> 
> On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <ma...@stuartsierra.com> wrote:
>> Hello all, Hadoop newbie here, asking: what's the preferred way to
>>  handle large (~1 million) collections of small files (10 to 100KB) in
>>  which each file is a single "record"?
>> 
>>  1. Ignore it, let Hadoop create a million Map processes;
>>  2. Pack all the files into a single SequenceFile; or
>>  3. Something else?
>> 
>>  I started writing code to do #2, transforming a big tar.bz2 into a
>>  BLOCK-compressed SequenceFile, with the file names as keys.  Will that
>>  work?
>> 
>>  Thanks,
>>  -Stuart, altlaw.org
>>

Re: Best practices for handling many small files

Posted by Stuart Sierra <ma...@stuartsierra.com>.

Thanks for the advice, everyone.  I'm going to go with #2, packing my
million files into a small number of SequenceFiles.  This is slow, but
only has to be done once.  My "datacenter" is Amazon Web Services :),
so storing a few large, compressed files is the easiest way to go.

My code, if anyone's interested, is here:
http://stuartsierra.com/2008/04/24/a-million-little-files

-Stuart
altlaw.org


On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <ma...@stuartsierra.com> wrote:
> Hello all, Hadoop newbie here, asking: what's the preferred way to
>  handle large (~1 million) collections of small files (10 to 100KB) in
>  which each file is a single "record"?
>
>  1. Ignore it, let Hadoop create a million Map processes;
>  2. Pack all the files into a single SequenceFile; or
>  3. Something else?
>
>  I started writing code to do #2, transforming a big tar.bz2 into a
>  BLOCK-compressed SequenceFile, with the file names as keys.  Will that
>  work?
>
>  Thanks,
>  -Stuart, altlaw.org
>

Re: Best practices for handling many small files

Posted by Ted Dunning <td...@veoh.com>.

Yes.  That (2) should work well.


On 4/23/08 8:55 AM, "Stuart Sierra" <ma...@stuartsierra.com> wrote:

> Hello all, Hadoop newbie here, asking: what's the preferred way to
> handle large (~1 million) collections of small files (10 to 100KB) in
> which each file is a single "record"?
> 
> 1. Ignore it, let Hadoop create a million Map processes;
> 2. Pack all the files into a single SequenceFile; or
> 3. Something else?
> 
> I started writing code to do #2, transforming a big tar.bz2 into a
> BLOCK-compressed SequenceFile, with the file names as keys.  Will that
> work?
> 
> Thanks,
> -Stuart, altlaw.org