You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Per Steffensen <st...@designware.dk> on 2011/09/01 10:57:45 UTC

How does FileInputFormat sub-classes handle many small files

Hi

FileInputFormat sub-classes (TextInputFormat and 
SequenceFileInputFormat) are able to take all files in a folder and 
split the work of handling them into several sub-jobs (map-jobs). I know 
it can split a very big file into several sub-jobs, but how does it 
handle many small files in the folder. If there are 10000 small files 
each with 100 datarecords, I would not like my sub-jobs to become too 
small (due to the overhead of starting a JVM for each sub-job etc.). I 
would like e.g. 100 sub-jobs each about handling 10000 datarecords, or 
maybe 10 sub-jobs each about handling 100000 datarecords, but I would 
not like 10000 sub-jobs each about handling 100 datarecords. For this to 
be possible one split (the work to be done by one sub-job) will have to 
span more than one file. My question is, if FileInputFormat sub-classes 
are able to make such splits, or if they always create at least one 
split=sub-job=map-job per file?

Another thing is: I expect that FileInputFormat has to somehow list the 
files in the folder. Who does this listing handle many many files in the 
folder. Most OS's are bad at listing files in folders when there are a 
lot of files - at some point it become worse than O(n) where n is the 
number of files. Windows of course really suck, and even linux has 
problems with very high number of files. How does HDFS handle listing of 
files in a folder with many many files? Or maybe I should address this 
question to the hdfs mailing list?

Regards, Per Steffensen

Re: How does FileInputFormat sub-classes handle many small files

Posted by Harsh J <ha...@cloudera.com>.

Per,

On Fri, Sep 2, 2011 at 12:33 AM, Per Steffensen <st...@designware.dk> wrote:
> Yes I found CombineFileInputFormat. It worries me a little though to see
> that it extends the deprecated FileInputFormat instead of the new
> FileInputFormat. It that a problem?
> Also I notice that CombineFileInputFormat is abstract. Why is that? Is the
> extension shown on the following webpage a good way out of this:
> http://blog.yetitrails.com/2011/04/dealing-with-lots-of-small-files-in.html

It is abstract cause it does not include a record reader with it, and
needs you to specify that for your files. Even FileInputFormat is
unusable on its own - you generally use Text or Sequence IFs depending
on your file format. Its not difficult to extend and write your
requirements, though :)

That blog post looks good to me as an example. Do adapt it to the
proper record reader you require (LineRecordReader,
SequenceFile.Reader, etc.).

Regarding stable/new API: For 0.20 releases, please disregard the
deprecation of mapreduce API. It was undeprecated later and was
re-deemed stable. If you'd still like to use the new API for this
class, perhaps you need to pull it from a higher version's sources, or
use a distro/release that incorporates it (Ex: I use CDH3 here, and it
does have CFIP in new and stable API classes both thanks to its tested
backporting)

-- 
Harsh J

Re: How does FileInputFormat sub-classes handle many small files

Posted by Per Steffensen <st...@designware.dk>.

Harsh J skrev:
> Hello Per,
>
> On Thu, Sep 1, 2011 at 2:27 PM, Per Steffensen <st...@designware.dk> wrote:
>   
>> Hi
>>
>> FileInputFormat sub-classes (TextInputFormat and SequenceFileInputFormat)
>> are able to take all files in a folder and split the work of handling them
>> into several sub-jobs (map-jobs). I know it can split a very big file into
>> several sub-jobs, but how does it handle many small files in the folder. If
>> there are 10000 small files each with 100 datarecords, I would not like my
>> sub-jobs to become too small (due to the overhead of starting a JVM for each
>> sub-job etc.). I would like e.g. 100 sub-jobs each about handling 10000
>> datarecords, or maybe 10 sub-jobs each about handling 100000 datarecords,
>> but I would not like 10000 sub-jobs each about handling 100 datarecords. For
>> this to be possible one split (the work to be done by one sub-job) will have
>> to span more than one file. My question is, if FileInputFormat sub-classes
>> are able to make such splits, or if they always create at least one
>> split=sub-job=map-job per file?
>>     
>
> You need CombineFileInputFormat for your case, not the vanilla
> FileInputFormat which gives at least one split per file.
>   
Yes I found CombineFileInputFormat. It worries me a little though to see 
that it extends the deprecated FileInputFormat instead of the new 
FileInputFormat. It that a problem?
Also I notice that CombineFileInputFormat is abstract. Why is that? Is 
the extension shown on the following webpage a good way out of this:
http://blog.yetitrails.com/2011/04/dealing-with-lots-of-small-files-in.html
>   
>> Another thing is: I expect that FileInputFormat has to somehow list the
>> files in the folder. Who does this listing handle many many files in the
>> folder. Most OS's are bad at listing files in folders when there are a lot
>> of files - at some point it become worse than O(n) where n is the number of
>> files. Windows of course really suck, and even linux has problems with very
>> high number of files. How does HDFS handle listing of files in a folder with
>> many many files? Or maybe I should address this question to the hdfs mailing
>> list?
>>     
>
> Listing would be one RPC call in HDFS, so it might take some time for
> millions of files under a single directory. Although the listing
> operation, since done in-memory of the NameNode would be fast enough,
> the transfer of the results to the client for large amount of items
> may take up some time -- and there's no way to page either. I do not
> think the complexity is worse than O(n) though, for files under the
> same dir - only transfer costs should be your worry. But I've not
> measured these things to give you concrete statements on this. Might
> be a good exercise?
>   
It would be nice with some number about this, yes. But we will make our 
own performance test anyway.
> Also know that listing is only done on the front end (job client
> submission) and not later on. So it is just a one time cost.
>
>

Re: How does FileInputFormat sub-classes handle many small files

Posted by Harsh J <ha...@cloudera.com>.

Hello Per,

On Thu, Sep 1, 2011 at 2:27 PM, Per Steffensen <st...@designware.dk> wrote:
> Hi
>
> FileInputFormat sub-classes (TextInputFormat and SequenceFileInputFormat)
> are able to take all files in a folder and split the work of handling them
> into several sub-jobs (map-jobs). I know it can split a very big file into
> several sub-jobs, but how does it handle many small files in the folder. If
> there are 10000 small files each with 100 datarecords, I would not like my
> sub-jobs to become too small (due to the overhead of starting a JVM for each
> sub-job etc.). I would like e.g. 100 sub-jobs each about handling 10000
> datarecords, or maybe 10 sub-jobs each about handling 100000 datarecords,
> but I would not like 10000 sub-jobs each about handling 100 datarecords. For
> this to be possible one split (the work to be done by one sub-job) will have
> to span more than one file. My question is, if FileInputFormat sub-classes
> are able to make such splits, or if they always create at least one
> split=sub-job=map-job per file?

You need CombineFileInputFormat for your case, not the vanilla
FileInputFormat which gives at least one split per file.

> Another thing is: I expect that FileInputFormat has to somehow list the
> files in the folder. Who does this listing handle many many files in the
> folder. Most OS's are bad at listing files in folders when there are a lot
> of files - at some point it become worse than O(n) where n is the number of
> files. Windows of course really suck, and even linux has problems with very
> high number of files. How does HDFS handle listing of files in a folder with
> many many files? Or maybe I should address this question to the hdfs mailing
> list?

Listing would be one RPC call in HDFS, so it might take some time for
millions of files under a single directory. Although the listing
operation, since done in-memory of the NameNode would be fast enough,
the transfer of the results to the client for large amount of items
may take up some time -- and there's no way to page either. I do not
think the complexity is worse than O(n) though, for files under the
same dir - only transfer costs should be your worry. But I've not
measured these things to give you concrete statements on this. Might
be a good exercise?

Also know that listing is only done on the front end (job client
submission) and not later on. So it is just a one time cost.

-- 
Harsh J