You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Peter McTaggart <pe...@gmail.com> on 2008/09/15 06:14:12 UTC

Small Filesizes

Hi All,



I am considering using HDFS for an application that potentially has many
small files – ie  10-100 million files with an estimated average filesize of
50-100k (perhaps smaller) and is an online interactive application.

All of the documentation I have seen suggests that a blockszie of 64-128Mb
works best for Hadoop/HDFS and it is best used for batch oriented
applications.



Does anyone have any experience using it for files of this size  in an
online application environment?

Is it worth pursuing HDFS for this type of application?



Thanks

Peter

Re: Small Filesizes

Posted by Stuart Sierra <ma...@stuartsierra.com>.
On Mon, Sep 15, 2008 at 8:23 AM, Brian Vargas <br...@ardvaark.net> wrote:
> A simple solution is to just load all of the small files into a sequence
> file, and process the sequence file instead.

I use this approach too.  I make SequenceFiles with
key= the file name (Text)
value= the contents of the file (BytesWritable)
and use BLOCK compression.

-Stuart

Re: Small Filesizes

Posted by Mafish Liu <ma...@gmail.com>.
Hi,
  I'm just working on this situation you described, with millions of small
files sized around 10KB.
  My idea is to compact this files into big ones and create indexes for
them. This is a file system over file system and support append update, lazy
delete.
  May this help .

-- 
Mafish@gmail.com
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.

Re: Small Filesizes

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.
Peter,

You are likely to hit memory limitations on the name-node.
With 100 million small files it will need to support 200 mln objects,
which will require roughly 30 GB of RAM on the name-node.
You may also consider hadoop archives or present your files as a
collection of records and use Pig, Hive etc.

--Konstantin

Brian Vargas wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
> 
> Peter,
> 
> In my testing with files of that size (well, larger, but still well
> below the block size) it was impossible to achieve any real throughput
> on the data because of the overhead of looking up the locations to all
> those files on the NameNode.  Your application spends so much time
> looking up file names that most of the CPUs sit idle.
> 
> A simple solution is to just load all of the small files into a sequence
> file, and process the sequence file instead.
> 
> Brian
> 
> Peter McTaggart wrote:
>> Hi All,
>>
>>
>>
>> I am considering using HDFS for an application that potentially has many
>> small files – ie  10-100 million files with an estimated average filesize of
>> 50-100k (perhaps smaller) and is an online interactive application.
>>
>> All of the documentation I have seen suggests that a blockszie of 64-128Mb
>> works best for Hadoop/HDFS and it is best used for batch oriented
>> applications.
>>
>>
>>
>> Does anyone have any experience using it for files of this size  in an
>> online application environment?
>>
>> Is it worth pursuing HDFS for this type of application?
>>
>>
>>
>> Thanks
>>
>> Peter
>>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (MingW32)
> Comment: What is this? http://pgp.ardvaark.net
> 
> iD8DBQFIzlOt3YdPnMKx1eMRA18fAJ48voMDWLRiKPZHcBxAFAM1Kktk8wCguSDX
> dIHsqlePzQHQYFr9AwhkI3I=
> =gmAj
> -----END PGP SIGNATURE-----
> 

Re: Small Filesizes

Posted by Brian Vargas <br...@ardvaark.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Peter,

In my testing with files of that size (well, larger, but still well
below the block size) it was impossible to achieve any real throughput
on the data because of the overhead of looking up the locations to all
those files on the NameNode.  Your application spends so much time
looking up file names that most of the CPUs sit idle.

A simple solution is to just load all of the small files into a sequence
file, and process the sequence file instead.

Brian

Peter McTaggart wrote:
> Hi All,
> 
> 
> 
> I am considering using HDFS for an application that potentially has many
> small files – ie  10-100 million files with an estimated average filesize of
> 50-100k (perhaps smaller) and is an online interactive application.
> 
> All of the documentation I have seen suggests that a blockszie of 64-128Mb
> works best for Hadoop/HDFS and it is best used for batch oriented
> applications.
> 
> 
> 
> Does anyone have any experience using it for files of this size  in an
> online application environment?
> 
> Is it worth pursuing HDFS for this type of application?
> 
> 
> 
> Thanks
> 
> Peter
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (MingW32)
Comment: What is this? http://pgp.ardvaark.net

iD8DBQFIzlOt3YdPnMKx1eMRA18fAJ48voMDWLRiKPZHcBxAFAM1Kktk8wCguSDX
dIHsqlePzQHQYFr9AwhkI3I=
=gmAj
-----END PGP SIGNATURE-----