You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Brendan cheng <cc...@hotmail.com> on 2012/05/22 11:39:26 UTC

Storing millions of small files

Hi,
I read HDFS architecture doc and it said HDFS is tuned for at storing large file, typically gigabyte to terabytes.What is the downsize of storing million of small files like <10MB?  or what setting of HDFS is suitable for storing small files?
Actually, I plan to find a distribute filed system for storing mult million of files.
Brendan

Re: Storing millions of small files

Posted by Ted Dunning <td...@maprtech.com>.

Mongo has the best out of box experience of anything, but can be limited in
terms of how far it will scale.

Hbase is a bit tricky to manage if you don't have expertise in managing
Hadoop.

Neither is a great idea if your data objects can be as large as 10MB.

On Wed, May 23, 2012 at 8:30 AM, Brendan cheng <cc...@hotmail.com> wrote:

>
> Thanks you guys advice! I have to mention more for my use case:
> (1) million files to store(2) 99% static, no change once written(3) fast
> download, or highly Available (4) cost effective
> (5) in future, would like extend a versioning system on the file
> of course from administrative point of view, most Hadoop function works
> for me.
> I checked a little bit of HBASE and I want to compare it with MongoDB as
> both also kind of key value.  but MongoDB give me more functionalities that
> I don't need it at the moment.
> what do you think?
>
> ________________________________
> > Date: Tue, 22 May 2012 21:56:31 -0700
> > Subject: Re: Storing millions of small files
> > From: mcsrivas@gmail.com
> > To: hdfs-user@hadoop.apache.org
> >
> > Brendan, since you are looking for a distr file system that can store
> > multi millions of files, try out MapR.  A few customers have actually
> > crossed over 1 trillion files without hitting problems.  Small files or
> > large files are handled equally well.
> >
> > Of course, if you are doing map-reduce, it is better to process more
> > data per mapper (I'd say the sweet spot is between 64M - 256M of data),
> > so it might make sense to process many small files per mapper.
> >
> > On Tue, May 22, 2012 at 2:39 AM, Brendan cheng
> > <cc...@hotmail.com>> wrote:
> >
> > Hi,
> > I read HDFS architecture doc and it said HDFS is tuned for at storing
> > large file, typically gigabyte to terabytes.What is the downsize of
> > storing million of small files like <10MB?  or what setting of HDFS is
> > suitable for storing small files?
> > Actually, I plan to find a distribute filed system for storing mult
> > million of files.
> > Brendan
> >
>

RE: Storing millions of small files

Posted by Brendan cheng <cc...@hotmail.com>.

Thanks you guys advice! I have to mention more for my use case:
(1) million files to store(2) 99% static, no change once written(3) fast download, or highly Available (4) cost effective
(5) in future, would like extend a versioning system on the file
of course from administrative point of view, most Hadoop function works for me.
I checked a little bit of HBASE and I want to compare it with MongoDB as both also kind of key value.  but MongoDB give me more functionalities that I don't need it at the moment.
what do you think?

________________________________
> Date: Tue, 22 May 2012 21:56:31 -0700 
> Subject: Re: Storing millions of small files 
> From: mcsrivas@gmail.com 
> To: hdfs-user@hadoop.apache.org 
>  
> Brendan, since you are looking for a distr file system that can store  
> multi millions of files, try out MapR.  A few customers have actually  
> crossed over 1 trillion files without hitting problems.  Small files or  
> large files are handled equally well. 
>  
> Of course, if you are doing map-reduce, it is better to process more  
> data per mapper (I'd say the sweet spot is between 64M - 256M of data),  
> so it might make sense to process many small files per mapper. 
>  
> On Tue, May 22, 2012 at 2:39 AM, Brendan cheng  
> <cc...@hotmail.com>> wrote: 
>  
> Hi, 
> I read HDFS architecture doc and it said HDFS is tuned for at storing  
> large file, typically gigabyte to terabytes.What is the downsize of  
> storing million of small files like <10MB?  or what setting of HDFS is  
> suitable for storing small files? 
> Actually, I plan to find a distribute filed system for storing mult  
> million of files. 
> Brendan 
>

Re: Storing millions of small files

Posted by "M. C. Srivas" <mc...@gmail.com>.

Brendan, since you are looking for a distr file system that can store multi
millions of files, try out MapR.  A few customers have actually crossed
over 1 trillion files without hitting problems.  Small files or large files
are handled equally well.

Of course, if you are doing map-reduce, it is better to process more data
per mapper (I'd say the sweet spot is between 64M - 256M of data), so it
might make sense to process many small files per mapper.

On Tue, May 22, 2012 at 2:39 AM, Brendan cheng <cc...@hotmail.com> wrote:

>
> Hi,
> I read HDFS architecture doc and it said HDFS is tuned for at storing
> large file, typically gigabyte to terabytes.What is the downsize of storing
> million of small files like <10MB?  or what setting of HDFS is suitable for
> storing small files?
> Actually, I plan to find a distribute filed system for storing mult
> million of files.
> Brendan

Re: Storing millions of small files

Posted by Keith Wiley <kw...@keithwiley.com>.

In addition to the responses already provided, there is another downside to using hadoop with numerous files: it takes much longer to run a hadoop job!  Starting a hadoop job consists of communicating between the driver (which runs on a client machine outside the cluster) and the namenode to locate all of the input files.  Each and every individual file is located with a set of RPCs between the client and the cluster and this is done in an entirely serial fashion.  In experiments we ran (and gave a talk on at the Hadoop Summit in 2010) we concluded that this overhead dominated our hadoop jobs.  By reducing the number of files (by using sequence files) we could greatly decrease the overall job time even though that actual MapReduce time was unaffected (by simply reducing the overhead of locating all of the files).

Here's a link to the slides from my talk:
http://www.slideshare.net/ydn/8-image-stackinghadoopsummit2010

Cheers!

On May 22, 2012, at 02:39 , Brendan cheng wrote:

> 
> Hi,
> I read HDFS architecture doc and it said HDFS is tuned for at storing large file, typically gigabyte to terabytes.What is the downsize of storing million of small files like <10MB?  or what setting of HDFS is suitable for storing small files?
> Actually, I plan to find a distribute filed system for storing mult million of files.
> Brendan 		 	   		  

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
                                           --  Keith Wiley
________________________________________________________________________________

Re: Storing millions of small files

Posted by Wasif Riaz Malik <wm...@gmail.com>.

Hi,

Hi Brendan,

The number of files that can be stored in HDFS is limited by the size of
the NameNode's RAM. The downside with storing small files is that you would
saturate the NameNode's RAM with a small data set (sum of the size of all
your small files). However, you can store around 100 million files (at
least) using 60GB of RAM at the NameNode. The downside with having a large
namespace is that the NameNode might take upto an hour to recover from
failures, but you can overcome this issue by using the HA Namenode.

Are you planning to store more than 100 million files?

Regards,
Wasif Riaz Malik

On Tue, May 22, 2012 at 11:39 AM, Brendan cheng <cc...@hotmail.com> wrote:

>
> Hi,
> I read HDFS architecture doc and it said HDFS is tuned for at storing
> large file, typically gigabyte to terabytes.What is the downsize of storing
> million of small files like <10MB?  or what setting of HDFS is suitable for
> storing small files?
> Actually, I plan to find a distribute filed system for storing mult
> million of files.
> Brendan

Re: Storing millions of small files

Posted by Harsh J <ha...@cloudera.com>.

Brendan,

The issue with using lots of small files is that your processing
overhead increases (repeated, avoidable file open-read(little)-close
calls). HDFS is also used by those who wish to also heavily process
the data they've stored and with a huge number of files such a process
is not gonna be quick to cut through quick for them. RAM is just
another factor, due to the design of NameNode. But ideally you do not
want to end up with having to go through millions of files when you
wish to process them all, as they can be stored more efficiently for
those purposes via several tools/formats/etc.

You can probably utilize HBase for such storage. It will allow you to
store large amounts of data in compact files while at the same time
allowing random access to them, if thats needed by your use-case as
well. Check out this one previous discussion on this topic at:
http://search-hadoop.com/m/j95CxojSOC which was related to storing
image files. Should apply to your question as well. Head over to
user@hbase.apache.org if you have further questions on Apache HBase.

On Tue, May 22, 2012 at 3:09 PM, Brendan cheng <cc...@hotmail.com> wrote:
>
> Hi,
> I read HDFS architecture doc and it said HDFS is tuned for at storing large file, typically gigabyte to terabytes.What is the downsize of storing million of small files like <10MB?  or what setting of HDFS is suitable for storing small files?
> Actually, I plan to find a distribute filed system for storing mult million of files.
> Brendan

-- 
Harsh J

Re: Storing millions of small files

Posted by Mohammad Tariq <do...@gmail.com>.

Hi Brendan,
      Every file, directory and block in HDFS is represented as an
object in the namenode’s memory, each of which occupies 150 bytes.When
we store many small files in the HDFS, these small files occupy a
large portion of the namespace(large overhead on namenode). As a
consequence, the disk space is underutilized because of the namespace
limitation.If you want to handle "small files", you should go for
"hadoop sequence file or HAR files" depending upon your use
case..Hbase is also an option.But again it depends upon your use
case.I would suggest you go through this blog -
"http://www.cloudera.com/blog/2009/02/the-small-files-problem/". Must
read for people managing large no of small files.

Regards,
    Mohammad Tariq


On Tue, May 22, 2012 at 3:09 PM, Brendan cheng <cc...@hotmail.com> wrote:
>
> Hi,
> I read HDFS architecture doc and it said HDFS is tuned for at storing large file, typically gigabyte to terabytes.What is the downsize of storing million of small files like <10MB?  or what setting of HDFS is suitable for storing small files?
> Actually, I plan to find a distribute filed system for storing mult million of files.
> Brendan