You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by 陈桂芬 <ch...@163.com> on 2009/05/07 04:33:55 UTC

how to improve the Hadoop's capability of dealing with small files

Hi:

In my application, there are many small files. But the hadoop is designed to deal with many large files.

I want to know why hadoop doesn’t support small files very well and where is the bottleneck. And what can I do to improve the Hadoop’s capability of dealing with small files.

Thanks.


Re: how to improve the Hadoop's capability of dealing with small files

Posted by imcaptor <im...@gmail.com>.
Please try  -D dfs.block.size=4096000
The specification must be in bytes.


On Tue, May 5, 2009 at 4:47 AM, Christian Ulrik Søttrup <soettrup@nbi.dk
>wrote:
- 隐藏引用文字 -

> Hi all,
>
> I have a job that creates very big local files so i need to split it to as
> many mappers as possible. Now the DFS block size I'm
> using means that this job is only split to 3 mappers. I don't want to
> change the hdfs wide block size because it works for my other jobs.
>
> Is there a way to give a specific file a different block size. The
> documentation says it is, but does not explain how.
> I've tried:
> hadoop dfs -D dfs.block.size=4M -put file  /dest/
>
> But that does not work.
>
> any help would be apreciated.
>
> Cheers,
> Chrulle
>

2009/5/7 陈桂芬 <ch...@163.com>

> Hi:
>
> In my application, there are many small files. But the hadoop is designed
> to deal with many large files.
>
> I want to know why hadoop doesn’t support small files very well and where
> is the bottleneck. And what can I do to improve the Hadoop’s capability of
> dealing with small files.
>
> Thanks.
>
>

Re: how to improve the Hadoop's capability of dealing with small files

Posted by Rasit OZDAS <ra...@gmail.com>.
I have the similar situation, I have very small files,
I never tried HBase (want to), but you can also group them
and write (let's say) 20-30 into a file as every file becomes a key in that
big file.

There are methods in API which you can write an object as a file into HDFS,
and read again
to get original object. Having list of items in object can solve this
problem..

Re: how to improve the Hadoop's capability of dealing with small files

Posted by jason hadoop <ja...@gmail.com>.
The way I typically address that is to write a zip file using the zip
utilities. Commonly for output.
HDFS is not optimized for low latency, but for high through put for bulk
operations.

2009/5/7 Edward Capriolo <ed...@gmail.com>

> 2009/5/7 Jeff Hammerbacher <ha...@cloudera.com>:
> > Hey,
> >
> > You can read more about why small files are difficult for HDFS at
> > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
> >
> > Regards,
> > Jeff
> >
> > 2009/5/7 Piotr Praczyk <pi...@gmail.com>
> >
> >> If You want to use many small files, they are probably having the same
> >> purpose and struc?
> >> Why not use HBase instead of a raw HDFS ? Many small files would be
> packed
> >> together and the problem would disappear.
> >>
> >> cheers
> >> Piotr
> >>
> >> 2009/5/7 Jonathan Cao <jo...@rockyou.com>
> >>
> >> > There are at least two design choices in Hadoop that have implications
> >> for
> >> > your scenario.
> >> > 1. All the HDFS meta data is stored in name node memory -- the memory
> >> size
> >> > is one limitation on how many "small" files you can have
> >> >
> >> > 2. The efficiency of map/reduce paradigm dictates that each
> >> mapper/reducer
> >> > job has enough work to offset the overhead of spawning the job.  It
> >> relies
> >> > on each task reading contiguous chuck of data (typically 64MB), your
> >> small
> >> > file situation will change those efficient sequential reads to larger
> >> > number
> >> > of inefficient random reads.
> >> >
> >> > Of course, small is a relative term?
> >> >
> >> > Jonathan
> >> >
> >> > 2009/5/6 陈桂芬 <ch...@163.com>
> >> >
> >> > > Hi:
> >> > >
> >> > > In my application, there are many small files. But the hadoop is
> >> designed
> >> > > to deal with many large files.
> >> > >
> >> > > I want to know why hadoop doesn't support small files very well and
> >> where
> >> > > is the bottleneck. And what can I do to improve the Hadoop's
> capability
> >> > of
> >> > > dealing with small files.
> >> > >
> >> > > Thanks.
> >> > >
> >> > >
> >> >
> >>
> >
> When the small file problem comes up most of the talk centers around
> the inode table being in memory. The cloudera blog points out
> something:
>
> Furthermore, HDFS is not geared up to efficiently accessing small
> files: it is primarily designed for streaming access of large files.
> Reading through small files normally causes lots of seeks and lots of
> hopping from datanode to datanode to retrieve each small file, all of
> which is an inefficient data access pattern.
>
> My application attempted to load 9000 6Kb files using a single
> threaded application and the FSOutpustStream objects to write directly
> to hadoop files. My plan was to have hadoop merge these files in the
> next step. I had to abandon this plan because this process was taking
> hours. I knew HDFS had a "small file problem" but I never realized
> that I could not do this problem the 'old fashioned way'. I merged the
> files locally and uploading a few small files gave great throughput.
> Small files is not just a permanent storage issue it is a serious
> optimization.
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: how to improve the Hadoop's capability of dealing with small files

Posted by Edward Capriolo <ed...@gmail.com>.
2009/5/7 Jeff Hammerbacher <ha...@cloudera.com>:
> Hey,
>
> You can read more about why small files are difficult for HDFS at
> http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
>
> Regards,
> Jeff
>
> 2009/5/7 Piotr Praczyk <pi...@gmail.com>
>
>> If You want to use many small files, they are probably having the same
>> purpose and struc?
>> Why not use HBase instead of a raw HDFS ? Many small files would be packed
>> together and the problem would disappear.
>>
>> cheers
>> Piotr
>>
>> 2009/5/7 Jonathan Cao <jo...@rockyou.com>
>>
>> > There are at least two design choices in Hadoop that have implications
>> for
>> > your scenario.
>> > 1. All the HDFS meta data is stored in name node memory -- the memory
>> size
>> > is one limitation on how many "small" files you can have
>> >
>> > 2. The efficiency of map/reduce paradigm dictates that each
>> mapper/reducer
>> > job has enough work to offset the overhead of spawning the job.  It
>> relies
>> > on each task reading contiguous chuck of data (typically 64MB), your
>> small
>> > file situation will change those efficient sequential reads to larger
>> > number
>> > of inefficient random reads.
>> >
>> > Of course, small is a relative term?
>> >
>> > Jonathan
>> >
>> > 2009/5/6 陈桂芬 <ch...@163.com>
>> >
>> > > Hi:
>> > >
>> > > In my application, there are many small files. But the hadoop is
>> designed
>> > > to deal with many large files.
>> > >
>> > > I want to know why hadoop doesn't support small files very well and
>> where
>> > > is the bottleneck. And what can I do to improve the Hadoop's capability
>> > of
>> > > dealing with small files.
>> > >
>> > > Thanks.
>> > >
>> > >
>> >
>>
>
When the small file problem comes up most of the talk centers around
the inode table being in memory. The cloudera blog points out
something:

Furthermore, HDFS is not geared up to efficiently accessing small
files: it is primarily designed for streaming access of large files.
Reading through small files normally causes lots of seeks and lots of
hopping from datanode to datanode to retrieve each small file, all of
which is an inefficient data access pattern.

My application attempted to load 9000 6Kb files using a single
threaded application and the FSOutpustStream objects to write directly
to hadoop files. My plan was to have hadoop merge these files in the
next step. I had to abandon this plan because this process was taking
hours. I knew HDFS had a "small file problem" but I never realized
that I could not do this problem the 'old fashioned way'. I merged the
files locally and uploading a few small files gave great throughput.
Small files is not just a permanent storage issue it is a serious
optimization.

Re: how to improve the Hadoop's capability of dealing with small files

Posted by Jeff Hammerbacher <ha...@cloudera.com>.
Hey,

You can read more about why small files are difficult for HDFS at
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.

Regards,
Jeff

2009/5/7 Piotr Praczyk <pi...@gmail.com>

> If You want to use many small files, they are probably having the same
> purpose and struc?
> Why not use HBase instead of a raw HDFS ? Many small files would be packed
> together and the problem would disappear.
>
> cheers
> Piotr
>
> 2009/5/7 Jonathan Cao <jo...@rockyou.com>
>
> > There are at least two design choices in Hadoop that have implications
> for
> > your scenario.
> > 1. All the HDFS meta data is stored in name node memory -- the memory
> size
> > is one limitation on how many "small" files you can have
> >
> > 2. The efficiency of map/reduce paradigm dictates that each
> mapper/reducer
> > job has enough work to offset the overhead of spawning the job.  It
> relies
> > on each task reading contiguous chuck of data (typically 64MB), your
> small
> > file situation will change those efficient sequential reads to larger
> > number
> > of inefficient random reads.
> >
> > Of course, small is a relative term?
> >
> > Jonathan
> >
> > 2009/5/6 陈桂芬 <ch...@163.com>
> >
> > > Hi:
> > >
> > > In my application, there are many small files. But the hadoop is
> designed
> > > to deal with many large files.
> > >
> > > I want to know why hadoop doesn’t support small files very well and
> where
> > > is the bottleneck. And what can I do to improve the Hadoop’s capability
> > of
> > > dealing with small files.
> > >
> > > Thanks.
> > >
> > >
> >
>

Re: how to improve the Hadoop's capability of dealing with small files

Posted by Piotr Praczyk <pi...@gmail.com>.
If You want to use many small files, they are probably having the same
purpose and struc?
Why not use HBase instead of a raw HDFS ? Many small files would be packed
together and the problem would disappear.

cheers
Piotr

2009/5/7 Jonathan Cao <jo...@rockyou.com>

> There are at least two design choices in Hadoop that have implications for
> your scenario.
> 1. All the HDFS meta data is stored in name node memory -- the memory size
> is one limitation on how many "small" files you can have
>
> 2. The efficiency of map/reduce paradigm dictates that each mapper/reducer
> job has enough work to offset the overhead of spawning the job.  It relies
> on each task reading contiguous chuck of data (typically 64MB), your small
> file situation will change those efficient sequential reads to larger
> number
> of inefficient random reads.
>
> Of course, small is a relative term?
>
> Jonathan
>
> 2009/5/6 陈桂芬 <ch...@163.com>
>
> > Hi:
> >
> > In my application, there are many small files. But the hadoop is designed
> > to deal with many large files.
> >
> > I want to know why hadoop doesn’t support small files very well and where
> > is the bottleneck. And what can I do to improve the Hadoop’s capability
> of
> > dealing with small files.
> >
> > Thanks.
> >
> >
>

Re: how to improve the Hadoop's capability of dealing with small files

Posted by Jonathan Cao <jo...@rockyou.com>.
There are at least two design choices in Hadoop that have implications for
your scenario.
1. All the HDFS meta data is stored in name node memory -- the memory size
is one limitation on how many "small" files you can have

2. The efficiency of map/reduce paradigm dictates that each mapper/reducer
job has enough work to offset the overhead of spawning the job.  It relies
on each task reading contiguous chuck of data (typically 64MB), your small
file situation will change those efficient sequential reads to larger number
of inefficient random reads.

Of course, small is a relative term?

Jonathan

2009/5/6 陈桂芬 <ch...@163.com>

> Hi:
>
> In my application, there are many small files. But the hadoop is designed
> to deal with many large files.
>
> I want to know why hadoop doesn’t support small files very well and where
> is the bottleneck. And what can I do to improve the Hadoop’s capability of
> dealing with small files.
>
> Thanks.
>
>